Pre-installation tasks for DP Profiler Agent for HDP 3.x versions
Perform these tasks before you try to install the Data Profiler agent on the cluster.
Ensure that you have downloaded the required software from the customer portal, following the instructions provided as part of the product procurement process.
DSS includes the following parts:
ImportantThe MPack package for the DataPlane profiler agent includes MPack files for both the HDP 2.6.5 and HDP 3.0 versions. You need to unzip the package and identify the MPack to use depending on your version of HDP. You can identify the MPack from the name of the file. For example, the HDP 3.x MPack will have the string hdp3 in its name and the HDP 2.6.x MPack will have the string hdp2 in its name. Make sure you use the right MPack corresponding to your HDP version for installation.
- DSS app that needs to be installed on the DataPlane host
- Cluster agent software that needs to be installed on every cluster that is managed by DSS. The cluster agent software consists of an MPack package and the profiler service package.
- Ensure that the clusters are running the required version of HDP.
Ensure that the following HDP components are installed and configured:
- Spark2 with Livy for Spark2 and Spark Thrift Server for Spark2
- Ensure that Hive Interactive is enabled.
- If you plan to sync users from LDAP into Ranger, ensure a dpprofiler user is created in LDAP and synced into Ranger.
- Ensure that Ranger integration for HDFS and Hive is enabled.
- Make sure that HDFS Audit logging for Ranger is enabled.
- Make sure you install Hive client and HDFS client on the machine where you plan to install DataPlane Profiler Agent.
- Restart the services as required.
Make sure the resource requirements for YARN queues for a default DSS
configurations are as follows:
- RAM should be greater than or equal to 24 GB.
- CPU Cores should be greater than equal to 12.
Update the YARN parameters as follows:
yarn.scheduler.capacity.maximum-am-resource-percentparameter on YARN > Scheduler (let this be x) such that, when multiplied with the total memory in YARN, it should be greater than or equal to 8G.
The equation appears as follows:
(x * total_memory_in_yarn) >= 8G
For example, for 16 GB it is advised to set x to 0.5.
All these resources must be allocated exclusively for profiler agent and profilers. It is advisable to have a separate queue.
Make sure a stable Hive LLAP instance is available with the following minimal requirements.
Considering the following parameters:
- a= Average number of executor for sensitive/tablestats profilers
- b= Average RAM per executor for sensitive/tablestats profilers
- c= Average RAM per application master for sensitive/tablestats profilers
- y= RAM available in yarn for dpprofilers queue
The following formula will determine the minimal requirements:
LLAP will have x more jobs accessing data in Hive through LLAP with each having a parallelism.