DPS Installation and Setup
Also available as:
PDF
loading table of contents...

Chapter 2. Planning for a DPS Installation

Prior to installing DPS, you must consider various aspects of your HDP environment and prepare them prior to DPS installation. This includes items such as operating system version, cluster security, the node configuration requirements for DPS Platform and associated services, etc.

Review the following information prior to starting your DPS installation to ensure your environment is properly configured for a successful DPS installation.

Support Matrix information

You can find the most current information about interoperability for this release on the Support Matrix. The Support Matrix tool provides information about:

  • HDP and Ambari

  • Operating Systems

  • Databases

  • Browsers

  • JDKs

To access the tool, go to: https://supportmatrix.hortonworks.com.

Requirements for DataPlane Service Platform Host

Hortonworks DataPlane Service (DPS™) is composed of a platform (DPS Platform) and the services that plug into the platform (DLM, DSS, etc.), which are all installed on the same host node. DPS also includes engines and agents that are installed on the clusters used with DPS.

You should install the DPS Platform on a host remote to the cluster. The DPS Platform host must meet the requirements identified in the following sections.

All clusters registered with DPS must be managed by Apache Ambari.

DPS Support Matrix Information

See the Requirements for DataPlane Service Platform Host and the Hortonworks Support Matrix for details regarding supported operating systems, databases, software, etc.

Required Docker Versions

Docker 1.12.x

Other Software Requirements

On each DPS Platform host, ensure that the following software is available:

  • yum and rpm

  • tar and wget

  • bash shell

Processing and Memory Requirements

The DPS Platform host requires the following:

  • Multicore processor, with minimum 8 cores

  • Minimum 16 GB RAM

See the HDP and Ambari Support Matrices for requirements.

Port and Network Requirements

Have the following ports available and open:

Port NumberPurposeRequired to be open?
80Where DPS Platform runs.Yes
443For SSL-based communication.Yes
8443Where the Apache Knox instance for login runs.Yes
8500

For debugging using Consul.

This port must be available, but it is optional to have it open.

No

It is recommended that you use a DNS server to resolve host names. If resolving host names from an /etc/hosts file, you must add the names to the hosts files of each DPS container. Follow the instructions, Add Host Entries to /etc/hosts Files in DPS Administration.

LDAP and AD Support Requirements

To use LDAP and AD, you must use the same LDAP and AD instance across all HDP clusters managed by DataPlane, as well as for DataPlane Service itself.

HDP 2.6.3 Apache Component Requirements

The following additional Apache components are required for DPS Platform support:

Component PurposeComments
KnoxUser authentication with LDAP (SSO)Knox must be enabled on clusters before you can register the clusters with DPS.
AmbariCluster registration in DPSAll clusters used with DPS must be using Ambari.

SmartSense Requirements

A SmartSense ID is required to install DPS Services (DLM and DSS).

You can retrieve the SmartSense ID from the Hortonworks Support Portal, under the Tools tab.

Additional DPS Requirements and Recommendations

Understanding the requirements and recommendations indicated below can help to avoid common issues during and after DPS installation.

  • Prior to starting installation, you must have downloaded the required tarballs and MPacks from the customer portal, following the instructions provided as part of the product procurement process.

  • You need to have root access to the nodes on which all DPS services will be installed.

  • If you are using AWS, do not use the public DNS to access DPS.

    Use a public IP address or set up and use a DNS (Route 53) fully qualified domain name (FQDN).

  • Every host name used with DPS must be resolvable by DNS or configured in the /etc/hosts file on the DPS container, so that host names can be resolved between all cluster nodes.

    Using a DNS server is the recommended method, unless you are using Amazon Web Services (AWS). But if the instances are added to /etc/hosts, you must explicitly register the cluster host names within the DPS Docker containers. It is not sufficient to have the host names included in the /etc/hosts file on the DPS Platform host. See the DPS Platform Administration guide for instructions.

  • If you are not using the LDAP server packaged with DPS, you need the corporate LDAP settings to configure LDAP.

    Ensure you have the correct settings if using your own LDAP, as most of the settings cannot be changed in DPS after they are set.

  • Use the default Knox user.

    If you choose a server to host Knox that is not the one the Ambari Server defaulted to, proxyuser rules will change and you will be prompted for a restart.

  • When enabling DPS Platform and DLM, and installing Knox, follow the automated Ambari placement recommendations to avoid requiring a restart.

  • Important: Do not edit the cluster name from Ambari after registering the cluster with DPS Platform.

DPS Service Requirements for HDP Clusters

DPS Support Matrix Information

See the Requirements for DataPlane Service Platform Host and the Hortonworks Support Matrix for details regarding supported HDP configurations.

All clusters used with DPS must be managed by Ambari.

Configuring Cluster Security for DPS Services

Following are lists of the minimum required actions that you must perform on each HDP cluster as part of configuring security for DPS and onboarding clusters for each of the DPS services. You can perform any additional security-related tasks as appropriate for your environment and company policies.

Table 2.1. Minimum Security Requirements Checklist for DPS

TaskInstructionsFound in...Comments
Enable Knox in AmbariInstall KnoxApache Knox Gateway User's GuideServices required in the Knox topology for DPS are Ambari, AmbariUI, JobTracker, NameNode, Ranger, RangerUI, and ResourceManager
Enable Ranger in AmbariInstalling Ranger Using AmbariHDP Security guide 
Configure a reverse proxy with KnoxConfiguring the Knox GatewayHDP Security guideThe Knox Gateway is not required, but is recommended
Configure SSO topologyForm-based Identity Provider (IdP)HDP Security guide 
Configure LDAP with AmbariConfiguring Ambari Authentication with LDAP or Active Directory AuthenticationHDP Security guide 
Synchronize required LDAP users and groups with Ambari Synchronizing LDAP Users and GroupsHDP Security guide

You must disable LDAP pagination;

Users registering clusters in DPS must have Admin role in Ambari

Configure LDAP with RangerConfiguring Ranger Authentication with UNIX, LDAP, or ADHDP Security guideRequired for DSS and if using Ranger with DLM
Configure LDAP with Knox for proxy authenticationSetting Up LDAP AuthenticationHDP Security guide 
Configure Knox for HASetting Up Knox Services for HAHDP Security guideRequired only if clusters are configured for HA
Configure Knox SSO for AmbariSetting up Knox SSO for AmbariHDP Security guideIf done on an existing cluster, at login you will see a Knox page and must log in with your LDAP credentials

If you are performing Hive replication with the Data Lifecycle Manager (DLM) service, ensure that the following tasks were completed during cluster installation. You must configure Ambari Ranger on clusters used in replicating Hive databases.

Table 2.2. Minimum Security Requirements Checklist for DLM

TaskInstructionsFound in...Comments
Configure LDAP with RangerConfiguring Ranger Authentication with UNIX, LDAP, or ADHDP Security guideRequired if using Ranger with DLM
Configure user synchronization for policy administrationConfigure Ranger User SyncHDP Security guideRequired only if using Ranger
Configure Ranger plugin for HDFSEnabling Ranger Plugins: HDFSHDP Security guideRequired only if using Ranger
Configure Ranger plugin for HiveEnabling Ranger Plugins: HiveHDP Security guideRequired only if using Ranger
Configure Ranger plugin for KnoxEnabling Ranger Plugins: KnoxHDP Security guideRequired only if using Ranger
Configure Ranger HDFS plugin for KerberosRanger Plugins--Kerberos: HDFSHDP Security guideRequired only if using Ranger
Configure Ranger Hive plugin for KerberosRanger Plugins--Kerberos: HiveHDP Security guideRequired only if using Ranger
Configure Ranger Knox plugin for KerberosRanger Plugins--Kerberos: KnoxHDP Security guideRequired only if using Ranger
Configure Knox SSO for RangerSetting up Knox SSO for RangerHDP Security guide 

If you are using the Data Steward Studio (DSS) service, ensure that the following tasks were completed during cluster installation. You must configure Apache Atlas and Apache Knox SSO before you can use DSS.

Table 2.3. Minimum Security Requirements Checklist for DSS

TaskInstructionsFound in...Comments
Enable Atlas in AmbariInstalling and Configuring Apache Atlas Using AmbariHDP Data Governance guide 
Configure LDAP with AtlasCustomize ServicesHDP Data Governance guideAdapt the instructions for Ranger
Configure Ranger plugin for AtlasEnabling Ranger Plugins: AtlasHDP Security guide 
Configure Knox SSO for AtlasSetting up Knox SSO for AtlasHDP Security guide 
Configure Knox SSO for RangerSetting up Knox SSO for RangerHDP Security guide 

Data Lifecycle Manager (DLM) Installation Requirements and Recommendations

The clusters on which you install the Data Lifecycle Manager (DLM) Engine must meet the requirements identified in the following sections. After the DLM Engine is installed and properly configured on a cluster, the cluster can be used for DLM replication.

[Important]Important

Clusters used as source and destination in a DLM replication relationship must have exactly the same configurations for LDAP, Kerberos, Ranger, Knox, HA, etc.

DLM Support Matrix Information

See the Requirements for Clusters Used With Data Lifecycle Manager Engine and the Hortonworks Support Matrix for details regarding supported operating systems, databases, software, etc.

Port and Network Requirements

Have the following ports available and open:

Default Port NumberPurposeCommentsRequired to be open?
25968Port for DLM Engine (Beacon) service on hosts.

Accessibility is required from all clusters.

Beacon is the internal name for the DLM Engine. You will see the name Beacon in some paths, commands, etc.

Yes
8020NameNode host Yes
50010All DataNode hosts Yes
8080Ambari server host Yes
10000HiveServer2 hostBinary mode port (Thrift)Yes
10001HiveServer2 hostHTTP mode portYes
2181ZooKeeper hosts Yes
6080Ranger Port  Yes
8443Knox Port Yes
8050Yarn Port Yes

HDP 2.6.3 Apache Component Requirements

The following additional Apache components are required for DLM support:

Component PurposeComments
Hive 1For replicating Hive database contentHive 2 queries are supported, but for replication, HiveServer 2 with Hive 1 is always used.
HDFSFor replicating HDFS data. 
KnoxAuthentication federation from DPSKnox must be enabled on clusters before you can register the clusters with DPS.
RangerAuthorization on clusters during replicationRanger is optional for HDFS replication, but required for Hive replication.

Additional DLM Requirements and Recommendations

Understanding the requirements and recommendations indicated below can help to avoid common issues during and after installation of the DLM service.

  • Apache Hive should be installed during initial installation, unless you are certain you will not use Hive replication in the future.

    If you decide to install Hive after creating HDFS replication policies in Data Lifecycle Manager, all HDFS replication policies must be deleted and then recreated after adding Hive.

  • Clusters used in DLM replication must have symmetrical configurations.

    That is, each cluster in a replication relationship must be configured exactly the same for Kerberos, LDAP, High Availability (HA), Apache Ranger, and so forth.

Data Steward Studio (DSS) Installation Requirements and Recommendations

The clusters on which you install the DSS Profiler Agent must meet the requirements identified in the following sections. After the Profiler Agent is installed and properly configured on a cluster, the cluster can be used by DSS.

Data Steward Studio (DSS) is provided as Evaluation Software with Hortonworks DPS 1.0. Evaluation Software is provided without charge and pursuant to your the DataPlane Service Terms of Use. Evaluation Software may only be used for internal business, evaluation, and non-production purposes. Feedback on Evaluation Software is welcomed and may be submitted through your regular support channels.

DSS Support Matrix Information

See the Requirements for Data Steward Studio Profiler and the Hortonworks Support Matrix for details regarding supported operating systems, databases, software, etc.

Other Software Requirements

DSS has no additional software requirements.

Port and Network Requirements

Have the following ports available and open:

Port NumberPurposeRequired?
21900Profiler Web service runs on this portThis is required for DataPlane to access profiled data from the profiler datastore.
8999Livy runs on this portLivy is the observer for profilers and is required for submitting profiler jobs.
21000AtlasRequired if you are installing in a different DMZ.
6080RangerRequired if you are installing in a different DMZ.
8443KnoxRequired if you are installing in a different DMZ.
8080AmbariYes

HDP 2.6.3 Apache Component Requirements

The following additional Apache components are required for DSS support:

Component PurposeComments
AtlasFor Hive Metadata availability and storage of univariate statistics
RangerFor access logs availability for usage profiling 
Spark 2For Profiler computation – both univariate and Ranger profilers 
Livy Server 2 Job Server for Profilers 
HDFS For registering and sharing Profiler .jarsCo-located on the Profiler Agent Node.
HiveFor column profiling