4. DistCp Security Settings

The following settings may affect DistCp:

  • The HDP version of the source and destination clusters.

  • Whether or not the HDP clusters have security set up.  

When copying data from a secure cluster to an non-secure cluster, the following configuration setting is required for the DistCp client:

<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>

When copying data from a secure cluster to a secure cluster, the following configuration settings is required in the core-site.xml:

<property>
  <name>hadoop.security.auth_to_local</name>
  <value></value>
  <description>Maps kerberos principals to local user names</description>
</property> 

HDP Version

The HDP version of the source and destination clusters can determine which type of file systems should be used to read the source cluster and write to the destination cluster. For example, when copying data from a 1.x cluster to a 2.x cluster, it is impossible to use “hdfs” for both the source and the destination, because HDP 1.x and 2.x have different RPC versions, and the client cannot understand both at the same time. In this case the WebHdfsFilesystem (webhdfs://) can be used in both the source and destination clusters, or the HftpFilesystem (hftp://) can be used to read data from the source cluster.  

Security

The security setup can affect whether DistCp should be run on the source cluster or the destination cluster. The rule-of-thumb is that if one cluster is secured and the other is not secured, DistCp should be run from the secured cluster -- otherwise there may be security-related issues.

Examples

  • distcp hdfs://hdp-1.3-secure webhdfs://hdp-2.0-insecure

    In this case distcp should be run from the secure source cluster. Currently there may be issues associated with running distcp in this scenario. A possible solution is discussed here.

  • distcp hdfs://hdp-2.0-secure hdfs://hdp-2.0-secure

    One issue here is that the SASL RPC client requires that the remote server’s Kerberos principal must match the server principal in its own configuration. Therefore, the same principal name must be assigned to the applicable NameNodes in the source and the destination cluster. For example, if the Kerberos principal name of the NameNode in the source cluster is nn/host1@realm, the Kerberos principal name of the NameNode in destination cluster must be nn/host2@realm, rather than nn2/host2@realm, for example.


loading table of contents...