Administration
Also available as:
PDF

DistCp Between HA Clusters

To copy data between HA clusters, use the dfs.internal.nameservices property in the hdfs-site.xml file to explicitly specify the name services belonging to the local cluster, while continuing to use the dfs.nameservices property to specify all of the name services in the local and remote clusters.

Use the following steps to copy data between HA clusters:

  1. Modify the following properties in the hdfs-site.xml file in the HDFS client:

    1. Add both name services to dfs.nameservices = HAA, HAB

    2. Add the dfs.internal.nameservices properties:

      • On the HAA cluster, add the following details:

        dfs.internal.nameservices = HAA

      • On the HAB cluster, add the following details of the local cluster:

        dfs.internal.nameservices = HAB

    3. Add dfs.ha.namenodes.<nameservice> details:

      dfs.ha.namenodes.HAB = nn1,nn2

    4. Add the dfs.namenode.rpc-address.<cluster>.<nn> property:

      dfs.namenode.rpc-address.HAB.nn1 = <NN1_fqdn>:8020

      dfs.namenode.rpc-address.HAB.nn2 = <NN2_fqdn>:8020

    5. Add the following properties to enable distcp over WebHDFS and secure WebHDFS:

      dfs.namenode.http-address.HAB.nn1 = <NN1_fqdn>:50070

      dfs.namenode.http-address.HAB.nn2 = <NN2_fqdn>:50070

      dfs.namenode.https-address.HAB.nn1 = <NN1_fqdn>:50470

      dfs.namenode.https-address.HAB.nn2 = <NN2_fqdn>:50470

    6. Add the dfs.client.failover.proxy.provider.<cluster> property:

      dfs.client.failover.proxy.provider. HAB = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

    [Note]Note

    The properties listed earlier can be used to copy data from HAA cluster to HAB cluster only. To be able to copy data from HAB to HAA, add the properties in the HDFS client on the HAA cluster as well.

  2. Add the following property in the mapred-site.xml file in the HDFS client on the local cluster:

    <property>
       <name>mapreduce.job.send-token-conf</name>
         <value>
             yarn.http.policy|^yarn.timeline-service.webapp.*$|^yarn.timeline-service.client.*$|hadoop.security.key.provider.path|
             hadoop.rpc.protection|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|
             ^dfs.client.failover.proxy.provider.*$|dfs.namenode.kerberos.principal|dfs.namenode.kerberos.principal.pattern|
             mapreduce.jobhistory.principal
         </value>
    </property>
    
    [Note]Note

    The properties listed earlier can be used to copy data from HAA cluster to HAB cluster only. To be able to copy data from HAB to HAA, add the mapreduce.job.send-token-conf in the HDFS client on the HAA cluster as well.

  3. Restart the HDFS service, then run the distcp command using the NameService. For example:

    hadoop distcp hdfs://HAA/tmp/testDistcp hdfs://HAB/tmp/