Chapter 11. Enabling Mirroring and Replication with Azure Cloud Services

You can enable secure connectivity between your on-premises Apache Hadoop cluster and a Microsoft Azure Data Factory service in the cloud. You might want to create a hybrid data pipeline for uses such as maintaining sensitive data on premises while leveraging the cloud for nonsensitive data, providing disaster recovery or replication services, and using the cloud for development or test environments.

Prerequisites

You must have created a properly configured data factory in the cloud.
Your environment must meet the HDP versioning requirements described in "Replication Between HDP Versions" in Creating Falcon Entity Definitions.

	Note
	Due to changes in Hive, for the Oozie HCAT URI (which is used for Hive table feeds) Falcon supports URIs with only one metastore. This applies even if you have multiple metastores configured.

Connect the Azure Data Factory to Your On-premises Hadoop Cluster

In the Data Factory editor, create a new Hadoop cluster.
As part of the creation process, you must identify the following linked services. You need the names of these services when configuring the Falcon instance.
- Hadoop cluster Linked Service
  Represents the on-premises Hadoop cluster as a data factory compute resource. You can use this resource as compute target for your data factory Hive and Pig jobs. This linked service references the service bus (transport) linked service. The service is identified as OnPremisesHadoopCluster linked service.
- Service Bus Namespace
  Contains information about a unique Azure service bus namespace that is used for communicating job requests and status information between the data factory and your on-premises Hadoop cluster. The service is identified as TransportLinkedService.
The JSON object for the linked services can be found in the ./Data Factory JSONs/LinkedServices folder.

Configure an on-premises Falcon instance to connect with the data factory Hadoop cluster.

Add the following information to your Falcon conf/startup.properties file, with the following changes:

Replace <your azure service bus namespace> with the name you assigned to the service bus in step 1.
Replace <your Azure service bus SAS key> with the credentials for the Azure bus namespace taken from the Azure web portal.

######### ADF Configurations start #########

# A String object that represents the namespace

*.microsoft.windowsazure.services.servicebus.namespace=<your azure service bus namespace>

# Request and status queues on the namespace

*.microsoft.windowsazure.services.servicebus.requestqueuename=adfrequest
*.microsoft.windowsazure.services.servicebus.statusqueuename=adfstatus

# A String object that contains the SAS key name
*.microsoft.windowsazure.services.servicebus.sasKeyName=RootManageSharedAccessKey

# A String object that contains the SAS key
*.microsoft.windowsazure.services.servicebus.sasKey=<your Azure service bus SAS key>

# A String object containing the base URI that is added to your Service Bus namespace to form the URI to connect
# to the Service Bus service. To access the default public Azure service, pass ".servicebus.windows.net"
*.microsoft.windowsazure.services.servicebus.serviceBusRootUri=.servicebus.windows.net

# Service bus polling frequency (in seconds)
*.microsoft.windowsazure.services.servicebus.polling.frequency=60

Restart Falcon from the Ambari UI.
1. Click the Services tab and select the Falcon service.
2. On the Summary page, verify that the Falcon service status is Stopped.
3. Click Service Actions > Start.
  A dialog box displays.
4. (Optional) Turn off Falcon's maintenance mode by clicking the checkbox.
  Maintenance Mode suppresses alerts.
5. Click Confirm Start > OK.
  On the Summary page, the Falcon status displays as Started when the service is available.
After restarting Falcon, the Azure data factory and the on-premises Falcon instance should successfully connect.

Create datasets and data pipelines to run Hive and Pig jobs that target the on-premises Hadoop cluster.

Example

Following is an example of a script to test running a Hive script on an on-premises HDP cluster.

JSONs for all the objects can be found in ./Data factory JSONs/ folder.

{
    "name": "OnpremisesInputHDFSForHadoopHive",
    "properties": {
        "published": false,
        "type": "CustomDataset",
        "linkedServiceName": "OnpremisesHadoopCluster",
        "typeProperties": {
            "tableName": "partitionedcdrs",
            "partitionedBy": "yearno=${YEAR};monthno=${MONTH}"
        },
        "availability": {
            "frequency": "Day"
            "interval": 1
        },
        "external": true,
        "policy": {
            "executionPriorityOrder": "OldestFirst"
        }
    }
    }

    {
        "name": "OnpremisesOutputHDFSForHadoopHive",
        "properties": {
        "published": false,
        "type": "CustomDataset",
        "linkedServiceName": "OnpremisesHadoopCluster",
        "typeProperties": {
            "tableName": "callsummarybymonth",
            "partitionedBy": "yearno=${YEAR};monthno=${MONTH}"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        }
    }
    }

    {
    "name": "TestHiveRunningOnHDPHiveInHDFS",
    "properties": {
        "description": "Test pipeline to run Hive script on an on-premises HDP cluster (Hive is in HDFS location)",
        "activities": [
            {
            "type": "HadoopHive",
            "typeProperties": {
                "runOnBehalf": "ambari-qa",
                "scriptPath": "/apps/falcon/adf-demo/demo.hql",
                "Year": "$$Text.Format('{0:yyyy}',SliceStart)",
                "Month": "$$Text.Format('{0:%M}',SliceStart)",
                "Day": "$$Text.Format('{0:%d}',SliceStart)"
            },
            "inputs": [
                {
                    "name": "OnpremisesInputHDFSForHadoopHive"
                }
            ],
            "outputs": [
                {
                    "name": "OnpremisesOutputHDFSForHadoopHive"
                }
            ],
            "policy": {
                "executionPriorityOrder": "OldestFirst",
                "timeout": "00:05:00",
                "concurrency": 1,
                "retry": 1
            },
            "scheduler": {
                "frequency": "Day",
                "interval": 1
            },
            "name": "HiveScriptOnHDPCluster",
            "linkedServiceName": "OnpremisesHadoopCluster"
        }
    ],
        "start": "2014-11-01T00:00:00Z",
        "end": "2014-11-02T00:00:00Z",
        "isPaused": false,
        "hubName": "hwkadftest1026_hub",
        "pipelineMode": "Scheduled"
  }
}

Configuring to Copy Files From an On-premises HDFS Store to Azure Blob Storage

You must configure access in your HDFS environment to be able to move data to and from the blob storage.

Prerequisite Setup:

Before you copy files, you must ensure the following:

You created a Microsoft Storage Blob Container.
You have the blob credentials available.
For Apache Falcon replication jobs, HDFS requires Azure blob credentials to move data to and from the Azure blob storage.

Log in to Ambari at http://[cluster ip]:8080.
Select HDFS and click Configs > Advanced.
Expand the Custom core-site section and click Add Property.
Add the Azure credential as a key/value property.
- Use the following format for the key, replacing account_name with the name of the Azure blob: fs.azure.account.key.account_name.blob.core.windows.net
- Use the Azure blob account key for the value.
(Optional) Take an HDFS checkpoint.
If a recent checkpoint does not exist, the NameNode(s) can take a very long time to start up.
1. Login to the NameNode host.
2. Put the NameNode in Safe Mode (read-only mode).
  sudo su hdfs -l -c 'hdfs dfsadmin -safemode enter'
3. Once in Safe Mode, create a checkpoint.
  sudo su hdfs -l -c 'hdfs dfsadmin -saveNamespace'
Restart the relevant components from the Ambari UI.
1. Click the Services tab and select HDFS.
2. Click Service Actions > Restart All.
3. Click Confirm Restart All > OK.
4. Verify that the status is Started for HDFS, YARN, MapReduce2, Oozie, and Falcon.
Note
If you restart the components at the command line, restart them in the following order: HDFS, YARN, MapReduce2, Oozie, Falcon.
Test if you can access the Azure blob storage through HDFS.
Example: hdfs dfs -ls -R wasb://[container-name@azure-user].blob.core.windows.net/

You can now set up the on-premises input dataset, output dataset in Azure blob and a pipeline to copy the data with the replication activity. The sample JSON objects are shown below.

{
    "name": "OnpremisesInputHDFSForHadoopMirror",
    "properties": {
        "published": false,
        "type": "CustomDataset",
        "linkedServiceName": "OnpremisesHadoopCluster",
        "typeProperties": {
            "folderPath": "/apps/hive/warehouse/callsummarybymonth/yearno=${YEAR}/monthno=${MONTH}"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": true,
        "policy": {
            "executionPriorityOrder": "OldestFirst"
        }
    }
}

{
    "name": "AzureBlobDatasetForHadoopMirror",
    "properties": {
        "published": false,
        "type": "AzureBlob",
        "linkedServiceName": "AzureBlobStorage",
        "typeProperties": {
            "folderPath": "results/${YEAR}/${MONTH}",
            "format": {
                "type": "TextFormat"
            }
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        }
    }
}

{
    "name": "TestReplicate2Azure",
    "properties": {
        "description": "Test pipeline to mirror data on onpremises HDFS to azure",
        "activities": [
            {
                "type": "HadoopMirror",
                "typeProperties": {},
                "inputs": [
                    {
                        "name": "OnpremisesInputHDFSForHadoopMirror"
                    }
                ],
                "outputs": [
                    {
                        "name": "AzureBlobDatasetForHadoopMirror"
                    }
                ],
                "policy": {
                    "executionPriorityOrder": "OldestFirst",
                    "timeout": "00:05:00",
                    "concurrency": 1,
                    "retry": 1
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "MirrorData 2Azure",
                "linkedServiceName": "OnpremisesHadoopCluster"
            }
        ],
        "start": "2014-11-01T00:00:00Z",
        "end": "2014-11-02T00:00:00Z",
        "isPaused": false,
        "hubName": "hwkadftest1026_hub",
        "pipelineMode": "Scheduled"
    }
}

	Note
	If you restart the components at the command line, restart them in the following order: HDFS, YARN, MapReduce2, Oozie, Falcon.