Data Movement and Integration
Also available as:
PDF
loading table of contents...

Chapter 2. HDP Data Movement and Integration

Enterprises that adopt a modern data architecture with Hadoop must reconcile data management realities when they bring existing and new data from disparate platforms under management. As Hadoop is deployed in corporate data and processing environments, data movement and lineage must be managed centrally and comprehensively to provide security, data governance, and administration teams the necessary oversight to ensure compliance with corporate standards for data management. Hortonworks offers the HDP Data Movement and Integration Suite (DMI Suite) to provide that comprehensive management for data movement in to and out of Hadoop.

Use cases for data movement and integration (DMI) include the following:

  • Definition and scheduling of data manipulation jobs, including:

    • Data transfer

    • Data replication

    • Mirroring

    • Snapshots

    • Disaster recovery

    • Data processing

  • Monitoring and administration of data manipulation jobs

  • Root cause analysis of failed jobs

  • Job restart, rerun, suspend and termination

  • Workflow design and management

  • Ad-hoc bulk data transfer and transformation

  • Collection, aggregation, and movement of large amounts of streaming data

Intended Audience

Administrators, operators, and DevOps team members who are responsible for the overall health and performance of the HDP ecosystem use DMI for management, monitoring, and administration. Management, monitoring, and administration are performed using the Falcon Dashboard.

Database Administrators

Responsible for establishing recurring transfers of data between RDBMS systems and Hadoop.

Business Analysts or other business users

Need the ability to perform ad-hoc ETL and analytics with a combination of Hadoop-based and RDBMS-based data.

DevOps

Responsible for:

  • Maximizing the predictability, efficiency, security, and maintainability of operational processes.

    Use the DMI Suite to create an abstraction of sources, data sets and target systems along with jobs and processes for importing, exporting, disaster recovery and processing.

  • Designing workflows of various types of actions including Java, Apache Hive, Apache Pig, Apache Spark, Hadoop distributed file system (HDFS) operations, along with SSH, shell, and email.

  • The collection, aggregation, and movement of streaming data, such as log events.

Data Movement Components

The HDP Data Movement and Integration Suite (DMI Suite) leverages the following Apache projects:

Apache Falcon

Management and abstraction layer to simplify and manage data movement in Hadoop

Apache Oozie

Enterprise workflow operations

Apache Sqoop

Bulk data transfers between Hadoop and RDBMS systems

Apache Flume

Distributed, reliable service for collecting, aggregating, and moving large amounts of streaming data

In addition, the DMI Suite integrates other Apache APIs to simplify creation of complex processes, validate user input, and provide integrated management and monitoring.

Beyond the underlying components, the DMI Suite provides powerful user interfaces that simplify and streamline creation and management of complex processes.