1. Understand the Basics

The Hortonworks Data Platform consists of three layers.

  • Core Hadoop: The basic components of Apache Hadoop.

    • Hadoop Distributed File System (HDFS): A special purpose file system that is designed to work with the MapReduce engine. It provides high-throughput access to data in a highly distributed environment.

    • MapReduce: A framework for performing high volume distributed data processing using the MapReduce programming paradigm.

  • Essential Hadoop: A set of Apache components designed to ease working with Core Hadoop.

    • Apache Pig: A platform for creating higher level data flow programs that can be compiled into sequences of MapReduce programs, using Pig Latin, the platform’s native language.

    • Apache Hive: A tool for creating higher level SQL-like queries using HiveQL, the tool’s native language, that can be compiled into sequences of MapReduce programs.

    • Apache HCatalog: A metadata abstraction layer that insulates users and scripts from how and where data is physically stored.

    • Templeton: A component that provides a set of REST-like APIs for HCatalog and related Hadoop components.

  • Supporting Components: A set of components that allow you to monitor your Hadoop installation and to connect Hadoop with your larger compute environment.

    • Apache Oozie: A server based workflow engine which is optimized for running workflows that execute Hadoop jobs.

    • Apache Sqoop: A component that provides a mechanism for moving data between HDFS and external structured datastores. Sqoop can be integrated with Oozie workflows.

For more information on the structure of the HDP, see Understanding Hadoop Ecosystem.