1. Understand the Basics

The Hortonworks Data Platform consists of three layers of components. A coordinated and tested set of these components is sometimes referred to as the Stack.

  • Core Hadoop 2: The basic components of Apache Hadoop.

    • Hadoop Distributed File System (HDFS) : A special purpose file system designed to work to provides high-throughput access to data in a highly distributed environment.

    • YARN: A resource negotiator for managing high volume distributed data processing. Previously part of the first version of MapReduce.

    • MapReduce 2 (MR2) : A set of client libraries for computation using the MapReduce programming paradigm and a History Server for logging job and task information. Previously part of the first version of MapReduce.

  • Essential Hadoop: A set of Apache components designed to ease working with Core Hadoop 2.

    • Apache Pig A platform for creating higher level data flow programs that can be compiled into sequences of MapReduce programs, using Pig Latin, the platform’s native language.

    • Apache Hive: A tool for creating higher level SQL queries using HiveQL, the tool’s native language, that can be compiled into sequences of MapReduce programs. Included with Apache HCatalog.

    • Apache HCatalog: A metadata abstraction layer that insulates users and scripts from how and where data is physically stored. Now part of Apache Hive. Includes WebHCat, which provides a set of REST APIs for HCatalog and related Hadoop components. Originally named Templeton.

    • Apache HBase: A distributed, column-oriented database that provides the ability to access and manipulate data randomly in the context of the large blocks that make up HDFS.

    • Apache ZooKeeper: A centralized tool for providing services to highly distributed systems. ZooKeeper is necessary for HBase installations.

  • Hadoop Support: A set of components that allow you to monitor your Hadoop installation and to connect Hadoop with your larger compute environment.

    • Apache Oozie: A server based workflow engine optimized for running workflows that execute Hadoop jobs.

      Running the current Oozie examples requires some reconfiguration from the standard Ambari installation. See Using HDP for Workflow and Scheduling (Oozie)

    • Apache Sqoop: A component that provides a mechanism for moving data between Hadoop and external structured data stores. Can be integrated with Oozie workflows.

    • Apache Flume: A log aggregator. This component must be installed manually. It is not supported in the context of Ambari at this time.

      See Installing and Configuring Flume for more information.

    • Ganglia: An Open Source tool for monitoring high-performance computing systems.

    • Nagios: An Open Source tool for monitoring systems, services, and networks.

You must always install HDFS, but you can select components from the other layers based on your needs.

loading table of contents...