Introduction

Welcome to Hortonworks Data Cloud.

Hortonworks Data Cloud (HDCloud) for Amazon Web Services (AWS) is a service that allows you to quickly launch clusters for use cases involving analyzing and processing vast amounts of data. Powered by the Hortonworks Data Platform, Hortonworks Data Cloud is an easy-to-use, cost-effective, and scalable solution for handling big data use cases with Apache Hadoop, Apache Hive, and Apache Spark.

Use Cases

Ephemeral on-demand clusters: Spin up a Hadoop cluster within minutes, and start modeling and analyzing your data sets immediately. Instead of going through infinite configuration options, choose from a set of prescriptive cluster configurations. You can add additional nodes on demand. When you are done with your analysis, you can give the resources back to the cloud, reducing your costs.

Spark or Hive clusters: Spin up Spark or Hive clusters, depending on your specific data processing (Spark) or data analytics (Hive) tasks.

Automation: Automatically create clusters, run specific jobs, and then terminate the clusters.

Shared S3 data: Since data can be shared with external applications via S3, you can collect and publish data across applications to S3, and then use this data for analysis. This means that data for analysis can be collected and stored on S3 even while no Hadoop clusters are active. Similarly, Hadoop applications can publish data to S3 for access so that your data persists after you terminate the cluster.

Architecture

The following graphic illustrates high-level architecture of the Hortonworks Data Cloud:

Primary Components

The two primary components of Hortonworks Data Cloud are the cloud controller and one or more clusters being managed by that controller. The cloud controller and the cluster nodes run on EC2 instances.

The cloud controller is a web application that communicates with the AWS Services to create AWS resources on your behalf. Once the AWS resources are in place, the cloud controller uses Apache Ambari to deploy and configure the cluster to the AWS instances (based on your choice of HDP version and cluster configuration). Once your cluster is deployed, you can use the cloud controller to scale the cluster.

A cluster, used for storing and processing data, includes three node types: master, worker, and compute.

For the purposes of instance scaling and management, cluster instances are deployed into three auto scaling groups: one for the master node, one for the worker nodes, and another ones for the compute nodes. For more information on auto scaling groups, see AWS documentation.

AWS Services

The following AWS services are used by Hortonworks Data Cloud:

Network and Security

In addition to the Amazon EC2 instances created for the cloud controller and cluster nodes, Hortonworks Data Cloud deploys the following network and security AWS resources on your behalf:

Amazon RDS

When creating a cluster, you have an option to have a Hive Metastore database created with the cluster or to use an external Hive Metastore that is backed by Amazon RDS. Using an external Amazon RDS database for the Hive Metastore allows you to preserve the Hive Metastore metadata and reuse between clusters. For more information, see Managing Metastores documentation.

Furthermore, you have an option to use an external Amazon RDS database to store cloud controller configuration information for upgrade and recovery purposes. For more information, see Amazon RDS Instance documentation.

Amazon S3

Hortonworks Data Cloud provides seamless access to Amazon S3 buckets, in which you can store data for an extended period of time. You can copy the data sets to HDFS for analysis and then copy back to S3 when done. For more information, see Data Storage on Amazon S3 documentation.

Get Started

This section will get you running Hortonworks Data Cloud in your AWS environment.

To get started:

  1. Meet the prerequisites.
  2. Review available AWS regions and select the region in which you would like to launch the cloud controller.
  3. Review available cluster configurations and select your desired configuration.
  4. Launch a cloud controller instance that you will use to provision a cluster.
  5. Log in to the cloud controller UI and create a cluster.
  6. Log in to Ambari and SSH to the cluster nodes.
  7. Manage your clusters in the cloud controller UI: scale-up, scale-down, and when done, terminate.

Note

The Hortonworks Data Cloud software runs in your AWS environment. You are responsible for AWS charges while running Hortonworks Data Cloud and the clusters being managed by Hortonworks Data Cloud. To learn more about AWS pricing, see service-specific pricing pages or AWS Simple Monthly Calculator.

Prerequisites

To use Hortonworks Data Cloud, you need the following:

  1. AWS account: If you already have an AWS account, log in to the AWS Management Console. Alternatively, you can create a new AWS account.
  2. A key pair in a selected region: The Amazon EC2 instances that you create for Hortonworks Data Cloud will be accessible by the key pair that you provide during installation. Refer to the AWS documentation for instructions on how to create a key pair in a selected region.
  3. Subscription to Hortonworks Data Cloud AWS Marketplace products: To launch Hortonworks Data Cloud for AWS, you must subscribe to the two AWS Marketplace products: Hortonworks Data Cloud - Controller Service (allows you to launch the cloud controller) and Hortonworks Data Cloud - HDP Services (allows the cloud controller to create HDP clusters). Refer to Subscribe documentation.

AWS Regions

Not all AWS Services are supported in all regions (For details, see the AWS Region Table). Therefore, Hortonworks Data Cloud can only be launched in the following regions:

Region Name Region
US East (N. Virginia) us-east-1
US West (Oregon) us-west-2
EU Central (Frankfurt) eu-central-1
EU West (Dublin) eu-west-1
Asia Pacific (Tokyo) ap-northeast-1

Cluster Configurations

You can create different types of Apache Hive and Apache Spark clusters. After you have launched the cloud controller and it's time to create a cluster, you will be prompted to choose the HDP Version and the Cluster Type.

HDP Version: HDP 2.6 Cloud
Cluster Type Services Description
Data Science Spark 1.6,
Zeppelin 0.7.0
This cluster configuration includes Spark 1.6 with Zeppelin.
Data Science Spark 2.1,
Zeppelin 0.7.0
This cluster configuration includes Spark 2.1 with Zeppelin.
EDW - Analytics Hive 2 LLAP,
Zeppelin 0.7.0
This cluster configuration includes Hive 2 LLAP.
EDW - ETL Hive 1.2.1,
Spark 1.6
This cluster configuration includes Hive and Spark 1.6.
EDW - ETL Hive 1.2.1,
Spark 2.1
This cluster configuration includes Hive and Spark 2.1.
BI Druid 0.9.2 This cluster configuration includes a Technical Preview of Druid.
HDP Version: HDP 2.5 Cloud
Cluster Type Services Description
Data Science Spark 1.6,
Zeppelin 0.6.0
This cluster configuration includes Spark 1.6 and Zeppelin.
EDW - ETL Hive 1.2.1,
Spark 1.6
This cluster configuration includes Hive and Spark 1.6.
EDW - ETL Hive 1.2.1,
Spark 2.0
This cluster configuration includes a Technical Preview of Spark 2.0.
EDW - Analytics Hive 2 LLAP,
Zeppelin 0.6.0
This cluster configuration includes a Technical Preview of Hive 2 LLAP.

For a full list of services included in each of the configurations, refer to Cluster Services.

Choosing Your Configuration

When creating a cluster, you can choose a more stable cluster configuration for a predicable experience. Alternatively, you can try the latest capabilities by choosing a cluster configuration that is much more experimental. The following configuration classification applies:

  • Stable configurations are the best choice if you want to avoid issues and other problems with launching and using clusters.
  • If you want to use a Technical Preview version of a component in a release of HDP, use these configurations.
  • These are the most cutting edge of the configurations, including Technical Preview components in a Technical Preview HDP release.