Creating a Cluster

Note

If you have not subscribed to the Hortonworks Data Cloud - HDP Services AWS Marketpace product, you will receive an error message when you attempt to create a cluster. Refer to Subscribe documentation for more information.

  1. Browse to your running cloud controller. For example: https://ec2-11-111-111-11.compute-1.amazonaws.com/

    To determine the URL for your cloud controller, refer to Obtaining the Cloud Controller URL.

  2. Log in using the email address and password provided in the CloudFormation template during the controller launch.

  3. Click CREATE CLUSTER and the following form is displayed:

    By default, only basic options are shown. You can expand SHOW ADVANCED OPTIONS to view additional options.

  4. For (OPTIONAL) CHOOSE CLUSTER TEMPLATE, you can optionally select a template that you want to use. This option is only available if you have saved at least one cluster template prior to launching this form. Refer to Managing Cluster Templates for more information.

  5. For GENERAL CONFIGURATION, enter the following parameters:

    Parameter Description
    Cluster Name Enter the name for this cluster. The name:
    • Must start with a letter.
    • Must include 5-20 characters.
    • Can include only lowercase letters, numbers, and -.
    HDP Version Choose the HDP version to use for this cluster.
    Cluster Type Choose the type of cluster configuration to use. Refer to Cluster Configurations for more information.

    You can expand SHOW ADVANCED OPTIONS to view additional options:

    Parameter Description
    Tags You can optionally add custom tags that will be displayed on the CloudFormation stack and on EC2 instances. Refer to Tagging Resources for more information.
    Custom Properties Allows you to paste or upload a JSON file with custom properties set to specific values. Refer to Custom Properties for more information.
    Node Recipes Allows you to upload scripts that will be run pre- or post- cluster deployment. Refer to Node Recipes for more information.
    Flex Subscription This option will appear if you have configured your deployment for a Flex Subscription.
  6. For HARDWARE & STORAGE, the default options allow you to select the instance types for the cluster nodes and the count for the worker nodes.

    Sizing Guidelines

    For a small cluster (less than 20 nodes), we recommend using:

    • Master node: m4.4xlarge (16vCPU, 64.0 GB Memory)
    • Worker node: m4.4xlarge (16vCPU, 64.0 GB Memory), with EBS-based storage

    For a medium size cluster (20-100 nodes), we recommend using:

    • Master node: d2x4large (16vCPU, 122 GB Memory)
    • Worker node: d2.4xlarge (16vCPU, 122 GB Memory) and 12 x 2000 HDD, which is suitable for storing HDFS data and for distributing YARN local directories.

    For an LLAP cluster, we recommend using d2.8xlarge (36vCPU, 244 GB Memory), with 24 x 2000 HDD. You may also consider using i2.8xlarge with attached SSD storage.

    Parameter Description
    Master Instance Type Choose the instance type for the master node.
    Worker Instance Type Choose the instance type for the worker nodes.
    Worker Instance Count Enter the number of worker nodes. The cluster will be created with one master node and this number of worker nodes.
    Compute Instance Type Choose the instance type for the compute nodes.
    Compute Instance Count Optionally, enter the number of compute nodes. These nodes are solely used for processing data. If you set this to 0 (default value), you will be able to add compute nodes later.
    Use Spot Instances

    Check this option to use EC2 spot instances as your compute nodes. Next, enter your bid price. The price that is pre-loaded in the form is the current on-demand price for your chosen EC2 instance type.

    In general, if you choose to use spot instances as compute nodes when creating your cluster, any additional nodes that you add to that cluster will be using spot instances. Likewise, the bid price that you submit will be used for adding additional compute nodes. After creating a cluster, you can view the bid price in the cluster details.

    If you decide not to use spot instances, any compute nodes that you add to your cluster will be using standard on-demand instances.

    For more information, refer to Using Spot Instances.

    Auto repair

    The auto repair option is available for worker nodes and compute nodes that run or on-demand instances. It is not available for compute nodes using spot instances.

    • ON (default) - Failed worker and compute nodes will be repaired automatically.
    • OFF - Failed worker and compute nodes will not repaired automatically, but you will have an option to repair or delete failed worker and compute nodes manually.

    Fore more information, refer to Node Auto Repair

    You can expand SHOW ADVANCED OPTIONS to view additional options related to master and worker nodes Storage. Fore more information, refer to Instance Storage Settings.

  7. For NETWORK, the default options allow you to configure the SSH Key Name and basic network access settings.

    Parameter Description
    Remote Access Allow connections to the inbound ports for the cluster node instances from this address range. Must be a valid CIDR IP. For example:
    • 10.0.0.0/24 will allow access from 10.0.0.0 through 10.0.0.255.
    • 0.0.0.0/0 will allow access from all.
    Refer to Security Groups for more information on the inbound ports that are used with cluster node instances.
    Protected Gateway Access to Ambari and Zeppelin Web UIs This option is checked by default. This option provides password-protected access to the cluster web UI for Ambari and Zeppelin. See Protected Gateway for more information.
    Protected Gateway Access to Hive JDBC This option is checked by default. This options provides password-protected access to Hive JDBC. Refer to Protected Gateway for more information.
    Protected Gateway Access to Cluster Components (NN, RM, JHS, SHS) This option is checked by default. This option provides password-protected access to the cluster web ports for the HDFS NameNode, YARN ResourceManager, MapReduce JobHistory Server, and Spark History Server. Refer to Protected Gateway for more information.

    You can expand SHOW ADVANCED OPTIONS to view additional options:

    Parameter Description
    Use existing VPC and subnet Specify whether to use an existing VPC and subnet to deploy the cluster inside it. See Existing VPC for more information.
  8. For SECURITY, the default options allow you to configure the SSH Key Name and basic network access settings.

    Parameter Description
    SSH Key Name Name of an existing EC2 key pair to enable SSH to access the cluster instances.
    Cluster User Set the default username and password for the cluster (including Zeppelin, Ambari and the Protected Gateway). See Protected Gateway for more information.

    You can expand SHOW ADVANCED OPTIONS to view additional options:

    Parameter Description
    Instance Role Specify the role that will allow the EC2 instances in the cluster to access S3. The options are:
    • Create new AWS Role to grant S3 access (default) - A new role will be created and used.
    • Select an existing AWS Role - Specify an Instance Profile ARN for the existing role that you want to use. The role must have the s3access policy attached.
    • Do not assign AWS Role.
    Hive Metastore Specify whether to use an external Amazon RDS instance for the Hive metastore. See Hive Metastore for more information.
    Druid Metastore This option is only available for HDP 2.6 clusters that use the BI: Druid configuration. Specify whether to use an external Amazon RDS instance for the Druid metastore. See Druid Metastore for more information.
  9. For AUTO SCALING, click on the ON/OFF button to enable the feature. Once enabled, you can define auto scaling policies for adding or removing nodes. Refer to Auto Scaling for more information.

  10. After choosing the options, click CREATE CLUSTER. You have an option to save this cluster configuration as a cluster template, receive an email when the cluster creation is complete or show JSON that can be used to create a cluster using the Command Line Interface (CLI).

Custom Properties

When creating a cluster, you can optionally include custom cluster configuration properties. This option allows you to set service configuration properties automatically as part of the cluster create process.

Under GENERAL CONFIGURATION, expand SHOW ADVANCED OPTIONS and the option to add custom properties will be displayed. You have two options:

In either case, the JSON structure is a list of configuration property maps keyed by configuration type:

[
  {
  "configuration-type" : {
    "property-name" : "property-value",
    "property-name2" : "property-value"
    }
  },
  {
  "configuration-type2" : {
    "property-name" : "property-value"
    }
  }
]

For example, to set configurations for core-site and hdfs-site, submit the following JSON, which adds the properties and property values to the core-site.xml and hdfs-site.xml respectively:

[
  {
  "core-site" : {
    "property-name" : "property-value",
    "property-name2" : "property-value"
    }
  },
  {
  "hdfs-site" : {
    "property-name" : "property-value"
    }
  }
]

Common configuration types include:

Service Common Configuration Types
HDFS hdfs-site, core-site
YARN yarn-site
Hive hive-site, hiveserver2-site, hive-interactive-site, hiveserver2-interactive-site, hive-metastore-site
Spark spark-defaults

If you need to set additional properties when the cluster is already running, refer to Setting Configuration Properties.

Instance Storage Settings

When creating a cluster, you can adjust the advanced instance storage settings, such as storage type, volume count, and size for the master and worker nodes.

Under HARDWARE AND STORAGE, expand SHOW ADVANCED OPTIONS and the available storage options for the master and worker nodes will be displayed. Once displayed, clicking CHANGE exposes the advanced storage settings for storage type, the volume count, and the size of each volume.

Parameter Description
Storage Type The available types are: "General Purpose (SSD)", "Ephemeral" or "Throughput Optimized HDD".
Count The number of storage volumes to include in the instance.
Size The size of each volume (in GB).
Encryption If you selected "General Purpose (SSD)", you can optionally configure encryption for your EBS volumes. For more information, refer to AWS documentation.

Hive Metastore

When creating a cluster, you have an option to have a Hive metastore database created with the cluster, or to use an external Hive metastore that is backed by Amazon RDS.

If you choose to use an external Amazon RDS instance for the Hive metastore, you can choose from the list of previously registered Hive metastores or you can register a new metastore. See Managing Shared Metastores for more information on registering shared metastores.

Parameter Description
Do not use an external Amazon RDS instance. A database will be created with the cluster. The database will be destroyed when the cluster is terminated. The metastore data will not be preserved when the cluster is terminated.
Register new Hive metastore... Enter connection information for an existing database on an existing Amazon RDS instance and this Hive metastore will be automatically registered and used with the cluster. See Managing Shared Metastores for more information.
List of registered Hive Metastores If you have previously registered a Hive metastore for the HDP version chosen for this cluster, you can select it from the list. This option is only available if you have previously registered at least one Hive metastore for the chosen HDP version.

Druid Metastore

The option to use an external Druid metastore is only available for HDP 2.6 clusters that use the BI: Druid configuration.

When creating an HDP 2.6 cluster using the BI configuration, you have an option to have a Druid metastore database created with the cluster, or you can use an external Druid metastore that is backed by Amazon RDS.

If you choose to use an external Amazon RDS instance for the Druid metastore, you can choose from the list of previously registered Druid metastores or you can register a new metastore. See Managing Shared Metastores for more information on registering shared metastores.

Parameter Description
Do not use an external Amazon RDS instance. A database will be created with the cluster. The database will be destroyed when the cluster is terminated. The metastore data will not be preserved when the cluster is terminated.
Register new Druid metastore... Enter connection information for an existing database on an existing Amazon RDS instance and this Hive/Druid metastore will be automatically registered and used with the cluster. See Managing Shared Metastores for more information.
List of registered Druid Metastores If you have previously registered a Druid metastore for the HDP version chosen for this cluster, you can select it from the list. This option is only available if you have previously registered at least one Druid metastore for the chosen HDP version.

Existing VPC

You can optionally choose to install into a different VPC (and subnet) than the VPC in which the cloud controller instance is running. Default is to install the cluster node instances into the same VPC as the cloud controller instance, but in a new subnet.

For instructions on how to create an Amazon VPC for use with an Amazon RDS instance, refer to this Amazon tutorial.