Chapter 7. Real-Time Data Analytics with Druid

	Important
	The HDP distribution of Druid is a Hortonworks Technical Preview. Do not use Druid in your production systems. If you have questions about using a Technical Preview component of HDP, contact Support by logging a case on the Hortonworks Support Portal.

Druid is an open-source data store designed for online analytical processing (OLAP) queries on event data. Druid supports the following data analytics features:

Streaming data ingestion
Real-time queries
Scalability to trillions of events and petabytes of data
Sub-second query latency

These traits make this data store particularly suitable for enterprise-scale business intelligence (BI) applications in environments that require minimal latency. With Druid you can have applications running interactive queries that "slice and dice" data in motion.

A common use case for Druid is to provide a data store that can return BI about streaming data that comes from user activity on a website or multidevice entertainment platform, from consumer events sent over by a data aggregator, or from any other large-scale set of relevant transactions or events from Internet-connected sources.

Druid is licensed under the Apache License, version 2.0.

Content Roadmap

The following table provides links to Druid information resources. The table points to resources that are not contained in this HDP Data Access Guide.

	Important
	The hyperlinks in the table and many others throughout this documentation jump to content published on the druid.io site. Do not download any Druid code from this site for installation in an HDP cluster. Instead, install by selecting Druid as a Service in an Ambari-asssisted HDP installation, as described in Installing and Configuring Druid.

Table 7.1. Druid Content Roadmap in Other Sources

Type of Information	Resources	Description
Introductions	About Druid (Source: druid.io)	Introduces the feature highlights of Druid, and explains in which environments the data store is best suited. The page also links to comparisons of Druid against other common data stores.
	Druid Concepts (Source: druid.io)	This page is the portal to the druid.io technical documentation. While the body of this page describes some of the main technical concepts and components, the right-side navigation pane outlines and links to the topics in the druid.io documentation.
	Druid: A Real-time Analytical Data Store (Source: druid.io)	This white paper describes the Druid architecture in detail, performance benchmarks, and an overview of Druid issues in a production environment. The extensive References section at the end of the document point to a wide range of information sources.
Tutorial	Druid Quickstart (Source: druid.io)	A getting started tutorial that walks you through a Druid package, installation, and loading and querying data. The installation of this tutorial is for instructional purposes only and not intended for use in a Hortonworks Hadoop cluster.
Developing on Druid	Developing on Druid (Source: druid.io)	Provides an overview of major Druid components to help developers who want to code applications that use Druid-ingested data. The web page links to another about segments, which is an essential entity to understand when writing applications for Druid.
Data Ingestion	Batch Data Ingestion Loading Streams (Source: druid.io)	These two pages introduce how Druid can ingest data from both static files and real-time streams.
Queries of Druid Data	Querying (Source: druid.io)	Describes the method for constructing queries, supported query types, and query error messages.
Best Practices	Recommendations (Source: druid.io)	A list of tips and FAQs.

Architecture

Druid offers streaming ingestion and batch ingestion to support both of these data analytics modes. A Druid cluster consists of several Druid node types and components. Each Druid node is optimized to serve particular functions. The following list is an overview of Druid node types:

Realtime nodes ingest and index streaming data that is generated by system events. The nodes construct segments from the data and store the segments until these segments are sent to historical nodes. The realtime nodes do not store segments after the segments are transferred.

Historical nodes are designed to serve queries over immutable, historical data. Historical nodes download immutable, read-optimized Druid segments from deep storage and use memory-mapped files to load them into available memory. Each historical node tracks the segments it has loaded in ZooKeeper and transmits this information to other nodes of the Druid cluster when needed.

Broker nodes form a gateway between external clients and various historical and realtime nodes. External clients send queries to broker nodes. The nodes then break each query into smaller queries based on the location of segments for the queried interval and forwards them to the appropriate historical or realtime nodes. Broker nodes merge query results and send them back to the client. These nodes can also be configured to use a local or distributed cache for caching query results for individual segments.

Coordinator nodes mainly serve to assign segments to historical nodes, handle data replication, and to ensure that segments are distributed evenly across the historical nodes. They also provide a UI to manage different data sources and configure rules to load data and drop data for individual datas sources. The UI can be accessed via Ambari Quick Links.

Middle manager nodes are responsible for running various tasks related to data ingestion, realtime indexing, segment archives, etc. Each Druid task is run as a separate JVM.

Overlord nodes handle task management. Overlord nodes maintain a task queue that consists of user-submitted tasks. The queue is processed by assigning tasks in order to the middle manager nodes, which actually run the tasks. The overlord nodes also support a UI that provides a view of the current task queue and access to task logs. The UI can be accessed via Ambari Quick Links for Druid.

	Tip
	For more detailed information and diagrams about the architecture of Druid, see Druid: A Real-time Analytical Data Store Design Overview on druid.io.

Installing and Configuring Druid

Use Apache Ambari to install Druid. Manual installation of Druid to your HDP cluster is not recommended.

After you have a running Ambari server set up for HDP, you can install and configure Druid for your Hadoop cluster just like any other Ambari Service.

	Important
	You must use Ambari 2.5.0 or a later version to install and configure Druid. Earlier versions of Ambari do not support Druid as a service. Also, Druid is available as an Ambari Service when you run with the cluster with HDP 2.6.0 and later versions.

If you need to install and start an Ambari server, see the Getting Ready section of the Apache Ambari Installation Guide.

If a running Ambari server is set up for HDP, see Installing, Configuring, and Deploying a HDP Cluster and the following subsections. While you can install and configure Druid for the Hadoop cluster just like any other Ambari Service, the following subsections contain Druid-specific details for some of the steps in the Ambari-assisted installation.

Interdependencies for the Ambari-Assisted Druid Installation

To use Druid in a real-world environment, your Druid cluster must also have access to other resources to make Druid operational in HDP:

ZooKeeper: A Druid instance requires installation of Apache ZooKeeper. Select ZooKeeper as a Service during Druid installation. If you do not, the Ambari installer does not complete. ZooKeeper is used for distributed coordination among Druid nodes and for leadership elections among coordinator and overlord nodes.
Deep storage: HDFS or Amazon S3 can be used as the deep storage layer for Druid in HDP. In Ambari, you can select HDFS as a Service for the deep storage of data. Aternatively, you can set up Druid to use Amazon S3 as the deep storage layer by setting the druid.storage.type property to s3. The cluster relies on the distributed filesystem to store Druid segments so that there is permanent backup of the data.
Metadata storage: The metadata store is used to persist information about Druid segments and tasks. MySQL, Postgres, and Derby are supported metadata stores. You have the opportunity to select the metadata database when you install and configure Druid with Ambari.
Batch execution engine: Select YARN + MapReduce2 as the appropriate execution resource manager and execution engine, respectively. Druid hadoop index tasks use MapReduce jobs for distributed ingestion of large amounts of data.
(Optional) Druid metrics reporting: If you plan to monitor Druid performance metrics using Grafana dashboards in Ambari, select Ambari Metrics System as a Service.

	Tip
	If you plan to deploy high availability (HA) on a Druid cluster, review the High Availability in Druid Clusters section below to learn what components to install and how to configure the installation so that the Druid instance is primed for a HA environment.

Assigning Slave and Client Components

On the Assign Slaves and Clients window of Ambari, generally you should select Druid Historical and Druid MiddleManager for multiple nodes. (You may also need to select other components that are documented in the Apache Ambari Installation Guide.)The purpose of these components are as follows:

Druid Historical: Loads data segments.

Druid MiddleManager: Runs Druid indexing tasks.

Configuring the Druid Installation

Use the Customize Services window of Ambari installer to finalize the configuration of Druid.

Select Druid > Metadata storage type to access the drop-down menu for choosing the metadata storage database. When you click the drop-down menu, notice the tips about the database types that appear in the GUI.

To proceed with the installation after selecting a database, enter your admin password in the Metadata storage password fields. If MySQL is your metadata storage database, also follow the steps in the Setting up MySQL for Druid section below.

Toggle to the Advanced tab. Ambari has Stack Advisor, which is configuration wizard. Stack Advisor populates many configuration settings based on the your previously entered settings and your environment. Although Stack Advisor provides many default settings for Druid nodes and other related entities in the cluster, you might need to manually tune the configuration parameter settings. Review the following druid.io documentation to determine if you need to change any default settings and how other manual settings operate:

Setting up MySQL for Druid

Complete the following task only if you chose MySQL as the metadata store for Druid.

On the Ambari Server host, stage the appropriate MySQL connector for later deployment.
1. Install the connector.
  RHEL/CentOS/Oracle Linux
  yum install mysql-connector-java*
  SLES
  zypper install mysql-connector-java*
  Debian/Ubuntu
  apt-get install libmysql-java
2. Confirm that mysql-connector-java.jar is in the Java share directory.
  ls /usr/share/java/mysql-connector-java.jar
3. Make sure the Linux permission of the .JARfile is 644.
4. Execute the following command:
  ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
Create the Druid database.
- The Druid database must be created prior using the following command:
  # mysql -u root -p
  CREATE DATABASE <DRUIDDATABASE> DEFAULT CHARACTER SET utf8;
  replacing <DRUIDDATABASE> with the Druid database name.
Create a Druid user with sufficient superuser permissions.
- Enter the following in the MySQL database admin utility:
  # mysql -u root -p
  CREATE USER '<DRUIDUSER>'@'%' IDENTIFIED BY '<DRUIDPASSWORD>';
  GRANT ALL PRIVILEGES ON <DRUIDDATABASE>.* TO '<DRUIDUSER>'@'%’;
  FLUSH PRIVILEGES;
  replacing <DRUIDUSER> with the Druid user name and <DRUIDPASSWORD> with the Druid user password.

Security and Druid

	Important
	Place the Druid endpoints behind a firewall. More robust security features that will remove the need to deploy a firewall around Druid are in development.

You can configure Druid nodes to integrate with a Kerberos-secured Hadoop cluster to enable authentication between Druid and other HDP Services. You can enable the authentication with Ambari 2.5.0+, which lets you secure the HTTP endpoints by including a SPNEGO-based Druid extension. The mechanism by which Kerberos security uses keytabs and principals to strengthen security is described in Kerberos Overview and Kerberos Principals.

A benefit of enabling authentication in this way is that it can connect Druid Web UIs to the core Hadoop cluster while maintaining Kerberos protection. The main Web UIs to use with HDP are the Coordinator Console and the Overlord Console. See Coordinator Node and Overlord Node on the druid.io documentation website for more information about the consoles for these nodes.

Securing Druid Web UIs and Accessing Endpoints

Enabling SPNEGO-based Kerberos authentication between the Druid HTTP endpoints and the rest of the Hadoop cluster requires running the Ambari Kerberos Wizard and manually connecting to Druid HTTP endpoints in a command line. After authenticating successfully to Druid, you can submit queries through the endpoints.

Procedure 7.1. Enabling Kerberos Authentication in Druid

Prerequisite: The Ambari Server and Services on the cluster must have SPNEGO-based Kerberos security enabled. See the Ambari Security Guide if you need to configure and enable SPNEGO-based Kerberos authentication.

	Important
	The whole HDP cluster is down after you configure the Kerberos settings and initialize the Kerberos wizard in the following task. Ensure you can have temporary down-time before completing all steps of this task.

Follow the steps in Launching the Kerberos Wizard (Automated Setup) until you get to the Configure Identities window of the Ambari Kerberos Wizard. Do not click the Next button until you adjust advanced Druid configuration settings as follows:

On the Configure Identities window:

Review the principal names, particularly the Ambari Principals on the General tab. These principal names, by default, append the name of the cluster to each of the Ambari principals. You can leave the default appended names or adjust them by removing the -cluster-name from the principal name string. For example, if your cluster is named druid and your realm is EXAMPLE.COM, the Druid principal that is created is druid@EXAMPLE.COM.
Select the Advanced tab > Druid drop-down menu.

Use the following table to determine for which Druid properties, if any, you need to change the default settings. In most cases, you should not need to change the default values.

Table 7.2. Advanced Druid Identity Properties of Ambari Kerberos Wizard

Property	Default Value Setting	Description
`druid.hadoop.security.spnego.` `excludedPaths`	`['status']` If you want to set more than one path, enter values in the following format: `['/status','/condition']`	Specify here HTTP paths that do not need to be secured with authentication. A possible use case for providing paths here are to test scripts outside of a production environment.
`druid.hadoop.security.spnego.` `keytab`	`keytab_dir/spnego.service.keytab`	This is the SPNEGO service keytab that is used for authentication.
`druid.hadoop.security.spnego.` `principal`	`HTTP/_HOST@realm`	This is the SPNEGO service principal that is used for authentication.
`druid.security.extensions.` `loadlist`	`[druid-kerberos]`	This indicates the Druid security extension to load for Kerberos.

Confirm your configuration. Optionally, you can download a CSV file of the principals and keytabs that Ambari can automatically create.

Click Next to proceed with kerberization of the cluster. During the process, all running Services are stopped. Kerberos configuration settings are applied to various components, and keytabs and principals are generated. When the Kerberos process finishes, all Services are restarted and checked.

	Tip
	Initializing the Kerberos Wizard might require a significant amount of time to complete, depending on the cluster size. Refer to the GUI messaging on the screen for progress status.

Procedure 7.2. Accessing Kerberos-Protected HTTP Endpoints

Before accessing any Druid HTTP endpoints, you need to authenticate yourself using Kerberos and get a valid Kerberos ticket as follows:

Log in via the Key Distribution Center (KDC) using the following kinit command, replacing the arguments with your real values:
```
kinit -k -t keytab_file_path user@REALM.COM
```
Verify that you received a valid Kerberos authentication ticket by running the klist command. The console prints a Credentials cache message if authentication was successful. An error message indicates that the credentials are not cached.
Access Druid with a curl command. When you run the command, you must include the SPNEGO protocol --negotiate argument. (Note that this argument has double hyphens.) The following is an example command. Replace anyUser, cookies.txt, and endpoint with your real values.
```
curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-Type: application/json' http://_endpoint
```

Submit a query to Druid in the following example curl command format:

curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-Type: application/json'  http://broker-host:port/druid/v2/?pretty -d @query.json

High Availability in Druid Clusters

Ambari provides the ability to configure the High Availability (HA) features available in Druid and other HDP Stack Services.

Configuring Druid Clusters for High Availability

HA by its nature requires a multinode structure with somewhat more sophisticated configuration than a cluster without HA. Do not use local storage in an HA environment.

Configure a Cluster with an HDFS Filesystem

Prerequisites

MySQL or Postgres must be installed as the metadata storage layer. Configure your metadata storage for HA mode to avoid outages that impact cluster operations. See either High Availability and Scalability in the MySQL Reference Manual or High Availability, Load Balancing, and Replication in the PostgreSQL Documentation, depending on your storage selection. Derby does not support a multinode cluster with HA.

At least three ZooKeeper nodes must be dedicated to HA mode.

Steps

Enable Namenode HA using the wizard as described in Configuring NameNode High Availability
Install the Druid Overlord, Coordinator, Broker, Realtime, and Historical processes on multiple nodes that are distributed among different hardware servers.
- Within each Overlord and Coordinator domain, ZooKeeper determines which node is the Active node. The other nodes supporting each process are in Standby state until an Active node stops running and the Standby nodes receive the failover.
- Multiple Historical and Realtime nodes also serve to support a failover mechanism. But for Broker and Realtime processes, there are no designated Active and Standby nodes.
- Muliple instances of Druid Broker processes is required for HA. Recommendations: Use an external, virtual IP address or load balancer to direct user queries to multiple Druid Broker instances. A Druid Router can also serve as a mechanism to route queries to multiple broker nodes.
Ensure that the replication factor for each data source is set greater than 1 in the Coordinator process rules. If no data source rule configurations were changed, no action is required because the default value is 2.