Using Druid and Apache Hive
Also available as:
PDF

How Druid indexes Hive data

Before you can create a Druid datasource based on Hive data, you must understand how data of a Hive external table map is mapped to the column orientation and segment files of Druid.

Mapping of a Hive external table to a Druid file

Each Druid segment consists of the following objects to facilitate fast lookup and aggregation:
Timestamp column
The SQL-based timestamp column is filled in based on how you set the time granularity of imported Hive data and what time range of data is selected in the Hive external table. This column is essential for indexing the data in Druid because Druid itself is a time-series database. The timestamp column must be named __time.
Dimension columns
The dimension columns are used to set string attributes for search and filter operations. To index a Hive-sourced column as a Druid dimension column, you must cast the column as a string type.
Metric columns
Metric columns are used to index metrics that are supposed to be used as aggregates or measures. To index a Hive-sourced column as a Druid metric column, you must cast the column as a Hive numeric data type.

The following figure represents an example of how Druid data can be categorized in the three column types.