How Druid indexes Hive data

Before you can create a Druid data source based on Hive data, you must understand how Hive external table data maps to the column orientation and segment files of Druid.

Mapping of a Hive external table to a Druid file

Each Druid segment consists of the following objects to facilitate fast lookup and aggregation:

Timestamp column: The SQL-based timestamp column is filled in based on how you set the time granularity of imported Hive data and what time range of data is selected in the Hive external table. This column is essential for indexing the data in Druid because Druid itself is a time-series database. The timestamp column must be named __time.

Dimension columns: The dimension columns are used to set string attributes for search and filter operations. To index a Hive-sourced column as a Druid dimension column, you must cast the column as a string type.

Metric columns: Metric columns are used to index metrics for use as aggregates or measures. To index a Hive-sourced column as a Druid metric column, you must cast the column as a Hive numeric data type.

The following figure shows how you can categorize Druid data into three types of columns.