Developing Apache Spark Applications
Also available as:
PDF

Selecting a Connector

Use the following information to select an HBase connector for Spark.

The two connectors are designed to meet the needs of different workloads. In general, use the Hortonworks Spark-HBase Connector for SparkSQL, DataFrame, and other fixed schema workloads. Use the RDD-Based Spark-HBase Connector for RDDs and other flexible schema workloads.

Hortonworks Spark-HBase Connector

When using the connector developed by Hortonworks, the underlying context is data frame, with support for optimizations such as partition pruning, predicate pushdowns, and scanning. The connector is highly optimized to push down filters into the HBase level, speeding up workload. The tradeoff is limited flexibility because you must specify your schema upfront. The connector leverages the standard Spark DataSource API for query optimization.

The connector is open-sourced for the community. The Hortonworks Spark-HBase Connectorlibrary is available as a downloadable Spark package at https://github.com/hortonworks-spark/shc. The repository readme file contains information about how to use the package with Spark applications.

For more information about the connector, see A Year in Review blog.

RDD-Based Spark-HBase Connector

The RDD-based connector is developed by the Apache community. The connector is designed with full flexibility in mind: you can define schema on read and therefore it is suitable for workloads where schema is undefined at ingestion time. However, the architecture has some tradeoffs when it comes to performance.

Refer to the following table for other factors that might affect your choice of connector, source repos, and code examples.

Table 8.1. Comparison of the Spark-HBase Connectors

Hortonworks Spark-HBase Connector Connector RDD-Based Spark-HBase Connector
Source Hortonworks Apache HBase community
Apache Open Source? Yes Yes
Requires a Schema? Yes: Fixed schema No: Flexible schema
Suitable Data for Connector SparkSQL or DataFrame RDD
Main Repo shc git repo Apache hbase-spark git repo
Sample Code for Java Not available Apache hbase.git repo
Sample Code for Scala shc git repo Apache hbase.git repo
Sample Code for SparkSQL Perform DataFrame git repo Apache hbase.git repo