Spark Guide
Also available as:
PDF

Chapter 1. Introduction

Hortonworks Data Platform supports Apache Spark 1.4.1, a fast, large-scale data processing engine.

Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside other engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. YARN allows flexibility: you can choose the right processing tool for the job. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources. In a modern data architecture with multiple processing engines using YARN and accessing data in HDFS, Spark on YARN is the leading Spark deployment mode.

Spark Features

Spark on HDP supports the following features:

  • Spark Core

  • Spark on YARN

  • Spark on YARN on Kerberos-enabled clusters

  • Spark History Server

  • DataFrame API

  • Spark MLLib

  • Optimized Row Columnar (ORC) files

  • Support for Hive 0.13.1, including the collect_list UDF

  • The ML Pipeline API in PySpark

The following features are available as technical previews:

  • Spark SQL

  • Spark Streaming

  • Spark Thrift Server

  • Dynamic Executor Allocation

  • SparkR

The following features and associated tools are not officially supported by Hortonworks:

  • Spark Standalone

  • GraphX

  • Apache Zeppelin

  • iPython

Spark on YARN uses YARN services for resource allocation, running Spark Executors in YARN containers. Spark on YARN supports workload management and Kerberos security features. It has two modes:

  • YARN-Cluster mode, optimized for long-running production jobs.

  • YARN-Client mode, best for interactive use such as prototyping, testing, and debugging. Spark Shell runs in YARN-Client mode only.

Table 1.1. Spark - HDP Version Support

HDPAmbariSpark
2.3.22.1.21.4.1
2.3.02.1.11.3.1
2.2.92.1.11.3.1
2.2.82.1.11.3.1
2.2.62.1.11.2.1
2.2.42.0.11.2.1

Table 1.2. Spark Feature Support by Version

Feature1.2.11.3.11.4.1
Spark CoreYesYesYes
Spark on YARNYesYesYes
Spark on YARN, Kerberos-enabled clustersYesYesYes
Spark History ServerYesYesYes
Spark MLLibYesYesYes
Hive 0.1.3, including collect_list UDF YesYes
ML Pipeline API (PySpark)  Yes
DataFrame API TPYes
ORC Files TPYes
Spark SQLTPTPTP
Spark StreamingTPTPTP
Spark Thrift Server TPTP
Dynamic Executor Allocation TPTP
SparkR  TP
Spark Standalone   
GraphX   

   TP: Tech Preview