Apache Spark Component Guide
Also available as:
PDF
loading table of contents...

Configuring Spark for Wire Encryption

You can configure Spark to protect sensitive data in transit, by enabling wire encryption.

In general, encryption protects data by making it unreadable without a phrase or digital key to access the data. Data can be encrypted while it is in transit and when it is at rest:

  • "In transit" encryption refers to data that is encrypted when it traverses a network. The data is encrypted between the sender and receiver process across the network. Wire encryption is a form of "in transit" encryption.

  • "At rest" or "transparent" encryption refers to data stored in a database, on disk, or on other types of persistent media.

Apache Spark supports "in transit" wire encryption of data for Apache Spark jobs. When encryption is enabled, Spark encrypts all data that is moved across nodes in a cluster on behalf of a job, including the following scenarios:

  • Data that is moving between executors and drivers, such as during a collect() operation.

  • Data that is moving between executors, such as during a shuffle operation.

Spark does not support encryption for connectors accessing external sources; instead, the connectors must handle any encryption requirements. For example, the Spark HDFS connector supports transparent encrypted data access from HDFS: when transparent encryption is enabled in HDFS, Spark jobs can use the HDFS connector to read encrypted data from HDFS.

Spark does not support encrypted data on local disk, such as intermediate data written to a local disk by an executor process when the data does not fit in memory. Additionally, wire encryption is not supported for shuffle files, cached data, and other application files. For these scenarios you should enable local disk encryption through your operating system.

[Note]Note

The following instructions enable SSL for Spark. Starting with Spark 2.0 (currently in technical preview), you can also enable HTTPS on the History Server UI, for browsing job history data.

Configuration Instructions,

Use the following commands to configure Spark for wire encryption:

  1. On each node, create keystore files, certificates, and truststore files.

    1. Create a keystore file:

      keytool -genkey -alias <host> -keyalg RSA -keysize 1024 –dname CN=<host>,OU=hw,O=hw,L=paloalto,ST=ca,C=us –keypass <KeyPassword> -keystore <keystore_file> -storepass <storePassword>

    2. Create a certificate:

      keytool -export -alias <host> -keystore <keystore_file> -rfc –file <cert_file> -storepass <StorePassword>

    3. Create a truststore file:

      keytool -import -noprompt -alias <host> -file <cert_file> -keystore <truststore_file> -storepass <truststorePassword>

  2. Create one truststore file that contains the public keys from all certificates.

    1. Log on to one host and import the truststore file for that host:

      keytool -import -noprompt -alias <hostname> -file <cert_file>-keystore <all_jks> -storepass <allTruststorePassword>
    2. Copy the <all_jks> file to the other nodes in your cluster, and repeat the keytool command on each node.

  3. Enable Spark authentication.

    1. Set spark.authenticate to true in the yarn-site.xml file:

      <property>
        <name>spark.authenticate</name>
        <value>true</value>
      </property>
    2. Set the following properties in the spark-defaults.conf file:

      spark.authenticate true
      spark.authenticate.enableSaslEncryption true
  4. Enable Spark SSL.

    Set the following properties in the spark-defaults.conf file:

    spark.ssl.enabled true
    spark.ssl.enabledAlgorithms TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
    spark.ssl.keyPassword <KeyPassword>
    spark.ssl.keyStore <keystore_file>
    spark.ssl.keyStorePassword <storePassword>
    spark.ssl.protocol TLS
    spark.ssl.trustStore <all_jks>
    spark.ssl.trustStorePassword <allTruststorePassword>