Configuring Spark for Wire Encryption
You can configure Spark to protect sensitive data in transit by enabling wire encryption.
In general, wire encryption protects data by making it unreadable without a phrase or digital key to access the data. Data can be encrypted while it is in transit and when it is at rest:
"In transit" encryption refers to data that is encrypted when it traverses a network. The data is encrypted between the sender and receiver process across the network. Wire encryption is a form of "in transit" encryption.
"At rest" or "transparent" encryption refers to data stored in a database, on disk, or on other types of persistent media.
Apache Spark supports "in transit" wire encryption of data for Apache Spark jobs. When encryption is enabled, Spark encrypts all data that is moved across nodes in a cluster on behalf of a job, including the following scenarios:
Data that is moving between executors and drivers, such as during a
Data that is moving between executors, such as during a shuffle operation.
Spark does not support encryption for connectors accessing external sources; instead, the connectors must handle any encryption requirements. For example, the Spark HDFS connector supports transparent encrypted data access from HDFS: when transparent encryption is enabled in HDFS, Spark jobs can use the HDFS connector to read encrypted data from HDFS.
Spark does not support encrypted data on local disk, such as intermediate data written to a local disk by an executor process when the data does not fit in memory. Additionally, wire encryption is not supported for shuffle files, cached data, and other application files. For these scenarios you should enable local disk encryption through your operating system.
Enabling Spark wire encryption also enables HTTPS on the History Server UI, for browsing historical job data.
On each node, create keystore files, certificates, and truststore files.
- Create a keystore
keytool -genkey \ -alias <host> \ -keyalg RSA \ -keysize 1024 \ –dname CN=<host>,OU=hw,O=hw,L=paloalto,ST=ca,C=us \ –keypass <KeyPassword> \ -keystore <keystore_file> \ -storepass <storePassword>
- Create a
keytool -export \ -alias <host> \ -keystore <keystore_file> \ -rfc –file <cert_file> \ -storepass <StorePassword>
- Create a truststore
keytool -import \ -noprompt \ -alias <host> \ -file <cert_file> \ -keystore <truststore_file> \ -storepass <truststorePassword>
- Create a keystore file:
Create one truststore file that contains the public keys from all
- Log on to one host and import the truststore file for that
keytool -import \ -noprompt \ -alias <hostname> \ -file <cert_file> \ -keystore <all_jks> \ -storepass <allTruststorePassword>
- Copy the
<all_jks>file to the other nodes in your cluster, and repeat the
keytoolcommand on each node.
- Log on to one host and import the truststore file for that host:
Enable Spark authentication.
<property> <name>spark.authenticate</name> <value>true</value> </property>
- Set the following properties in the
spark.authenticate true spark.authenticate.enableSaslEncryption true
Enable Spark SSL.
Set the following properties in the
spark.ssl.enabled true spark.ssl.keyPassword <KeyPassword> spark.ssl.keyStore <keystore_file> spark.ssl.keyStorePassword <storePassword> spark.ssl.protocol TLS spark.ssl.trustStore <all_jks> spark.ssl.trustStorePassword <allTruststorePassword>
Enable HTTPS for the Spark UI.
(Optional) If you want to enable optional on-disk block encryption, which applies
to both shuffle and RDD blocks on disk, complete the following steps:
- Add the following properties to the
spark-defaults.conffile for Spark:
spark.io.encryption.enabled true spark.io.encryption.keySizeBits 128 spark.io.encryption.keygen.algorithm HmacSHA1
- Enable RPC encryption.
- Add the following properties to the