Running Spark in Docker Containers on YARN
Apache Spark applications usually have a complex set of required software dependencies. Spark applications may require specific versions of these dependencies (such as Pyspark and R) on the Spark executor hosts, sometimes with conflicting versions. Installing these dependencies creates package isolation and organizational challenges, which have typically been managed by specialized operations teams. Virtualization solutions such as Virtualenv or Conda can be complex and inefficient due to per-application dependency downloads.
Docker support in Apache Hadoop 3 enables you to containerize dependencies along with an application in a Docker image, which makes it much easier to deploy and manage Spark applications on YARN.
To enable Docker support in YARN, refer to the following documentation:
"Configure YARN for running Docker containers" in the HDP Managing Data Operating System guide.
"Launching Applications Using Docker Containers" in the Apache Hadoop 3.1.0 YARN documentation.
Links to these documents are available at the bottom of this topic.
Containerized Spark: Bits and Configuration
The base Spark and Hadoop libraries and related configurations installed on the gateway hosts are distributed automatically to all of the Spark hosts in the cluster using the Hadoop distributed cache, and are mounted into the Docker containers automatically by YARN.
In addition, any binaries (–files, –jars, etc.) explicitly included by the user when the application is submitted are also made available via the distributed cache.
YARN Client Mode
In YARN client mode, the driver runs in the submission client’s JVM on the gateway machine. Spark client mode is typically used through Spark-shell.
The YARN application is submitted as part of the SparkContext initialization at the driver. In YARN Client mode the ApplicationMaster is a proxy for forwarding YARN allocation requests, container status, etc., from and to the driver.
In this mode, the Spark driver runs on the gateway hosts as a java process, and not in a YARN container. Hence, specifying any driver-specific YARN configuration to use Docker or Docker images will not take effect. Only Spark executors will run in Docker containers.
During submission, deploy mode is specified as
–deploy-mode=client with the following executor container
Settings for Executors
spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=<spark executor’s docker-image> spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=<any volume mounts needed by the spark application>
YARN Cluster Mode
In the "classic" distributed application YARN cluster mode, a user submits a Spark job to be executed, which is scheduled and executed by YARN. The ApplicationMaster hosts the Spark driver, which is launched on the cluster in a Docker container.
During submission, deploy mode is specified as
–deploy-mode=cluster. Along with the executor’s Docker container
configurations, the driver/app master’s Docker configurations can be set through
environment variables during submission. Note that the driver’s Docker image can be
customized with settings that are different than the executor’s image.
Additional Settings for Driver
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=<docker-image> spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro
In the remainder of this topic, we will use YARN client mode.
In this example, Spark-R is used (in YARN client mode) with a Docker image that includes the R binary and the necessary R packages (rather than installing these on the host).
/usr/hdp/current/spark2-client/bin/sparkR --master yarn --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=spark-r-demo --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro
This example shows how to use PySpark (in YARN client mode) with Python3 (which is part of the Docker image and is not installed on the executor host) to run OLS linear regression for each group using statsmodels with all the dependencies isolated through the Docker image.
The Python version can be customized using the
PYSPARK_PYTHON environment variables on the Spark driver and
PYSPARK_DRIVER_PYTHON=python3.6 PYSPARK_PYTHON=python3.6 pyspark --master yarn --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.executorEnv. YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=pandas-demo --conf spark.executorEnv. YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro
Running Containerized Spark Jobs Using Zeppelin
To run containerized Spark using Apache Zeppelin, configure the Docker image, the runtime volume mounts, and the network as shown below in the Zeppelin Interpreter settings (under User (e.g.: admin) > Interpreter) in the Zeppelin UI.
Configuring the Livy Interpreter
You can also configure Docker images, volume, etc. for other Zeppelin interpreters.
You must restart the interpreter(s) in order for these new settings to take effect. You can then submit Spark applications as before in Zeppelin to launch them using Docker containers.