Managing Data Operating System
Also available as:
PDF
loading table of contents...

Example of Running PySpark with a Docker Image

You can use a PySpark program with a Docker image that contains Python3 binaries.

Consider an example of using PySpark with a Docker container in the YARN client mode configuration. You can specify the required Docker configuration and the Dockerfile as specified.

Docker Configuration

PYSPARK_DRIVER_PYTHON=python3.6 

PYSPARK_PYTHON=python3.6 pyspark --master yarn --conf

spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf
 
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=pandas-demo --conf 

spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro

Dockerfile

ENV PYTHON_VERSION 36u
RUN yum -y install https://centos7.iuscommunity.org/ius-release.rpm
RUN yum -y install python$PYTHON_VERSION python$PYTHON_VERSION-dev python$PYTHON_VERSION-pip  python$PYTHON_VERSION-virtualenv

ENV PYSPARK_PYTHON python3.6
ENV PYSPARK_DRIVER_PYTHON python3.6

RUN ln -s /usr/bin/python3.6 /usr/local/bin/python


RUN wget https://bootstrap.pypa.io/get-pip.py

RUN python get-pip.py

RUN pip3.6 install numpy
RUN pip3.6 install pandas
RUN pip3.6 install --upgrade --no-deps statsmodels
RUN pip3.6 install patsy
RUN pip3.6 install scikit-learn

Example Program