Apache Spark Component Guide
Also available as:
PDF
loading table of contents...

Specifying Which Version of Spark to Run

More than one version of Spark can run on a node. If your cluster runs Spark 1, you can install Spark 2 and test jobs on Spark 2 in parallel with a Spark 1 working environment. After verifying that all scripts and jobs run successfully with Spark 2 (including any changes for backward compatibility), you can then step through transitioning jobs from Spark 1 to Spark 2. For more information about installing a second version of Spark, see Installing Spark.

Use the following guidelines for determining which version of Spark runs a job by default, and for specifying an alternate version if desired.

  • By default, if only one version of Spark is installed on a node, your job runs with the installed version.

  • By default, if more than one version of Spark is installed on a node, your job runs with the default version for your HDP package. In HDP 2.6, the default is Spark version 1.6.

  • If you want to run jobs on the non-default version of Spark, use one of the following approaches:

    • If you use full paths in your scripts, change spark-client to spark2-client; for example:

      change /usr/hdp/current/spark-client/bin/spark-submit

      to /usr/hdp/current/spark2-client/bin/spark-submit.

    • If you do not use full paths, but instead launch jobs from the path, set the SPARK_MAJOR_VERSION environment variable to the desired version of Spark before you launch the job.

      For example, if Spark 1.6.3 and Spark 2.0 are both installed on a node and you want to run your job with Spark 2.0, set

      SPARK_MAJOR_VERSION=2.

      You can set SPARK_MAJOR_VERSION in automation scripts that use Spark, or in your manual settings after logging on to the shell.

      Note: The SPARK_MAJOR_VERSION environment variable can be set by any user who logs on to a client machine to run Spark. The scope of the environment variable is local to the user session.

The following example submits a SparkPi job to Spark 2, using spark-submit under /usr/bin:

  1. Navigate to a host where Spark 2.0 is installed.

  2. Change to the Spark2 client directory:

    cd /usr/hdp/current/spark2-client/

  3. Set the SPARK_MAJOR_VERSION environment variable to 2:

    export SPARK_MAJOR_VERSION=2

  4. Run the Spark Pi example:

    ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
        --master yarn-client \
        --num-executors 1 \
        --driver-memory 512m \
        --executor-memory 512m \
        --executor-cores 1 \
        examples/jars/spark-examples*.jar 10

    Note that the path to spark-examples-*.jar is different than the path used for Spark 1.x.

To change the environment variable setting later, either remove the environment variable or change the setting to the newly desired version.