Apache Spark Component Guide
Also available as:
PDF
loading table of contents...

Adding Libraries to Spark

Spark comes equipped with a selection of libraries, including Spark SQL, Spark Streaming, and MLlib. However, if you want to use a custom library with a Spark application (such as a compression library or Magellan), you can use one of the following two spark-submit script options:

  • The --jars option transfers associated .jar files to the cluster.

    Specify a list of comma-separated .jar files.

  • The --packages option pulls files directly from Spark packages.

    This approach requires an internet connection.

For example, you can use the --jars option to add codec files. The following example adds the LZO compression library:

spark-submit --driver-memory 1G --executor-memory 1G --master yarn-client --jars /usr/hdp/2.3.0.0-2557/hadoop/lib/hadoop-lzo-0.6.0.2.3.0.0-2557.jar test_read_write.py

For more information about the two options, see Advanced Dependency Management on the Apache Spark "Submitting Applications" web page.

[Note]Note

If you launch a Spark job that references a codec library without specifying where the codec resides, Spark returns an error similar to the following:

Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.

To address this issue, specify the codec file with the --jars option in your job submit command, as in the following example:

spark-submit --driver-memory 1G --executor-memory 1G --master yarn-client --jars /usr/hdp/2.3.0.0-$BUILD/hadoop/lib/hadoop-lzo-0.6.0.2.3.0.0-$BUILD.jar test_read_write.py