Apache Spark Component Guide
Chapter 6. Running PySpark in a Virtual Environment

For many PySpark applications, it is sufficient to use --py-files to specify dependencies. However, there are times when --py-files is inconvenient, such as the following scenarios:

  • A large PySpark application has many dependencies, including transitive dependencies.

  • A large application needs a Python package that requires C code to be compiled before installation.

  • You want to run different versions of Python for different applications.

For these situations, you can create a virtual environment as an isolated Python runtime environment. HDP 2.6 supports VirtualEnv for PySpark in both local and distributed environments, easing the transition from a local environment to a distributed environment.

Note: This feature is currently only supported in YARN mode.

For more information, see Using VirtualEnv with PySpark.