Apache Spark Component Guide
Also available as:
PDF
loading table of contents...

Enabling Spark SQL User Impersonation for the Spark Thrift Server

By default, the Spark Thrift server runs queries under the identity of the operating system account running the Spark Thrift server. In a multi-user environment, queries often need to run under the identity of the end user who originated the query; this capability is called "user impersonation."

When user impersonation is enabled, Spark Thrift server runs Spark SQL queries as the submitting user. By running queries under the user account associated with the submitter, the Thrift server can enforce user level permissions and access control lists. Associated data cached in Spark is visible only to queries from the submitting user.

User impersonation enables granular access control for Spark SQL queries at the level of files or tables.

The user impersonation feature is controlled with the doAs property. When doAs is set to true, Spark Thrift server launches an on-demand Spark application to handle user queries. These queries are shared only with connections from the same user. Spark Thrift server forwards incoming queries to the appropriate Spark application for execution, making the Spark Thrift server extremely lightweight: it merely acts as a proxy to forward requests and responses. When all user connections for a Spark application are closed at the Spark Thrift server, the corresponding Spark application also terminates.

Prerequisites

Spark SQL user impersonation is supported for Apache Spark 1 versions 1.6.3 and later.

If storage-based authorization is to be enabled, complete the instructions in Configuring Storage-based Authorization in the Data Access Guide before enabling user impersonation.

Enabling User Impersonation on an Ambari-managed Cluster

To enable user impersonation for the Spark Thrift server on an Ambari-managed cluster, complete the following steps:

  1. Enable doAs support. Navigate to the “Advanced spark-hive-site-override” section and set hive.server2.enable.doAs=true.

  2. Add DataNucleus jars to the Spark Thrift server classpath. Navigate to the “Custom spark-thrift-sparkconf” section and set the spark.jars property as follows:

    spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
  3. (Optional) Disable Spark Yarn application for Spark Thrift server master. Navigate to the "Advanced spark-thrift-sparkconf" section and set spark.master=local. This prevents launching a spark-client HiveThriftServer2 application master, which is not needed when doAs=true because queries are executed by the Spark AM, launched on behalf of the user. When spark.master is set to local, SparkContext uses only the local machine for driver and executor tasks.

    (When the Thrift server runs with doAs set to false, you should set spark.master to yarn-client, so that query execution leverages cluster resources.)

  4. Restart the Spark Thrift server.

Enabling User Impersonation on an Cluster Not Managed by Ambari

To enable user impersonation for the Spark Thrift server on a cluster not managed by Ambari, complete the following steps:

  1. Enable doAs support. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/hive-site.xml file:

    <property>
        <name>hive.server2.enable.doAs</name>
        <value>true</value>
    </property>
  2. Add DataNucleus jars to the Spark Thrift server classpath. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf file:

    spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
  3. (Optional) Disable Spark Yarn application for Spark Thrift server master. Add the following setting to the /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf file:

    spark.master=local

    This prevents launching an unused spark-client HiveThriftServer2 application master, which is not needed when doAs=true because queries are executed by the Spark AM, launched on behalf of the user. When spark.master is set to local, SparkContext uses only the local machine for driver and executor tasks.

    (When the Thrift server runs with doAs set to false, you should set spark.master to yarn-client, so that query execution leverages cluster resources.)

  4. Restart the Spark Thrift server.

For more information about user impersonation for the Spark Thrift Server, see Using Spark SQL.