Cloud Data Access
Also available as:
PDF
loading table of contents...

Configuring Access to Google Cloud Storage

Access from a cluster to a Google Cloud Storage is possible through a service account. Configuring access to Google Cloud Storage involves the following steps.

Table 6.1. Overview of Configuring Access to Google Cloud Storage

StepConsiderations
Creating a service account on Google Cloud Platform and generating a key associated with it.
  • You may need to contact your Google Cloud Platform admin in order to complete these steps.

  • If you already have a service account, you do not need to perform these steps as long as you are able to provide the service account key. If you have a service account but do not know the service account key, you should be able to generate a new key.

Modifying permissions of the Google Cloud Storage bucket so that you can access it by using your service account key.
  • You may need to contact your Google Cloud Platform admin in order to complete these steps.

  • You will typically perform these steps for each bucket that you want to access.

  • You do not need to perform these steps if your service account has project-wide access to all buckets on the account.

Placing the service account key on all nodes of the cluster and setting related properties in Ambari.
  • These configuration steps are appropriate for a single-user cluster.

  • Only one configuration per cluster is recommended, that is, you should use one service account key per cluster. If required, it is possible to use multiple service account keys with the same cluster; In this case, multiple service account keys should be available on all nodes, and each job-specific configuration should use one selected key.


Create a GCP Service Account

You must create a Google Cloud Platform service account and generate an access key (in the JSON or P12 format, JSON is preferred). If you are using a corporate GCP account, it is likely that only your GCP admin can perform these steps. Example steps are described below.

These steps assume that you have a Google Cloud Platform account. If you do not have one, you can create it at https://console.cloud.google.com.

Steps

  1. In the Google Cloud Platform web console, navigate to IAM & admin > Service accounts:

  2. Click +Create Service Account.

  3. Provide the following information:

    • Under Service account name, provide some name for your service account.

    • Under Role, select the project-level roles that the account should have.

    • Check Furnish a new private key and select JSON or P12. We recommend using JSON.

  4. Click Create and the file containing the key will be downloaded onto your machine. The name of the key file is usually long and contains spaces, so you may want to rename the file.

Later you will need to place this key on your cluster nodes.

Modify GCS Bucket Permissions

You or your GCP admin must set the bucket permissions so that your service account has access to the bucket that you want to access from the cluster. Storage Object Admin is the minimum role required to access the cluster. Example steps are described below.

Steps

  1. In the Google Cloud Platform web console, navigate to Storage > Browser.

  2. Find the bucket for which you want to edit permissions.

  3. Click the and select Edit bucket permissions:

  4. In the Permissions tab set the bucket-level permissions:

    • Click on Add members and enter the service account that you want to use to access the bucket.

    • Under Roles, select Storage Object Admin or another role that allows accessing the bucket. For more information, refer to Cloud Storage IAM Roles in GCP documentation.

    • When done, click Add.

After performing these steps, the bucket-level permissions will be updated.

Configure Access to GCS from Your Cluster

After obtaining the service account key, perform these steps on your cluster. The steps below assume that your service account key is called google-access-key.json. If you chose a different name, make sure to update the commands accordingly.

Steps

  1. Place the service account key on all nodes of the clusters.

    Note the following about the location where to place file:

    • Make sure to use an absolute path such as /etc/hadoop/conf/google-access-key.json (where google-access-key.json is your JSON key).

    • The path must be the same on all nodes.

    • In a single-user cluster, /etc/hadoop/conf/google-access-key.json is appropriate. Permissions for the file should be set to 444.

    • If you need to use this option with a multi-user cluster, you should place this in the user's home directory: ${USER_HOME}/.credentials/storage.json. Permissions for the file should be set to 400.

    There are many ways to place the file on the hosts. For example you can create a `hosts` file listing all the hosts, one per line, and then run the following:

    for host in `cat hosts`; 
    do scp -i <Path_to_ssh_private_key> google-access-key.json <Ssh_user>@$host:/etc/hadoop/conf/google-access-key.json; 
    done

  2. In the Ambari web UI, set the following two properties under custom-core-site.

    To set these properties in the custom-core-site, navigate to HDFS > Configs > Custom core-site and click Add Property. The JSON and the p12 properties cannot be set at the same time.

    • If using a key in the JSON format (recommended), set the following properties:

      google.cloud.auth.service.account.json.keyfile=<Path-to-the-JSON-file> 
      fs.gs.working.dir=/

    • If using a key in the P12 format, set the following properties:

      fs.gs.service.account.auth.email=<Your-Service-Account-email>
      fs.gs.service.account.auth.keyfile=<Path-to-the-p12-file>
      fs.gs.working.dir=/

      [Note]Note

      Setting fs.gs.working.dir configures the initial working directory of a GHFS instance. This should always be set to "/".

  3. Save the configuration change and restart affected services. Additionally - depending on what services you are using - you must restart other services that access cloud storage such as Spark Thrift Server, HiveServer2, and Hive Metastore; These will not be listed as affected by Ambari, but require a restart to pick up the configuration changes.

  4. Test access to the Google Cloud Storage bucket by running a few commands from any cluster node. For example, you can use the command listed below (replace “mytestbucket” with the name of your bucket):

    hadoop fs -ls gs://mytestbucket/

After performing these steps, you should be able to start working with the Google Cloud Storage bucket(s).