Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Configure Access to GCS from Your Cluster

After obtaining the service account key, perform these steps on your cluster. The steps below assume that your service account key is called google-access-key.json. If you chose a different name, make sure to update the commands accordingly.

Steps

  1. In the Ambari web UI, set the following three properties under custom-core-site. To set these properties in the custom-core-site, navigate to HDFS > Configs > Custom core-site and click Add Property.

    • Set the following properties:

      fs.gs.auth.service.account.email=<VALUE1>
      fs.gs.auth.service.account.private.key.id=<VALUE2>
      fs.gs.auth.service.account.private.key=<VALUE3>

      You can obtain the values for these parameters as follows:

      ParameterValue
      fs.gs.auth.service.account.emailThe client_email field extracted from the credential's JSON file.
      fs.gs.auth.service.account.private.key.idThe private_key_id field extracted from the credential's JSON file.
      fs.gs.auth.service.account.private.keyThe private_key field extracted from the credential's JSON file.

      The JSON key file downloaded in the previous step is in plain text. Open the file in your favorite text editor to extract the relevant values needed in the above configuration.

      [Note]Note

      For enhanced security, these values can also be configured using Hadoop CredentialProvider.

    • Configure the following properties if they're not already set by Ambari

      fs.gs.working.dir=/
      fs.gs.path.encoding=uri-path

      [Note]Note

      Setting fs.gs.working.dir configures the initial working directory of a GHFS instance. This should always be set to "/".

      Setting fs.gs.path.encoding sets the path encoding to be used, and allows for spaces in the filename. This should always be set to "uri-path".

  2. Save the configuration change and restart affected services. Additionally - depending on what services you are using - you must restart other services that access cloud storage such as Spark Thrift Server, HiveServer2, and Hive Metastore; These will not be listed as affected by Ambari, but require a restart to pick up the configuration changes.

  3. Test access to the Google Cloud Storage bucket by running a few commands from any cluster node. For example, you can use the command listed below (replace “mytestbucket” with the name of your bucket):

    hadoop fs -ls gs://mytestbucket/

After performing these steps, you should be able to start working with the Google Cloud Storage bucket(s).