HDFS Administration Guide
Also available as:
PDF
loading table of contents...

Configuring and Using HDFS Data at Rest Encryption

After the Ranger KMS has been set up and the NameNode and HDFS clients have been configured, an HDFS administrator can use the hadoop key and hdfs crypto command-line tools to create encryption keys and set up new encryption zones.

The overall workflow is as follows:

  1. Create an HDFS encryption zone key that will be used to encrypt the file-level data encryption key for every file in the encryption zone. This key is stored and managed by Ranger KMS.

  2. Create a new HDFS folder. Specify required permissions, owner, and group for the folder.

  3. Using the new encryption zone key, designate the folder as an encryption zone.

  4. Configure client access. The user associated with the client application needs sufficient permission to access encrypted data. In an encryption zone, the user needs file/directory access (through Posix permissions or Ranger access control), as well as access for certain key operations. To set up ACLs for key-related operations, see the Ranger KMS Administration Guide.

After permissions are set, Java API clients and HDFS applications with sufficient HDFS and Ranger KMS access privileges can write and read to/from files in the encryption zone.

Prepare the Environment

HDP supports hardware acceleration with Advanced Encryption Standard New Instructions (AES-NI). Compared with the software implementation of AES, hardware acceleration offers an order of magnitude faster encryption/decryption.

To use AES-NI optimization you need CPU and library support, described in the following subsections.

CPU Support for AES-NI optimization

AES-NI optimization requires an extended CPU instruction set for AES hardware acceleration.

There are several ways to check for this; for example:

$ cat /proc/cpuinfo | grep aes

Look for output with flags and 'aes'.

Library Support for AES-NI optimization

You will need a version of the libcrypto.so library that supports hardware acceleration, such as OpenSSL 1.0.1e. (Many OS versions have an older version of the library that does not support AES-NI.)

A version of the libcrypto.so libary with AES-NI support must be installed on HDFS cluster nodes and MapReduce client hosts -- that is, any host from which you issue HDFS or MapReduce requests. The following instructions describe how to install and configure the libcrypto.so library.

RHEL/CentOS 6.5 or later

On HDP cluster nodes, the installed version of libcrypto.so supports AES-NI, but you will need to make sure that the symbolic link exists:

$ sudo ln -s /usr/lib64/libcrypto.so.1.0.1e /usr/lib64/libcrypto.so

On MapReduce client hosts, install the openssl-devel package:

$ sudo yum install openssl-devel

Verifying AES-NI Support

To verify that a client host is ready to use the AES-NI instruction set optimization for HDFS encryption, use the following command:

hadoop checknative

You should see a response similar to the following:

15/08/12 13:48:39 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
14/12/12 13:48:39 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
snappy:  true /usr/lib64/libsnappy.so.1
lz4:     true revision:99
bzip2:   true /lib64/libbz2.so.1
openssl: true /usr/lib64/libcrypto.so

If you see true in the openssl row, Hadoop has detected the right version of libcrypto.so and optimization will work.

If you see false in this row, you do not have the correct version.

Create an Encryption Key

Create a "master" encryption key for the new encryption zone. Each key will be specific to an encryption zone.

Ranger supports AES/CTR/NoPadding as the cipher suite. (The associated property is listed under HDFS -> Configs in the Advanced hdfs-site list.)

Key size can be 128 or 256 bits.

Recommendation: create a new superuser for key management. In the following examples, superuser encr creates the key. This separates the data access role from the encryption role, strengthening security.

Create an Encryption Key using Ranger KMS (Recommended)

In the Ranger Web UI screen:

  1. Choose the Encryption tab at the top of the screen.

  2. Select the KMS service from the drop-down list.

To create a new key:

  1. Click on "Add New Key":

  2. Add a valid key name.

  3. Select the cipher name. Ranger supports AES/CTR/NoPadding as the cipher suite.

  4. Specify the key length, 128 or 256 bits.

  5. Add other attributes as needed, and then save the key.

For information about rolling over and deleting keys, see Using the Ranger Key Management Service in the Ranger KMS Administration Guide.

[Warning]Warning

Do not delete an encryption key while it is in use for an encryption zone. This will result in loss of access to data in that zone.

Create an Encryption Key using the CLI

The full syntax of the hadoop key create command is as follows:

[create <keyname> [-cipher <cipher>] 
[-size <size>] 
[-description <description>] 
[-attr <attribute=value>] 
[-provider <provider>] 
[-help]]

Example:

# su - encr

# hadoop key create <key_name> [-size <number-of-bits>]

The default key size is 128 bits. The optional -size parameter supports 256-bit keys, and requires the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File on all hosts in the cluster. For installation information, see the Ambari Security Guide.

Example:

# su - encr

# hadoop key create key1

To verify creation of the key, list the metadata associated with the current user:

# hadoop key list -metadata

For information about rolling over and deleting keys, see Using the Ranger Key Management Service in the Ranger KMS Administration Guide.

[Warning]Warning

Do not delete an encryption key while it is in use for an encryption zone. This will result in loss of access to data in that zone.

Create an Encryption Zone

Each encryption zone must be defined using an empty directory and an existing encryption key. An encryption zone cannot be created on top of a directory that already contains data.

Recommendation: use one unique key for each encryption zone.

Use the crypto createZone command to create a new encryption zone. The syntax is:

-createZone -keyName <keyName> -path <path>

where:

  • -keyName: specifies the name of the key to use for the encryption zone.

  • -path specifies the path of the encryption zone to be created. It must be an empty directory.

[Note]Note

The hdfs service account can create zones, but cannot write data unless the account has sufficient permission.

Recommendation: Define a separate user account for the HDFS administrator, and do not provide access to keys for this user in Ranger KMS.

Steps:

  1. As HDFS administrator, create a new empty directory. For example:

    # hdfs dfs -mkdir /zone_encr

  2. Using the encryption key, make the directory an encryption zone. For example:

    # hdfs crypto -createZone -keyName key1 -path /zone_encr

    When finished, the NameNode will recognize the folder as an HDFS encryption zone.

  3. To verify creation of the new encryption zone, run the crypto -listZones command as an HDFS administrator:

    -listZones

    You should see the encryption zone and its key. For example:

    $ hdfs crypto -listZones 
    /zone-encr  key1
    [Note]Note

    The following property (in the hdfs-default.xml file) causes listZone requests to be batched. This improves NameNode performance. The property specifies the maximum number of zones that will be returned in a batch.

    dfs.namenode.list.encryption.zones.num.responses

    The default is 100.

To remove an encryption zone, delete the root directory of the zone. For example:

hdfs dfs -rm -R /zone_encr

Copy Files from/to an Encryption Zone

To copy existing files into an encryption zone, use a tool like distcp.

Note: for separation of administrative roles, do not use the hdfs user to create encryption zones. Instead, designate another administrative account for creating encryption keys and zones. See Creating an HDFS Admin User for more information.

The files will be encrypted using a file-level key generated by the Ranger Key Management Service.

DistCp Considerations

DistCp is commonly used to replicate data between clusters for backup and disaster recovery purposes. This operation is typically performed by the cluster administrator, via an HDFS superuser account.

To retain this workflow when using HDFS encryption, a new virtual path prefix has been introduced, /.reserved/raw/. This virtual path gives super users direct access to the underlying encrypted block data in the file system, allowing super users to distcp data without requiring access to encryption keys. This also avoids the overhead of decrypting and re-encrypting data. The source and destination data will be byte-for-byte identical, which would not be true if the data were re-encrypted with a new EDEK.

[Warning]Warning

When using /.reserved/raw/ to distcp encrypted data, make sure you preserve extended attributes with the -px flag. This is necessary because encrypted attributes such as the EDEK are exposed through extended attributes; they must be preserved to be able to decrypt the file. For example:

sudo -u encr hadoop distcp -px hdfs:/cluster1-namenode:50070/.reserved/raw/apps/enczone hdfs:/cluster2-namenode:50070/.reserved/raw/apps/enczone

This means that if the distcp operation is initiated at or above the encryption zone root, it will automatically create a new encryption zone at the destination (if one does not already exist).

Recommendation: To avoid potential mishaps, first create identical encryption zones on the destination cluster.

Copying between encrypted and unencrypted locations

By default, distcp compares file system checksums to verify that data was successfully copied to the destination.

When copying between an unencrypted and encrypted location, file system checksums will not match because the underlying block data is different. In this case, specify the -skipcrccheck and -update flags to avoid verifying checksums.

Read and Write Files from/to an Encryption Zone

Clients and HDFS applications with sufficient HDFS and Ranger KMS permissions can read and write files from/to an encryption zone.

Overview of the client write process:

  1. The client writes to the encryption zone.

  2. The NameNode checks to make sure that the client has sufficient write access permissions. If so, the NameNode asks Ranger KMS to create a file-level key, encrypted with the encryption zone master key.

  3. The Namenode stores the file-level encrypted data encryption key (EDEK) generated by Ranger KMS as part of the file's metadata, and returns the EDEK to the client.

  4. The client asks Ranger KMS to decode the EDEK (to DEK), and uses the DEK to write encrypted data. Ranger KMS checks for permissions for the user before decrypting EDEK and producing the DEK for the client.

Overview of the client read process:

  1. The client issues a read request for a file in an encryption zone.

  2. The NameNode checks to make sure that the client has sufficient read access permissions. If so, the NameNode returns the file's EDEK and the encryption zone key version that was used to encrypt the EDEK.

  3. The client asks Ranger KMS to decrypt the EDEK. Ranger KMS checks for permissions to decrypt EDEK for the end user.

  4. Ranger KMS decrypts and returns the (unencrypted) data encryption key (DEK).

  5. The client uses the DEK to decrypt and read the file.

The preceding steps take place through internal interactions between the DFSClient, the NameNode, and Ranger KMS.

In the following example, the /zone_encr directory is an encrypted zone in HDFS.

To verify this, use the crypto -listZones command (as an HDFS administrator). This command lists the root path and the zone key for the encryption zone. For example:

# hdfs crypto -listZones
/zone_encr  key1

Additionally, the /zone_encr directory has been set up for read/write access by the hive user:

# hdfs dfs -ls /
 …
drwxr-x---   - hive   hive            0 2015-01-11 23:12 /zone_encr

The hive user can, therefore, write data to the directory.

The following examples use the copyFromLocal command to move a local file into HDFS.

[hive@blue ~]# hdfs dfs -copyFromLocal web.log /zone_encr
[hive@blue ~]# hdfs dfs -ls /zone_encr
Found 1 items
-rw-r--r--   1 hive hive       1310 2015-01-11 23:28 /zone_encr/web.log

The hive user can read data from the directory, and can verify that the file loaded into HDFS is readable in its unencrypted form.

[hive@blue ~]# hdfs dfs -copyToLocal /zone_encr/web.log read.log
[hive@blue ~]# diff web.log read.log
[Note]Note

For more information about accessing encrypted files from Hive and other components, see Configuring HDP Services for HDFS Encryption.

Users without access to KMS keys will be able to see file names (via the -ls command), but they will not be able to write data or read from the encrypted zone. For example, the hdfs user lacks sufficient permissions, and cannot access the data in /zone_encr:

[hdfs@blue ~]# hdfs dfs -copyFromLocal install.log /zone_encr
copyFromLocal: Permission denied: user=hdfs, access=EXECUTE, inode="/zone_encr":hive:hive:drwxr-x---

[hdfs@blue ~]# hdfs dfs -copyToLocal /zone_encr/web.log read.log
copyToLocal: Permission denied: user=hdfs, access=EXECUTE, inode="/zone_encr":hive:hive:drwxr-x---

Delete Files from an Encryption Zone

You cannot move data from an Encryption Zone to a global Trash bin outside of the encryption zone.

To delete files from an encryption zone, use one of the following approaches:

  1. When deleting the file via CLI, use the -skipTrash option. For example:

    hdfs dfs -rm /zone_name/file1 -skipTrash

  2. When deleting the file via CLI, use the -skipTrash option. For example:

    hdfs dfs -rm /zone_name/file1 -skipTrash

  3. (Hive only) Use PURGE, as in DROP TABLE ... PURGE. This skips the Trash bin even if the trash feature is enabled.