Chapter 3. Data Protection

Encryption is applied to electronic information in order to ensure its privacy and confidentiality. Wire encryption protects data as it moves into and through Hadoop cluster over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC.

The following describes how the data is protected as it is in motion:

  • Clients typically communicate directly with the Hadoop cluster and the data can be protected using:

    • RPC encryption: Clients interacting directly with the Hadoop cluster through RPC. A client uses RPC to connect to the NameNode (NN) to initiate file read and write operations. RPC connections in Hadoop use Java’s Simple Authentication & Security Layer (SASL), which supports encryption.

    • Data Transfer Protocol: The NN gives the client the address of the first DataNode (DN) to read or write the block. The actual data transfer between the client and a DN uses Data Transfer Protocol.

  • Users typically communicate with the Hadoop cluster using a Browser or a command line tools, data can be protected using:

    • HTTPS encryption: Users typically interact with Hadoop using a browser or compontent CLI, while applications use REST APIs or Thrift. Encryption over the HTTP protocol is implemented with the support for SSL across a Hadoop cluster and for the individual components such as Ambari.

    • JDBC: HiveServer2 implements encryption with Java SASL protocol’s quality of protection (QOP) setting. With this the data moving between a HiveServer2 over jdbc and a jdbc client can be encrypted.

  • Additionally within the cluster communication between processes can be protected using:

    • HTTPS encryption during shuffle: Staring in HDP 2.0 encryption during shuffle is supported. The data moves between the Mappers and the Reducers over the HTTP protocol, this step is called shuffle. Reducer initiates the connection to the Mapper to ask for data and acts as SSL client.

This chapter provides an overview on encryption over-the-wire in Hadoop. Data can be moved in and out of Hadoop over RPC, HTTP, Data Transfer Protocol, and JDBC. Network traffic over each of these protocols can be encrypted to provide privacy for data movement.


loading table of contents...