Chapter 3. HDFS Permissions

The Hadoop Distributed File System (HDFS) enforces permissions the same way on Windows and Linux deployments. The HDFS permissions model for files and directories shares much of the POSIX model; each file and directory is associated with an owner and a group.

 1. HDFS and the HadoopUsers Group

On each node that HDP is installed, HDP sets up a HadoopUsers group and creates a hadoop user in that group. The hadoop user is the superuser in HDP. This user:

  1. Is the owner of the HDP services installed on each Windows Server node.

  2. Is the HDFS superuser. This superuser can modify the permissions of any HDFS directory or file, regardless of owner.

  3. Is the Oozie proxy user.

  4. Is the WebHCat proxy user.

[Note]Note

HDP depends on user accounts on each cluster node for enforcing access rules to the data in HDFS.

 1.1. Active Directory Groups and Users

HDP resolves membership in the machine's local groups, and skip groups coming from Active Directory. Although Active Directory groups are unsupported, Active Directory users are supported. You can create local groups on all nodes in the cluster, and manage group membership individually on each node. These local groups can contain Active Directory users.

For a Windows domain user, such as CORP\$win_username, the Hadoop code ignores the domain portion and treats the user identity as just the username, $win_username. File ownership in HDFS and job submissions display as $win_username. Consequently, if the cluster is joined to multiple domain controllers, and the same username exists in multiple domains, Hadoop assumes they are the same user: DOMAIN1\$win_username = DOMAIN2\$win_username = DOMAIN3\$win_username.

 1.2. Protecting Against Impersonation

If a user account can create new users on machines that have direct access to the HDP cluster, then those users can create a hadoop user and get administrative access to HDP services.

You can manage user access in Windows similar to a Linux non-secured cluster by:

  • Putting all of the cluster nodes behind a firewall.

  • Only allowing HDFS client access and MapReduce job submission from specific machines (or a specific subnet).

  • Giving users accounts on machines with non-admininistrator permissions.