User Guide
Also available as:
PDF

Pattern-Based Anonymization Rules

Write pattern-based rules to anonymize data by pattern, using the extract pattern to extract content to anonymize.

Required and Optional Fields

  • name

  • description (optional)

  • rule_id (should be set to PATTERN)

  • patterns

  • extract (optional)

  • include_files (optional)

  • exclude_files (optional)

  • action (optional, default value is ANONYMIZE)

  • replace_value (optional, applicable only when action=REPLACE)

  • shared (optional, default value is true)

  • enabled (optional, default value is true)

For more information on each field, refer to Fields Used for Defining Anonymization Rules.

Rule Definition Example (without extract)

    {
      "name": "EMAIL",
      "rule_id": "Pattern",
      "patterns": ["(?<![a-z0-9._%+-])[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}(?![a-z0-9._%+-])$?",
      "shared": false
    }

The content of the input file version.txt is:

Hadoop 2.7.3.2.5.0.0-1245
Subversion git@github.com:hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z

The content of the output file version.txt, with anonymized email address, is:

Hadoop 2.7.3.2.5.0.0-1245
Subversion ‡qpe@unqfay.mjp‡:hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z

Rule Definition Example (with extract)

    {
      "name": "KEYSTORE",
      "rule_id": "Pattern",
      "patterns": ["oozie.https.keystore.pass=([^\\s]*)", "OOZIE_HTTPS_KEYSTORE_PASS=([^\\s]*)"],
      "extract": "=([^\\s]*)",
      "include_files": ["java_process.txt", "pid.txt", "ambari-agent.log", "java_process.txt", "oozie-env.cmd"],
      "shared": false
    }

The content of the input file oozie-env.cmd is:

oozie.https.keystore.pass=abcde
set OOZIE_HTTPS_KEYSTORE_PASS=12345

To anonymize the content of the input file, the following anonymization patterns configured in the rule will be used:

"oozie.https.keystore.pass=([^\\s]*)", "OOZIE_HTTPS_KEYSTORE_PASS=([^\\s]*)"

oozie.https.keystore.pass=([^\\s]*) and OOZIE_HTTPS_KEYSTORE_PASS=([^\\s]*) match with oozie.https.keystore.pass=abcde and OOZIE_HTTPS_KEYSTORE_PASS=12345 respectively.

Next, the extract pattern "=([^\\s]*) is used to identify 12345 and abcde, which are the values to be anonymized.

The content of the output file oozie-env.cmd is:

oozie.https.keystore.pass=‡vvdwa‡
set OOZIE_HTTPS_KEYSTORE_PASS=‡zdowg‡

The values of oozie.https.keystore.pass and OOZIE_HTTPS_KEYSTORE_PASS have been anonymized.

For more examples, refer to Examples of Pattern-Based Anonymization Rules.

Examples of Pattern-Based Anonymization Rules

This section includes examples of commonly used pattern-based anonymization rules.

Example 1: Mask by pattern across all log files, without extract pattern

To mask all email addresses in all log files, use the following rule definition:

{
  "name": "EMAIL",
  "rule_id": "Pattern",
  "patterns": ["(?<![a-z0-9._%+-])[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}(?![a-z0-9._%+-])"],
  "include_files": ["*.log*"],
  "shared": false
}

Example 2: Mask by pattern across all log files, with extract pattern

To mask encryption keys, logged in the following format Key=12.. with a value consisting of 64 hexadecimal characters, use the following rule definition:

{
  "name": "ENC_KEYS",
  "rule_id": "Pattern",
  "patterns": ["Key=[a-f\\d]{64}\\s"],
  "extract": "=([a-f\\d]{64})",
  "include_files": ["*.log*"],
  "shared": false
}

Input data, test.log is:

encryption key=1234567890adc1234567aaabc1234567890adc1234567aaabc12345678901234 for keystore
derby.system.home=null

Output data, test.log, with the encryption keys anonymized, is:

encryption key=‡8697685738fnx1736987qigyx7611731027yds0096404hlsph91727138403654‡ for keystore
derby.system.home=null

Example 3: Mask by pattern across all files, except a few files

To mask email addresses in all files, except hdfs-site.xml and .property files, use the following rule definition:

{
  "name": "EMAIL",
  "rule_id": "Pattern",
  "patterns": ["(?<![a-z0-9._%+-])[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}(?![a-z0-9._%+-])"],
  "exclude_files" : ["*.properties", "hdfs-site.xml"],
  "shared": false
}

Input data, version.txt, is:

Hadoop 2.7.3.2.5.0.0-1245
Subversion git@github.com :hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z

Output file version.txt, with an anonymized email address, is:

Hadoop 2.7.3.2.5.0.0-1245
Subversion ‡qpe@unqfay.mjp‡ :hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z