Using Apache HiveQL
Also available as:
PDF

Generate surrogate keys

You can use the built-in SURROGATE_KEY user-defined function (UDF) to automatically generate numerical Ids for rows as you enter data into a table. The generated surrogate keys can replace wide, multiple composite keys.

Joins using the generated keys are faster than joins using strings. Using generated keys does not force data into a single node by a row number. You can generate keys as abstractions of natural keys. Surrogate keys have an advantage over UUIDs, which are slower and probabilistic.

The SURROGATE_KEY UDF generates a unique Id for every row that you insert into a table.

It generates keys based on the execution environment in a distributed system, which includes a number of factors, such as internal data structures, the state of a table, and the last transaction id. Surrogate key generation does not require any coordination between compute tasks.

The UDF takes no arguments or two arguments as follows:

  • Write Id bits
  • Task Id bits
  1. Create a students table in the default ORC format that has ACID properties.
    CREATE TABLE students (row_id INT, name VARCHAR(64), dorm INT);
  2. Insert data into the table. For example:
    INSERT INTO TABLE students VALUES (1, 'fred flintstone', 100), (2, 'barney rubble', 200);
  3. Create a version of the students table using the SURROGATE_KEY UDF.
    CREATE TABLE students_v2 
    (`ID` BIGINT DEFAULT SURROGATE_KEY(),
     row_id INT,
     name VARCHAR(64), 
     dorm INT, 
     PRIMARY KEY (ID) DISABLE NOVALIDATE);
  4. Insert data, which automatically generates surrogate keys for the primary keys.
    INSERT INTO students_v2 (row_id, name, dorm) SELECT * FROM students;
  5. Take a look at the surrogate keys.
    SELECT * FROM students_v2;
    +-----------------+---------------------+-------------------+-------------------+
    | students_v2.id  | students_v2.row_id  | students_v2.name  | students_v2.dorm  |
    +-----------------+---------------------+-------------------+-------------------+
    | 1099511627776   | 1                   | fred flintstone   | 100               |
    | 1099511627777   | 2                   | barney rubble     | 200               |
    +-----------------+---------------------+-------------------+-------------------+               
  6. Add the surrogate keys as a foreign key to another table, such as a student_grades table, to speed up subsequent joins of the tables.
    ALTER TABLE student_grades ADD COLUMNS (gen_id BIGINT);
     
    MERGE INTO student_grades g USING students_v2 s ON g.row_id = s.row_id
    WHEN MATCHED THEN UPDATE SET gen_id = s.id;
  7. Perform fast joins on the surrogate keys.