Mixing functions (English: hash functions)

Mixing functions are mathematical algorithms that systematically and irreversibly transform original data into an unrecognisable form. Input data of any size (e.g. a name or a personal identification number) is transformed into a fixed-length string called a hash value or checksum.
This is a one-way function – it is almost impossible to recover the original value from the hash value. With the hash function, the same input value will always produce the same hash value, so even small changes to the input data will produce a completely different hash value.
With hash functions, there is no need to create a key file. However, the original values that will be used to identify the persons must be stored in a separate table and the hash values must be fixed for conversion.
The original table of values should be stored separately from the pseudonymised research data in a place where access control, encryption and security measures for sensitive data are ensured. It is only accessible to the researcher who needs to know the real identities of the individuals (e.g. the Principal Investigator) and not to others. When using hash functions for pseudonymisation, it is recommended to use widely used algorithms such as SHA-2, SHA-3 or SHA-256). Hash functions are available for both Python (e.g, hashlib library), both in R (e.g, digest package).

Example

Using the hash function in pseudonymisation step by step:
  1. Selects an appropriate hashing algorithm (e.g. SHA-256)
  2. Selects the identifier to be pseudonymised (e.g. “Jānis Bērziņš”, “Līga Ozola”, “Kārlis Priedītis”)
  3. The hash algorithm converts this text into a fixed-length string of numbers and letters (e.g. “5d41402abc4b2a76b9719d911017c592”)
  4. This hash value replaces the original identifier throughout the dataset
Original data
Name, surname Faculty Level of physical activity
John Berzins Computer Science Low
Līga Ozola Medical Medium
Karlis Priedītis Social Sciences Medium
In the pseudonymised dataset, the hash value replaces the first and last name using the SHA-256 hash function.
Pseudonymised dataset (example)
Mixing value Faculty Physical activity level
7b9d67f94873e2d4c7874bc5742227c7b0d44fad343e29e9686dd2608b489985 Computer Science Low
9f9dda4086fb4a6430ea3518aa5f724dc9d1c134d0eee44580edca19b62e0d3f Medical Medium
f3cf16d99e4e05d5ec35528bcbbe9d4fa40d86e7a19bba202d8f07b9a9b0d663 Social sciences Medium
Table of original values (contains personal data; kept separately in a secure place)
Name, surname
John Berzins
Līga Ozola
Karlis Priedītis

Mixing functions (English: hash functions)

Mixing functions are mathematical algorithms that systematically and irreversibly transform original data into an unrecognisable form. Input data of any size (e.g. a name or a personal identification number) is transformed into a fixed-length string called a hash value or checksum.
This is a one-way function – it is almost impossible to recover the original value from the hash value. With the hash function, the same input value will always produce the same hash value, so even small changes to the input data will produce a completely different hash value.
With hash functions, there is no need to create a key file. However, the original values that will be used to identify the persons must be stored in a separate table and the hash values must be fixed for conversion.
The original table of values should be stored separately from the pseudonymised research data in a place where access control, encryption and security measures for sensitive data are ensured. It is only accessible to the researcher who needs to know the real identities of the individuals (e.g. the Principal Investigator) and not to others. When using hash functions for pseudonymisation, it is recommended to use widely used algorithms such as SHA-2, SHA-3 or SHA-256). Hash functions are available for both Python (e.g, hashlib library), both in R (e.g, digest package).

Example

Using the hash function in pseudonymisation step by step:
  1. Selects an appropriate hashing algorithm (e.g. SHA-256)
  2. Selects the identifier to be pseudonymised (e.g. “Jānis Bērziņš”, “Līga Ozola”, “Kārlis Priedītis”)
  3. The hash algorithm converts this text into a fixed-length string of numbers and letters (e.g. “5d41402abc4b2a76b9719d911017c592”)
  4. This hash value replaces the original identifier throughout the dataset
Original data
Name, surname Faculty Level of physical activity
John Berzins Computer Science Low
Līga Ozola Medical Medium
Karlis Priedītis Social Sciences Medium
In the pseudonymised dataset, the hash value replaces the first and last name using the SHA-256 hash function.
Pseudonymised dataset (example)
Mixing value Faculty Physical activity level
7b9d67f94873e2d4c7874bc5742227c7b0d44fad343e29e9686dd2608b489985 Computer Science Low
9f9dda4086fb4a6430ea3518aa5f724dc9d1c134d0eee44580edca19b62e0d3f Medical Medium
f3cf16d99e4e05d5ec35528bcbbe9d4fa40d86e7a19bba202d8f07b9a9b0d663 Social sciences Medium
Table of original values (contains personal data; kept separately in a secure place)
Name, surname
John Berzins
Līga Ozola
Karlis Priedītis