Mixing functions (English: hash functions)
Mixing functions are mathematical algorithms that systematically and irreversibly transform original data into an unrecognisable form. Input data of any size (e.g. a name or a personal identification number) is transformed into a fixed-length string called a hash value or checksum.
This is a one-way function – it is almost impossible to recover the original value from the hash value. With the hash function, the same input value will always produce the same hash value, so even small changes to the input data will produce a completely different hash value.
With hash functions, there is no need to create a key file. However, the original values that will be used to identify the persons must be stored in a separate table and the hash values must be fixed for conversion.
The original table of values should be stored separately from the pseudonymised research data in a place where access control, encryption and security measures for sensitive data are ensured. It is only accessible to the researcher who needs to know the real identities of the individuals (e.g. the Principal Investigator) and not to others. When using hash functions for pseudonymisation, it is recommended to use widely used algorithms such as SHA-2, SHA-3 or SHA-256). Hash functions are available for both Python (e.g, hashlib library), both in R (e.g, digest package).
Example
Using the hash function in pseudonymisation step by step:
-
Selects an appropriate hashing algorithm (e.g. SHA-256)
-
Selects the identifier to be pseudonymised (e.g. “Jānis Bērziņš”, “Līga Ozola”, “Kārlis Priedītis”)
-
The hash algorithm converts this text into a fixed-length string of numbers and letters (e.g. “5d41402abc4b2a76b9719d911017c592”)
-
This hash value replaces the original identifier throughout the dataset
Original data
| Name, surname | Faculty | Level of physical activity |
|---|---|---|
| John Berzins | Computer Science | Low |
| Līga Ozola | Medical | Medium |
| Karlis Priedītis | Social Sciences | Medium |
In the pseudonymised dataset, the hash value replaces the first and last name using the SHA-256 hash function.
Pseudonymised dataset (example)
| Mixing value | Faculty | Physical activity level |
|---|---|---|
| 7b9d67f94873e2d4c7874bc5742227c7b0d44fad343e29e9686dd2608b489985 | Computer Science | Low |
| 9f9dda4086fb4a6430ea3518aa5f724dc9d1c134d0eee44580edca19b62e0d3f | Medical | Medium |
| f3cf16d99e4e05d5ec35528bcbbe9d4fa40d86e7a19bba202d8f07b9a9b0d663 | Social sciences | Medium |
Table of original values (contains personal data; kept separately in a secure place)
| Name, surname |
|---|
| John Berzins |
| Līga Ozola |
| Karlis Priedītis |
Mixing functions (English: hash functions)
Mixing functions are mathematical algorithms that systematically and irreversibly transform original data into an unrecognisable form. Input data of any size (e.g. a name or a personal identification number) is transformed into a fixed-length string called a hash value or checksum.
This is a one-way function – it is almost impossible to recover the original value from the hash value. With the hash function, the same input value will always produce the same hash value, so even small changes to the input data will produce a completely different hash value.
With hash functions, there is no need to create a key file. However, the original values that will be used to identify the persons must be stored in a separate table and the hash values must be fixed for conversion.
The original table of values should be stored separately from the pseudonymised research data in a place where access control, encryption and security measures for sensitive data are ensured. It is only accessible to the researcher who needs to know the real identities of the individuals (e.g. the Principal Investigator) and not to others. When using hash functions for pseudonymisation, it is recommended to use widely used algorithms such as SHA-2, SHA-3 or SHA-256). Hash functions are available for both Python (e.g, hashlib library), both in R (e.g, digest package).
Example
Using the hash function in pseudonymisation step by step:
-
Selects an appropriate hashing algorithm (e.g. SHA-256)
-
Selects the identifier to be pseudonymised (e.g. “Jānis Bērziņš”, “Līga Ozola”, “Kārlis Priedītis”)
-
The hash algorithm converts this text into a fixed-length string of numbers and letters (e.g. “5d41402abc4b2a76b9719d911017c592”)
-
This hash value replaces the original identifier throughout the dataset
Original data
| Name, surname | Faculty | Level of physical activity |
|---|---|---|
| John Berzins | Computer Science | Low |
| Līga Ozola | Medical | Medium |
| Karlis Priedītis | Social Sciences | Medium |
In the pseudonymised dataset, the hash value replaces the first and last name using the SHA-256 hash function.
Pseudonymised dataset (example)
| Mixing value | Faculty | Physical activity level |
|---|---|---|
| 7b9d67f94873e2d4c7874bc5742227c7b0d44fad343e29e9686dd2608b489985 | Computer Science | Low |
| 9f9dda4086fb4a6430ea3518aa5f724dc9d1c134d0eee44580edca19b62e0d3f | Medical | Medium |
| f3cf16d99e4e05d5ec35528bcbbe9d4fa40d86e7a19bba202d8f07b9a9b0d663 | Social sciences | Medium |
Table of original values (contains personal data; kept separately in a secure place)
| Name, surname |
|---|
| John Berzins |
| Līga Ozola |
| Karlis Priedītis |