Codebook or data dictionary

A codebook or data dictionary is a structured document that describes the variables in a dataset and how they are arranged, thus ensuring transparency, reproducibility and efficient data analysis. It serves as a guide for researchers and other stakeholders, facilitating data management and reuse of the dataset.
When depositing a dataset in a repository, the codebook must be submitted together with the datasets in a separate document.
Where a description of the content, structure and variables of the dataset is included ReadMe file, there is no need to prepare a separate codebook.
The codebook usually contains the following information:
  • Variable (variable) names
  • Variable descriptions/explanations (label) (short description of the variable or its full name)
  • Data type/type, e.g. text (text), numeric value (numeric), date (date)
  • Units/Range/Format
  • Category data codes and their decoding, (e.g. for a Likert scale variable, 1 – Strongly disagree, 2 – Disagree, 3 – Neutral, 4 – Agree, 5 – Strongly agree)
  • Calculations used to derive the derived variables in the dataset or notes on the relationships between the variables
  • If necessary, a comment section, e.g. an explanation of how missing values (NA/999) were flagged in the dataset etc.), the type of measurement obtained (self-reported/otherwise measured), information on the measuring equipment, etc.
It is recommended to use codes for missing data instead of leaving blank fields (blank). It is recommended to standardise these codes across the dataset (e.g. NA = not answered).
If missing values are present in the dataset for more than one reason, they may be indicated by different codings (e.g. 999 = missed, 888 = not applicable, 777 = data entry error). It is important to apply an encoding to missing values that is outside the range of possible values for the variable (e.g. missing age values can safely be encoded with 999 as no participant’s age can take this value).
After creating the codebook, remember to save/convert it in a sustainable and compatible format. For more information on preferred formats, see “Preparing the dataset for long-term storage”.

Example

Example dataset (first 10 rows of records are shown):
Example of a codebook corresponding to a dataset:

Codebook or data dictionary

A codebook or data dictionary is a structured document that describes the variables in a dataset and how they are arranged, thus ensuring transparency, reproducibility and efficient data analysis. It serves as a guide for researchers and other stakeholders, facilitating data management and reuse of the dataset.
When depositing a dataset in a repository, the codebook must be submitted together with the datasets in a separate document.
Where a description of the content, structure and variables of the dataset is included ReadMe file, there is no need to prepare a separate codebook.
The codebook usually contains the following information:
  • Variable (variable) names
  • Variable descriptions/explanations (label) (short description of the variable or its full name)
  • Data type/type, e.g. text (text), numeric value (numeric), date (date)
  • Units/Range/Format
  • Category data codes and their decoding, (e.g. for a Likert scale variable, 1 – Strongly disagree, 2 – Disagree, 3 – Neutral, 4 – Agree, 5 – Strongly agree)
  • Calculations used to derive the derived variables in the dataset or notes on the relationships between the variables
  • If necessary, a comment section, e.g. an explanation of how missing values (NA/999) were flagged in the dataset etc.), the type of measurement obtained (self-reported/otherwise measured), information on the measuring equipment, etc.
It is recommended to use codes for missing data instead of leaving blank fields (blank). It is recommended to standardise these codes across the dataset (e.g. NA = not answered).
If missing values are present in the dataset for more than one reason, they may be indicated by different codings (e.g. 999 = missed, 888 = not applicable, 777 = data entry error). It is important to apply an encoding to missing values that is outside the range of possible values for the variable (e.g. missing age values can safely be encoded with 999 as no participant’s age can take this value).
After creating the codebook, remember to save/convert it in a sustainable and compatible format. For more information on preferred formats, see “Preparing the dataset for long-term storage”.

Example

Example dataset (first 10 rows of records are shown):
Example of a codebook corresponding to a dataset: