Skip to Main Content

Research Data Management

A practical guide on the best practices of research data and codes management

Anonymizing data

Anonymizing data 


Anonymizing data is an essential procedure to protect the privacy and confidentiality of individuals whose information is included in a dataset. This practice involves transforming data in such a way that it is impossible to identify individuals from the data, either directly or indirectly, while preserving the dataset's utility for analysis and research purposes. When conducting research studies that involve human participants, it should always be considered alongside obtaining informed consent for data sharing or imposing access restrictions. 

Anonymizing research data can be time consuming and therefore costly. Early planning can help reduce the costs. The HKU Policy on Research Integrity describes ethical data collection and storage. In some cases, the researcher may wish to store two copies of the data, the original held in dark archive with or without an embargo period, and other redacted or anonymized for sharing. 

For the sake of minimizing privacy risks, researchers are strongly recommended to anonymize datasets that contain personal identifiers for long-term retention.  Personal data should never be disclosed from research information, unless a participant has given consent to do so, ideally in writing. 

Personal identifiers and sensitive data

Personal identifiers and sensitive data 


A person's identity can be disclosed from direct identifiers such as names, addresses, telephone numbers or pictures. These can be easily redacted. More problematic is indirect identifiers which, when linked to other publicly available information sources, could identify someone. These include information on workplace, occupation, salary, age, etc. 

The Hong Kong Personal Data (Privacy Ordinance) defines personal data as ‘any information relating directly or indirectly to a living individual, from which it is practicable for the identity of the individual to be directly or indirectly ascertained, and in a form in which access to or processing of the data is practicable.’ 

According to the European Union’s General Data Protection Regulation (GDPR), sensitive data—also referred to as "special categories of personal data"—also require a higher level of security due to their nature and potential impact on individuals' privacy and rights. Sensitive data includes information that reveals an individual's racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership. Additionally, the regulation categorizes genetic data, biometric data (when processed to uniquely identify a person), data concerning health, and data concerning a person's sex life or sexual orientation as sensitive as well. 

 

More Resources:

The HKU Data Protection Office, Data Protection Awareness Training

Resources

Resources for anonymization or de-identification 


The UK Data Service Guide: 

Australian Research Data Commons: De-identifying data 

Office of the Privacy Commissioner for Personal Data, Hong Kong: Guidance on Personal Data Erasure and Anonymisation 

UK Anonymisation Network: The Anonymisation Decision Making Framework