Image source: Open Science Badge from the Centre for Open Science
Before getting into the concept of Open Data, it is important to understand what “data” refers to. Data are any type of information that has been collected, observed, generated, or created to validate original research findings.
According to NASA (2024), they include:
Primary data / Raw data |
Data that are directly collected or created by researchers, including but not limited to:
|
Secondary data / Processed data | Data that is used by someone different than who collected or generated the data. Often, this may include data that has been processed from its raw state to be more readily usable by others. |
Defined by the Open Knowledge Foundation in the Open Data Handbook, Open data is...
“ data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike. ”
In this context, open data should also be:
Sufficiently described and documented with appropriate metadata so that they can be easily understood and reused
Accessible with appropriate license, copyright, and citation information
Findable in an accredited or trustworthy resource, accompanied with history of changes and versioning
Open data can be of value to multiple stakeholders in many areas. Individual researchers could be benefited from opening data as follows:
Secure your research data in publicly accessible repositories as long as they are of continuing value
Get higher visibility of the research findings associated with the shared data
Have more chances to be cited and get credits from it. Study found that journal articles with statements linking to data in a repository receive 25% higher citations on average (Colavizza et al., 2020)
Comply with the increasing journal requirements and/or funder requirements on research data sharing upon journal article publication and/or receipt of awarded funding
Open data could also benefit the entire scientific community, such as:
Accelerate scientific discovery by providing greater access to research data that facilitate replications
Improve integrity of scientific research and scholarly records and reduce academic fraud
Enhance global emergency response to intercontinental crisis, e.g. the COVID-19 pandemic
Foster a global research culture of transparency and validation, making science inclusive and diverse with equitable sharing of knowledge
The FAIR data principles is a set of guidelines that help researchers make better use of, and engage with a broader audience with, their research data. First introduced in 2016 (Wilkinson et al.), the principles aim to improve discoverability, accessibility and reusability of the data to be shared openly, making them more valuable and maximizing their use, re-use and impact. The principles have since been widely accepted and adopted by academic communities and research institutions worldwide.
The FAIR data principles specify shared data must be: Findable, Accessible, Interoperable, and Reusable.
Properly preparing your research data for sharing is crucial to ensure its reusability, especially when they are planned to be shared or published openly.
Implementing good practices in research data management across the research cycle could better help you in making open data, covering aspects such as data management planning, data documentation, organization, formatting, anonymization, and licensing, etc.
Read more best practices of research data management on the Research Data Services website.
Research data can be shared in a variety of locations. Several researchers may opt to share directly via emails and personal websites or upload their data files directly on the journal publisher website in their journal article. They may seem convenient, yet it might not be the best option often due to limited accessibility, long-term preservation not guaranteed, versioning not supported, etc.
In compliance with the FAIR data principles, a long-term repository that provides a permanent identifier is the most recommended option for data sharing.
When you are considering which data repository would best suits your data, consider the followings:
Does your funding sponsor (or journal publisher) require or recommend a specific data repository?
Is there a domain-specific repository that is widely-used in your research field?
Does your organization/institution offer a data repository?
Do you think the tools offered by the repository for data discovery and distribution are suitable for your data?
Does the repository provide open data access and support the FAIR data principles (e.g. offer persistent identifiers like DOI, data licensing)?
According to the Open Science Training Handbook, as recommended by OpenAIRE, researchers may consider the order of preference as follows:
Use a disciplinary repository established for your research domain with recognized standards in your discipline
Use an institutional research data repository
Use other general repositories that are designed to accommodate multi-disciplinary research data
Search for other data repositories in a global registry such as re3data or FAIRsharing.
An external disciplinary or data-type specific repository often follows discipline-specific metadata standards and data curation practices. Primary considerations should be given to these data repositories as they could better facilitate discoverability, understanding, and reusability of the shared datasets in a specific area of study. Researchers and research students are recommended to seek advice from their colleagues or supervisors for locating suitable disciplinary specific repositories in their subject fields.
For instance, the Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) maintains a list of National Institutes of Health (NIH) supported domain-specific repositories here.
Researchers can also identify appropriate discipline-specific repositories by referring to a data repository registry like re3data.org or exploring the catalogue of databases available in the FAIRsharing collection.
The University of Hong Kong has an institutional data repository, HKU DataHub, powered by Figshare. HKU researchers and students could cite, store, publish and share their research data and other digital materials on this repository. It serves as a persistent ‘home’ for research data generated by HKU community members, providing long-term storage, global open access, identifier generation, and secure protection, etc.
Visit HKU DataHub to explore research data and other digital scholarly outputs published by HKU researchers and HKU DataHub: The Guide for guidelines on how to use the repository.
Several generalist repositories offer cost-free services and accounts for researchers from multiple disciplines to deposit their research data. They usually accept all types of data and are most commonly used for data that cannot go into a domain- or discipline-specific repository.
The Generalist Repository Ecosystem Initiative (GREI)
The GREI is an initiative of the NIH to bring together seven generalist repositories in a collaborative working group. The repositories participated in the GREI are highlighted in the below table:
Figshare is the general version of the HKU institutional data repository. It offers 20GB free storage for a cost-free private account. HKU researchers are recommended to use DataHub, which uses the same engine and interface, for more storage quota. | |
Zenodo is run by the CERN data centre to support long-term preservation and open science movement in Europe. | |
OSF (Open Science Framework) is a free and open-source project management tool that supports researchers throughout their entire project lifecycle in open science best practices. | |
Harvard Dataverse Repository is a free data repository open to all researchers from any discipline, both inside and outside of the Harvard community. | |
Dryad is an open data publishing platform and community committed to the open availability and routine re-use of all research data. | |
Mendeley Data is a free and open generalist data repository to create, share, access and cite FAIR data globally, owned by Elsevier. | |
Vivli is an independent, non-profit organization that has developed a global clinical research data sharing platform. The platform focuses on sharing individual participant-level data from completed clinical trials to serve the international research community. |
A comparison chart of the GREI-participated generalist repositories is available here.
Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020) The citation advantage of linking publications to research data. PLoS ONE 15(4): e0230416. https://doi.org/10.1371/journal.pone.0230416
FASEB. (2024, July 29). Choosing Your Generalist Repository. https://dataworks.faseb.org/helpdesk/kb/choosing-a-generalist-repository
OpenAIRE. (2024, July 29). Guides for Researchers: How to find a trustworthy repository for your data. https://www.openaire.eu/find-trustworthy-data-repository
Open Knowledge Foundation. (2024). Open Data Handbook. What is Open Data? https://opendatahandbook.org/guide/en/what-is-open-data/
NASA. (2024). Open Science 101. https://nasa.github.io/Transform-to-Open-Science/os101-modules/
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. https://doi.org/10.1038/sdata.2016.18