Skip to Main Content

Open Science

Open Data

osf open science badge for open data

Image source: Open Science Badge from the Centre for Open Science

 

What is Open Data? 

Before getting into the concept of Open Data, it is important to understand what “data” refers to. Data are any type of information that has been collected, observed, generated, or created to validate original research findings.

According to NASA (2024), they include: 

Primary data / Raw data

Data that are directly collected or created by researchers, including but not limited to: 

  • Responses to interviews, questionnaires, and surveys 

  • Data acquired from recorded measurements, including remote sensing data 

  • Data acquired from physical samples and specimens form the base of many studies 

  • Data generated from models and simulations 

Secondary data / Processed data Data that is used by someone different than who collected or generated the data. Often, this may include data that has been processed from its raw state to be more readily usable by others. 

 

Defined by the Open Knowledge Foundation in the Open Data Handbook, Open data is... 

“ data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike. ” 

 

In this context, open data should also be: 

  • Sufficiently described and documented with appropriate metadata so that they can be easily understood and reused 

  • Accessible with appropriate license, copyright, and citation information 

  • Findable in an accredited or trustworthy resource, accompanied with history of changes and versioning 

Benefits of Open Data

Benefits of Open Data 

Open data can be of value to multiple stakeholders in many areas. Individual researchers could be benefited from opening data as follows: 

  • Secure your research data in publicly accessible repositories as long as they are of continuing value 

  • Get higher visibility of the research findings associated with the shared data 

  • Have more chances to be cited and get credits from it. Study found that journal articles with statements linking to data in a repository receive 25% higher citations on average (Colavizza et al., 2020) 

  • Comply with the increasing journal requirements and/or funder requirements on research data sharing upon journal article publication and/or receipt of awarded funding 

 

Open data could also benefit the entire scientific community, such as: 

  • Accelerate scientific discovery by providing greater access to research data that facilitate replications 

  • Improve integrity of scientific research and scholarly records and reduce academic fraud 

  • Enhance global emergency response to intercontinental crisis, e.g. the COVID-19 pandemic 

  • Foster a global research culture of transparency and validation, making science inclusive and diverse with equitable sharing of knowledge  

The FAIR data principles

image of FAIR data principles

The FAIR data principles 

The FAIR data principles is a set of guidelines that help researchers make better use of, and engage with a broader audience with, their research data. First introduced in 2016 (Wilkinson et al.), the principles aim to improve discoverability, accessibility and reusability of the data to be shared openly, making them more valuable and maximizing their use, re-use and impact. The principles have since been widely accepted and adopted by academic communities and research institutions worldwide. 

The FAIR data principles specify shared data must be: Findable, Accessible, Interoperable, and Reusable.  

Findable

F1. (Meta)data are assigned a globally unique and persistent identifier 

F2. Data are described with rich metadata (defined by R1 below) 

F3. Metadata clearly and explicitly include the identifier of the data they describe 

F4. (Meta)data are registered or indexed in a searchable resource 

Accessible

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol 

A1.1 The protocol is open, free, and universally implementable 

A1.2 The protocol allows for an authentication and authorisation procedure, where necessary 

A2. Metadata are accessible, even when the data are no longer available 

Interoperable

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation 

I2. (Meta)data use vocabularies that follow FAIR principles 

I3. (Meta)data include qualified references to other (meta)data 

Reusable 

R1. (Meta)data are richly described with a plurality of accurate and relevant attributes 

R1.1 (Meta)data are released with a clear and accessible data usage license 

R1.2 (Meta)data are associated with detailed provenance 

R1.3 (Meta)data meet domain-relevant community standards 

Preparing data for sharing

Preparing data for sharing 

Properly preparing your research data for sharing is crucial to ensure its reusability, especially when they are planned to be shared or published openly.  

Implementing good practices in research data management across the research cycle could better help you in making open data, covering aspects such as data management planning, data documentation, organization, formatting, anonymization, and licensing, etc. 

Read more best practices of research data management on the Research Data Services website

Where to share research data

Locations for data sharing 

Research data can be shared in a variety of locations. Several researchers may opt to share directly via emails and personal websites or upload their data files directly on the journal publisher website in their journal article. They may seem convenient, yet it might not be the best option often due to limited accessibility, long-term preservation not guaranteed, versioning not supported, etc. 

In compliance with the FAIR data principles, a long-term repository that provides a permanent identifier is the most recommended option for data sharing. 

Selecting a repository

Selecting a repository 

When you are considering which data repository would best suits your data, consider the followings: 

  • Does your funding sponsor (or journal publisher) require or recommend a specific data repository? 

  • Is there a domain-specific repository that is widely-used in your research field? 

  • Does your organization/institution offer a data repository? 

  • Do you think the tools offered by the repository for data discovery and distribution are suitable for your data? 

  • Does the repository provide open data access and support the FAIR data principles (e.g. offer persistent identifiers like DOI, data licensing)? 

 

According to the Open Science Training Handbook, as recommended by OpenAIRE, researchers may consider the order of preference as follows: 

  1. Use a disciplinary repository established for your research domain with recognized standards in your discipline 

  1. Use an institutional research data repository 

  1. Use other general repositories that are designed to accommodate multi-disciplinary research data 

  1. Search for other data repositories in a global registry such as re3data or FAIRsharing

 

1. Disciplinary Repository 

An external disciplinary or data-type specific repository often follows discipline-specific metadata standards and data curation practices. Primary considerations should be given to these data repositories as they could better facilitate discoverability, understanding, and reusability of the shared datasets in a specific area of study. Researchers and research students are recommended to seek advice from their colleagues or supervisors for locating suitable disciplinary specific repositories in their subject fields. 

For instance, the Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) maintains a list of National Institutes of Health (NIH) supported domain-specific repositories here

Researchers can also identify appropriate discipline-specific repositories by referring to a data repository registry like re3data.org or exploring the catalogue of databases available in the FAIRsharing collection. 

 

2. HKU Institutional Data Repository: DataHub 

The University of Hong Kong has an institutional data repository, HKU DataHub, powered by Figshare. HKU researchers and students could cite, store, publish and share their research data and other digital materials on this repository. It serves as a persistent ‘home’ for research data generated by HKU community members, providing long-term storage, global open access, identifier generation, and secure protection, etc. 

Visit HKU DataHub to explore research data and other digital scholarly outputs published by HKU researchers and HKU DataHub: The Guide for guidelines on how to use the repository. 

 

3. Generalist Repository 

Several generalist repositories offer cost-free services and accounts for researchers from multiple disciplines to deposit their research data. They usually accept all types of data and are most commonly used for data that cannot go into a domain- or discipline-specific repository.  

 

GREI logo

The Generalist Repository Ecosystem Initiative (GREI) 

The GREI is an initiative of the NIH to bring together seven generalist repositories in a collaborative working group. The repositories participated in the GREI are highlighted in the below table: 

figshare logo Figshare is the general version of the HKU institutional data repository. It offers 20GB free storage for a cost-free private account. HKU researchers are recommended to use DataHub, which uses the same engine and interface, for more storage quota.
zenodo logo Zenodo is run by the CERN data centre to support long-term preservation and open science movement in Europe. 
osf logo OSF (Open Science Framework) is a free and open-source project management tool that supports researchers throughout their entire project lifecycle in open science best practices. 
harvard dataverse logo Harvard Dataverse Repository is a free data repository open to all researchers from any discipline, both inside and outside of the Harvard community. 
dryad logo Dryad is an open data publishing platform and community committed to the open availability and routine re-use of all research data.
mendeley data logo Mendeley Data is a free and open generalist data repository to create, share, access and cite FAIR data globally, owned by Elsevier. 
vivli logo Vivli is an independent, non-profit organization that has developed a global clinical research data sharing platform. The platform focuses on sharing individual participant-level data from completed clinical trials to serve the international research community. 

A comparison chart of the GREI-participated generalist repositories is available here

Reference

Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020) The citation advantage of linking publications to research data. PLoS ONE 15(4): e0230416. https://doi.org/10.1371/journal.pone.0230416  

FASEB. (2024, July 29). Choosing Your Generalist Repository. https://dataworks.faseb.org/helpdesk/kb/choosing-a-generalist-repository  

OpenAIRE. (2024, July 29). Guides for Researchers: How to find a trustworthy repository for your data. https://www.openaire.eu/find-trustworthy-data-repository  

Open Knowledge Foundation. (2024). Open Data Handbook. What is Open Data? https://opendatahandbook.org/guide/en/what-is-open-data/  

NASA. (2024). Open Science 101. https://nasa.github.io/Transform-to-Open-Science/os101-modules/ 

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. https://doi.org/10.1038/sdata.2016.18