LibGuides: Open Science: Open Data

What is Open Data

Image source: Open Science Badge from the Centre for Open Science

What is Open Data?

Before getting into the concept of Open Data, it is important to understand what “data” refers to. Data are any type of information that has been collected, observed, generated, or created to validate original research findings.

A detailed list of the different types of research data is available on our Research Data Management guide.

Defined by the Open Knowledge Foundation in the Open Data Handbook, Open data is...

“ data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike. ”

In this context, open data should also be:

Sufficiently described and documented with appropriate metadata so that they can be easily understood and reused

Accessible with appropriate license, copyright, and citation information

Findable in an accredited or trustworthy resource, accompanied with history of changes and versioning

Benefits of Open Data

Open data can be of value to multiple stakeholders in many areas. Individual researchers could be benefited from opening data as follows:

Secure your research data in publicly accessible repositories as long as they are of continuing value

Get higher visibility of the research findings associated with the shared data

Have more chances to be cited and get credits from it. Study found that journal articles with statements linking to data in a repository receive 25% higher citations on average (Colavizza et al., 2020)

Comply with the increasing journal requirements and/or funder requirements on research data sharing upon journal article publication and/or receipt of awarded funding

Open data could also benefit the entire scientific community, such as:

Accelerate scientific discovery by providing greater access to research data that facilitate replications

Improve integrity of scientific research and scholarly records and reduce academic fraud

Enhance global emergency response to intercontinental crisis, e.g. the COVID-19 pandemic

Foster a global research culture of transparency and validation, making science inclusive and diverse with equitable sharing of knowledge

The FAIR data principles

The FAIR data principles is a set of guidelines that help researchers make better use of, and engage with a broader audience with, their research data. First introduced in 2016 (Wilkinson et al.), the principles aim to improve discoverability, accessibility and reusability of the data to be shared openly, making them more valuable and maximizing their use, re-use and impact. The principles have since been widely accepted and adopted by academic communities and research institutions worldwide.

The FAIR data principles specify shared data must be: Findable, Accessible, Interoperable, and Reusable.

Findable

F1. (Meta)data are assigned a globally unique and persistent identifier

F2. Data are described with rich metadata (defined by R1 below)

F3. Metadata clearly and explicitly include the identifier of the data they describe

F4. (Meta)data are registered or indexed in a searchable resource

Accessible

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol

A1.1 The protocol is open, free, and universally implementable

A1.2 The protocol allows for an authentication and authorisation procedure, where necessary

A2. Metadata are accessible, even when the data are no longer available

Interoperable

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

I2. (Meta)data use vocabularies that follow FAIR principles

I3. (Meta)data include qualified references to other (meta)data

Reusable

R1. (Meta)data are richly described with a plurality of accurate and relevant attributes

R1.1 (Meta)data are released with a clear and accessible data usage license

R1.2 (Meta)data are associated with detailed provenance

R1.3 (Meta)data meet domain-relevant community standards

Preparing data for sharing

Properly preparing your research data for sharing is crucial to ensure its reusability, especially when they are planned to be shared or published openly.

Implementing good practices in research data management across the research cycle could better help you in making open data, covering aspects such as data management planning, data documentation, organization, formatting, anonymization, and licensing, etc.

Read more best practices of research data management on the Research Data Management Guide.

Where to share research data

Locations for data sharing

Research data can be shared in a variety of locations. Several researchers may opt to share directly via emails and personal websites or upload their data files directly on the journal publisher website in their journal article. They may seem convenient, yet it might not be the best option often due to limited accessibility, long-term preservation not guaranteed, versioning not supported, etc.

In compliance with the FAIR data principles, a long-term repository that provides a permanent identifier is the most recommended option for data sharing.

Selecting a repository

When you are considering which data repository would best suits your data, consider the followings:

Does your funding sponsor (or journal publisher) require or recommend a specific data repository?

Is there a domain-specific repository that is widely-used in your research field?

Does your organization/institution offer a data repository?

Do you think the tools offered by the repository for data discovery and distribution are suitable for your data?

Does the repository provide open data access and support the FAIR data principles (e.g. offer persistent identifiers like DOI, data licensing)?

According to the Open Science Training Handbook, as recommended by OpenAIRE, researchers may consider the order of preference as follows:

Use a disciplinary repository established for your research domain with recognized standards in your discipline

Use an institutional research data repository

Use other general repositories that are designed to accommodate multi-disciplinary research data

Search for other data repositories in a global registry such as re3data or FAIRsharing.

1. Disciplinary Repository

An external disciplinary or data-type specific repository often follows discipline-specific metadata standards and data curation practices. Primary considerations should be given to these data repositories as they could better facilitate discoverability, understanding, and reusability of the shared datasets in a specific area of study. Researchers and research students are recommended to seek advice from their colleagues or supervisors for locating suitable disciplinary specific repositories in their subject fields.

For instance, the Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) maintains a list of National Institutes of Health (NIH) supported domain-specific repositories.

Researchers can also identify appropriate discipline-specific repositories by referring to a data repository registry like re3data.org or exploring the catalogue of databases available in the FAIRsharing collection.

2. HKU Institutional Data Repository: DataHub

The University of Hong Kong has an institutional data repository, HKU DataHub, powered by Figshare. HKU researchers and students could cite, store, publish and share their research data and other digital materials on this repository. It serves as a persistent ‘home’ for research data generated by HKU community members, providing long-term storage, global open access, identifier generation, and secure protection, etc.

Visit HKU DataHub to explore research data and other digital scholarly outputs published by HKU researchers and HKU DataHub: The Guide for guidelines on how to use the repository.

To enhance the visibility of the research within HKU community, HKU researchers who deposit their data primarily at disciplinary repository may create an item record at DataHub with the link directing others to where the data is stored. Read the guide for uploading linked files to DataHub.

3. Generalist Repository

Several generalist repositories offer cost-free services and accounts for researchers from multiple disciplines to deposit their research data. They usually accept all types of data and are most commonly used for data that cannot go into a domain- or discipline-specific repository.

The Generalist Repository Ecosystem Initiative (GREI)

The GREI is an initiative of the NIH to bring together seven generalist repositories in a collaborative working group. The repositories participated in the GREI are highlighted in the below table:

Figshare is the general version of the HKU institutional data repository. It offers 20GB free storage for a cost-free private account. HKU researchers are recommended to use DataHub, which uses the same engine and interface, for more storage quota.

Zenodo is run by the CERN data centre to support long-term preservation and open science movement in Europe.

OSF (Open Science Framework) is a free and open-source project management tool that supports researchers throughout their entire project lifecycle in open science best practices.

Harvard Dataverse Repository is a free data repository open to all researchers from any discipline, both inside and outside of the Harvard community.

Dryad is an open data publishing platform and community committed to the open availability and routine re-use of all research data.

Mendeley Data is a free and open generalist data repository to create, share, access and cite FAIR data globally, owned by Elsevier.

Vivli is an independent, non-profit organization that has developed a global clinical research data sharing platform. The platform focuses on sharing individual participant-level data from completed clinical trials to serve the international research community.

Reference

Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020) The citation advantage of linking publications to research data. PLoS ONE 15(4): e0230416. https://doi.org/10.1371/journal.pone.0230416

FASEB. (2024, July 29). Choosing Your Generalist Repository. https://dataworks.faseb.org/helpdesk/kb/choosing-a-generalist-repository

OpenAIRE. (2024, July 29). Guides for Researchers: How to find a trustworthy repository for your data. https://www.openaire.eu/find-trustworthy-data-repository

Open Knowledge Foundation. (2024). Open Data Handbook. What is Open Data? https://opendatahandbook.org/guide/en/what-is-open-data/

NASA. (2024). Open Science 101. https://nasa.github.io/Transform-to-Open-Science/os101-modules/

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. https://doi.org/10.1038/sdata.2016.18