LibGuides: Research Data Management: Data Organization

File structures

Creating an organized file structure is critical for managing research data efficiently. By organizing files systematically, researchers can ensure that data are easily located and accessible.

First of all, you may consider the best hierarchy for files, deciding whether a deep or shallow hierarchy is preferable. In planning a hierarchical folder structure, aim for a balance between breadth and depth, so that no one category gets too big, and you don't have to click through endless folders to find a file. It is suggested to restrict the level of folders to three or four deep and not to have more than ten items in each list.

Second, determine the criteria for organization. Research project files can be organized by:

Research activities, e.g. interview, surveys, or focus groups

Data/File types, e.g. images (.tiff, .jpeg or .png), text (.rtf, .docx or .pdf), spreadsheets (.csv), or coding files (.py or .R), etc.

Other contextual information, e.g. project, experiment, date, time, location, etc.

The names of the folders and sub-folders should reflect the contents. Avoid using the names of researchers or staff. Below shows an example:

Image source: Harvard University Research Data Management, Directory Structure, File Organization

In some databases or software tools, a tag-based system may be available. The system allows users to assign each file one or more tags or labels. This makes it easier for files to have overlapping categories. Hence, the files can be categorized or sorted in multiple ways simultaneously (by subject, by author, and by the project it relates to, for example).

File naming

A file naming convention is a framework for naming your files in a way that describes what they contain and how they relate to other files. A well-structured file naming system ensures that files are easily identifiable, searchable, and organized, thus facilitating smooth collaboration and data retrieval. Below states several best practices on naming your data files:

Be consistent and descriptive

It's important to develop a consistent approach to naming files, especially in collaboration research, you may want to agree on file naming conventions early when the project begins. This will help you and your collaborators to create and locate data files more easily.

Keep the file names short, but comprehensible, e.g. 20240104_Study2_Interview_Subject1.docx. Avoid being too generic like “data.csv” or “file1.docx”.

Include key elements for clarity

Try to incorporate the below key elements in your file names:

Project or experiment name or acronym;

Unique identifiers for experiment or sample IDs;

Location;

Date or date range in format YYYYMMDD;

Version numbers such as v01, v02, final.

Avoid special characters and spaces

Special characters (e.g., @ # ? ! & % $ * : ; . , ^ “ ( ) < > / \ ~ |, etc.) and spaces can cause issues in different operating systems and software. Use underscores (_), hyphens (-) or capitalize the first letter of each word (i.e. Camel case) to separate words within a file name. For example, “file-name.doc”, “file_name.doc”, “FileName.doc”.

Use leading zeros for sequential numbering

When numbering files, use leading zeros to maintain numerical order in digital file systems (e.g., 001, 002, instead of 1, 2). This practice ensures that files are sorted correctly.

Document your file naming conventions

Create a document outlining the naming conventions used in your project, so that others in your lab, research team or department can follow this standard. For example, you can document your file naming convention in a README.txt file and share it with your team members.

More resources:

Harvard University Research Data Management, File Naming Conventions

NYU Libraries, Research Data Management, File Organization

UK Data Services, Research Data Management, Formatting Data, Organising

University College London, Naming conventions for electronic records

File formats

Proper file formatting ensures the longevity, accessibility, and usability of research data. You are advised to consider several important points when choosing file format(s) suitable for your digital data (UK Data Services, 2024):

What format is best suited for data creation?

What format is best suited for data analyses and other planned uses?

What format is best suited for long-term sustainability and sharing of data?

Should you choose an open versus a proprietary format?

Should the format be lossy or not?

Is the format suitable for conversion?

In terms of long-term accessibility and file sustainability, whenever possible, the most ideal practice is to use open, lossless and standard formats. Open formats are not tied to specific software or vendors, ensuring that data remains accessible over time. Examples include CSV for tabular data, TXT for plain text, and TIFF for images. These formats are widely supported and can be read by a variety of software applications. This is to minimize the risk of losing access to your data files due to possible obsolescence of proprietary software in the future.

Please refer to more details on the table of file formats recommended by the UK Data Service. These formats are considered acceptable and recommended especially for sharing, re-use, and long-term preservation.

After considering open formats, you may still have good reasons for choosing a closed or proprietary file format for your research. However, when you were archiving or preserving your data once your research is completed, you are advised to convert the original closed format and save an extra copy of your data in an open format.

Version Control

Versioning (Version Control)

Version Control is an important practice to keep tracking changes (e.g. additions, deletions, replacements) in your data files, ensure the latest version is being used, and avoid outdated information being incorporated into current versions of your data files.

When developing a suitable version control strategy for managing your research data files systematically, you may consider whether:

Files are used by single or multiple users;

Files are stored in one or multiple locations simultaneously;

Versions across users or locations need to be synchronized or not (e.g. original raw data files only accessible to PI while the anonymized version is shared to collaborators in another location);

While the above factors may affect researchers to adopt different practices in versioning the files, you are recommended to consider the best practices listed below (UK Data Service, 2024):

Decide how many versions of a file to keep, which versions to keep, for how long and how to organize versions.

Identify milestone versions to keep, e.g. Major versions rather than minor versions

Uniquely identify different versions of files using a systematic naming convention, such as using version numbers or dates. For example, HealthSurvey-2024-APR-01 or HealthSurvey_v02.

Record changes made to a file when a new version is created.

Record relationships between items where needed, for example, between code and the data file it is run against; between data file and related documentation or metadata; or between multiple files.

Track the location of files if they are stored in a variety of locations, regularly synchronize them if they need to be identical.

More Resources:

Harvard University Research Data Management, Version Control

UK Data Service, Research Data Management, Formatting Data, Versioning

University College London, Version Control of Electronic Records