Data Organization
Introduction
For data organization, we suggest to use the 5S methodology that uses a list of five words (Assmann et al., 2022):
- Sort: delete unnecessary files.
- Set in order: develop and document naming conventions and folder structures.
- Shine:
- Comply with conventions.
- Develop routines.
- Standardize:
- Document rules and responsibilities.
- Develop best practices and Standard Operating Procedures (SOPs).
- Sustain:
- Regularly check whether rules are followed.
- Implement improvements if necessary.
File naming
File names should ideally allow to establish a connection to a certain experiment or data collection (Bobrov et al., 2021). Within your research group, it is recommended (Bobrov et al., 2021; Bres et al., 2022) to:
- Choose a file and folder naming convention.
- Document your convention, for instance in Standard Operating Procedures (SOPs).
- Make the documentation available to all research group members.
- Stay consistent.
Recommendations for naming conventions
If you need to choose a file and folder naming convention, it is recommended (Assmann et al., 2022; Bobrov et al., 2021; Bres et al., 2022) to include the following:
- Favor alphabetically sortable names (e.g. starting with the date: YYYY-MM-DD).
- Limit file names to maximum 32 characters (32CharactersLooksExactlyLikeThis.txt). Short names are easier to find and they need a shorter path, whereas long names can cause technical problems. Thus, select a name that is as short as possible and as long as necessary.
- Favor names that reflect and are unique to the content (i.e. person, project ID/part, sample ID, experiment ID, status, data, version number and/or software name).
- Use periods only before file extensions.
- Do not use special characters or whitespaces which can be confusing to both machines and humans.
- Use leading zeros when using sequential numbering:
- For a sequence of 1-10: 01-10
- For a sequence of 1-100: 001-010-100
Examples of file names
- Good structure: YYYY-MM-DD_JV_ProjectID_ExperimentID with IDs being linked to a table with data documentation such as metadata (Bobrov et al., 2021).
- Good names (Bres et al., 2022):
- 2016-01-04_ProjectA_Ex1Test1_SmithE_v1.0.xlsx
- 2000_USNM_379221_01_tiff
- USNM_379221_01.tiff
- Bad names (Bres et al., 2022):
- Test data 2016.xlsx
- Meeting notes Jan 17
- Notes Eric.txt
- Final FINAL last version.docx
Tools for simultaneous renaming of files
Multiple OS
Linux
Mac
Unix
- mv command
Windows
- Advanced Renamer
- Altap Salamander
- Ant Renamer
- Bulk Rename Utility
- ExifToolGUI
- Rename-It!
- Total Commander
- WildRename
File versioning
If you decide to version your files, keep the following in mind (Bres et al., 2022):
- Decide how to version files with project partners.
- Write down how a version change is to be defined.
- Document version changes.
Options for file versioning include (Bres et al., 2022):
- In file names
- Within data (e.g. header, comment field)
- In text files (e.g. README file)
- Within a Version Control System (VCS) (e.g. git, Apache Subversion)
Manual file versioning
If you decide to version your files manually, it is recommended to:
- Use a version control table.
- Define responsibilities for completion of files.
- Use semantic versioning: MAJOR.MINOR.PATCH (Bobrov et al., 2021; Bres et al., 2022). E.g.:
- Ex1Test1_SmithE_v1.0.0.xlsx
- Ex1Test1_SmithE_v1.2.5.xlsx
- Ex1Test1_SmithE_v2.1.1.xlsx
- Save milestone versions.
- Store obsolete versions separately after backup.
Folder structure
Recommendations for folder structure
For a good folder structure, it is recommended to:
- Invest time planning out folder structure.
- Choose a folder structure that is (Bobrov et al., 2021; Bres et al., 2022):
- Clear (i.e. self-exaplanatory, with an intuitive navigation, also for other team members)
- Comprehensive
- Efficient
- Hierarchical, increasing findability
- Have maximum (Bres et al., 2022):
- 4 levels
- 10 elements per folder
Example of folder structure
- Project
- Data
- Raw_data
- Processed_data
- Documentation
- Code
- Src
- Output
- Plots
- Documentation
- Protocols
- Data
- Manuscripts
- Conference_reports
- Administrative_information
Further resources
- File naming and folder hierarchy: MIT Libraries
- Naming and organizing your files and folder: MIT Libraries Data Management Services 2020
- README: File & Folder Schema: MIT Libraries Data Management Services 2018
5S methodology
- The 5S Methodology in Research Data Management: Lang et al. 2021
- 5S Data: Setz dich auf deine 5 Buchstaben und organisiere deine Daten! (Coffee Lecture): Research Data Management Thuringia (TKFDM)
File naming
- Batch file renaming tools: Malinowski 2020
- Best practices: Malinowski 2020
- Convetion worksheet: Briney 2020
- File naming: ELIXIR Belgium
- File naming examples: Briney et al. 2020, Table 1.
Folder structure
- Simple Open Data template: de Plaa 2021
- Template for research repositories: Colomb et al. 2020
Data organization in spreadsheets
- Data Organization in Spreadsheets: Broman & Woo 2017
- Six tips for better spreadsheets: Perkel 2022
- Tidy data for librarians: Library Carpentry
Tools
- FAIR4Health Data Curation Tool
- G-Node Infrastructure (GIN) = Modern Research Data Management for Neuroscience
References
- Assmann, C., Gadelha, L., Markus, K., & Vandendorpe, J. (2022). Workshop on Research Data Management.
- Bobrov, E., Adam, L.-S., Söring, S., Jäckel, D., Herwig, A., Lindstädt, B., Vandendorpe, J., & Shutsko, A. (2021). Workshop on Research Data.
- Bres, E., Rudolf, D., Lindstädt, B., & Shutsko, A. (2022). Research Data Management in Medical and Biomedical Sciences.