Data Reuse

Page last modified on: 2024, October 31

On this page

Data discovery

Strategies to search for data

The Consortium of European Social Science Data Archives (CESSDA) (CESSDA Data Management Expert Guide, n.d.) has produced a list of steps in data discovery. The main ones are outlined below, and you can look at their website for the sub-steps.

Develop a clear picture of the research data you need
Locate appropriate data resources
Set up a search query and search the data resource
Select data candidates
Evaluate data quality

CESSDA also suggests three steps to adjust your search strategy (CESSDA Data Management Expert Guide, n.d.):

Use appropriate words in appropriate fields
Broaden your scope
Narrow your scope

Other tips and tricks from the Center for Open Science 2023 include citation chaining (i.e. the process of mining citations in relevant literature to find more sources), looking at previous reuse, and documenting your search strategy to avoid repetition in one repository while helping you to replicate the same strategies in other data. To properly document your search strategy, keep a record of the terms used, filters, other refinements, dates and repositories searched.

Services to search for data

Resources to facilitate data reuse in microbiology

Below are listed widely used resources in microbiology that facilitate the reuse of raw data found in the data repositories (see section above). These so-called “secondary databases” provided added value through additional data types for example from data integration or from processing of raw data. For each resource and when available, the FAIRsharing and re3data pages are linked. On the FAIRsharing page, you will find information such as which journals endorse the resource (under “Collections & Recommendations” and then “In Policies”). On the re3data page, you will find information such as the above-mentioned criteria to select a trusted resource. DB = database.

Domain, Data Type	Data repository	FAIRsharing	re3data
Viruses, Knowledge resources	ViralZone	FAIRsharing	re3data
	International Committee for the Taxonomy of Viruses ICTV	-	-
Viruses, Virus-host databases	Virus-HostDB	-	-
	Viral Host-Range DB VHRDB	FAIRsharing	-
Viruses, Sequence analysis platforms	NCBI Virus	FAIRsharing	-
	(BV-BRC)	FAIRsharing	re3data
Viruses, Nucleic acid sequence downloads	RVDB	-	-
	(inphared)	-	-
Viruses, macromolecular structures	VIPERdb	FAIRsharing	re3data
Viruses, Protein sequences	Virus Orthologous Groups (VOGdb)	-	-
	Phage Orthologous Groups (PHROGs)	-	-
Viruses, -omics data sets	IMG/VR	FAIRsharing	-
	Multi-Omics Portal of Virus Infection (MVIP)	-	-
All, Protein sequence search	InterPro	FAIRsharing	re3data

Services where data can be published

Interdisciplinary and discipline-specific repositories
Data reports
Data journals (see e.g. here)

Registries of data repositories

Registry of Research Data Repositories (re3data.org)
OpenAIRE Explore
OpenDOAR
FAIRsharing.org
Master Data Repository List

Search engines

NCBI Data sets
Google
- Data set Search
- Keyword + “data set”
Library search engines
- Bielefeld Academic Search Engine (BASE)
- LIVIVO – The Search Portal for Life Sciences
Discipline-specific search engines
- Bacterial and Viral Bioinformatics Resource Center (BV-BRC)
- NFDI4Chem Search
- Study Hub NFDI4Health COVID-19
- TerrestrialMetagenomeDB
Mendeley Data

(Meta)data aggregators

Data selection

Below is a list of criteria for selecting trustworthy data sets (Bres et al., 2022; Sielemann et al., 2020). As in Sielemann et al. 2020 (Sielemann et al., 2020), for each possible criterion, several questions to consider are listed.

Integrity of the source
- Is the source/submitter associated with data fabrication/plagiarism?
- Is the way missing values are handled documented?
Biases
- How was the data generated?
- Is the data generation clearly and precisely documented?
Missing meta information (sparsity)
- Do you have all the relevant information?
- Is the information understandable and consistent?
Integration of data sets from different sources
- Is the data comparable?
- Are the methods used for data generation and analysis well-documented and comparable?
Quality issues
- Is the quality high enough to reach your goals?
- Are there any scores/hints available to check the quality of the data set?
Copyright/Legal issues
- Are there any restrictions for the reuse and publication of the data, especially due to the Nagoya protocol?
Further documentation
- Is the research purpose/(hypo-)thesis well documented?
- Is it documented whether the data are raw or processed?

Data provenance

The provenance of research data can be defined as “a documented trail that accounts for the origin of a piece of data and where it has moved from to where it is presently” (National Library of Medicine, 2022). As suggested by Schröder et al. 2022, it can be accounted for by answering questions based on the W7 provenance model (Schröder et al., 2022):

W1: Who participated in the study? [List of all researchers involved in an experiment and their affiliations]
W2: Which biological and chemical resources and which equipment was used in the study? [Resources and the equipment used in an experiment including all details such as the lot number and the passage information]
W3: How was a particular file created? [Sequence of activities that led to the creation of a particular file]
W4: When was an activity conducted? [Date and time point of a particular activity, its duration]
W5: Why was the experiment done? [Objective]
W6: Where was the experiment conducted? [Institution where the experiments was conducted]
W7: What was the order of the stimulation parameters in a particular experiment?

Data reuse

Benefits and drawbacks

Making data reusable benefits researchers who publish their data, researchers who reuse data, and society.

Researchers who publish their data see an increase in their scientific reputation, citations, and collaborations (Rehwald et al., 2022; Pauls et al., 2023). In addition, researchers who publish their data not only comply with the FAIR Data Principles, but also avoid bias in the body of evidence (Institute of Medicine, 2015), increase transparency and thus trust in research (Engelhardt et al., 2022; Rehwald et al., 2022; Pauls et al., 2023). Finally, by sharing their resources and perspectives, researchers who publish their data enable other researchers to build on their work, accelerating scientific discovery (Engelhardt et al., 2022; Institute of Medicine, 2015; Rehwald et al., 2022).

Researchers can recycle unique data by performing secondary analyses to answer new research questions and/or with new methods (Rehwald et al., 2022; Pauls et al., 2023). Reusing data in this way saves resources such as time, energy, and money (Engelhardt et al., 2022; Finnish Social Science Data Archive, n.d.; Rehwald et al., 2022; Pauls et al., 2023). Data reuse also increases collaboration and, over time, enables the comparison of different samples (Rehwald et al., 2022; Pauls et al., 2023). Indeed, data reuse is essential for interdisciplinary experiments and cross-cutting research approaches (Pavone, 2020).

Making data reusable can also benefit society. It reduces unnecessary experimentation (Rehwald et al., 2022), avoids duplication of data collection, and minimizes collection from hard-to-reach, vulnerable or over-researched populations (Finnish Social Science Data Archive, n.d.; Rehwald et al., 2022). It also enables replication and thus promotes reproducibility. Finally, it benefits teaching and improves the link between academia and industry (Rehwald et al., 2022).

As suggested by Sielemann et al. 2020 (Sielemann et al., 2020), there are also challenges, limitations, and risks associated with data reuse.

For researchers who publish their data, preparing data sets for reuse is time-consuming.

For researchers reusing data, there are risks such as unknown quality and normalization (i.e. “the same data is stored multiple times in the same database under different names/identifiers”). There is also the challenge of comparing and integrating data sets from different sources (Sielemann et al., 2020).

Successful cases of data reuse

Case 1: FishBase (Pavone, 2020)

Various data sources have been combined into a digital catalogue of fish, known as FishBase. The data in FishBase were processed using a new algorithm to create a new dataset. This new dataset was combined with other data to create AquaMaps, a tool for predicting the natural occurrence of marine species based on environmental parameters. This led to an increase in citations of FishBase (e.g. Coro et al. 2018) and a report on EU fish stocks,the evidence for which was debated in the European Parliament in 2017. In addition, climate change predictions from AquaMaps and NASA were merged to create a climate change timeline.

Case 2: TerrestrialMetagenomeDB

TerrestrialMetagenomeDB is a public repository of curated and standardised metadata for terrestrial metagenomes.

Further cases in microbiology

See Sielemann et al. 2020.

Relevant licenses and terms of use

See Licenses.

Data citation

Common standards for data citation

Interdisciplinary

DataCite 2019: Creator (PublicationYear): Title. Version. Publisher. (resourceTypeGeneral). Identifier
FORCE 11: Author(s), Year, Data set title, Data repository or archive, Version, Global persistent identifier (preferably as a link)
BibGuru
DOI Citation Formatter
How to Cite Data sets and Link to Publications

For nucleic acid sequences and functional genomics

Code citation

Code citation allows for greater recognition of research software. Some major platforms and tools offer code citation: GitHub, GitLab, JabRef, Zenodo, and Zotero (Code Citation Was Made Possible by Research Software Engineers in Germany and the Netherlands, n.d.).

How-tos

How to make your data reusable?

Properly document your data with metadata (Pavone, 2020).
Use common metadata standards and terminologies (Pavone, 2020).
Standardise your data.
Share your raw data with an open license.

How to maximize already existing data?

See Wood-Charlson et al. 2022 (Wood-Charlson et al., 2022).

References

CESSDA Data Management Expert Guide. Retrieved August 24, 2023, from https://dmeg.cessda.eu/
Bres, E., Rudolf, D., Lindstädt, B., & Shutsko, A. (2022). Research Data Management in Medical and Biomedical Sciences.
Sielemann, K., Hafner, A., & Pucker, B. (2020). The reuse of public datasets in the life sciences: potential risks and rewards. PeerJ. https://doi.org/10.7717/peerj.9954
National Library of Medicine. (2022). Data Provenance. National Library of Medicine. https://www.nnlm.gov/guides/data-glossary/data-provenance#: :text=Definition,to%20where%20it%20is%20presently
Schröder, M., Staehlke, S., Groth, P., Nebe, J. B., Spors, S., & Krüger, F. (2022). Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. J. Biomed. Semantics, 13(1), 4.
Rehwald, S., Leimer, S., Lindstädt, B., Shutsko, A., & Vandendorpe, J. (2022). Workshop on Research Data Management in Medical and Biomedical Sciences.
Pauls, C., Feeken, C., Steen, E.-E., Lindstädt, B., Vandendorpe, J., & Markus, K. (2023). Workshop on Research Data Management.
Institute of Medicine. (2015). Guiding Principles for Sharing Clinical Trial Data. In Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. National Academies Press (US).
Engelhardt, C., Biernacka, K., Coffey, A., Cornet, R., Danciu, A., Demchenko, Y., Downes, S., Erdmann, C., Garbuglia, F., Germer, K., Helbig, K., Hellström, M., Hettne, K., Hibbert, D., Jetten, M., Karimova, Y., Kryger Hansen, K., Kuusniemi, M. E., Letizia, V., … Zhou, B. (2022). D7.4 How to be FAIR with your data. A teaching and training handbook for higher education institutions. https://doi.org/10.5281/ZENODO.6674301
Finnish Social Science Data Archive. Why are research data managed and reused? Retrieved August 17, 2023, from https://www.fsd.tuni.fi/en/services/data-management-guidelines/why-are-research-data-managed-and-reused/
Pavone, G. (2020). Data Reuse Stories. Some concrete cases involving several institutions and consortia in Europe. https://www.openaire.eu/blogs/data-reuse-stories-some-concrete-cases-involving-several-institutions-and-consortia-in-europe
Code citation was made possible by research software engineers in Germany and the Netherlands. Retrieved August 23, 2023, from https://www.esciencecenter.nl/news/code-citation-was-made-possible-by-research-software-engineers-in-germany-and-the-netherlands/
Wood-Charlson, E. M., Crockett, Z., Erdmann, C., Arkin, A. P., & Robinson, C. B. (2022). Ten simple rules for getting and giving credit for data. PLOS Computational Biology, 18(9), 1–11. https://doi.org/10.1371/journal.pcbi.1010476