Skip to main content Skip to docs navigation

Data Reuse




Page last modified on: 2024, October 14

On this page

Data discovery

Strategies to search for data

The Consortium of European Social Science Data Archives (CESSDA) (CESSDA Data Management Expert Guide, n.d.) has produced a list of steps in data discovery. The main ones are outlined below, and you can look at their website for the sub-steps.

  1. Develop a clear picture of the research data you need
  2. Locate appropriate data resources
  3. Set up a search query and search the data resource
  4. Select data candidates
  5. Evaluate data quality

CESSDA also suggests three steps to adjust your search strategy (CESSDA Data Management Expert Guide, n.d.):

  1. Use appropriate words in appropriate fields
  2. Broaden your scope
  3. Narrow your scope

Other tips and tricks from the Center for Open Science 2023 include citation chaining (i.e. the process of mining citations in relevant literature to find more sources), looking at previous reuse, and documenting your search strategy to avoid repetition in one repository while helping you to replicate the same strategies in other data. To properly document your search strategy, keep a record of the terms used, filters, other refinements, dates and repositories searched.

Services to search for data

Resources to facilitate data reuse in microbiology

Below are listed widely used resources in microbiology that facilitate the reuse of raw data found in the data repositories (see section above). These so-called “secondary databases” provided added value through additional data types for example from data integration or from processing of raw data. For each resource and when available, the FAIRsharing and re3data pages are linked. On the FAIRsharing page, you will find information such as which journals endorse the resource (under “Collections & Recommendations” and then “In Policies”). On the re3data page, you will find information such as the above-mentioned criteria to select a trusted resource. DB = database.

Domain, Data Type Data repository FAIRsharing re3data
Viruses, Knowledge resources ViralZone FAIRsharing re3data
  International Committee for the Taxonomy of Viruses ICTV - -
Viruses, Virus-host databases Virus-HostDB - -
  Viral Host-Range DB VHRDB FAIRsharing -
Viruses, Sequence analysis platforms NCBI Virus FAIRsharing -
  (BV-BRC) FAIRsharing re3data
Viruses, Nucleic acid sequence downloads RVDB - -
  (inphared) - -
Viruses, macromolecular structures VIPERdb FAIRsharing re3data
Viruses, Protein sequences Virus Orthologous Groups (VOGdb) - -
  Phage Orthologous Groups (PHROGs) - -
Viruses, -omics data sets IMG/VR FAIRsharing -
  Multi-Omics Portal of Virus Infection (MVIP) - -
All, Protein sequence search InterPro FAIRsharing re3data

Services where data can be published

Registries of data repositories

Search engines

(Meta)data aggregators

Data selection

Below is a list of criteria for selecting trustworthy data sets (Bres et al., 2022; Sielemann et al., 2020). As in Sielemann et al. 2020 (Sielemann et al., 2020), for each possible criterion, several questions to consider are listed.

  • Integrity of the source
    • Is the source/submitter associated with data fabrication/plagiarism?
    • Is the way missing values are handled documented?
  • Biases
    • How was the data generated?
    • Is the data generation clearly and precisely documented?
  • Missing meta information (sparsity)
    • Do you have all the relevant information?
    • Is the information understandable and consistent?
  • Integration of data sets from different sources
    • Is the data comparable?
    • Are the methods used for data generation and analysis well-documented and comparable?
  • Quality issues
    • Is the quality high enough to reach your goals?
    • Are there any scores/hints available to check the quality of the data set?
  • Copyright/Legal issues
    • Are there any restrictions for the reuse and publication of the data, especially due to the Nagoya protocol?
  • Further documentation
    • Is the research purpose/(hypo-)thesis well documented?
    • Is it documented whether the data are raw or processed?

Data provenance

The provenance of research data can be defined as “a documented trail that accounts for the origin of a piece of data and where it has moved from to where it is presently” (National Library of Medicine, 2022). As suggested by Schröder et al. 2022, it can be accounted for by answering questions based on the W7 provenance model (Schröder et al., 2022):

  • W1: Who participated in the study? [List of all researchers involved in an experiment and their affiliations]
  • W2: Which biological and chemical resources and which equipment was used in the study? [Resources and the equipment used in an experiment including all details such as the lot number and the passage information]
  • W3: How was a particular file created? [Sequence of activities that led to the creation of a particular file]
  • W4: When was an activity conducted? [Date and time point of a particular activity, its duration]
  • W5: Why was the experiment done? [Objective]
  • W6: Where was the experiment conducted? [Institution where the experiments was conducted]
  • W7: What was the order of the stimulation parameters in a particular experiment?

Data reuse

Benefits and drawbacks

Making data reusable benefits researchers who publish their data, researchers who reuse data, and society.

Researchers who publish their data see an increase in their scientific reputation, citations, and collaborations (Rehwald et al., 2022; Pauls et al., 2023). In addition, researchers who publish their data not only comply with the FAIR Data Principles, but also avoid bias in the body of evidence (Institute of Medicine, 2015), increase transparency and thus trust in research (Engelhardt et al., 2022; Rehwald et al., 2022; Pauls et al., 2023). Finally, by sharing their resources and perspectives, researchers who publish their data enable other researchers to build on their work, accelerating scientific discovery (Engelhardt et al., 2022; Institute of Medicine, 2015; Rehwald et al., 2022).

Researchers can recycle unique data by performing secondary analyses to answer new research questions and/or with new methods (Rehwald et al., 2022; Pauls et al., 2023). Reusing data in this way saves resources such as time, energy, and money (Engelhardt et al., 2022; Finnish Social Science Data Archive, n.d.; Rehwald et al., 2022; Pauls et al., 2023). Data reuse also increases collaboration and, over time, enables the comparison of different samples (Rehwald et al., 2022; Pauls et al., 2023). Indeed, data reuse is essential for interdisciplinary experiments and cross-cutting research approaches (Pavone, 2020).

Making data reusable can also benefit society. It reduces unnecessary experimentation (Rehwald et al., 2022), avoids duplication of data collection, and minimizes collection from hard-to-reach, vulnerable or over-researched populations (Finnish Social Science Data Archive, n.d.; Rehwald et al., 2022). It also enables replication and thus promotes reproducibility. Finally, it benefits teaching and improves the link between academia and industry (Rehwald et al., 2022).

As suggested by Sielemann et al. 2020 (Sielemann et al., 2020), there are also challenges, limitations, and risks associated with data reuse.

For researchers who publish their data, preparing data sets for reuse is time-consuming.

For researchers reusing data, there are risks such as unknown quality and normalization (i.e. “the same data is stored multiple times in the same database under different names/identifiers”). There is also the challenge of comparing and integrating data sets from different sources (Sielemann et al., 2020).

Successful cases of data reuse

Case 1: FishBase (Pavone, 2020)

Various data sources have been combined into a digital catalogue of fish, known as FishBase. The data in FishBase were processed using a new algorithm to create a new dataset. This new dataset was combined with other data to create AquaMaps, a tool for predicting the natural occurrence of marine species based on environmental parameters. This led to an increase in citations of FishBase (e.g. Coro et al. 2018) and a report on EU fish stocks,the evidence for which was debated in the European Parliament in 2017. In addition, climate change predictions from AquaMaps and NASA were merged to create a climate change timeline.

Case 2: TerrestrialMetagenomeDB

TerrestrialMetagenomeDB is a public repository of curated and standardised metadata for terrestrial metagenomes.

Further cases in microbiology

See Sielemann et al. 2020.

Relevant licenses and terms of use

See Licenses.

Data citation

Common standards for data citation

Interdisciplinary

For nucleic acid sequences and functional genomics

Code citation

Code citation allows for greater recognition of research software. Some major platforms and tools offer code citation: GitHub, GitLab, JabRef, Zenodo, and Zotero (Code Citation Was Made Possible by Research Software Engineers in Germany and the Netherlands, n.d.).

How-tos

How to make your data reusable?

  • Properly document your data with metadata (Pavone, 2020).
  • Use common metadata standards and terminologies (Pavone, 2020).
  • Standardise your data.
  • Share your raw data with an open license.

How to maximize already existing data?

See Wood-Charlson et al. 2022 (Wood-Charlson et al., 2022).

Get Help

If you have any further questions about the management and analysis of your microbial research data, please contact us: helpdesk@nfdi4microbiota.de (by emailing us you agree to the privacy policy on our website: Contact)

References

  1. CESSDA Data Management Expert Guide. Retrieved August 24, 2023, from https://dmeg.cessda.eu/
  2. Bres, E., Rudolf, D., Lindstädt, B., & Shutsko, A. (2022). Research Data Management in Medical and Biomedical Sciences.
  3. Sielemann, K., Hafner​, A., & Pucker, B. (2020). The reuse of public datasets in the life sciences: potential risks and rewards. PeerJ. https://doi.org/10.7717/peerj.9954
  4. National Library of Medicine. (2022). Data Provenance. National Library of Medicine. https://www.nnlm.gov/guides/data-glossary/data-provenance#: :text=Definition,to%20where%20it%20is%20presently
  5. Schröder, M., Staehlke, S., Groth, P., Nebe, J. B., Spors, S., & Krüger, F. (2022). Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. J. Biomed. Semantics, 13(1), 4.
  6. Rehwald, S., Leimer, S., Lindstädt, B., Shutsko, A., & Vandendorpe, J. (2022). Workshop on Research Data Management in Medical and Biomedical Sciences.
  7. Pauls, C., Feeken, C., Steen, E.-E., Lindstädt, B., Vandendorpe, J., & Markus, K. (2023). Workshop on Research Data Management.
  8. Institute of Medicine. (2015). Guiding Principles for Sharing Clinical Trial Data. In Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. National Academies Press (US).
  9. Engelhardt, C., Biernacka, K., Coffey, A., Cornet, R., Danciu, A., Demchenko, Y., Downes, S., Erdmann, C., Garbuglia, F., Germer, K., Helbig, K., Hellström, M., Hettne, K., Hibbert, D., Jetten, M., Karimova, Y., Kryger Hansen, K., Kuusniemi, M. E., Letizia, V., … Zhou, B. (2022). D7.4 How to be FAIR with your data. A teaching and training handbook for higher education institutions. https://doi.org/10.5281/ZENODO.6674301
  10. Finnish Social Science Data Archive. Why are research data managed and reused? Retrieved August 17, 2023, from https://www.fsd.tuni.fi/en/services/data-management-guidelines/why-are-research-data-managed-and-reused/
  11. Pavone, G. (2020). Data Reuse Stories. Some concrete cases involving several institutions and consortia in Europe. https://www.openaire.eu/blogs/data-reuse-stories-some-concrete-cases-involving-several-institutions-and-consortia-in-europe
  12. Code citation was made possible by research software engineers in Germany and the Netherlands. Retrieved August 23, 2023, from https://www.esciencecenter.nl/news/code-citation-was-made-possible-by-research-software-engineers-in-germany-and-the-netherlands/
  13. Wood-Charlson, E. M., Crockett, Z., Erdmann, C., Arkin, A. P., & Robinson, C. B. (2022). Ten simple rules for getting and giving credit for data. PLOS Computational Biology, 18(9), 1–11. https://doi.org/10.1371/journal.pcbi.1010476