Data discovery
Strategies to search for data
The Consortium of European Social Science Data Archives (CESSDA) (CESSDA Data Management Expert Guide, n.d.) has produced a list of steps in data discovery. The main ones are outlined below, and you can look at their website for the sub-steps.
- Develop a clear picture of the research data you need
- Locate appropriate data resources
- Set up a search query and search the data resource
- Select data candidates
- Evaluate data quality
CESSDA also suggests three steps to adjust your search strategy (CESSDA Data Management Expert Guide, n.d.):
- Use appropriate words in appropriate fields
- Broaden your scope
- Narrow your scope
Other tips and tricks from the Center for Open Science 2023 include citation chaining (i.e. the process of mining citations in relevant literature to find more sources), looking at previous reuse, and documenting your search strategy to avoid repetition in one repository while helping you to replicate the same strategies in other data. To properly document your search strategy, keep a record of the terms used, filters, other refinements, dates and repositories searched.
Services to search for data
Resources to facilitate data reuse in microbiology
Below are listed widely used resources in microbiology that facilitate the reuse of raw data found in the data repositories (see section above). These so-called “secondary databases” provided added value through additional data types for example from data integration or from processing of raw data. For each resource and when available, the FAIRsharing and re3data pages are linked. On the FAIRsharing page, you will find information such as which journals endorse the resource (under “Collections & Recommendations” and then “In Policies”). On the re3data page, you will find information such as the above-mentioned criteria to select a trusted resource. DB = database.
Domain, Data Type | Data repository | FAIRsharing | re3data |
---|---|---|---|
Viruses, Knowledge resources | ViralZone | FAIRsharing | re3data |
International Committee for the Taxonomy of Viruses ICTV | - | - | |
Viruses, Virus-host databases | Virus-HostDB | - | - |
Viral Host-Range DB VHRDB | FAIRsharing | - | |
Viruses, Sequence analysis platforms | NCBI Virus | FAIRsharing | - |
(BV-BRC) | FAIRsharing | re3data | |
Viruses, Nucleic acid sequence downloads | RVDB | - | - |
(inphared) | - | - | |
Viruses, macromolecular structures | VIPERdb | FAIRsharing | re3data |
Viruses, Protein sequences | Virus Orthologous Groups (VOGdb) | - | - |
Phage Orthologous Groups (PHROGs) | - | - | |
Viruses, -omics data sets | IMG/VR | FAIRsharing | - |
Multi-Omics Portal of Virus Infection (MVIP) | - | - | |
All, Protein sequence search | InterPro | FAIRsharing | re3data |
Services where data can be published
- Interdisciplinary and discipline-specific repositories
- Data reports
- Data journals (see e.g. here)
Registries of data repositories
- Registry of Research Data Repositories (re3data.org)
- OpenAIRE Explore
- OpenDOAR
- FAIRsharing.org
- Master Data Repository List
Search engines
- NCBI Data sets
- Google
- Data set Search
- Keyword + “data set”
- Library search engines
- Bielefeld Academic Search Engine (BASE)
- LIVIVO – The Search Portal for Life Sciences
- Discipline-specific search engines
- Bacterial and Viral Bioinformatics Resource Center (BV-BRC)
- NFDI4Chem Search
- Study Hub NFDI4Health COVID-19
- TerrestrialMetagenomeDB
- Mendeley Data
(Meta)data aggregators
Data selection
Below is a list of criteria for selecting trustworthy data sets (Bres et al., 2022; Sielemann et al., 2020). As in Sielemann et al. 2020 (Sielemann et al., 2020), for each possible criterion, several questions to consider are listed.
- Integrity of the source
- Is the source/submitter associated with data fabrication/plagiarism?
- Is the way missing values are handled documented?
- Biases
- How was the data generated?
- Is the data generation clearly and precisely documented?
- Missing meta information (sparsity)
- Do you have all the relevant information?
- Is the information understandable and consistent?
- Integration of data sets from different sources
- Is the data comparable?
- Are the methods used for data generation and analysis well-documented and comparable?
- Quality issues
- Is the quality high enough to reach your goals?
- Are there any scores/hints available to check the quality of the data set?
- Copyright/Legal issues
- Are there any restrictions for the reuse and publication of the data, especially due to the Nagoya protocol?
- Further documentation
- Is the research purpose/(hypo-)thesis well documented?
- Is it documented whether the data are raw or processed?
Data provenance
The provenance of research data can be defined as “a documented trail that accounts for the origin of a piece of data and where it has moved from to where it is presently” (National Library of Medicine, 2022). As suggested by Schröder et al. 2022, it can be accounted for by answering questions based on the W7 provenance model (Schröder et al., 2022):
- W1: Who participated in the study? [List of all researchers involved in an experiment and their affiliations]
- W2: Which biological and chemical resources and which equipment was used in the study? [Resources and the equipment used in an experiment including all details such as the lot number and the passage information]
- W3: How was a particular file created? [Sequence of activities that led to the creation of a particular file]
- W4: When was an activity conducted? [Date and time point of a particular activity, its duration]
- W5: Why was the experiment done? [Objective]
- W6: Where was the experiment conducted? [Institution where the experiments was conducted]
- W7: What was the order of the stimulation parameters in a particular experiment?
Data reuse
Benefits and drawbacks
Making data reusable benefits researchers who publish their data, researchers who reuse data, and society.
Researchers who publish their data see an increase in their scientific reputation, citations, and collaborations (Rehwald et al., 2022; Pauls et al., 2023). In addition, researchers who publish their data not only comply with the FAIR Data Principles, but also avoid bias in the body of evidence (Institute of Medicine, 2015), increase transparency and thus trust in research (Engelhardt et al., 2022; Rehwald et al., 2022; Pauls et al., 2023). Finally, by sharing their resources and perspectives, researchers who publish their data enable other researchers to build on their work, accelerating scientific discovery (Engelhardt et al., 2022; Institute of Medicine, 2015; Rehwald et al., 2022).
Researchers can recycle unique data by performing secondary analyses to answer new research questions and/or with new methods (Rehwald et al., 2022; Pauls et al., 2023). Reusing data in this way saves resources such as time, energy, and money (Engelhardt et al., 2022; Finnish Social Science Data Archive, n.d.; Rehwald et al., 2022; Pauls et al., 2023). Data reuse also increases collaboration and, over time, enables the comparison of different samples (Rehwald et al., 2022; Pauls et al., 2023). Indeed, data reuse is essential for interdisciplinary experiments and cross-cutting research approaches (Pavone, 2020).
Making data reusable can also benefit society. It reduces unnecessary experimentation (Rehwald et al., 2022), avoids duplication of data collection, and minimizes collection from hard-to-reach, vulnerable or over-researched populations (Finnish Social Science Data Archive, n.d.; Rehwald et al., 2022). It also enables replication and thus promotes reproducibility. Finally, it benefits teaching and improves the link between academia and industry (Rehwald et al., 2022).
As suggested by Sielemann et al. 2020 (Sielemann et al., 2020), there are also challenges, limitations, and risks associated with data reuse.
For researchers who publish their data, preparing data sets for reuse is time-consuming.
For researchers reusing data, there are risks such as unknown quality and normalization (i.e. “the same data is stored multiple times in the same database under different names/identifiers”). There is also the challenge of comparing and integrating data sets from different sources (Sielemann et al., 2020).
Successful cases of data reuse
Case 1: FishBase (Pavone, 2020)
Various data sources have been combined into a digital catalogue of fish, known as FishBase. The data in FishBase were processed using a new algorithm to create a new dataset. This new dataset was combined with other data to create AquaMaps, a tool for predicting the natural occurrence of marine species based on environmental parameters. This led to an increase in citations of FishBase (e.g. Coro et al. 2018) and a report on EU fish stocks,the evidence for which was debated in the European Parliament in 2017. In addition, climate change predictions from AquaMaps and NASA were merged to create a climate change timeline.
Case 2: TerrestrialMetagenomeDB
TerrestrialMetagenomeDB is a public repository of curated and standardised metadata for terrestrial metagenomes.
Further cases in microbiology
Relevant licenses and terms of use
See Licenses.
Data citation
Common standards for data citation
Interdisciplinary
- DataCite 2019: Creator (PublicationYear): Title. Version. Publisher. (resourceTypeGeneral). Identifier
- FORCE 11: Author(s), Year, Data set title, Data repository or archive, Version, Global persistent identifier (preferably as a link)
- BibGuru
- DOI Citation Formatter
- How to Cite Data sets and Link to Publications
For nucleic acid sequences and functional genomics
- How do I cite my ArrayExpress data sets in my publication?
- How to Cite Data in ENA
- Citing and linking to the GEO database
- How do I cite NCBI services and databases?
Code citation
Code citation allows for greater recognition of research software. Some major platforms and tools offer code citation: GitHub, GitLab, JabRef, Zenodo, and Zotero (Code Citation Was Made Possible by Research Software Engineers in Germany and the Netherlands, n.d.).
How-tos
How to make your data reusable?
- Properly document your data with metadata (Pavone, 2020).
- Use common metadata standards and terminologies (Pavone, 2020).
- Standardise your data.
- Share your raw data with an open license.
How to maximize already existing data?
See Wood-Charlson et al. 2022 (Wood-Charlson et al., 2022).
References
- CESSDA Data Management Expert Guide. Retrieved August 24, 2023, from https://dmeg.cessda.eu/
- Bres, E., Rudolf, D., Lindstädt, B., & Shutsko, A. (2022). Research Data Management in Medical and Biomedical Sciences.
- Sielemann, K., Hafner, A., & Pucker, B. (2020). The reuse of public datasets in the life sciences: potential risks and rewards. PeerJ. https://doi.org/10.7717/peerj.9954
- National Library of Medicine. (2022). Data Provenance. National Library of Medicine. https://www.nnlm.gov/guides/data-glossary/data-provenance#: :text=Definition,to%20where%20it%20is%20presently
- Schröder, M., Staehlke, S., Groth, P., Nebe, J. B., Spors, S., & Krüger, F. (2022). Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. J. Biomed. Semantics, 13(1), 4.
- Rehwald, S., Leimer, S., Lindstädt, B., Shutsko, A., & Vandendorpe, J. (2022). Workshop on Research Data Management in Medical and Biomedical Sciences.
- Pauls, C., Feeken, C., Steen, E.-E., Lindstädt, B., Vandendorpe, J., & Markus, K. (2023). Workshop on Research Data Management.
- Institute of Medicine. (2015). Guiding Principles for Sharing Clinical Trial Data. In Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. National Academies Press (US).
- Engelhardt, C., Biernacka, K., Coffey, A., Cornet, R., Danciu, A., Demchenko, Y., Downes, S., Erdmann, C., Garbuglia, F., Germer, K., Helbig, K., Hellström, M., Hettne, K., Hibbert, D., Jetten, M., Karimova, Y., Kryger Hansen, K., Kuusniemi, M. E., Letizia, V., … Zhou, B. (2022). D7.4 How to be FAIR with your data. A teaching and training handbook for higher education institutions. https://doi.org/10.5281/ZENODO.6674301
- Finnish Social Science Data Archive. Why are research data managed and reused? Retrieved August 17, 2023, from https://www.fsd.tuni.fi/en/services/data-management-guidelines/why-are-research-data-managed-and-reused/
- Pavone, G. (2020). Data Reuse Stories. Some concrete cases involving several institutions and consortia in Europe. https://www.openaire.eu/blogs/data-reuse-stories-some-concrete-cases-involving-several-institutions-and-consortia-in-europe
- Code citation was made possible by research software engineers in Germany and the Netherlands. Retrieved August 23, 2023, from https://www.esciencecenter.nl/news/code-citation-was-made-possible-by-research-software-engineers-in-germany-and-the-netherlands/
- Wood-Charlson, E. M., Crockett, Z., Erdmann, C., Arkin, A. P., & Robinson, C. B. (2022). Ten simple rules for getting and giving credit for data. PLOS Computational Biology, 18(9), 1–11. https://doi.org/10.1371/journal.pcbi.1010476