Directly to content
  1. Publishing |
  2. Search |
  3. Browse |
  4. Recent items rss |
  5. Open Access |
  6. Jur. Issues |
  7. DeutschClear Cookie - decide language by browser settings

Investigating the Reuse of Biomedical Research Data Using the Data Citation Corpus

Cohen, Avihay ; Iarkaeva, Anastasiia ; Ivanovic, Blanka ; Nachev, Vladislav ; Bobrov, Evgeny

[thumbnail of Cohen_Investigating_Reuse_2025.pdf]
Preview
PDF, English - main document
Download (1MB) | Lizenz: Creative Commons LizenzvertragInvestigating the Reuse of Biomedical Research Data Using the Data Citation Corpus by Cohen, Avihay ; Iarkaeva, Anastasiia ; Ivanovic, Blanka ; Nachev, Vladislav ; Bobrov, Evgeny underlies the terms of Creative Commons Attribution 4.0

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

Abstract

Considerable efforts are being invested in promoting data sharing due to its multiple benefits. In addition to enhancing transparency and trust, a key benefit of data sharing is the utilization of data by others in subsequent research or applications, which can be defined as data reuse. Therefore, this practice is encouraged and sometimes required by funders, journals and other stakeholders. Research institutions are similarly interested in tracking the reuse of datasets generated by their researchers.

We analyzed the reuse of data shared by researchers from our institution, Charité – Universitätsmedizin Berlin. We examined how many and which datasets published by Charité authors were referenced in published articles. We further explored the characteristics of these cases, including the types of data identifiers that have been used, the repositories in which they were deposited and the development of references over time. Although there are many purposes of data reuse, we specifically targeted cases of references to reused data in published literature, as this is currently the only scalable way to investigate whether data is being reused.

Data reuse is rarely mentioned formally in the reference lists of articles, posing a challenge for identifying and tracking these types of references. The Data Citation Corpus (DCC) is the first comprehensive approach to address this issue by text-mining a large body of the published literature for both formal references to and in-text mentions of reused data, comprised of more than 5 million records. As part of the process of monitoring open science practices at our institution, we have collected identifiers of open datasets. We utilized this information to search the DCC for references to datasets originating from our institution (defined as datasets underlying articles with at least one co-author from Charité). “Reuse” was operationalized as a reference to a dataset where the citing article and the article through which we had originally found the dataset had no author overlap.

Results indicate that out of 1,268 open and restricted datasets published by Charité researchers between the years 2020 and 2022, 65 were referenced in the literature. These datasets were referenced by 552 articles. References were unevenly distributed, with some datasets cited frequently (> 100 times) and many others cited only once or twice. The majority of references were to datasets deposited in the Gene Expression Omnibus. Reuse of datasets published in generic repositories was extremely rare, with only one reference to a dataset shared in Figshare repository. Our findings suggest that reuse of open biomedical data is common. The DCC can be used for detecting reuse, if a set of dataset identifiers is available. However, further standardization of identifiers in the DCC would be recommendable. Furthermore, a standard definition of dataset reuse is critical to accurately capture data sharing and reuse processes in future analyses. Despite these challenges, detection of data reuse is possible at scale and can be used to gauge impact of as well as motivate data sharing efforts.

Document type: Conference Item
Place of Publication: Heidelberg
Date Deposited: 15 Apr 2025 09:48
Date: 2025
Event Dates: 12.03.2025 - 14.03.2025
Event Location: Universität Heidelberg
Event Title: E-Science-Tage 2025
Faculties / Institutes: Service facilities > Computing Centre
Collection: E-Science-Tage 2025
About | FAQ | Contact | Imprint |
OA-LogoDINI certificate 2013Logo der Open-Archives-Initiative