> Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.
Prior work referenced in above abstract with some team overlap:
Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia [2021]
https://aarontay.medium.com/3-new-tools-to-try-for-literatur...
https://archive.is/Ul13s
Specific to Wikipedia:
Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia [2024]
https://arxiv.org/abs/2406.19291v1
https://doi.org/10.48550/arXiv.2406.19291
> Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.
Prior work referenced in above abstract with some team overlap:
Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia [2021]
https://direct.mit.edu/qss/article/2/1/1/97565/Wikipedia-cit...
https://doi.org/10.1162/qss_a_00105
Datasets:
A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2024)
https://zenodo.org/records/10782978
https://doi.org/10.5281/zenodo.10782978
A Comprehensive Dataset of Classified Citations with Identifiers from Multilingual Wikipedia (2024)
https://zenodo.org/records/11210434
https://doi.org/10.5281/zenodo.11210434
Code (MIT License):
https://github.com/albatros13/wikicite
https://github.com/albatros13/wikicite/tree/multilang
Bonus links:
https://www.mediawiki.org/wiki/Alternative_parsers
https://scholarlykitchen.sspnet.org/2022/11/01/guest-post-wi...