This is the repository of the Wikipedia Citations in Wikidata grant. It's a collection of scripts that can be used to extract citations from the English Wikipedia to external bibliographic resources, and then to upload them to Wikidata.
A complete diagram with a description of all the workflow steps is available here!
Quoting the grant's description:
Our goal is to develop four software modules in Python (the codebase from now on) that can be easily reused by developers in the Wikidata community:
- extractor a module to extract citation and bibliographic information from articles in the English Wikipedia;
- converter a module to convert extracted information into a CSV-based format compliant with a shareable bibliographic data model, e.g., the OpenCitations Data Model;
- enricher a module for reconciling bibliographic resources and people (obtained in step 2) with entities available in Wikidata via their persistent identifiers (primarily DOIs, QIDs, ORCIDs, VIAFs, then also persons, places and organisations if time allows);
- pusher a module to disambiguate, deduplicate, and load citation and bibliographic data in Wikidata that reuses code already developed by the wikidata community as much as possible.
The repository folder structure reflects these same modules that constitute the entire workflow.
Each module has a README
file that contains specific instructions on how to set up the execution
environment, on how to configure the modules and how to run them. Here, only a general overview of the
entire process is given.
This particular workflow strictly requires that the user executes the given scripts in a particular order:
- the Extractor module takes as input a dump of the current English Wikipedia pages and outputs a parquet dataset containing the extracted citations. Our suggestion is to directly download the parquet dataset from here at Zenodo (the ZIP file to be downloaded is called "citations_from_wikipedia.zip").
- the Converter module takes as input the parquet dataset from the previous step and produces a set of RDF files which are OCDM compliant.
- the Enricher module takes as input the RDF files from the previous step and tries to enrich them as much as possible by adding external identifiers coming from various APIs. When the external identifiers are added, a deduplication step is applied to each RDF file.
- the Pusher module take as input the enriched RDF files from the previous step and produces TSV files compliant with the QuickStatements input format that enable the user to bulk upload the citational data onto Wikidata.
More details can be found inside the README
documents of each module; please refer to them for specific
information about the inner workings of each workflow step.
Some external tools were reused, in particular:
- oc_ocdm [docs], an ORM library that allows to easily manipulate OCDM compliant RDF graphs through a well-defined API. It's available as a package on PyPi (click here);
- oc_graphenricher [docs], a tool that's able to enrich a set of OCDM entities with external identifiers and then to apply a deduplication step so to remove duplicated entities that share at least one identical identifier. It's available as a package on PyPi (click here);
- meta, a tool that is able to apply a lot of preprocessing and data-cleaning techniques to a given CSV file with a compatible format. It then generates and stores an OCDM compliant RDF graph containing the same bibliographical information that was extracted from the CSV file. Its execution can require a lot of time, because of the complexity of the operations that have to be executed on the given dataset;
- cite-classifications-wiki, a set of scripts that are intended to be run in a pyspark/hadoop environment which is distributed over a cluster of machines. It's able to digest a full Wikipedia dump and to extract from it citations to external bibliographical resources. It's currently limited to the English version of Wikipedia.
They can be used outside this context for purposes different from those of this project. All of them were very helpful for the development of this workflow. More information about them can be found in their respective GitHub repositories.
Tests (together with their instructions) can be found in the following sub-folders:
Workflow step | Test sub-folder |
---|---|
Extractor | test |
Converter | test |
Enricher | test |
Pusher | test |
Distributed under the ISC License. See LICENSE
for more information.
Project member | e-mail address |
---|---|
Silvio Peroni - @essepuntato | [email protected] |
Marilena Daquino | [email protected] |
Giovanni Colavizza | [email protected] |
Gabriele Pisciotta | [email protected] |
Simone Persiani | [email protected] |
Project Link: https://github.com/opencitations/wcw
This project has been developed within the context of the "Wikipedia Citations in Wikidata" grant, under the supervision of prof. Silvio Peroni.