Bioschemas Data Harvesting

Details of the harvesting of Bioschemas markup from live deployments on the Web.

The initial purpose is to track the harvesting of data for use in Project 29 at the BioHackathon-Europe 2021. The harvesting will be conducted with BMUSE and the data hosted on a server at Heriot-Watt University.

BioHackathon 2021 Harvest

We aim to harvest data from the sites on the Bioschemas live deploy page for which we have a sitemap. We will also include sites where we have a list of URLs. Full details of the datasets to be harvested and their progress can be found on the project board.

We have loaded the harvested data into a GraphDB triplestore:

SPARQL Endpoint
grlc REST API: Contains a curated list of queries.
Need to change the endpoint to https://swel.macs.hw.ac.uk/data/repositories/bioschemas
- Alternate grlc REST API: Contains all queries in queries directory
Snorql Extended Interface
Data directory
Executable query notebook

Notes about datasets included in the collection.

Data Harvested with BMUSE

DisProt: 2,044 pages harvested using the dynamic scraper (v0.4.0) on 20 October 2021
MobiDB: 2,083 pages harvested using the dynamic scraper (v0.4.0) on 27 October 2021
Paired Omics: 78 pages harvested using the dynamic scraper (v0.5.0) on 28 October 2021
BridgeDb: 2 pages harvested using the static scraper (v0.5.1) on 2 November 2021
PCDDB: 1,402 pages harvested using the static scraper (v0.5.1) on 2 November 2021
MassBank: 76,253 pages harvested using the static scraper (v0.5.0) on 4 November 2021; 10,326 pages did not harvest due to errors in the JSON-LD. For loading into the triplestore, the nquad files were merged using the command find . -name *.nq -exec cat {} \; > massbank.nq as detailed here.
Cosmic: 2,424 pages harvested using the static scraper (v0.5.2) on 4 November 2021
Nanocommons: 3 pages harvested using the static scraper (v0.5.2) on 4 November 2021
Alliance of Genomes: 12 pages harvested using scraper (v0.5.2) on 5 November 2021
BioVersions: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
EGA: 11,834 pages harvested using scraper (v0.5.2) on 5 November 2021; 745 pages could not be harvested
IFB: 87 pages harvested using scraper (v0.5.2) on 5 November 2021
PDBe: 672 pages harvested using scraper (v0.5.2) on 5 November 2021
Prosite: 5,859 pages harvested using scraper (v0.5.2) on 5 November 2021
UniProt: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
FAIRsharing: 6,351 pages harvested using scraper (v0.5.2) on 6 November 2021
COVID19 Portal: 20 pages harvested using the dynamic scraper (v0.5.2) on 7 November 2021
GBIF: 68,167 pages harvested using the static scraper (v0.5.2) on 7 November 2021
TeSS: 13,940 pages harvested using scraper (v0.5.2) on 7 November 2021
Scholia:
- 5,345 pages harvested out of 660k supplied URLs using dynamic scraper (v0.5.2) on 8 November 2021; 1 page did not scrape
- 68,974 pages harvested using dynamic scraper (v0.5.2) on 10 November 2021; 21 pages did not scrape
Protein Ensembl (PED): 187 pages harvested using the dynamic scraper (v0.5.2) on 9 November 2021
Bgee: statically scraped (v0.5.2) on 9-10 November
- https://bgee.org/sitemap_main.xml 22 pages
- https://bgee.org/sitemap_gene1.xml 49,001 pages
COVIDmine (no longer maintained): 49,959 pages scraped using the dynamic scraper (v0.5.2) on 8 November 2021
MetaNetX: statically scraped (v0.5.2) on 11 November 2021
- https://www.metanetx.org/sitemap_main.xml 12 pages
- https://www.metanetx.org/sitemap_chem1.xml 49,001 pages

Data Feeds and Associated Named Graph

We have started testing loading data dumps made available as the experimental Schema.org data feed. The following table details the feeds that have been loaded. The raw data is available here.

Data Source	Date Generated	Date Loaded	Named Graph
bio.tools	2021-11-09	2021-12-17	http://bio.tools/comp-tools-0.6-draft/
chembl-28	2022-01-15	2022-03-04	https://www.ebi.ac.uk/chembl-28/

The following triples were hand inserted to track the provenance of the data feeds. Note that the location retrieved from pav:retrievedFrom refers to the domain of the data and the date pav:retrievedOn is the date the date was generated. This is to be consistent with the data coming from BMUSE.

# Bio.Tools
INSERT DATA {
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievedFrom> <https://bio.tools> .
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievededOn> "2021-11-09T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://bio.tools/comp-tools-0.6-draft/> a <https://schema.org/DataFeed> .
}

# ChEMBL 28
INSERT DATA {
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievedFrom> <https://www.ebi.ac.uk/chembl/> .
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievededOn> "2022-01-15T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<https://www.ebi.ac.uk/chembl-28/> a <https://schema.org/DataFeed> .
}

bioschemas / bioschemas-data-harvesting Goto Github PK