Code Monkey home page Code Monkey logo

bioschemas-data-harvesting's Introduction

Bioschemas Data Harvesting

Details of the harvesting of Bioschemas markup from live deployments on the Web.

The initial purpose is to track the harvesting of data for use in Project 29 at the BioHackathon-Europe 2021. The harvesting will be conducted with BMUSE and the data hosted on a server at Heriot-Watt University.

BioHackathon 2021 Harvest

We aim to harvest data from the sites on the Bioschemas live deploy page for which we have a sitemap. We will also include sites where we have a list of URLs. Full details of the datasets to be harvested and their progress can be found on the project board.

We have loaded the harvested data into a GraphDB triplestore:

Notes about datasets included in the collection.

Data Harvested with BMUSE

  1. DisProt: 2,044 pages harvested using the dynamic scraper (v0.4.0) on 20 October 2021
  2. MobiDB: 2,083 pages harvested using the dynamic scraper (v0.4.0) on 27 October 2021
  3. Paired Omics: 78 pages harvested using the dynamic scraper (v0.5.0) on 28 October 2021
  4. BridgeDb: 2 pages harvested using the static scraper (v0.5.1) on 2 November 2021
  5. PCDDB: 1,402 pages harvested using the static scraper (v0.5.1) on 2 November 2021
  6. MassBank: 76,253 pages harvested using the static scraper (v0.5.0) on 4 November 2021; 10,326 pages did not harvest due to errors in the JSON-LD. For loading into the triplestore, the nquad files were merged using the command find . -name *.nq -exec cat {} \; > massbank.nq as detailed here.
  7. Cosmic: 2,424 pages harvested using the static scraper (v0.5.2) on 4 November 2021
  8. Nanocommons: 3 pages harvested using the static scraper (v0.5.2) on 4 November 2021
  9. Alliance of Genomes: 12 pages harvested using scraper (v0.5.2) on 5 November 2021
  10. BioVersions: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
  11. EGA: 11,834 pages harvested using scraper (v0.5.2) on 5 November 2021; 745 pages could not be harvested
  12. IFB: 87 pages harvested using scraper (v0.5.2) on 5 November 2021
  13. PDBe: 672 pages harvested using scraper (v0.5.2) on 5 November 2021
  14. Prosite: 5,859 pages harvested using scraper (v0.5.2) on 5 November 2021
  15. UniProt: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
  16. FAIRsharing: 6,351 pages harvested using scraper (v0.5.2) on 6 November 2021
  17. COVID19 Portal: 20 pages harvested using the dynamic scraper (v0.5.2) on 7 November 2021
  18. GBIF: 68,167 pages harvested using the static scraper (v0.5.2) on 7 November 2021
  19. TeSS: 13,940 pages harvested using scraper (v0.5.2) on 7 November 2021
  20. Scholia:
    • 5,345 pages harvested out of 660k supplied URLs using dynamic scraper (v0.5.2) on 8 November 2021; 1 page did not scrape
    • 68,974 pages harvested using dynamic scraper (v0.5.2) on 10 November 2021; 21 pages did not scrape
  21. Protein Ensembl (PED): 187 pages harvested using the dynamic scraper (v0.5.2) on 9 November 2021
  22. Bgee: statically scraped (v0.5.2) on 9-10 November
  23. COVIDmine (no longer maintained): 49,959 pages scraped using the dynamic scraper (v0.5.2) on 8 November 2021
  24. MetaNetX: statically scraped (v0.5.2) on 11 November 2021

Data Feeds and Associated Named Graph

We have started testing loading data dumps made available as the experimental Schema.org data feed. The following table details the feeds that have been loaded. The raw data is available here.

Data Source Date Generated Date Loaded Named Graph
bio.tools 2021-11-09 2021-12-17 http://bio.tools/comp-tools-0.6-draft/
chembl-28 2022-01-15 2022-03-04 https://www.ebi.ac.uk/chembl-28/

The following triples were hand inserted to track the provenance of the data feeds. Note that the location retrieved from pav:retrievedFrom refers to the domain of the data and the date pav:retrievedOn is the date the date was generated. This is to be consistent with the data coming from BMUSE.

# Bio.Tools
INSERT DATA {
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievedFrom> <https://bio.tools> .
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievededOn> "2021-11-09T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://bio.tools/comp-tools-0.6-draft/> a <https://schema.org/DataFeed> .
}

# ChEMBL 28
INSERT DATA {
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievedFrom> <https://www.ebi.ac.uk/chembl/> .
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievededOn> "2022-01-15T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<https://www.ebi.ac.uk/chembl-28/> a <https://schema.org/DataFeed> .
}

bioschemas-data-harvesting's People

Contributors

albangaignard avatar egonw avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bioschemas-data-harvesting's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.