EMBL adapter
Contains adapters for connecting EMBL content into GBIF.
This repository contains an EMBL API crawler that produces data that is later used for producing DwC-A files suitable for ingestion into GBIF. The result will replace the current EMBL dataset.
Expected use of the EMBL API by the crawler is described in this working document
The adapter is configured to run once a week at a specific time (might be changed in the future).
See properties startTime
and frequencyInDays
in the gbif-configuration project here.
Basic steps of the adapter:
- Request data from ENA portal API, two requests for each dataset + one taxonomy request (optional)
- Store raw data into database
- Process and store processed data into database (Perform backend deduplication)
- Clean temporal files
Requests
We get data from https://www.ebi.ac.uk/ena/portal/api. Query supports the following operators and characters: AND
, OR
, NOT
, ()
, """
, *
.
Requests requestUrl1
(sequence) and requestUrl2
(wgs_set) can be seen at gbif-configuration project here.
Sequence requests
Request with result=sequence
- a dataset for eDNA: environmental_sample=True, host="" (no host)
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=true AND NOT host="*"
- include records with environmental_sample=true
- include records with coordinates and\or specimen_voucher
- exclude records dataclass="CON" see here
- exclude records with host
- a dataset for organism sequenced: environmental_sample=False, host="" (no host)
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=false AND NOT host="*"
- include records with environmental_sample=false
- include records with coordinates and\or specimen_voucher
- exclude records dataclass="CON" see here
- exclude records with host
- a dataset with hosts
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND host="*" AND NOT host="human*" AND NOT host="*Homo sa*"
- include records with coordinates and\or specimen_voucher
- include records with host
- exclude records dataclass="CON" see here
- exclude records with human host
WGS_SET request
Request with result=wgs_set
.
These requests are pretty much the same with some differences:
- sequence_md5 field not supported, use specimen_voucher twice to match number of fields
- do not use dataclass filter
Taxonomy
Adapter requests taxonomy separately: download a zipped archive, unzip it and store it into database. Configuration is here.
Database
The data is stored in the postgres database after execution. Each dataset has own table with raw and processed data.
Database is created only once in the target environment and tables are cleaned up before every run.
Database creation scripts for data and taxonomy.
See gbif-configuration here and here for connection properties.
Backend deduplication
We perform several deduplication steps.
First step
Perform SQL (local copy here), get rid of some duplicates and join data with taxonomy; based on issue here
Second step
Get rid of records with both missing specimen_voucher
and collection_date
Third step
Keep only one record with same sample_accession
and scientific_name
and get rid of the rest
DWC archives
Adapter stores all processed data back into database (tables with postfix _processed
) which then used by IPT as SQL sources.
Test datasets (UAT):
- https://www.gbif-uat.org/dataset/ee8da4a4-268b-4e91-ab5a-69a04ff58e7a
- https://www.gbif-uat.org/dataset/768eeb1f-a208-4170-9335-2968d17c7bdc
- https://www.gbif-uat.org/dataset/10628730-87d4-42f5-b593-bd438185517f
and production ones (prod):
- https://www.gbif.org/dataset/583d91fe-bbc0-4b4a-afe1-801f88263016
- https://www.gbif.org/dataset/393b8c26-e4e0-4dd0-a218-93fc074ebf4e
- https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Configuration
Remember that all configuration files are in the private gbif-configuration project!
Configuration files in the directory src/main/resources
do not affect the adapter and can be used, for example, for testing (local run).
Local run
Use scripts start.sh and start-taxonomy.sh for local testing. Remember to provide valid logback and config files for the scripts (you may need to create databases before run).