gbif / embl-adapter Goto Github PK

2.0 8.0 2.0 165 KB

Contains adapters for connecting EMBL content into GBIF

License: Apache License 2.0

Java 99.51% Shell 0.49%

embl-adapter's Introduction

EMBL adapter

Contains adapters for connecting EMBL content into GBIF.

This repository contains an EMBL API crawler that produces data that is later used for producing DwC-A files suitable for ingestion into GBIF. The result will replace the current EMBL dataset.

Expected use of the EMBL API by the crawler is described in this working document

The adapter is configured to run once a week at a specific time (might be changed in the future). See properties startTime and frequencyInDays in the gbif-configuration project here.

Basic steps of the adapter:

Request data from ENA portal API, two requests for each dataset + one taxonomy request (optional)
Store raw data into database
Process and store processed data into database (Perform backend deduplication)
Clean temporal files

Requests

We get data from https://www.ebi.ac.uk/ena/portal/api. Query supports the following operators and characters: AND, OR, NOT, (), """, *.

Requests requestUrl1 (sequence) and requestUrl2 (wgs_set) can be seen at gbif-configuration project here.

Sequence requests

Request with result=sequence

a dataset for eDNA: environmental_sample=True, host="" (no host)

query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=true AND NOT host="*"

include records with environmental_sample=true
include records with coordinates and\or specimen_voucher
exclude records dataclass="CON" see here
exclude records with host

a dataset for organism sequenced: environmental_sample=False, host="" (no host)

query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=false AND NOT host="*"

include records with environmental_sample=false
include records with coordinates and\or specimen_voucher
exclude records dataclass="CON" see here
exclude records with host

a dataset with hosts

query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND host="*" AND NOT host="human*" AND NOT host="*Homo sa*"

include records with coordinates and\or specimen_voucher
include records with host
exclude records dataclass="CON" see here
exclude records with human host

WGS_SET request

Request with result=wgs_set. These requests are pretty much the same with some differences:

sequence_md5 field not supported, use specimen_voucher twice to match number of fields
do not use dataclass filter

Taxonomy

Adapter requests taxonomy separately: download a zipped archive, unzip it and store it into database. Configuration is here.

Database

The data is stored in the postgres database after execution. Each dataset has own table with raw and processed data.

Database is created only once in the target environment and tables are cleaned up before every run.

Database creation scripts for data and taxonomy.

See gbif-configuration here and here for connection properties.

Backend deduplication

We perform several deduplication steps.

First step

Perform SQL (local copy here), get rid of some duplicates and join data with taxonomy; based on issue here

Second step

Get rid of records with both missing specimen_voucher and collection_date

Third step

Keep only one record with same sample_accession and scientific_name and get rid of the rest

DWC archives

Adapter stores all processed data back into database (tables with postfix _processed) which then used by IPT as SQL sources.

Test datasets (UAT):

and production ones (prod):

Data mapping

Configuration

Remember that all configuration files are in the private gbif-configuration project!

Configuration files in the directory src/main/resources do not affect the adapter and can be used, for example, for testing (local run).

Local run

Use scripts start.sh and start-taxonomy.sh for local testing. Remember to provide valid logback and config files for the scripts (you may need to create databases before run).

embl-adapter's People

Contributors

Stargazers

Watchers

Forkers

thomasstjerne manongros

embl-adapter's Issues

Explore ENA Sample Metadata

From an email exchange with @qgroom

A sequence in ENA https://www.ebi.ac.uk/ena/browser/view/FR997887
The same sequence in GBIF https://www.gbif.org/occurrence/3349789425

A sample in ENA https://www.ebi.ac.uk/ena/browser/view/SAMEA7633093
This is not in GBIF, but note the much richer data in a sample

I notice we do have the museum record for the latter.

We should explore if GBIF could/should be using anything from the sample ENA. Storing sample_accession from the responses in dwc:materialSampleID may be part of the solution.

Set pubDate to the day the archive is generate in the EML

Currently, in the EML the <pubDate> is hard coded but it should be set to the day the dataset is created (i.e. the day the process is run).

Capture sample_accession

From discussion in the BICKIL meeting.
Consider storing sample_accession from the responses in dwc:materialSampleID

Add occurrenceRemarks (description field from EMBL)

There might be useful information in the ´description´ field from EMBL:
https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=100&query=%20accession=%22MW554910%22&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex,description

Additions to the existing processing of the sequence data

When querying sequence data from the EMBL API, add a filter to exclude CONTIGs. E.g query=country="*"%20AND%20dataclass!="CON" to get data where country exists, but not CONTIGs.
Add sample_accession to fields in the query. see: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md#example-queries
You need to add a "seen before" filter when you iterate through the data. If a record has a sample_accession you should check if you have already seen the same sample_accession AND the scientific_name in a previous record. Could probably just be a HashMap (you dont need to store the records, just the combination of sample_accession AND scientific_name). If the combination was seen before, skip the the record

Additional data / New query

Use the same API queries as with result=sequence but set result=wgs_set
You have to remove sequence_md5 from fields (not present in this result format)
This data should be added to the above/existing data (replacing what was filtered out by the CON filter)

regex rules trying to map institutionCode / collectionCode when available

Very useful datasets.

Sometime ago I tried NCBI's LinkOut system to set Genbank published sequences linked back to our specimens.

Now I am interested in doing it the opposite way: search INSDC Sequences dataset and then link specimens from my institution to those sequences inside gbif context.

It would be nice if collectionCode (and/or insitutionCode) were available to filter INSDC sequences dataset.
Institutions could use a direct link to the subset of sequences attributed to their collections' specimens.

I might be wrong, but those codes seem to be always empty: only catalogNumber is provided.
So it's difficult to find sequences citing voucher specimens of a given collection/institution.

It's a shame because many times both of code and numbers are available in source data, but they are being mapped together into dwc catalogNumber field. A few examples:

These are collectionCode+catalogNumber concatenations:
- No separator: https://www.gbif.org/occurrence/3349765325
  Voucher specimen here: https://www.gbif.org/occurrence/2821299116
- ':' separator: https://www.gbif.org/occurrence/3350297248
  Voucher specimen here: https://www.gbif.org/occurrence/895066924
- ' ' separator: https://www.gbif.org/occurrence/4008248179
  Voucher specimen here: https://www.gbif.org/occurrence/895006983
I also found sometimes inverted combinations (catalogNumber before collectionCode):
- No separator: https://www.gbif.org/occurrence/3990282449
  Voucher specimen here: https://www.gbif.org/occurrence/2821301687
Also sometimes (collectionCode) goes inside parentheses: https://www.gbif.org/occurrence/3817942152
This shows a collectionCode, wrongly mapped as a catalogNumber: https://www.gbif.org/occurrence/3350565676
(but I might find out the number in publication and link back anyway)

Perhaps you could try to implement some regex rules and search for collectionCode in those cases which might be clear enough to solve?
i.e. having a list of candidate collectionCode names in upper case, try to check if they are followed or preceded by a catalogNumber.

Whenever you find a non-numeric uppercase string it might well be just a collectionCode and not a catalogNumber (see 4th example). Specially if it matches one of those candidate collectionCode names.

With candidate collectionCode names I mean using Index Herbariorum codes, or different collectionCode names published in GBIF.
EDIT: one difficulty of this task is to decide whether the codes shown in INSDC sequences voucher information should be mapped as collectionCode or institutionCode. I'd just map them to both fields because that's impossible to solve:
I have seen different specimens of a given institution and collection, and the same researcher has cited them in 3 different ways: "IC:CC:CN", "IC:CN" (no info about collection) and "CC:CN"

Anyway, thanks a lot for publishing this dataset
@abubelinha

Document the complete mapping from EMBL API to DwC-A

There is a start to this but on first glance it seems incomplete. E.g. the sequence description, the accession is listed as mapped to 3 fields which seems suspicious, and important IDs like study_accession, sample_accession and bio_material seem missing.

I suggest we document the complete specification including the extensions that will be used and also what scope the resulting datasets will be (e.g. is it a single, or multiple datasets for different types of deposits)

Extend materialised identifiers to other formats of ENA voucher IDs

For the clustering, a the triple ID (inst:coll:cat) is added as an identifier for records. In ENA, for several records the voucher ID is constructed just with instcode and catnumber (inst:cat), missing the collection code. Adding this pattern to the array of used IDs for a specimen would increase the chances of matching with an ENA record.

Datasets affected: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Example ENA record with inst:cat as identifiers: https://www.gbif.org/occurrence/3349806846
Related specimen: https://www.gbif.org/occurrence/3347309485

From https://www.insdc.org/documents/feature-table
specimen_voucher="[institution-code:[collection-code:]]specimen_id"

Currently (22 September 21) there are at least 518.274 voucherIDs for sequences in ENA based on such a schema.

Capture taxonomy from scientificName

Following up on #10 to add improvements to the adapter.

This dataset has a lot of scientificNames that contain some information of a taxonomic level, e.g. 'uncultured bacterium' which could be used to populate Kingdom = Bacteria or 'uncultured Amoebozoa' which could be used as Kingdom = Protozoa, instead of Kingdom = Incertae sedis.

I will update this issue next week if the other datasets also have the same issue with the scientificNames when I check them.

Splitting into several datasets depending on sample type

We could use the he fields “environmental_sample” and “host” fields to split the data. Users will probably want to know whether a bacteria comes from the water of a river or the guts of a fish.

We could have:

a dataset for eDNA: environmental_sample=True
a dataset for organism sequenced: environmental_sample=False & host=“”
a dataset with hosts? If so, we have to think about how to model this (these are probably questions for https://github.com/tdwg/dwc-qa if there aren't already guidelines for that):
- should the host be an occurrence as well?
- how do we map the relationship between host and sequence organism?
- Should we use the fields associatedOccurrences or associatedOrganisms for example?

Deduplicate entries before generating DwCA

How can we ensure that we dont have these duplicates (gbif/portal-feedback#2064)?
From what I can see on the issue, there are two solutions:

giving preference to Sample for sample-linked records and using set masters
cluster by prefix which is 4 chars or longer

Although I am not sure which prefix we are talking about

I dont see any sample identifier in the API call given as example: https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=100&query=geo_box1(-90%2C-180%2C90%2C180)&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex

In this example, I suspect that the first four results refer to the same organism
For example:

Have the same organism, same location/scientific name/associated publication but one is a sequence of ghbA2 mRNA for giant hemoglobin A1 globin chain and the other one a sequence of ghbA2 mRNA for giant hemoglobin A2 globin chain.
if I add “sample_accession”, it wouldn’t help in this case (https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=100&query=geo_box1(-90%2C-180%2C90%2C180)&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex,sample_accession) since they didn’t fill this field.

I guess we could also group the data by date, location, scientific name? In most cases, this would solve the issue.

Get large amount of data from EMBL

It seems EMBL API restricts offset parameter to ~~10M records~~ 1M records, so we can't get more data than that.
There are ~27M records sequences which has coordinates:
Request:
https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=100&query=country=%22*%22&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex
Count:
https://www.ebi.ac.uk/ena/portal/api/count?result=sequence&query=country=%22*%22

We can probably try using API with the query parameter download set to true and download data as a file. Something like this:
https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=tsv&query=country=%22*%22&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex&download=true&[email protected]
Looks like it also requires an email. I didn't manage to get data this way because EMBL sent an email with the following link:
https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&savedSearch=null
which is not working. Maybe we should provide some more parameters, but I didn't find more information about that.

Their tool for requests:
https://www.ebi.ac.uk/ena/portal/api/#/Portal%20API/searchUsingGET

Reformat values in the associatedTaxa field

Right now the field contains only the name of the host, it would be good to have the type of association included. For example, "host":"Lutra lutra" (Lutra Lutra is the name of the host in this example)

Review if Targeted Locus Study (TLS) and Transcriptome Shotgun Assembly (TSA) should be included

It looks like the current implementation omits the TLS TSA set data

From our original draft

Set records are available under the wgs_set, tsa_set and tls_set result types, for whole genome shotgun, transcriptome shotgun assembly and targeted locus experiments, respectively. For each set, a “master” sequence record captures general information and these results, rather than individual contig records, are returned to the user.

We should review if either of these should be included in the datasets

Add higher taxonomy

We already have a checklist of the NCBI Taxonomy: https://www.gbif.org/dataset/fab88965-e69d-4491-a04d-e3198b626e52
(scheduled for monthly updates, https://builds.gbif.org/job/ncbi-dwca/).
It would be great to use its content to infer higher taxonomy of scientific names for occurrences: Kingdom, Phylum, Class, Order, Family, Genus

Capture identifiers in culture_collection and/or strain fields

From email correspondence with Kessy Abarenkov:
Sometimes strain (https://www.ncbi.nlm.nih.gov/nuccore/GQ503643) or culture_collection (https://www.ncbi.nlm.nih.gov/nuccore/EU427289) field is filled, sometimes they are both filled, either containing the same (https://www.ncbi.nlm.nih.gov/nuccore/GU256745) or different identifier (https://www.ncbi.nlm.nih.gov/nuccore/KX950433).

There might be identifiers in these fields that could support clustering

First step: Get data, map to DwC, and write a small DwC archive for initial testing

This query gives the first 1000 data rows from EMBL:

https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=1000&query=geo_box1(-90%2C-180%2C90%2C180)&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex

Fetch them, and map fields according to https://github.com/gbif/embl-adapter/blob/main/DATAMAPPING.md

You can use this EML metadata as a starting point: https://api.gbif.org/v1/dataset/ad43e954-dd79-4986-ae34-9ccdbd8bf568/document