pensoft / bicikl Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 12.0 1.61 MB

License: Creative Commons Zero v1.0 Universal

Java 1.07% Scala 0.32% HTML 48.72% Jupyter Notebook 49.89%

bicikl's People

Contributors

Stargazers

Watchers

Forkers

pietrh daniel-mietchen sofiemeeus mdmtrv infovarius fnielsen nickynicolson matdillen timrobertson100 rukayaj ajsaenz-lfw ckotwn

bicikl's Issues

@teodorgeorgiev can you give commit rights to @myrmoteras and @Daniel-Mietchen ?

@teodorgeorgiev can you give commit rights to @myrmoteras and @Daniel-Mietchen ?
Thanks!

make markdown pages for project 10 Lifeblock Testnet

Currently we have direct links to gitlab on the front page, while the other teams have their own markdown pages on this repo

To give @PietrH full access to this repository

@teodorgeorgiev could you give @PietrH (me!) full access to this repository, thank you!

Topic 3: Split othercatalognumbers for comparison between identifiers

Currently if someone has otherCatalogNumbers=a|b|c it won’t get compared with another record with catalogNumber=a.

To give @sofiemeeus full access to this repository

@teodorgeorgiev could you give @sofiemeeus full access to this repository

To give @Daniel-Mietchen full access to this repository

@teodorgeorgiev could you give @Daniel-Mietchen full access to this repository

What does BFH refer to?

@myrmoteras in your hackathon topic 8 you refer to "BFH", what is this?

access to modify topic 8

How can I modify the content in topic 8?

I would like to add to possible target treatments the following covering all the Linnaeus plant types:

Target treatments
• https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+doc.articleGbifId+bib.author+tax.name+matCit.verbatimMatCit+matCit.collectionCode+matCit.specimenCode+matCit.specimenHttpUri+matCit.accessionNumber&groupingFields=doc.uuid+doc.articleGbifId+bib.author+tax.name+matCit.verbatimMatCit+matCit.collectionCode+matCit.specimenCode+matCit.specimenHttpUri+matCit.accessionNumber&FP-bib.author=Jarvis%25&format=HTML or in JSON: https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+doc.articleGbifId+bib.author+tax.name+matCit.verbatimMatCit+matCit.collectionCode+matCit.specimenCode+matCit.specimenHttpUri+matCit.accessionNumber&groupingFields=doc.uuid+doc.articleGbifId+bib.author+tax.name+matCit.verbatimMatCit+matCit.collectionCode+matCit.specimenCode+matCit.specimenHttpUri+matCit.accessionNumber&FP-bib.author=Jarvis%25&format=JSON

Topic 2: Review GBIF implementation

GBIF and the team behind the EMBL INSDC data have recently built an adapter that brings the EMBL datasets into GBIF.
The adapter is here resulting in these datasets.

These records are then run through the clustering which links 724k records to something (not necessarily a specimen record from a museum, but often BOLD or a taxonomic treatment citing a specimen).

Because GBIF/EMBL have already gone through the process of mapping the EMBL APIs into DwC and through clustering using the catalogue number amongst other fields, it may help form some basis for exploration in this topic. The implementations are certainly not perfect, and any observations or contributions would be very welcome.

You can read a little about the collaboration here

Committer request

Can you please grant timrobertson100 committer or admin access?
If you grant me administrator access, I can then add the participants we're working with.

Thank you.

Topic 3: Improve clustering by adding an identifier with the schema instcode:catnumber

For the clustering, a the triple ID (inst:coll:cat) is added as an identifier for records. In ENA, for several records the voucher ID is constructed just with instcode and catnumber (inst:cat), missing the collection code. Adding this pattern to the array of used IDs for a specimen would increase the chances of matching with an ENA record.

Datasets affected: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Example ENA record with inst:cat as identifiers: https://www.gbif.org/occurrence/3349806846
Related specimen: https://www.gbif.org/occurrence/3347309485

From https://www.insdc.org/documents/feature-table
specimen_voucher="[institution-code:[collection-code:]]specimen_id"

Currently (22 September 21) there are at least 518.274 voucherIDs for sequences in ENA based on such a schema which could benefit from this improvement.

Topic 3: Explore tokenizing the recordedBy

The current algorithm does not accommodate variation in recordedBy that includes multiple collectors.
For example, recordedBy will not be considered as overlapping between a record containing recordedBy=Tim Robertson|Nicky Nicolson and another with Tim Robertson.

@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.

To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.

If this identifies useful links, the best approach to incorporate this into the clustering could be explored.