pensoft / bicikl Goto Github PK
View Code? Open in Web Editor NEWLicense: Creative Commons Zero v1.0 Universal
License: Creative Commons Zero v1.0 Universal
@teodorgeorgiev can you give commit rights to @myrmoteras and @Daniel-Mietchen ?
Thanks!
Currently we have direct links to gitlab on the front page, while the other teams have their own markdown pages on this repo
@teodorgeorgiev could you give @PietrH (me!) full access to this repository, thank you!
Currently if someone has otherCatalogNumbers=a|b|c it won’t get compared with another record with catalogNumber=a.
@teodorgeorgiev could you give @sofiemeeus full access to this repository
@teodorgeorgiev could you give @Daniel-Mietchen full access to this repository
@myrmoteras in your hackathon topic 8 you refer to "BFH", what is this?
How can I modify the content in topic 8?
I would like to add to possible target treatments the following covering all the Linnaeus plant types:
GBIF and the team behind the EMBL INSDC data have recently built an adapter that brings the EMBL datasets into GBIF.
The adapter is here resulting in these datasets.
These records are then run through the clustering which links 724k records to something (not necessarily a specimen record from a museum, but often BOLD or a taxonomic treatment citing a specimen).
Because GBIF/EMBL have already gone through the process of mapping the EMBL APIs into DwC and through clustering using the catalogue number amongst other fields, it may help form some basis for exploration in this topic. The implementations are certainly not perfect, and any observations or contributions would be very welcome.
You can read a little about the collaboration here
Can you please grant timrobertson100 committer or admin access?
If you grant me administrator access, I can then add the participants we're working with.
Thank you.
For the clustering, a the triple ID (inst:coll:cat) is added as an identifier for records. In ENA, for several records the voucher ID is constructed just with instcode and catnumber (inst:cat), missing the collection code. Adding this pattern to the array of used IDs for a specimen would increase the chances of matching with an ENA record.
Datasets affected: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Example ENA record with inst:cat as identifiers: https://www.gbif.org/occurrence/3349806846
Related specimen: https://www.gbif.org/occurrence/3347309485
From https://www.insdc.org/documents/feature-table
specimen_voucher="[institution-code:[collection-code:]]specimen_id"
Currently (22 September 21) there are at least 518.274 voucherIDs for sequences in ENA based on such a schema which could benefit from this improvement.
The current algorithm does not accommodate variation in recordedBy
that includes multiple collectors.
For example, recordedBy
will not be considered as overlapping between a record containing recordedBy=Tim Robertson|Nicky Nicolson
and another with Tim Robertson
.
@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.
To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.
If this identifies useful links, the best approach to incorporate this into the clustering could be explored.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.