Code Monkey home page Code Monkey logo

bicikl's People

Contributors

ajsaenz-lfw avatar ckotwn avatar daniel-mietchen avatar fnielsen avatar jholetschek avatar matdillen avatar mdmtrv avatar nickynicolson avatar pietrh avatar qgroom avatar rukayaj avatar sofiemeeus avatar timrobertson100 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bicikl's Issues

access to modify topic 8

Topic 2: Review GBIF implementation

GBIF and the team behind the EMBL INSDC data have recently built an adapter that brings the EMBL datasets into GBIF.
The adapter is here resulting in these datasets.

These records are then run through the clustering which links 724k records to something (not necessarily a specimen record from a museum, but often BOLD or a taxonomic treatment citing a specimen).

Because GBIF/EMBL have already gone through the process of mapping the EMBL APIs into DwC and through clustering using the catalogue number amongst other fields, it may help form some basis for exploration in this topic. The implementations are certainly not perfect, and any observations or contributions would be very welcome.

You can read a little about the collaboration here

Committer request

Can you please grant timrobertson100 committer or admin access?
If you grant me administrator access, I can then add the participants we're working with.

Thank you.

Topic 3: Improve clustering by adding an identifier with the schema instcode:catnumber

For the clustering, a the triple ID (inst:coll:cat) is added as an identifier for records. In ENA, for several records the voucher ID is constructed just with instcode and catnumber (inst:cat), missing the collection code. Adding this pattern to the array of used IDs for a specimen would increase the chances of matching with an ENA record.

Datasets affected: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Example ENA record with inst:cat as identifiers: https://www.gbif.org/occurrence/3349806846
Related specimen: https://www.gbif.org/occurrence/3347309485

From https://www.insdc.org/documents/feature-table
specimen_voucher="[institution-code:[collection-code:]]specimen_id"

Currently (22 September 21) there are at least 518.274 voucherIDs for sequences in ENA based on such a schema which could benefit from this improvement.

Topic 3: Explore tokenizing the recordedBy

The current algorithm does not accommodate variation in recordedBy that includes multiple collectors.
For example, recordedBy will not be considered as overlapping between a record containing recordedBy=Tim Robertson|Nicky Nicolson and another with Tim Robertson.

@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.

To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.

If this identifies useful links, the best approach to incorporate this into the clustering could be explored.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.