Code Monkey home page Code Monkey logo

operations's Introduction

operations

This repo is for issues and documentation related to coordination between identifier harmonization in the biosciences. It is for problem documentation, social aspects, brainstorming, and ephemeral code.

operations's People

Contributors

graybeal avatar jmcmurry avatar micheldumontier avatar nataled avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

graybeal nataled

operations's Issues

Review composite mapping of prefixes and namespaces

  • Ferret out any cross-registrations that may not be an exact match on prefix, but on other fields.
    • Assess the text-edit distance or other easy to do similarity match across all of the title fields.
    • Based on the above, for each prefix in the list, provide two pieces of data: 1) score indicating how likely the prefix in one row is to be related to a prefix in another row. (eg. corresponding to the same dataset, or to portions of that dataset). (eg. KEGG-disease, vs KEGG-protein). 2) what that corresponding prefix is to investigate
    • Re-arrange the list so that the related ones are clustered together for easier curation (?)

Standardize the syntax of example URL syntax

Eg. with [example-id] as is done in GO, or # as done elsewhere.
Unfortunately, we can not just always append the id to the end as sometimes more needs to be appended after that (eg. .html etc)

HGNC as use case of multiple identifier complexities

HGNC is an example collection with four co-occuring identifier complexities:

1. Ambiguity about what $id even is.

screen shot 2016-04-22 at 3 38 23 pm

The identifiers.org record above captures the fact that HGNC records exist in 3rd party databases but identifiers.org doesn't have a strong concept of a prefix; consequently it isn't possible to get to both "physical locations" of the entity using a single (equivalent) $id. In one case $id is prefixed, and in the other, it is not. HGNC, mercifully, honors both forms. However:

  1. Other data providers may not be as forgiving as HGNC is
  2. More often than not variation in the local ID pattern is precisely what the data provider is relying on in order to redirect to their right type-specific path.

A stronger notion of prefix is the simplest thing that would help data integrators collapse the following as equivalent http identifiers since 2674 is the invariant part of the ID.

Given the identifiers.org data model, there is no way to determine whether http://identifiers.org/hgnc/hgnc:2674 points to the same entity as http://identifiers.org/hgnc/2674. This is why I favor developing a bare-curie based resolver like http://n2t.net/hgnc:2674--or if identifiers.org is interested in doing so--http://identifiers.org/hgnc:2674

This would allow us to determine that all of these are talking about the same entity:

Authoritative sources:
Identifier resolvers:
Third party content providers
2. Multiple entity types (Genes and Gene families)
Identifiers.org namespace regex URI
hgnc ^((HGNC or hgnc):)?\d{1,5}$ http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$id [Example: 2674]
hgnc.family ^[A-Z0-9-]+(#[A-Z0-9-]+)?$ http://www.genenames.org/genefamilies/$id [Example: PADI]
hgnc.symbol ^[A-Za-z-0-9_]+(@)?$ http://www.genenames.org/cgi-bin/gene_symbol_report?match=$id [Example: DAPK1]

3. Multiple identifier types (alphanumeric symbol and numeric ID)

4. Type-specific URL patterns combined with lack of deterministic typing in local ID

Consequently you have to know what you're looking at before you can know where to resolve it. Note lack of deterministic typing in localID is not a problem unless you need type-specific URLs the way HGNC does.


Sorry to bug you @KrisGray, you're listed on the HGNC github; could you comment as to whether there's a single URL that can be used across types of IDs in HGNC? (family, symbol, numeric ID) so that we can address at least number 4 on the list?

cc: @timclark, @jkunze

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.