Code Monkey home page Code Monkey logo

mixs's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mixs's Issues

Include units in the column header

Units are expected to be included in the same field as the values, according to the “Value Syntax” field of the MIxS standard. If this were actually followed, it would make processing of spreadsheets tricky. In the majority of records, this instruction is not followed, and no units are provided at all. I (Luke) would like to see units in the column header, e.g. 'latitude_deg', 'temperature_deg_c', 'phosphate_umol_per_l', 'salinity_psu'. This marries units to the datum so they persist across data processing, including in plots of values. It also avoids inappropriate combination of data from different studies reported with different units, a peril of meta-analysis. Pier points out that separate unit and datum fields would allow units to be controlled by the Unit Ontology. Both approaches could be implemented.

correction of description for subspecf_gen_lin

Minor correction of description for subspecf_gen_lin,subspecific genetic lineage:"This should provide further information about the genetic distinctness of the sequenced organism by recording additional information e.g. serovar, serotype, biotype, ecotype, or any relevant genetic typing schemes like Group I plasmid. It can also contain alternative taxonomic information. It should contain both the lineage name, and the lineage rank, i.e. biovar:abc123"(just to remove duplicate biovar and add ecotype in the e.g list.)

G10K and VGP terminology

As a community outreach exercise it might be worth making a comparison of GSC terminology to that used by the G10K project(s) which can be found here:
https://genome10k.soe.ucsc.edu/wp-content/uploads/2018/07/g10kSubmissionGuide_0.pdf
Once we have that we can approach them to suggest any additions we would like to see for compliance with GSC MIxS and hopefully encourage the consortium members to include the appropriate metadata.
In addition, the VGP are actively looking at ways to define quality of genome assemblies, which is something we have discussed including in the GSC specifications.

Renaming this organization

Hi @GenomicsStandardsConsortium/mixs-dev ,

we have to rename https://github.com/GenomicsStandardsConsortium
because it is simply wrong. The correct name of GSC is Genomic (without s) Standards Consortium.

I suggest to vote between:

  1. genomicstandardsconsortium
  2. genomic-standards-consortium
  3. gensc

Please add your vote as comments to this issue.

Once we agreed on a new name, I will continue to rename this according to
https://help.github.com/articles/renaming-an-organization/

  • Finish voting
  • Rename this organization according to majority vote

two different definitions of 'host body product'

In the host-associated package, the definition for 'host body product' is:

Substance produced by the body, e.g. Stool, mucus, where the sample was obtained from. For foundational model of anatomy ontology (fma) or Uber-anatomy ontology (UBERON) terms, please see https://www.ebi.ac.uk/ols/ontologies/fma or https://www.ebi.ac.uk/ols/ontologies/uberon

In the other host packages it is:

Substance produced by the host's body, e.g. stool, mucus, where the sample was obtained from. For foundational model of anatomy ontology (fma) (v 4.11.0) or Uber-anatomy ontology (UBERON) (v releases/2014-06-15) terms, please see http://purl.bioontology.org/ontology/FMA or http://purl.bioontology.org/ontology/UBERON

The typographical differences make it difficult to compare programatically.

Migration for plan for annual season temp/precipitation

MIxS 4 has the terms annual_season_temp and annual_season_precpt.
In MIxS 5 these terms have been broken into the terms:
annual_temp / season_temp
annual_precpt / season_precpt

Is there a migration plan for how to move values represented using the MIxS 4 terms to the corresponding MIxS 5 terms?

Cleanup elevation vs. altitude

Altitude: altitude is a term used to identify heights of objects such as airplanes, space shuttles, rockets, atmospheric balloons and heights of places such as atmospheric layers and clouds. It is used to measure the height of an object which is above the earth’s surface. In this context, the altitude measurement is the vertical distance between the Earth's surface above sea level and the sampled position in the air
Elevation: elevation of the sampling site is its height above a fixed reference point , most commonly the mean sea level. Elevation is mainly used when referring to points on the Earth's surface, while altitude is used for points above the surface, such as an aircraft in flight or a spacecraft in orbit

With these new definitions, altitude should only be present in packages air and misc environment. Other packages should only contain elevation, to avoid further confusion

Proposition to include "eventID" and "parentEvent" into MIxS.

I want to propose to include the TDWG terms eventID and parentEvent to be included in MIxS.

Background:
Samples in environmental and ecological studies (e.g. metagenomics of microbes) are often taken in a hierarchical experimental set-up. For example: when sequencing microbes along a depth profile of the water column in a lake, a sample hierarchy can look like this (from high to low level): scientific project > multiple lakes > multiple stations per lake > multiple depths per station. Another experimental approach that often occurs is the application of different sequencing techniques to one environmental sample (e.g. meta genome and metatranscriptome) or technical replicates are made for a single sample (e.g. sequencing a soil sample 3 times to asses the variability introduced by sampling and wet-lab procedures). In all these cases, there is a need to be able to group samples (that is: the events) at a higher levels (parentEvents). Moreover, this would also help to make MIxS more interoperable with the DarwinCore EventCore format, which is necessary for multifaceted ecological and microbial studies that rely on both standards.

Proposed terms:
Label: eventID
Definition: (from TDWG http://rs.tdwg.org/dwc/terms/index.htm#eventID) An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.

Label: parentEventID
Definition: (from http://rs.tdwg.org/dwc/terms/index.htm#parentEventID) An event identifier for the super event which is composed of one or more sub-sampling events. The value must refer to an existing eventID. If the identifier is local it must exist within the given dataset. May be a globally unique identifier or an identifier specific to the data set.

IRIs (and URLs) for MIxS terms

Older MIxS terms have URLs provided by TDWG, such as https://terms.tdwg.org/wiki/mixs:alt_elev . These pages have links to GSC pages (e.g., http://gensc.org/ns/mixs/alt_elev) that don't currently resolve.

We need a more stable and maintainable solution to identifiers for MIxS terms.

One suggestion was to request PURLs via the OBO Foundry. This has a lot of benefits, but there were some concerns because MIxS is not an ontology. See the discussion at OBOFoundry/OBOFoundry.github.io#822.

Another suggestion is that we could host the URLs ourselves, of the form http://gensc.org/ns/mixs/alt_elev.

We can use this space to discuss.

sequencing machine type

OBI has agreed to add terms for any sequencing machines we need. We should supply them with a list.

Currently under MIxS seq method - Perhaps consider changing the term to “sequence machine type”?
Can the same “method” be run on many types of machines manufactured by different companies, or is it 1 machine = 1 method?

Seems like an algorithm is being equated with a machine. There are a number of companies that produce DNA sequencers. Do you include model number etc?

Also check SRA template

Choice of GPL

What is the reason for the strong copyleft license? It's not clear how this applies to non-software artefacts such as the ones in this repo. Are the excel files here considered source?

For non-code repos I usually use a CC-BY license or CC-0 waiver, I would recommend this from mixs

Decide on format for MIxS URIs - namespace

Go to this comment for solution: #233 (comment)

Although JSON does not strictly require term urls, much of what people need to do with mixs does (e.g., use mixs in linked data, use mixs terms in ontologies).

We discusses this topic at the CIG hackathon in Vienna in May.
Options include:

Use obo foundry purls
Make gensc purls
Make gensc URLs that are not purls (e.g., gensc.org/ns)
Keep namespace for terms in TDWG

Comment from @cmungall: Also https://w3id.org/

Comment from @lschriml: The last time we discussed this at the board level, there was a lot of support for:
Make gensc purls
--> Has this been discussed further on the CIG calls ?

Copy of GenomicsStandardsConsortium/mixs-ng#3

Sediment package changes

I am not sure how to interpret elevation in the context of the sediment env package (it's mandatory, so now way around it). Any useful hints?
Also, how come the total water depth is not available in that package?

Requirements for package-specific validation - can a sample be in multiple packages?

I am assuming one of the use cases for the rdf representation is a way to automatically validate sample (the exact mechanism, e.g. shex, derived json-schema etc should be a matter for another ticket).

The first matter at hand is formally specifying the requirements for validation, in particular package specific validation:

  • what is the cardinality of the sample to package relationship? Is this 1? Or can a sample be described using multiple packages?
  • If a sample is always described by exactly one package, can fields from other packages still be mixed in?
  • If a union of packages is permitted, what are the rules for combining field properties such as mandatoriness?

An example use case: a study looking at the impact of heavy metal concentrations in soil on plant root ecosystems

Currently heavy_metals is in soil, so this would be the natural package to use

However, the study may also look at other factors such as impact of season/climate/taxon, these are all in the plant package.

Definition for heavy_metals wrong

In MIxS 4 and MIxS 5 the definition for heavy_metals reads:

Heavy metals present and concentrationsany drug used by subject and the frequency of usage; can include multiple heavy metals and concentrations

Can someone fix the "concentrationsany" part?

Geo location usage guidance

The current description of the geo-location variable is

The geographical origin of the sample as defined by the country or sea name
followed by specific region name. Country or sea names should
be chosen from the INSDC country list (http://insdc.org/country.html),
or the GAZ ontology (v 1.512) (http://purl.bioontology.org/ontology/GAZ)

This is fine but... do we (GSC or DarwinCore) supply any advice/guidance on which location should be used for things that have been moved. e.g. plants originally collected in the wild but been grown for many years in a botanical garden somewhere, or zoo animals originally from the wild, or wild fish/coral now kept in aquariums?
For metagenome sequences I can see that the current location is appropriate, but for the genome of the sample should we use the original or the transplanted location?
Do we need a way to specify which has been given?

Include examples for MIxS terms

I have had a request from the NCBI BioSample team to include examples for each of the terms in the MIxS packages. We would like to use these for our documentation for submitters.

Evaluate CEDAR template datamodel for representing MIxS templates

Note this ticket is not about evaluating CEDAR tooling, but rather the abstract data model and associated JSON-LD / RDF / JSON Schema representation.

High level description here:

https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-Template,-Element,-and-Field-Instances

e.g JSON-LD/RDF for a template with two fields here:

{
  "studyID": { "@value": "SDY2"  },
  "pi": { 
    "fullName": { "@value": "Dr. P.I." }, 
    "homePage": { "@id": "https://www.stanford.edu/people/DrPI.html" }, 
    "address": { "@value": "Stanford, CA 94305, USA" },
    "dob": { "@value": "1999-01-01" } 
  }
}

Note that fields can be constrained to be value sets from ontologies:

The paper is here: https://more.metadatacenter.org/sites/default/files/An%20Open%20Repository%20Model%20for%20Acquiring%20Knowledge%20about%20Scientific%20Experiments.pdf

The abstract data model is in fig 1:

image

More docs and guides here:

https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-technical-documentation

MIxS 5 folder empty...

...is this intentional?

The README isn't really helpful - maybe a note about the status / location of MIxS 5?

Registering replicate samples

From ENA:

At ENA we have also been receiving queries on how to register/declare replicate samples of the types or
Technical replicates: same sample across multiple conditions, e.g. same physical sample from the same person sequenced twice
Since these are the same physical sample, ENA have been advising submitters to create one sample and two experiments, one pointing to X in the library_name of the experiment and the second experiment pointing to X_2 in in the library_name of the second experiment. Real example can be seen here:

  •   https://www.ebi.ac.uk/ena/data/view/ERS808783
    

with ERX1056590 (library name: 12033) and ERX1056638 (library name: 12033_2)
Biological replicates:liver tumor from 5 different patients under the same set of conditions (e.g. treated, normal
ENA have been advising these to be registered as separate samples, i.e. these get separate samples accessions.
We believe that in both cases, it may be useful to have a new attribute added to existing MIxS standards to declare the replicate status. The attribute name could be ‘replicate status’ and it could have a controlled vocabulary:
o technical replicate
o biological replicate

Resolve duplicate terms in checklists

Many of the mixs checklists contain the same term. Sometimes those terms mean the same thing in different checklists, sometimes they don't. @wdduncan has compiled a list of where terms are duplicated at https://github.com/GenomicsStandardsConsortium/mixs-rdf/blob/master/notebooks/output/multi-package-mixs-terms-only.xlsx.

I am going to make a series of issues to discuss and resolve these duplications. Each issue will reference this one, so we can organize them.

altitude in mix_v5 doesn't have MIXS ID

The altitude structured comment name (row 8 on the MIxS tab in mixs_v5.xlsx) doesn't have a MIXS ID assigned to it.

BTW who is managing IDs? Are they always unique? Do the IDs persist across MIxS versions?

Changes for environment terms in MIxS

Currently, MIxS requires three environment terms: biome, feature, and material. The recommendation is to use ENVO terms as the values.

At the C&I call on Jan 27, 2017, we discussed whether or not “biome” should be replaced by “environmental system” (now a top-level term in ENVO) and how to make it easier to use these terms.

With guidance from ENVO curator Pier Buttigieg, we concluded that the MIxS terms should stay as they are, but to provide better guidance on how to use them, and to allow multiple values. MIxS should refer users to the ENVO annotation guidelines.
New proposed definitions
environment (biome)
Current definition:
Biomes are defined based on factors such as plant structures, leaf types, plant spacing, and other factors like climate. Biome should be treated as the descriptor of the broad ecological context of a sample. Examples include: desert, taiga, deciduous woodland, or coral reef. EnvO (v 2013-06-14) terms can be found via the link: www.environmentontology.org/Browse-EnvO
Expected value: EnvO
Requirements (eu, ba, pl, vi, org, me, MIMARKS Survey, MIMARKS Specimen): M M M M M M M M
Value syntax: {term}

Proposed definition:
See http://www.environmentontology.org/annotation-guidelines. Include multiple biomes separated by a pipe, if appropriate. EnvO's biome class and its subclasses are intended to identify the ecosystem in which an entity of interest is embedded (i.e. the entity is a component of that system). In order for an ecosystem to qualify as a biome, ecological communities (or representatives thereof) resident in an ecosystem must have evolved adaptations to that ecosystem. Thus, biomes possess an evolutionarily consequential degree of temporal and spatial stability. Recommend subclasses of biome [ENVO:00000428].

Short definition:
Add terms that identify the ecosystem/biome from which the entity comes, multiple terms can be separated by pipes e.g. mangrove biome [ENVO:01000181] | estuarine biome [ENVO:01000020]. Recommend subclasses of biome [ENVO:00000428].

environment (feature)
Current definition:
Environmental feature level includes geographic environmental features. Compared to biome, feature is a descriptor of the more local environment. Examples include: harbor, cliff, or lake. EnvO (v 2013-06-14) terms can be found via the link: www.environmentontology.org/Browse-EnvO
Expected value: EnvO
Requirements (eu, ba, pl, vi, org, me, MIMARKS Survey, MIMARKS Specimen): M M M M M M M M
Value syntax: {term}

Proposed definition:
See http://www.environmentontology.org/annotation-guidelines. Include multiple features separated by a pipe, if appropriate. EnvO's environmental feature class and its subclasses are intended to identify environmental entities which have a strong, causal influence upon an entity of interest at the time of observation or sampling. For example, consider the observation of a camel watering at an oasis. While the camel is a component of a desert biome [ENVO:01000179], it is strongly influenced by a desert oasis [ENVO:00000156] during the observation. Other examples include natural features like cliff or lake, locations on the body like hand or colon, and man-made structures like building or car. Recommend subclasses of environmental feature [ENVO:00002297].

Short definition:
Add terms that identify environmental entities having causal influences upon the entity at time of sampling, multiple terms can be separated by pipes, e.g., shoreline [ENVO:00000486] | intertidal zone [ENVO:00000316]. Recommend subclasses of environmental feature [ENVO:00002297].
environment (material)
Current definition:
The environmental material level refers to the material that was displaced by the sample, or material in which a sample was embedded, prior to the sampling event. Environmental material terms are generally mass nouns. Examples include: air, soil, or water. EnvO (v 2013-06-14) terms can be found via the link: www.environmentontology.org/Browse-EnvO
Expected value: EnvO
Requirements (eu, ba, pl, vi, org, me, MIMARKS Survey, MIMARKS Specimen): M M M M M M M M
Value syntax: {term}

Proposed definition:
See http://www.environmentontology.org/annotation-guidelines. EnvO's environmental material class and its subclasses are intended to identify the medium or media present in an environment displaced by or in contact with a given entity. A pelagic fish swimming in the middle of the Atlantic Ocean would thus have ocean water [ENVO:00002151] as its environmental material. Similarly, an individual from the species Helicobacter pylori found in the human gastric mucosa could be annotated with mucus [ENVO:02000040] as its environmental material. Many entities will displace more than one environmental material and, ideally, all of these should be identified.Recommend subclasses of environmental material [ENVO:00010483].

Short definition:
Add terms that identify the thing (medium/media) displaced by the entity at time of sampling, multiple terms can be separated by pipes e.g. estuarine water [ENVO:01000301] | estuarine mud [ENVO:00002160]. Recommend subclasses of environmental material [ENVO:00010483].

NTR: access and benefit sharing permit

Label: access benefit sharing permit

Short name (ID): abs_certificate (or abs_cert)

Definition: Identifier that points to the signed Access and Benefit Sharing (ABS) agreement for a particular (set of) sample(s), for compliance with the Ngoya Protocol requirements. Recommended to use the ABS Clearing House unique identifier registered at https://absch.cbd.int/search/nationalRecords?schema=absPermit. For more information about ABS, see https://absch.cbd.int/help/about.

Background:
All samples collected after Oct 2014 in Nagoya signatory countries should have a signed access and benefit sharing (ABS) permit stating that they were collected in compliance with ABS agreements. The ABS clearing house (ABSCH) issues permanent identifiers, which can be searched by country or by reference at
https://absch.cbd.int/search/nationalRecords?schema=absPermit.

Adding this new term was discussed and agreed upon at the Compliance and Interoperability Group meetings. The term cannot be required, because not all countries have signed the treaty. As a non-signatory country, NCBI could not enforce it or validate it, but they could store it. Groups like GGBN could make their own checklists that do include ABSCH IDs.

No ENVO ID requested in spreadsheets

Currently, the spreadsheets that are downloadable from the gensc.org website do not clearly instruct users to supply the ID (in CURIE format), just the term label.

This is very risky as only the ID is authoritative. Can this be updated right away?

Users are submitting poor annotations right now.

A valid example would be
"air [ENVO:00002005]"

sampling agreement place holder

With the Nagoya Protocol being in place since Oct 2014 there are many samples being used for a variety of things (including sequencing) that should have an access and benefit sharing(ABS) agreement in place for that sample. While digital sequence information is mostly not included under the nagoya protocols (although some countries do include it I believe) it would still be useful to be able to link these sample sequences back to the original ABS documents. Is there a place-holder term for the ID of that document that could be a used? Something like "Sampling agreement"?
The value could be an agreement ID and country/government, or perhaps a link to the ABS clearing house record (if deposited) e.g.
https://absch.cbd.int/search/nationalRecords?schema=absPermit

perhaps this is one for @jdeck88 or someone who knows about Darwin core terms as I suspect there is probably something already in DC, but I cant find it.

Create "support" package modules

CC @jzrapp (perhaps link to the Cryo spreadsheet you've developed?)

For parameter groups that are repeated across multiple environmental packages (e.g. sample logistics)

These support packages (or parts of them) would be sourced by the environmental packages that need them, perhaps driven by an import file that would pull in either single parameters or the whole package.

Copy of GenomicsStandardsConsortium/mixs-ng#9 originally filed by @pbuttigieg.

Check definition of tax_id in MISAG/MIMAG

I'm not sure I understand the tax_ID :
+tax_id,taxa ID,The phylogenetic marker(s) used to assign an organism name to the SAG or MAG,enumeration,[16S rRNA gene|multi-marker approach|other],,1,sequencing,C,C,C,C,C,-,-,-,M,M,C,50,

Is this meant to be a numerical value from NCBI taxonomy, or something else?

New checklists for positive and negative controls

Background:
Experimental controls have become very important in microbiome studies (metagenomics, 16S rRNA gene profiling), and in fact each study should include controls (at the minimum one to several DNA extraction controls).
There have been several cases in the microbiome field where researchers report the finding of certain organisms in certain scenarios that eventually turned out to originate from kit reagents/contamination in the lab.
Some journals indicate now specifically in the author guidelines, that the data for DNA extraction controls (and other experimental controls) need to be provided.
See e.g. https://microbiomejournal.biomedcentral.com/submission-guidelines/preparing-your-manuscript/research-article
The above journal specifies “These controls should be sequenced, and the sequence data reported in the paper and made available along with the sample sequence data in a public repository.”

An initial call was held on June 12, 2019 with members of the CIG, including representatives from NCBI and ENA. Three options were discussed:

  1. Add terms for controls to all existing checklists
  2. Create two new checklists/packages for positive and negative control types
  3. Add an attribute for positive control to existing checklists and have a separate checklist for negative controls

ENA prefers option 2. NCBI will need to discuss implementations internally before expressing a preference.

Regardless of which implementation is adopted, new terms will be needed. An initial list was drafted at the call and will be shared as a spreadsheet where people can make suggestions and comments.

definition update : sample material processing

Hi,
Currently the definition for the attribute "sample material processing" is

Any processing applied to the sample during or after retrieving the sample from environment. This field accepts OBI, for a browser of OBI (v 2013-10-25) terms please see http://purl.bioontology.org/ontology/OBI

Having just been looking through the OBI terms it seems like the ONLY appropriate term in OBI for this field would be lavage ! Perhaps the definition of the term can be changed to:

A brief description of any processing applied to the sample during or after retrieving the sample from environment, or a link to the relevant protocol(s) performed.

Documentations revision:

Over at MIxS-syntax-data-types, I suggest we update:

{term}: ontology term, consists of alphabetic characters

to

{term}: An ontology term (i.e. a class), identified by the class label and its unique ID in CURIE format (i.e. "[namespace:[numericCode]]". For example:
soil biocrust [ENVO:01000910]
Multiple terms should be pipe separated:
anoxic water [ENVO:01000173]|eutrophic water [ENVO:00002224]

Guidelines for minting new IRIs for terms

We need to determine what conditions necessitate creating a new IRI for a mixs term.
This is seen if you at differences between look at the difference between the "value syntax" column for the mixs term "depth".

In version 4, the value syntax for depth is: {float} m (I assume this means the unit is meters)
But in version 5, value syntax for depth is: {float} {unit}

Does this mean a new IRI should be minted for depth in version 5?

cc @cmungall @ramonawalls @jdeck88 @folker @renzok

Differences between package specific spreadsheets and master spreadsheet for mixs_v5

I've noticed that the master spreadsheet and the package specific spreadsheet are not always in sync. For example, if I open mixs_v5.xlsx and filter for the water and I open MIxSwater_20180621.xlsx, the MIxSwater_20180621.xlsx spreadsheet includes the lat_lon term, but the mixs_v5.xlsx spreadsheet does not.

Is this by design? Or was there a problem merging the package specific spreadsheets?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.