genomicsstandardsconsortium / mixs Goto Github PK
View Code? Open in Web Editor NEWMinimum Information about any (X) Sequence” (MIxS) specification
Home Page: https://w3id.org/mixs
License: Creative Commons Zero v1.0 Universal
Minimum Information about any (X) Sequence” (MIxS) specification
Home Page: https://w3id.org/mixs
License: Creative Commons Zero v1.0 Universal
Units are expected to be included in the same field as the values, according to the “Value Syntax” field of the MIxS standard. If this were actually followed, it would make processing of spreadsheets tricky. In the majority of records, this instruction is not followed, and no units are provided at all. I (Luke) would like to see units in the column header, e.g. 'latitude_deg', 'temperature_deg_c', 'phosphate_umol_per_l', 'salinity_psu'. This marries units to the datum so they persist across data processing, including in plots of values. It also avoids inappropriate combination of data from different studies reported with different units, a peril of meta-analysis. Pier points out that separate unit and datum fields would allow units to be controlled by the Unit Ontology. Both approaches could be implemented.
Minor correction of description for subspecf_gen_lin,subspecific genetic lineage:"This should provide further information about the genetic distinctness of the sequenced organism by recording additional information e.g. serovar, serotype, biotype, ecotype, or any relevant genetic typing schemes like Group I plasmid. It can also contain alternative taxonomic information. It should contain both the lineage name, and the lineage rank, i.e. biovar:abc123"(just to remove duplicate biovar and add ecotype in the e.g list.)
As a community outreach exercise it might be worth making a comparison of GSC terminology to that used by the G10K project(s) which can be found here:
https://genome10k.soe.ucsc.edu/wp-content/uploads/2018/07/g10kSubmissionGuide_0.pdf
Once we have that we can approach them to suggest any additions we would like to see for compliance with GSC MIxS and hopefully encourage the consortium members to include the appropriate metadata.
In addition, the VGP are actively looking at ways to define quality of genome assemblies, which is something we have discussed including in the GSC specifications.
Hi @GenomicsStandardsConsortium/mixs-dev ,
we have to rename https://github.com/GenomicsStandardsConsortium
because it is simply wrong. The correct name of GSC is Genomic (without s) Standards Consortium.
I suggest to vote between:
Please add your vote as comments to this issue.
Once we agreed on a new name, I will continue to rename this according to
https://help.github.com/articles/renaming-an-organization/
In the host-associated package, the definition for 'host body product' is:
Substance produced by the body, e.g. Stool, mucus, where the sample was obtained from. For foundational model of anatomy ontology (fma) or Uber-anatomy ontology (UBERON) terms, please see https://www.ebi.ac.uk/ols/ontologies/fma or https://www.ebi.ac.uk/ols/ontologies/uberon
In the other host packages it is:
Substance produced by the host's body, e.g. stool, mucus, where the sample was obtained from. For foundational model of anatomy ontology (fma) (v 4.11.0) or Uber-anatomy ontology (UBERON) (v releases/2014-06-15) terms, please see http://purl.bioontology.org/ontology/FMA or http://purl.bioontology.org/ontology/UBERON
The typographical differences make it difficult to compare programatically.
use skos:inScheme to say which part of the template it is (environment, sequencing, study, ...)
see:
definition: Relates a resource (for example a concept) to a concept scheme in which it is included.
MIxS 4 has the terms annual_season_temp
and annual_season_precpt
.
In MIxS 5 these terms have been broken into the terms:
annual_temp
/ season_temp
annual_precpt
/ season_precpt
Is there a migration plan for how to move values represented using the MIxS 4 terms to the corresponding MIxS 5 terms?
Altitude: altitude is a term used to identify heights of objects such as airplanes, space shuttles, rockets, atmospheric balloons and heights of places such as atmospheric layers and clouds. It is used to measure the height of an object which is above the earth’s surface. In this context, the altitude measurement is the vertical distance between the Earth's surface above sea level and the sampled position in the air
Elevation: elevation of the sampling site is its height above a fixed reference point , most commonly the mean sea level. Elevation is mainly used when referring to points on the Earth's surface, while altitude is used for points above the surface, such as an aircraft in flight or a spacecraft in orbit
With these new definitions, altitude should only be present in packages air and misc environment. Other packages should only contain elevation, to avoid further confusion
Is there opportunity to work with the FAANG project?
https://www.faang.org/groups?ticket=&name=metadata&email=
Their sample metadata checklist appears to largely overlap our anyway, so it might be an opportunity to get more exposure and alignment between more standards.
I notice @cmungall is listed on the list of participants for the Metadata and Data Sharing page.
I want to propose to include the TDWG terms eventID and parentEvent to be included in MIxS.
Background:
Samples in environmental and ecological studies (e.g. metagenomics of microbes) are often taken in a hierarchical experimental set-up. For example: when sequencing microbes along a depth profile of the water column in a lake, a sample hierarchy can look like this (from high to low level): scientific project > multiple lakes > multiple stations per lake > multiple depths per station. Another experimental approach that often occurs is the application of different sequencing techniques to one environmental sample (e.g. meta genome and metatranscriptome) or technical replicates are made for a single sample (e.g. sequencing a soil sample 3 times to asses the variability introduced by sampling and wet-lab procedures). In all these cases, there is a need to be able to group samples (that is: the events) at a higher levels (parentEvents). Moreover, this would also help to make MIxS more interoperable with the DarwinCore EventCore format, which is necessary for multifaceted ecological and microbial studies that rely on both standards.
Proposed terms:
Label: eventID
Definition: (from TDWG http://rs.tdwg.org/dwc/terms/index.htm#eventID) An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
Label: parentEventID
Definition: (from http://rs.tdwg.org/dwc/terms/index.htm#parentEventID) An event identifier for the super event which is composed of one or more sub-sampling events. The value must refer to an existing eventID. If the identifier is local it must exist within the given dataset. May be a globally unique identifier or an identifier specific to the data set.
CC @jzrapp
For example, "porosity" applies to more than just soil (sediment, sea ice). We should relabel or remove overspecified parameters on detection.
Copy of GenomicsStandardsConsortium/mixs-ng#8 originally filed by @pbuttigieg
Hello the Bioscales project is in need of the term:
Soil CO2 flux
Older MIxS terms have URLs provided by TDWG, such as https://terms.tdwg.org/wiki/mixs:alt_elev . These pages have links to GSC pages (e.g., http://gensc.org/ns/mixs/alt_elev) that don't currently resolve.
We need a more stable and maintainable solution to identifiers for MIxS terms.
One suggestion was to request PURLs via the OBO Foundry. This has a lot of benefits, but there were some concerns because MIxS is not an ontology. See the discussion at OBOFoundry/OBOFoundry.github.io#822.
Another suggestion is that we could host the URLs ourselves, of the form http://gensc.org/ns/mixs/alt_elev.
We can use this space to discuss.
Taking notes from GSC-CIG, I asked for list of mailing lists as this is not on the site, @ramonawalls suggested filing a ticket here
I don't see a way to get the Dockerfile that is used to build the image up on dockerhub, which would limit any extensions to the validation framework
OBI has agreed to add terms for any sequencing machines we need. We should supply them with a list.
Currently under MIxS seq method - Perhaps consider changing the term to “sequence machine type”?
Can the same “method” be run on many types of machines manufactured by different companies, or is it 1 machine = 1 method?
Seems like an algorithm is being equated with a machine. There are a number of companies that produce DNA sequencers. Do you include model number etc?
Also check SRA template
What is the reason for the strong copyleft license? It's not clear how this applies to non-software artefacts such as the ones in this repo. Are the excel files here considered source?
For non-code repos I usually use a CC-BY license or CC-0 waiver, I would recommend this from mixs
E.g. tot_org_carb in the soil package should be mapped to http://purl.obolibrary.org/obo/ENVO_09000008
Should this be specified outside, e.g. in SSSOM format, as separate mapping files?
Hello the Bioscales project is in need of the term:
Soil denitrification potential
Go to this comment for solution: #233 (comment)
Although JSON does not strictly require term urls, much of what people need to do with mixs does (e.g., use mixs in linked data, use mixs terms in ontologies).
We discusses this topic at the CIG hackathon in Vienna in May.
Options include:
Use obo foundry purls
Make gensc purls
Make gensc URLs that are not purls (e.g., gensc.org/ns)
Keep namespace for terms in TDWG
Comment from @cmungall: Also https://w3id.org/
Comment from @lschriml: The last time we discussed this at the board level, there was a lot of support for:
Make gensc purls
--> Has this been discussed further on the CIG calls ?
I am not sure how to interpret elevation in the context of the sediment env package (it's mandatory, so now way around it). Any useful hints?
Also, how come the total water depth is not available in that package?
I am assuming one of the use cases for the rdf representation is a way to automatically validate sample (the exact mechanism, e.g. shex, derived json-schema etc should be a matter for another ticket).
The first matter at hand is formally specifying the requirements for validation, in particular package specific validation:
An example use case: a study looking at the impact of heavy metal concentrations in soil on plant root ecosystems
Currently heavy_metals
is in soil, so this would be the natural package to use
However, the study may also look at other factors such as impact of season/climate/taxon, these are all in the plant package.
In MIxS 4 and MIxS 5 the definition for heavy_metals
reads:
Heavy metals present and concentrationsany drug used by subject and the frequency of usage; can include multiple heavy metals and concentrations
Can someone fix the "concentrationsany" part?
The current description of the geo-location variable is
The geographical origin of the sample as defined by the country or sea name
followed by specific region name. Country or sea names should
be chosen from the INSDC country list (http://insdc.org/country.html),
or the GAZ ontology (v 1.512) (http://purl.bioontology.org/ontology/GAZ)
This is fine but... do we (GSC or DarwinCore) supply any advice/guidance on which location should be used for things that have been moved. e.g. plants originally collected in the wild but been grown for many years in a botanical garden somewhere, or zoo animals originally from the wild, or wild fish/coral now kept in aquariums?
For metagenome sequences I can see that the current location is appropriate, but for the genome of the sample should we use the original or the transplanted location?
Do we need a way to specify which has been given?
At least until we are finished with milestone mixs-json-schema-v1
I have had a request from the NCBI BioSample team to include examples for each of the terms in the MIxS packages. We would like to use these for our documentation for submitters.
Note this ticket is not about evaluating CEDAR tooling, but rather the abstract data model and associated JSON-LD / RDF / JSON Schema representation.
High level description here:
https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-Template,-Element,-and-Field-Instances
e.g JSON-LD/RDF for a template with two fields here:
{
"studyID": { "@value": "SDY2" },
"pi": {
"fullName": { "@value": "Dr. P.I." },
"homePage": { "@id": "https://www.stanford.edu/people/DrPI.html" },
"address": { "@value": "Stanford, CA 94305, USA" },
"dob": { "@value": "1999-01-01" }
}
}
Note that fields can be constrained to be value sets from ontologies:
The paper is here: https://more.metadatacenter.org/sites/default/files/An%20Open%20Repository%20Model%20for%20Acquiring%20Knowledge%20about%20Scientific%20Experiments.pdf
The abstract data model is in fig 1:
More docs and guides here:
https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-technical-documentation
...is this intentional?
The README isn't really helpful - maybe a note about the status / location of MIxS 5?
From ENA:
At ENA we have also been receiving queries on how to register/declare replicate samples of the types or
Technical replicates: same sample across multiple conditions, e.g. same physical sample from the same person sequenced twice
Since these are the same physical sample, ENA have been advising submitters to create one sample and two experiments, one pointing to X in the library_name of the experiment and the second experiment pointing to X_2 in in the library_name of the second experiment. Real example can be seen here:
https://www.ebi.ac.uk/ena/data/view/ERS808783
with ERX1056590 (library name: 12033) and ERX1056638 (library name: 12033_2)
Biological replicates:liver tumor from 5 different patients under the same set of conditions (e.g. treated, normal
ENA have been advising these to be registered as separate samples, i.e. these get separate samples accessions.
We believe that in both cases, it may be useful to have a new attribute added to existing MIxS standards to declare the replicate status. The attribute name could be ‘replicate status’ and it could have a controlled vocabulary:
o technical replicate
o biological replicate
Many of the mixs checklists contain the same term. Sometimes those terms mean the same thing in different checklists, sometimes they don't. @wdduncan has compiled a list of where terms are duplicated at https://github.com/GenomicsStandardsConsortium/mixs-rdf/blob/master/notebooks/output/multi-package-mixs-terms-only.xlsx.
I am going to make a series of issues to discuss and resolve these duplications. Each issue will reference this one, so we can organize them.
The altitude structured comment name (row 8 on the MIxS tab in mixs_v5.xlsx) doesn't have a MIXS ID assigned to it.
BTW who is managing IDs? Are they always unique? Do the IDs persist across MIxS versions?
Cross-link to #18 and EnvironmentOntology/envo#805 EnvironmentOntology/envo#804 EnvironmentOntology/envo#803 EnvironmentOntology/envo#802 EnvironmentOntology/envo#807
Over at ENVO, we are getting more requests on how to annotate in a MIxS compliant way. As MIxS 5 will change things quite a bit, we'd like to know when to expect its release so we can inform our users.
Do we have a planned release date for MIxS 5?
Currently, MIxS requires three environment terms: biome, feature, and material. The recommendation is to use ENVO terms as the values.
At the C&I call on Jan 27, 2017, we discussed whether or not “biome” should be replaced by “environmental system” (now a top-level term in ENVO) and how to make it easier to use these terms.
With guidance from ENVO curator Pier Buttigieg, we concluded that the MIxS terms should stay as they are, but to provide better guidance on how to use them, and to allow multiple values. MIxS should refer users to the ENVO annotation guidelines.
New proposed definitions
environment (biome)
Current definition:
Biomes are defined based on factors such as plant structures, leaf types, plant spacing, and other factors like climate. Biome should be treated as the descriptor of the broad ecological context of a sample. Examples include: desert, taiga, deciduous woodland, or coral reef. EnvO (v 2013-06-14) terms can be found via the link: www.environmentontology.org/Browse-EnvO
Expected value: EnvO
Requirements (eu, ba, pl, vi, org, me, MIMARKS Survey, MIMARKS Specimen): M M M M M M M M
Value syntax: {term}
Proposed definition:
See http://www.environmentontology.org/annotation-guidelines. Include multiple biomes separated by a pipe, if appropriate. EnvO's biome class and its subclasses are intended to identify the ecosystem in which an entity of interest is embedded (i.e. the entity is a component of that system). In order for an ecosystem to qualify as a biome, ecological communities (or representatives thereof) resident in an ecosystem must have evolved adaptations to that ecosystem. Thus, biomes possess an evolutionarily consequential degree of temporal and spatial stability. Recommend subclasses of biome [ENVO:00000428].
Short definition:
Add terms that identify the ecosystem/biome from which the entity comes, multiple terms can be separated by pipes e.g. mangrove biome [ENVO:01000181] | estuarine biome [ENVO:01000020]. Recommend subclasses of biome [ENVO:00000428].
environment (feature)
Current definition:
Environmental feature level includes geographic environmental features. Compared to biome, feature is a descriptor of the more local environment. Examples include: harbor, cliff, or lake. EnvO (v 2013-06-14) terms can be found via the link: www.environmentontology.org/Browse-EnvO
Expected value: EnvO
Requirements (eu, ba, pl, vi, org, me, MIMARKS Survey, MIMARKS Specimen): M M M M M M M M
Value syntax: {term}
Proposed definition:
See http://www.environmentontology.org/annotation-guidelines. Include multiple features separated by a pipe, if appropriate. EnvO's environmental feature class and its subclasses are intended to identify environmental entities which have a strong, causal influence upon an entity of interest at the time of observation or sampling. For example, consider the observation of a camel watering at an oasis. While the camel is a component of a desert biome [ENVO:01000179], it is strongly influenced by a desert oasis [ENVO:00000156] during the observation. Other examples include natural features like cliff or lake, locations on the body like hand or colon, and man-made structures like building or car. Recommend subclasses of environmental feature [ENVO:00002297].
Short definition:
Add terms that identify environmental entities having causal influences upon the entity at time of sampling, multiple terms can be separated by pipes, e.g., shoreline [ENVO:00000486] | intertidal zone [ENVO:00000316]. Recommend subclasses of environmental feature [ENVO:00002297].
environment (material)
Current definition:
The environmental material level refers to the material that was displaced by the sample, or material in which a sample was embedded, prior to the sampling event. Environmental material terms are generally mass nouns. Examples include: air, soil, or water. EnvO (v 2013-06-14) terms can be found via the link: www.environmentontology.org/Browse-EnvO
Expected value: EnvO
Requirements (eu, ba, pl, vi, org, me, MIMARKS Survey, MIMARKS Specimen): M M M M M M M M
Value syntax: {term}
Proposed definition:
See http://www.environmentontology.org/annotation-guidelines. EnvO's environmental material class and its subclasses are intended to identify the medium or media present in an environment displaced by or in contact with a given entity. A pelagic fish swimming in the middle of the Atlantic Ocean would thus have ocean water [ENVO:00002151] as its environmental material. Similarly, an individual from the species Helicobacter pylori found in the human gastric mucosa could be annotated with mucus [ENVO:02000040] as its environmental material. Many entities will displace more than one environmental material and, ideally, all of these should be identified.Recommend subclasses of environmental material [ENVO:00010483].
Short definition:
Add terms that identify the thing (medium/media) displaced by the entity at time of sampling, multiple terms can be separated by pipes e.g. estuarine water [ENVO:01000301] | estuarine mud [ENVO:00002160]. Recommend subclasses of environmental material [ENVO:00010483].
EBI have a JSON serialization for templates that is complete with JSON Schema for validation:
Label: access benefit sharing permit
Short name (ID): abs_certificate (or abs_cert)
Definition: Identifier that points to the signed Access and Benefit Sharing (ABS) agreement for a particular (set of) sample(s), for compliance with the Ngoya Protocol requirements. Recommended to use the ABS Clearing House unique identifier registered at https://absch.cbd.int/search/nationalRecords?schema=absPermit. For more information about ABS, see https://absch.cbd.int/help/about.
Background:
All samples collected after Oct 2014 in Nagoya signatory countries should have a signed access and benefit sharing (ABS) permit stating that they were collected in compliance with ABS agreements. The ABS clearing house (ABSCH) issues permanent identifiers, which can be searched by country or by reference at
https://absch.cbd.int/search/nationalRecords?schema=absPermit.
Adding this new term was discussed and agreed upon at the Compliance and Interoperability Group meetings. The term cannot be required, because not all countries have signed the treaty. As a non-signatory country, NCBI could not enforce it or validate it, but they could store it. Groups like GGBN could make their own checklists that do include ABSCH IDs.
As suggested by the Compliance and Interoperability Group (CIG) and approved by the GSC board in January 2020, the MIxS standards will now be open source, with the CC0 open source agreement (https://creativecommons.org/share-your-work/public-domain/cc0/). I am making the changes now, and adding information on how to cite.
Hello the Bioscales project is in need of the term:
Soil CH4 flux
Currently, the spreadsheets that are downloadable from the gensc.org website do not clearly instruct users to supply the ID (in CURIE format), just the term label.
This is very risky as only the ID is authoritative. Can this be updated right away?
Users are submitting poor annotations right now.
A valid example would be
"air [ENVO:00002005]"
With the Nagoya Protocol being in place since Oct 2014 there are many samples being used for a variety of things (including sequencing) that should have an access and benefit sharing(ABS) agreement in place for that sample. While digital sequence information is mostly not included under the nagoya protocols (although some countries do include it I believe) it would still be useful to be able to link these sample sequences back to the original ABS documents. Is there a place-holder term for the ID of that document that could be a used? Something like "Sampling agreement"?
The value could be an agreement ID and country/government, or perhaps a link to the ABS clearing house record (if deposited) e.g.
https://absch.cbd.int/search/nationalRecords?schema=absPermit
perhaps this is one for @jdeck88 or someone who knows about Darwin core terms as I suspect there is probably something already in DC, but I cant find it.
CC @jzrapp (perhaps link to the Cryo spreadsheet you've developed?)
For parameter groups that are repeated across multiple environmental packages (e.g. sample logistics)
These support packages (or parts of them) would be sourced by the environmental packages that need them, perhaps driven by an import file that would pull in either single parameters or the whole package.
Copy of GenomicsStandardsConsortium/mixs-ng#9 originally filed by @pbuttigieg.
I'm not sure I understand the tax_ID :
+tax_id,taxa ID,The phylogenetic marker(s) used to assign an organism name to the SAG or MAG,enumeration,[16S rRNA gene|multi-marker approach|other],,1,sequencing,C,C,C,C,C,-,-,-,M,M,C,50,
Is this meant to be a numerical value from NCBI taxonomy, or something else?
Background:
Experimental controls have become very important in microbiome studies (metagenomics, 16S rRNA gene profiling), and in fact each study should include controls (at the minimum one to several DNA extraction controls).
There have been several cases in the microbiome field where researchers report the finding of certain organisms in certain scenarios that eventually turned out to originate from kit reagents/contamination in the lab.
Some journals indicate now specifically in the author guidelines, that the data for DNA extraction controls (and other experimental controls) need to be provided.
See e.g. https://microbiomejournal.biomedcentral.com/submission-guidelines/preparing-your-manuscript/research-article
The above journal specifies “These controls should be sequenced, and the sequence data reported in the paper and made available along with the sample sequence data in a public repository.”
An initial call was held on June 12, 2019 with members of the CIG, including representatives from NCBI and ENA. Three options were discussed:
ENA prefers option 2. NCBI will need to discuss implementations internally before expressing a preference.
Regardless of which implementation is adopted, new terms will be needed. An initial list was drafted at the call and will be shared as a spreadsheet where people can make suggestions and comments.
Hi,
Currently the definition for the attribute "sample material processing" is
Any processing applied to the sample during or after retrieving the sample from environment. This field accepts OBI, for a browser of OBI (v 2013-10-25) terms please see http://purl.bioontology.org/ontology/OBI
Having just been looking through the OBI terms it seems like the ONLY appropriate term in OBI for this field would be lavage ! Perhaps the definition of the term can be changed to:
A brief description of any processing applied to the sample during or after retrieving the sample from environment, or a link to the relevant protocol(s) performed.
I have seen some older efforts attempting some mappings but nothing official or blessed.
Over at MIxS-syntax-data-types, I suggest we update:
{term}: ontology term, consists of alphabetic characters
to
{term}: An ontology term (i.e. a class), identified by the class label and its unique ID in CURIE format (i.e. "[namespace:[numericCode]]". For example:
soil biocrust [ENVO:01000910]
Multiple terms should be pipe separated:
anoxic water [ENVO:01000173]|eutrophic water [ENVO:00002224]
The URL for this organization is https://github.com/GenomicsStandardsConsortium, but it should be https://github.com/GenomicStandardsConsortium (Genomic not Genomics). I don't have the ability to change it. Can someone either give me access to change it or make the change?
Add hydrocarbon resources-cores/swabs to CV
We need to determine what conditions necessitate creating a new IRI for a mixs term.
This is seen if you at differences between look at the difference between the "value syntax" column for the mixs term "depth".
In version 4, the value syntax for depth is: {float} m (I assume this means the unit is meters)
But in version 5, value syntax for depth is: {float} {unit}
Does this mean a new IRI should be minted for depth in version 5?
I've noticed that the master spreadsheet and the package specific spreadsheet are not always in sync. For example, if I open mixs_v5.xlsx and filter for the water and I open MIxSwater_20180621.xlsx, the MIxSwater_20180621.xlsx spreadsheet includes the lat_lon term, but the mixs_v5.xlsx spreadsheet does not.
Is this by design? Or was there a problem merging the package specific spreadsheets?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.