airr-community / airr-standards Goto Github PK

View Code? Open in Web Editor NEW

35.0 21.0 23.0 8.48 MB

AIRR Community Data Standards

Home Page: https://docs.airr-community.org

License: Creative Commons Attribution 4.0 International

Python 78.42% R 20.22% Dockerfile 0.42% TeX 0.94%

standard adaptive immune receptors

airr-standards's People

Contributors

Stargazers

Watchers

airr-standards's Issues

NCBI XLS BioSample consistency with tsv file

The Travis test does not check for consistency with the NCBI files. It seems there is quite a lot missing in those files compared with the tsv file, which I am guessing is by design? It would still be worthwhile to put in an automated test to make sure that if something is added, it doesn't also need to be reflected in the NCBI templates. This might mean annotating each field with whether or not it should be in the NCBI template and checking for it.

Some MiAIRR data elements are "write-only" in BioProject

This issue was brought up by Christian on the mailing list...

While finishing up on the MiAIRR-to-BioProject mapping, I came across
the issue that I was unable to retrieve information on the submitter
and corresponding person. So I asked NCBI:

I am trying to use your eutils interface to retrieve BioProject
information about various related sequencing studies.
However, I have noticed that when I fetch a record (like this
one from the Kleinstein lab)

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=bioproject&id=PRJNA209947

the submitter and contact information is missing. This is also
true in the"bioproject.xml"export file on your FTP site. Am I
doing something wrong or can this information simply not be
retrieved from the database (the Submission.xsd does specify
these fields as mandatory, so they should be there)?

To which they replied:

You are not doing anything wrong as this information can not be
retrieved from the database. The submitter info and contact info
are only used during the submission process and removed from
the XML once public. Only the Submitter Organization info is
available - which is seen in the example (PRJNA209947) you provided.

Which means that in our current NCBI implementation these fields
are "write-only", which I assume was not our intention.

MiAIRR target_substrate clarification (field has been renamed to "template_class")

Again, as a result of our data curators diving into the MiAIRR standard in more detail, another clarification is requested 8-) Apologies if this is confusing given I am not an expert. Nishanth, please jump in if you can make my incoherent ramblings more succinct!!!

The questions I have to pass along are in regards to:

3 / process (nucl. acid) Target substrate String Controlled vocabulary (DNA|RNA)

One of the drivers for the question is the field is defined as a controlled vocabulary with only two options in the example, DNA or RNA. This implies that there are two, and only two possibilities in the controlled vocabulary. Is that the case? This causes some confusion on our end, in particular because the data our curators tend to gather the "DNA Type" from papers, which is whether the study uses gDNA or cDNA. Given that cDNA and gDNA are not a possibility for target_substrate and there is no place to capture that information, they have requested some clarification around this...

More specific clarification questions are:

" In what context is a 'target_substrate' a target? Is it the target for sequencing (eg. cDNA being sequenced) or for calculating viral load?" and a follow on comment "Is 'target_substrate' for those that are meant only for sequencing or other processing as well? For example, for sequencing, cDNA or gDNA makes more sense while for calculating viral load, RNA would be the target_substrate. And for RNA-seq it is DNA that is sequenced even though the source for the cDNA is RNA. So should RNA be the target substrate for that or DNA? Should the target substrate be defined more specifically to what is actually used to prep the assay specific sample (in this case, assay specific sample being the cDNA library prepared using Illumina adapters for sequencing)."

Comments?

A couple of thoughts/questions about AIRR_Minimal_Standard_Data_Elements.tsv

Just my 5 cents (assuming I've downloaded a large AIRR dataset and want to parse it):

1. From the table, is unclear whether all of these columns are required or some of them are optional and can be left blank.
2. I think providing an optional column for donor HLA haplotype in Subject is important. We must encourage people to share this very useful info.
3. Suppose one has a dataset of tetramer-sorted cells. Should the info go into CellProcessing.cell_phenotype?
4. DataProcessing.quality_thresholds says "How sequences were removed from (4) based on base quality scores". Actually its not always the best case to remove the sequences, starting from our old MITCR paper we've implemented re-mapping for low quality sequences to improve clonotype count quantification.
5. Rearrangement.v_call and other genes that the input is string. However, what to do with multiple mappings as sometimes it is not possible to resolve the exact segment? What would be the delimiter for this kind of situation?
6. Rearrangement.junction_aa - what should be put here for non-coding sequences (it says string)? Shouldn't the junction fields be limited to NT/AA alphabet?
7. Any guide on how to handle "exotic" rearrangement data such as incomplete IGH D-J found in lymphomas (i.e. no V, no CDR3 in common sense)?

FAIRsharing should point to GitHub

The FAIRsharing entry referenced in the manuscript should point to the versioned released of the spreadsheet and not to a google doc.

Mapping to BioSample attributes still incomplete

Updated the ODS table in the devel branch [here]. We are still lacking a full and non-ambiguous 1:1 mapping between the MiAIRR items, the NCBI-defined attributes and the AIRR-custom keywords in the submission templates.

There is now an ensure-consistency.py and a check-consistency.py

I renamed it at some point in the history. I'm guessing there are now two because of a merge commit (instead of a rebase). Will reconcile them.

Agenda items for Dec. AIRR meeting

We have a good chunk of Sunday on the agenda for the WGs to meet. I know many of us are on multiple WGs, which makes it hard to do it all concurrently. Apparently the whole Tools and Resource WG only gets a 1/2 hour to present.

We should plan ahead of time, ya know, set a good example for the other WGs who you know will scrap it together at the last minute, or is it, we are so on top of things that we can whip it up on the fly, I tend to forget ;-D

Do we want to meet as a group? What do we want to present?

Genbank structured comments

Continuing from original airr-standards repo. This is the original prompt (from @javh).

We should also discuss whether we can/should design these fields with the intention of using them in GenBank structured comments. This is one of the things that came up when working through the AIRR minimal information GenBank submission process. It's something we can add to the submissions, if we want, but we need to have a clearly defined vocabulary to give NCBI so that they can parse it and tag the data as AIRR.

See some additional comments at the original issue: airr-community/airr-formats#51

"anatomic site" vs. "tissue"

MiAIRR current defines an item "anatomic site" but no item "tissue" (although I'm pretty sure that we had it at some point). Obviously those elements are not interchangeable. BioSample has "tissue" as a mandatory item and we have introduced "anatomic_site" in the submission sheets. Nevertheless we are currently mapping "anatomic site" [MiAIRR] to "tissue" [BioSample] while "anatomic site" [BioSample] becomes an orphan.

Proposed solution:
(Re-)Introduce an item "tissue" into set 2 (sample)

Dependencies:
mapping table, figures

There are two of each object in definitions.yaml

Is it possible to deduplicate the objects in definitions.yaml? Is this a temporary situation?

Whole vs. partial sequences is ambiguous

The "Whole vs. partial sequences" is typed as a boolean but it is ambiguous what a "YES" or "NO" signifies. Either the field can be changed to be a question, like "Whole sequences?", but then may be confusing what "not whole" means. Better maybe to make it a controlled vocabulary string (whole|partial).

"ng_template" field name contains unit

The field name ng_template, which refers to the Template amount item (line 58), contains a unit in its name (ng). I consider this a bad practice, as the unit should be part of the data/value, not of the key.

Sequences less than 200bp not accepted by GenBank for AIRR submission

Notes below about this issue are from Lori Black @ NCBI GenBank sent in an email response about some of the sequences in one of the fasta files I had submitted for the AIRR standards.

[2] Many of the sequence(s) in your file(s) are less than 200 bp.

Unfortunately, we must inform you that we have a policy not to accept sequences shorter than 200 bp. We realize that this has short-term consequences for submitters, but feel that the long-term improvements in the database will be helpful for all database users.

If you resubmit your sequence submission(s) with additional sequence, we may then be able to accept your sequence(s). Alternatively, if you would like us to delete the sequence(s) that are under 200 bp and proceed with the rest, please inform us.

Add travis job to ensure consistency in specs between repositories

Remove the irrelevant files from master

Ahmad will remove the irrelevant files from master repo

Inconsistencies in YAML vs TSV

The Travis job found some inconsistencies:

https://travis-ci.org/airr-community/airr-standards/builds/285271493

cc @schristley @bussec @ahmadchan

Which Sample does MiAIRR refer to?

Is the sample described in the BioSample submission the primary sample collected from a donor (e.g., blood)? Or is it the derived/processed sample from which sequencing template was extracted (e.g., after flow sorting)?

Remove "(transferred from germline)" from airr-format docs

Transferred from airr-community/airr-formats#54

Identifiers in NCBI templates should match schema exactly

Currently in master, the AIRR_BioSample_v1.0.xls file has a column called "*sample_name" instead of "sample_name".

Ensuring MiAIRR is not NCBI specific - RE: CRWG

Hi All,

This is not urgent, but wanted to log it as it is something that we should probably address in coordination with CRWG.

I am pretty sure that we discussed this, but I can't seem to find that discussion in the issues... I think we agreed that the standard should use NCBI examples but not explicitly state that NCBI should always be used...

I raised this issue at the Common Repository Group meeting a fair while ago (airr-community/common-repo-wg#10) about wording that is too NCBI specific (their Recommendation 4:). The group agreed that this was probably not appropriate but that the wording came from the Minimal Standards working group and that the issue should be agreed by Minimal Standards and perhaps provide some alternate wording.

Current wording is: "Recommendation 4: For long-term storage, data and metadata should be deposited in the Sequence Read Archive (SRA)and GenBank, per the recommendations established by the AIRR Minimal Standards Working Group. The AIRR Working Groups should work with SRA/GenBank to customize metadata capture for AIRR data." Note the explicit mention that it should be deposited in SRA/Genbank.

I suggested changing the wording to something like this:

"Recommendation 4: For long-term storage, data and metadata should be deposited in one of the International Nucleotide Sequence Database Collaboration (INSDC) or similar archives such as SRA, Genbank, and ENA, per the recommendations established by the AIRR Minimal Standards Working Group. The AIRR Working Groups should work with the INSDC archives to coordinate the accurate gathering and storage of metadata for AIRR data."

It would be good if we could provide the CRWG some feedback on this so we can close off the issue.

"Freezing" a version of MiAIRR?

Hello All,

We are trying to use the MiAIRR spec to define the iReceptor API. The iReceptor API is used to access information from iReceptor Repositories and is used by the iReceptor Gateway to query individual repositories and federate responses across those repositories. By using MiAIRR, in principle any data repository that has MiAIRR data in it could be linked into the iReceptor network by implementing the iReceptor API.

We are currently using a number of the columns from the MiAIRR TSV file, primarily the MiAIRR field designation, the data type, the content type (to some degree), and the AIRR Format field names. Much of this is being done through use of the YAML file. Our problem is that we need something that is relatively stable (a v0.1 release) that we can develop against.

The Travis job will help to ensure the TSV and the YAML stay in sync, but the reality is that at some point we should probably freeze our master branch with a tagged release (probably linked to something around the MiAIRR manuscript) and limit changes on the development branch with considered merges back to master. Currently the master branch is changing faster than the development branch 8-)

Does it make sense to "freeze" a version of the MiAIRR spec (presumably the Master branch with a release tag), possibly linked to the MiAIRR manuscript, and move the development work into the development branch?

For us it would be better to have a frozen release ASAP do we can develop against it. We are targeting the December meeting for having the iReceptor repositories, APIs, and our Gateway implement a significant part of MiAIRR, but we need something stable to develop against...

Thoughts on moving in this direction???

Brian

Why isn't the sequence data part of 6 / data (proc. seq.)?

Isn't the sequence data needed for submission to GenBank?

Is it safe to remove asterisks from keywords in NCBI BioSample templates?

The mandatory keywords/column headings in the NCBI BioSample template are usually prefixed with an asterisk. Those were removed in 5d75d77. @ahmadchan is it save to do this or could this break the submission process?

Does it make sense to merge airr-formats and airr-standards?

Replicated at airr-community/airr-formats#44

We could publish a single set of versioned docs at docs.airr-community.org or something like that.

Thoughts?

Consistency checks need to be written for NCBI SRA XLS

Remove spaces from directory names

e.g., "NCBI Templates"

AIRR and MiAIRR logo...

Hi Ahmad,

Not sure if the Minimal Standard group is using GIT issues to track things. Hoping so 8-)

Were you the one who created the MiAIRR logo - I love using the receptor graphic in the acronym... Awesome...

I have just one suggestion. Should we use the receptor graphic as the "big I" in AIRR rather than the "little i" MiAIRR??? I think that would be a VERY cool logo for AIRR in general!

Brian

Change field name "organism" to "synthetic"

The MiAIRR item "Organism-based or synthetic" currently maps to the Formats WG field name organism. This creates ambiguity with the field name species_name. Furthermore the BioSample correlate to species_name is organism.

Proposed solution:
Rename organism to synthetic. Define it as boolean.
Rename species_name to organism.

What is our git development model?

If we're going to do most development on the development branch, could we make it the default branch when you land on the github page?

(In case it's still up for discussion, I'd vote to do development on master, though)

Harmonize data types with airr-formats

I'd propose harmonizing the data types between the two repos. In the airr-formats group, we specify the types as string, boolean, float, and integer. Any objections to a PR that would make them the same?

Merge cleanup

Tracking some issues related to merging airr-formats in airr-standards.

Change subtitle from "MiAIRR: AIRR Community Minimal Standards WG" to something more inclusive, perhaps "AIRR Community Data Standards"
Change URL to point to docs.airr-community.org
Add docs for MiAIRR to the RTD docs
Merge the testing code into a single Travis job
Convert reference library to use airr-standards spec file
Switch to Sphinx documentation gen (with Markdown support) for more flexibility

Should a version number be associated with repository submissions?

Not sure if it necessary but worthwhile to think about. We have the situation where the minimal standards will likely undergo refinements, yet we should soon expect that studies will be submitted to repositories following MiAIRR. How to handle if incompatible changes are introduced? Should a spec version number be associated with repository submissions? This might also be important for APIs that repositories expose. How could this be implemented?

I can see how tools that perform automated submission could insure that a version number is attached, not sure how to do that for manual submissions.

What's needed to complete for V1.1.0 software release?

While staying in an alpha version is ok, it means we do not take advantage of semantic versioning, so I'd like us to plan for features we should finish. Here's what I'd like to see:

integrate airr-formats (i.e. the current python reference library).
have the doc website generated from GitHub
NCBI/SRA XML operational and tested

Some other ideas but maybe not v1.0...

initial API for common repository
tools for performing NCBI/SRA submission, and extracting NCBI/SRA data back into AIRR objects.

Please post your thoughts.

Green box fields in BioSample?

The NCBI submission template for BioSample contains fields from the green box "#3" section. Where is this information supposed to go? For the common repo API, we want to have identifiers that essentially correspond to a sequencing library (so we can pull out all the other minstd metadata), but it's not clear whether taht would correspond to a BioSample id or an SRA experiment id.

epitope and specificity information

This is an issue that Bjoern brought up on the CRWG mailing list, whether part of MiAIRR or just support in formats is worth discussion.

The alternative is to describe the exact epitope recognized as part of the MIARR standard. That is absolutely possible, but will require a substantial update - I thought the point of using the IEDB record was that this takes away that complexity. If the epitope should be described, minimally you will need

epitope type (linear peptide, discontinuous amino acids, non-peptidic)

epitope molecular structure (3 different fields, depending on type above)

epitope source protein

epitope source organism

There are a bunch of additional fields (e.g. is this the exact epitope / an epitope containing region / a partial epitope). But as a start you might want to err on the minimal info above.

The primary use case at that point was that people are using e.g. MHC tetramer staining to isolate epitope specific repertoires.

AIRR Common Repositories

We will work with format group and common repo to finalize the specifications.

Provenance of samples

Problem:

To indicate the provenance of samples, MiAIRR currently only defines a single
data item "commercial source". It is therefore unclear how to describe the
source of samples from other sources. BioSample has a mandatory
"biomaterial_provider" field, but according to NCBI's documentation
[https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/] this is distinct
from the "biospecimen_repository", which indicates registered commercial
providers (controlled vocabulary). We have introduced "source_commercial" as
an AIRR-specific field in the BioSample submission template, which is likely
redundant with "biospecimen_repository".

Proposed solutions are either:

Replace "commercial source" in MiAIRR with "biomaterial provider" as the
sole and mandatory field to indicate sample provenance (free text format). All
providers will be entered into this field, irrespective of their status.
Remove "commercial source" in all downstream documents in which it would be
redundant with "biomaterial provider" and/or "biospecimen_repository".

Rename "commercial source" to "biospecimen repository" and fix downstream
mappings accordingly. This field is optional (applies only to commercially
obtained samples) and contains text from a controlled vocabulary (see above).
Add "biomaterial provider" as an new mandatory item to MiAIRR (free text
format).

Dependencies:
mapping table, figures

Auto generate definitions.yaml from TSV?

Hi All,

I was wondering if it might make sense to auto generate the definitions.yaml file from the MiAIRR TSV file rather than have a consistency check???

There are a couple of reasons for this:

They won't ever be out of sync (assuming the generation is done automatically). Probably worth having the Travis job still to ensure this and catch bugs in the conversion 8-)
The definitions would be more complete. Currently the definitions.yaml file is very minimal with name and type. The YAML object spec accepts descriptions and examples for fields in the definition. The TSV file has both field descriptions and examples. It would be nice if these could be added to the definitions.yaml file. I was going to do it manually but thought that is a lot of mundane work and will make it even harder to keep the two consistent. The driver for this is that our API uses the definitions.yaml file. But the API description is pretty bare bones as it has no descriptions or examples. It would be nice if the definitions.yaml file had this information.
When using the MiAIRR definitions in a swagger API spec like we are, we need definition in multiple forms. For swagger API responses, we need swagger objects. For API parameters, we need a different YAML structure, essentially a list of single parameters and their types, descriptions, and examples. Similar to #2 above, it would be nice if we could automatically generate a parameters.yaml file from the TSV.

I am happy to work on the code snippets to do this... The consistency check that we have now is a good starting point.

Thoughts?

I can create a branch for this development...

Brian

Consistency checks need to be written for NCBI XML

CSV dialect change

Would it be possible to switch to tab-separated values with no quoted strings? This tends to make things simpler to parse, and causes less confusion.

clean up development branch

The reconciliation between the master and development branch may have caused some files which were deleted and/or moved to come back. @ahmadchan can you please review the files on the development branch, and delete anything that shouldn't be there?

should there be _nt fields for rearrangement format?

In email with @williamdlees, who is modifying his tool to use the rearrangement formats, he brought up an issue about which fields he should expect:

in the file I downloaded - I can see, for example cdr1_nt but not cdr1_start or cdr1_end. On the other hand the spec at https://github.com/airr-community/airr-formats/blob/master/docs/rearrangements.md has cdr1_start and cdr1_end, but not cdr1_nt. Do I need to allow for either possibility? If so does that apply to any of the fields that have _start and _end co-ordinates in the spec

So I do think it would be worth covering explicitly in the spec - both adding the _nt fields, and stating that implementers can provide (or expect to consume) one or the other or both

Frontiers article describing file formats

Gur Yaari is coediting an issue of Frontiers in Immunology and asked if we'd like to contribute. The formats group has expressed interest. We would need to submit an abstract by Dec 15, full paper by end of April. One of these article types:

A Type Articles: Classification, Clinical Trial, Hypothesis & Theory, Methods, Original Research, Protocols, Review, Systematic Review, Technology Report

B Type Articles: Case Report, Clinic-Pathological Conference (CPC), Conceptual Analysis, Evaluation, Mini Review, Perspective

C Type Articles: Code, Data Report, Opinion

D Type Articles: Book Review, Core Concept (Young Minds), Editorial, Field Grand Challenge, Focused Review, Frontiers Commentary, General Commentary, New Discovery (Young Minds), Specialty Grand Challenge.

Are coordinates needed for 6 / data (proc. seq.)?

When we prepare "6 / data (proc. seq.)" for submission to GenBank, it requires coordinates on the sequence for the [vdjc] features. Shouldn't those coordinates be part of MiARIRR then?

MiAIRR "Lab" clarification...

Hello All,

I am trying to get a clear understanding of what the Minimal Standard "Lab" entity relates to.

In the iReceptor world we have a concept of a lab, but it is the lab that is carrying out the study. That is the "Jamie Scott Lab" has a bunch of studies that they have done...

It appears that the lab entity in MiAIRR is focused around data collection and data deposition. Is the intent the same as the iReceptor intent and the contact info for the data collection and data deposition are contact points within the lab for key aspects of the work?

Some in our group have interpreted the data processing fields as implying that the lab information is intended to identify the lab that has done the data collection/processing (that is, it is potentially a different lab than that carrying out the study).

Can someone clarify the intent here. I suspect we need to clean up the description of these fields one way or the other 8-)

Brian

MiAIRR age_event and age clarification...

Hi All,

Our data curators are looking for some clarification around age_event and age in the "1/subject" set. Presumably, this relates to the subject age, with the age_event being something like enrollment (as given in the example). The problem is, the "age" example doesn't make a lot of sense, given that the age example is "200 days". This would imply that the age of the subject was "200 days" at "Enrollment". Is that the intent? For human subjects, that seems an odd example???

From our curators point of view, capturing the age of a subject (or at least an age range) is quite important for the study (eg. someone who is 12 will have vastly different adaptive immune system activity than someone in their 70s) as well as this is what many of the papers that our curators are processing provide (the researchers think it is important as they provide it).

Can someone clarify the intent here? Does the example need to be made more clear?

MiAIRR Zombie - CDR3 / junction amino acid sequence

Problem:
If I remember correctly, we have decided against including the amino acid
sequence of CDR3 / JUNCTION as part of the minimal standard at least twice (as
it is redundant to the DNA sequence). Nevertheless this field is still in the
current table.
Our current NCBI implementation has no mapping for this field, as there is no
clean way to annotate it in Genbank.

Proposed solution:
Drop "JUNCTION amino acid sequence" as part of MiAIRR, keep the "junction_aa"
key as field name of Formats WG (as non-minimal implementations might want to
provide this data).

Dependencies:
mapping table, figures

Please let me know what you think until the call next Tuesday.

Develop initial yaml specifications that can be used for file formats and swagger API

Is diagnosis & intervention data really n-to-1 with subject?

In an earlier meeting, it was mentioned that there could be multiple diagnosis&intervention data records associated with a subject. Furthermore, these records are independent from the sample records associated with a subject, which is a n-to-1 relationship. Is it true that diagnosis & intervention data is n-to-1 with subject?

The reason I ask is because it introduces confusion about how to "flatten" records. For example, with NCBI, the MiAIRR objects are de-normalized into a single BioSample record. If a subject has (say) 3 diagnosis&intervention records, and (say) 5 samples, how many BioSample records is that?

airr-community / airr-standards Goto Github PK

airr-standards's People

Contributors

Stargazers

Watchers

Forkers

airr-standards's Issues

Recommend Projects

Recommend Topics

Recommend Org