Code Monkey home page Code Monkey logo

Comments (15)

laserson avatar laserson commented on July 17, 2024

Where is there a cdr1_nt field currently?

from airr-standards.

schristley avatar schristley commented on July 17, 2024

ChangeO optionally can produce them, though they are called *_imgt and contain the imgt gaps. I guess ChangeO didn't provide the positional information before. William seems to indicate that IMGT/VQUEST also provide them, but I've never used VQUEST to know. Oddly, looking at our old spreadsheet in the comparison of the different tools, I don't see any indication of positional info provided by any of the tools. Maybe we decided at some point that the positional information was more useful than the sequence, I don't remember.

from airr-standards.

laserson avatar laserson commented on July 17, 2024

I would guess positional information is always more useful, no? You can extract the sequence yourself. Maybe I'm misunderstanding.

from airr-standards.

schristley avatar schristley commented on July 17, 2024

I had taken data from one of Florian's study and ran it through the current ChangeO/AIRR toolset on VDJServer, and made the data available to William. That AIRR TSV has *_nt fields, not positional fields as mentioned in the spec, and William questioned why this was? I'm not completely sure why that is myself, @javh probably knows though.

You can look at the data yourself. Go to the job output and click on one of the AIRR TSV files to download it.

Now the positional fields are optional, so they don't need to be provided, but William was confused about which fields he should expect, or whether he has to support both.

As the *_nt fields are not in the spec, maybe we should add them, then at least the field names are reserved, and tools can check for existence of one or the other.

from airr-standards.

schristley avatar schristley commented on July 17, 2024

My goal with the VDJServer testing was to produce all of the fields that we currently have in the spec, but that doesn't seem possible with the current ChangeO/AIRR toolset, though I might be making a mistake with the parameters I'm using. All of the mandatory fields are produced though.

from airr-standards.

williamdlees avatar williamdlees commented on July 17, 2024

from airr-standards.

javh avatar javh commented on July 17, 2024

Yeah, this is a changeo limitation. We output the CDR/FWR sequences as an optional argument, but we don't output the positions of the CDR/FWR.

I would say adding the _nt fields to the spec as optional fields makes sense, but they don't necessarily need to be in the spec either, as these are just extra custom changeo output. I don't think we'll have any luck exhaustively defining the field set every tool will need, so changeo having some extra output isn't overly concerning to me.

It's a non-trivial change to add the CDR/FWR start/end positions to the changeo output, because I would need to add a backwards calculation from the IMGT-numbered sequences to the original query sequence and figure out any indel corrections that occurred.

It's in the plan for later, but because the start/end positions were optional, I decided it wasn't worth the hassle for the initial release.

from airr-standards.

bcorrie avatar bcorrie commented on July 17, 2024

As a follow on to Jason's comment "I don't think we'll have any luck exhaustively defining the field set every tool will need, so changeo having some extra output isn't overly concerning to me." I agree completely...

For what it is worth, iReceptor also has extra information that it stores and communicates via its API. We have somewhat arbitrarily prefixed such information (for use in APIs and TSV files etc) with ir_ to denote iReceptor extensions... For example, we have always tracked "age" in years if it is known for a subject. We denote this ir_age in our API to differentiate it from AIRR terms as MiAIRR has an age field as well. MiAIRR's age is a string and we find that studies often report age ranges. This is going to be much more of an issue in the downstream processing that you are discussing...

My point is primarily that our APIs, tools, and platforms should anticipate seeing data that they do not use or require and be forgiving of that. Any formats work we do should probably allow for having extensible fields that are not part of the "standard". I guess the only question is should that be done explicitly through a mechanism that we define or should we leave it up to the tools...

This might be worthy of a separate issue...

from airr-standards.

schristley avatar schristley commented on July 17, 2024

"I don't think we'll have any luck exhaustively defining the field set every tool will need, so changeo having some extra output isn't overly concerning to me." I would turn this around, it isn't what tools need but what tools provide. I think we should use the spec to try to exhaustively define the set of fields that tools provide. Because the space of field names is essentially a global shared resource, the spec can be used to insure all the known fields have well-defined names and defined semantics. By "spec" I mean the rearrangement annotations, but we could also think about extending this to study metadata.

This should also serve to point out where there are semantic differences. The age field mentioned by Brian is one example. Then we can debate how to resolve.

from airr-standards.

javh avatar javh commented on July 17, 2024

@schristley, we can take that approach as well. I'm fine either way, as long as we are, as you say, explicit about how each field is defined. However, that could get out of hand quickly.

Just as an example, in the case of changeo, it outputs CDR/FWR sequences that are IMGT-gapped. However, the CDR/FWR start/end positions are meant to reference the positions in the query sequence, which contain neither IMGT-gaps nor indel corrections. So just for CDR1, you might need cdr1_imgt_nt (IMGT-gapped nt), cdr1_nt (query nt), cdr1_align_nt (indel corrected CDR1 without IMGT-gaps), plus all the _aa fields.

But, maybe we can take an approach where people reserve custom fields? And if you want to add something new, but slightly different, then you need to create a new field name? Or we could go the route of setting the rule that all fields that are not defined need an appropriate prefix (custom_ was what we considered before). That has compliance problems though. Not sure.

from airr-standards.

schristley avatar schristley commented on July 17, 2024

I have a stub in place, but not completely implemented, a way to add a namespace for fields, which I'm thinking is exactly as you say, it is a prefix for the field name. Looking at changeo where the namespace is passed as the first parameter. The AIRR library would check if the fields are in the spec, and if not then prefix the new fields with changeo_. I haven't thought it completely through yet, but I feel that should only be used if somehow the tool is providing a semantically-incompatible version of the data. Meaning, a downstream consumer tool says "I want changeo's specific interpretation", and it shouldn't mean "I want the AIRR field, but it happens to be produced by changeo". We don't want to use that for stuff where tools agree on the semantics otherwise we get a changeo_cdr1_nt, vdjserver_cdr1_nt, mixcr_cdr1_nt, and so on.

@javh is right that this can get out of hand with the number of fields. We need to bring more tools into the fold to get an idea about the scope.

Should we make this a discussion agenda item for AIRR meeting?

from airr-standards.

williamdlees avatar williamdlees commented on July 17, 2024

from airr-standards.

bcorrie avatar bcorrie commented on July 17, 2024

I think one of the challenges is what do you mandate and what do you suggest 8-)

There seem to be several levels for any given field:

  1. Mandate fields (by naming fields) and mandating they be provided (required field) and mandate the type/ontology

  2. Mandate fields (by naming fields) and mandating type/onotology, but not requiring the data to be provided.

  3. Mandating fields (by naming fields) and leaving type/ontology to the user

  4. No mandate for anything, flexibility is key

I am FAR from an expert, but my experience as a computer scientist being in this field for a few years is that we probably want to do as much of #2 and #3 as possible (to enable sharing), do #1 for the really key fields, and ensure that we support #3 to enable tool flexibility...

I think if you do too much of #1 you lose flexibility, and flexibility is key for these tools today as far as I can tell... In a way, I think MiAIRR goes fairly far along the path, although there are many AIRR Formats required fields beyond the MiAIRR fields.

We want to do as much of #2 and #3 as possible, otherwise as William suggests, we aren't really helping all that much 8-) At the same time, the formats and tools should be extensible and forgiving of missing data to enable new tools to be developed.

I suspect that isn't a surprise to anyone, but thought I would state it explicitly.

Brian

from airr-standards.

javh avatar javh commented on July 17, 2024

Yeah, sounds like we should discuss it in a call. I like the namespace idea, on top our of existing set of mandatory and optional fields.

We don't presently have any fields for IMGT-gapped sequences in the spec. At a minimum, I think adding those as optional fields would be good (sequence_imgt, fwr1_imgt_nt, cdr1_imgt_nt, etc). Those should at least have unambiguous definitions.

from airr-standards.

javh avatar javh commented on July 17, 2024

Nucleotide fields were added for CDR and FWR in #89.

from airr-standards.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.