Code Monkey home page Code Monkey logo

mzlib's People

Contributors

acesnik avatar alexander-sol avatar avcarr2 avatar bintiibrahim avatar dippman avatar elaboy avatar ianhirsch avatar kyp4 avatar lonelu avatar nbollis avatar rmillikin avatar rmmiller22 avatar stefanks avatar trishorts avatar weaversd avatar xrsheeran avatar yulingdai avatar zdanaceau avatar zrolfs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mzlib's Issues

Move calibration into mzLib

As calibration is further improved and tested (especially for topdown data), if it was part of mzlib then these improvements could be used in proteoformsuite

chemical formula in unimod standard

CHEMICAL FORMULA
Chemical formulas of modifications may be specified following the mandatory key “formula.” Formulas must use Unimod symbols (http://www.unimod.org/masses.html) and follow the Unimod composition rules (http://www.unimod.org/fields.html). The formula is displayed and entered as atoms, optionally followed by a number in parentheses. The number may be negative and, if there is no number, 1 is assumed. The atom order is not important. C, F, H, etc. are symbols for elements, not one letter codes for amino acids. Isotopes of atoms are specified by the integer preceding the atomic symbol (e.g. 13C).

Serializability

I would like to serialize Proteomics objects (e.g. ModificationWithMass, Protein). Could we tag them with the [Serializable] attribute?

Working with splice variant entries in UniProt XML

Figuring out splice variant entries in UniProt XML format

Goal

  • I'd like to annotate splicing and gene fusion variation within the UniProt format, so that it can be handled in search software.

  • I'd also like to enable generating splice variant entries from UniProt XMLs.

Splice variant entries

It looks like the flat file contains a bit more detail on these entries. Let's take ZNF644 for example:

FT   VAR_SEQ       1   1222       Missing (in isoform 3).
FT                                {ECO:0000303|PubMed:14702039,
FT                                ECO:0000303|PubMed:15489334}.
FT                                /FTId=VSP_015855.
FT   VAR_SEQ    1223   1229       MDLTMHS -> MLIRQNL (in isoform 3).
FT                                {ECO:0000303|PubMed:14702039,
FT                                ECO:0000303|PubMed:15489334}.
FT                                /FTId=VSP_015856.
FT   VAR_SEQ    1230   1232       ALD -> GLI (in isoform 2).
FT                                {ECO:0000303|Ref.1}.
FT                                /FTId=VSP_012158.
FT   VAR_SEQ    1233   1327       Missing (in isoform 2).
FT                                {ECO:0000303|Ref.1}.
FT                                /FTId=VSP_012159.

There are two entries describing missing regions, and there are other similar entries for describing sequence changes.
I find this a bit confusing because they're reusing the same format for two different operations.
This confusion becomes a bit more pronounced in the XML format:

<feature type="splice variant" description="In isoform 3." id="VSP_015855" evidence="17 18">
<location>
<begin position="1"/>
<end position="1222"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 3." id="VSP_015856" evidence="17 18">
<original>MDLTMHS</original>
<variation>MLIRQNL</variation>
<location>
<begin position="1223"/>
<end position="1229"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 2." id="VSP_012158" evidence="19">
<original>ALD</original>
<variation>GLI</variation>
<location>
<begin position="1230"/>
<end position="1232"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 2." id="VSP_012159" evidence="19">
<location>
<begin position="1233"/>
<end position="1327"/>
</location>
</feature>

The "missing" detail has been omitted from these entries. However, it is easy to see that when no original and variation sequences are provided that they mean to say the sequence is missing from that isoform.
It is a little strange that the description is the only way to see what isoform the splice variant describes. Hopefully these are forced to be unique and not prone to typos. Also, hopefully, they all start with "In isoform."

Let's check that out:

# in bash: sudo pip install lxml

# in python:

from lxml import etree as et
HTML_NS = "http://uniprot.org/uniprot"
XSI_NS = "http://www.w3.org/2001/XMLSchema-instance"
NAMESPACE_MAP = {None : HTML_NS, "xsi" : XSI_NS}
UP = '{'+HTML_NS+'}'
u=et.parse("/mnt/e/uniprot-proteome%3AUP000005640.xml")
root=u.getroot()
for entry in root:
splices = []
for entry in root:
        for element in entry:
                if element.tag == UP+'feature' and element.get('type') == "splice variant":
                        splices.append(element)
len(splices) # 28663
all(e.get('description').startswith("In isoform") for e in splices) # True
len([e for e in splices if all(ee.tag != UP+"original" for ee in e.getchildren())]) # 15782 number with  missing sequences 
len([e for e in splices if any(ee.tag == UP+"original" for ee in e.getchildren())]) # 12881 number with sequences
len([e for e in splices if e.find(UP+"original") != None and e.find(UP+"variation") != None and len(e.find(UP+"original").text) != len(e.find(UP+"variation").text)]) # 7357 number of sequences where original and variation lengths are different
len([e for e in splices if e.find(UP+"original") != None and e.find(UP+"variation") != None and len(e.find(UP+"original").text) == len(e.find(UP+"variation").text)]) # 5524 number of sequences where original and variation lengths are the same

Yes, each splice variant feature has a description starting with "In isoform."
The sequences aren't all the same length.
About half have missing sequences.

Deconvolution

In imzspectrum interface add a deconvolution method that returns a list of masses.
Parameters could specify the max charge possible (useful for ms2 where precursor charges are known), confidence level (for intact only want confidently identified masses, for ms2 ok with a lot of low confidence id's that might even correspond to a single isotope peak).
Also a parameter could be the deconvolution result of a neighboring spectrum, which would increase confidence in matched masses (this is useful for intact, but useless for ms2)

Do not create separate modifications for each neutral loss

Instead, store the possible neutral loss array in a modification.
This way:

  • When writing the G-PTM-D database, will not write something like "Phospho of S NL:0", but instead "Phospho of S". This makes much more sense, since the neutral loss is not an inherent property of the modified protein, but rather of the fragmentation method.
  • I will handle the individual neutral losses in MetaMorpheus, and allow having "Phospho of S NL:0" or "Phospho of S NL:79.996" in the psms tsv file. That is informative, because PSMs correspond to a specific choice of a neutral loss.

Do two tech-reps to test deconvolution

Same masses should have same intensities and same elution times. The ones that do (with some tolerance) are considered to be real. Others considered to be wrong. Do machine learning to learn to separate real from wrong.

More detailed error for missing elements

Loading a ModificationWithMassAndCf before loading elements produces an error that can be very difficult to diagnose. Let's make it more detailed when we have the chance.

weird_error

GO Terms in ProteinDbLoader

ProteoformSuite uses GO terms, and I expect we will use them with a quantitation side of MetaMorpheus.

Could we make a couple new classes in the Proteomics namespace: GoTerm and ProteinWithGoTerms?

public class GoTerm
    {
        public string id { get; set; }
        public string description { get; set; }
        public Aspect aspect { get; set; }
    }

    public enum Aspect
    {
        molecularFunction,
        cellularComponent,
        biologicalProcess
    }
    public class ProteinWithGoTerms : Protein
    {
             public List<GoTerm> goTerms { get; set; }
             //Constructor with the goTerms list

As for the ProteinDbReader, I found the nested switch-cases with XmlReader confusing, so I changed it to the processing XElement objects with LINQ a while back. Here is a snippet of that code for getting the GO Terms from protein entries. I'm not exactly sure where we would pull out the GO information, but it would probably involve working the "dbReference" into the nested switches. @stefanks would you be interested in switching over to XElement for somewhat easier tracing? (Here it is in the ProteoformSuite code.)

List<XElement> dbReferences = entry.Elements().Where(node => node.Name.LocalName == "dbReference").ToList();
//Process dbReferences to retrieve Gene Ontology Terms
foreach (XElement dbReference in dbReferences)
{
   string dbReference_type = GetAttribute(dbReference, "type");
   if (dbReference_type == "GO")
   {
   	GoTerm go = new GoTerm();
   	string ID = GetAttribute(dbReference, "id");
   	go.id = ID.Split(':')[1].ToString();
   	IEnumerable<XElement> dbProperties = from XElement in dbReference.Elements() where XElement.Name.LocalName == "property" select XElement;

   	foreach (XElement property in dbProperties)
   	{
   		string type = GetAttribute(property, "type");
   		if (type == "term")
   		{
   			string description = GetAttribute(property, "value");
   			switch (description.Split(':')[0].ToString())
   			{
   				case "C":
   					go.aspect = Aspect.cellularComponent;
   					go.description = description.Split(':')[1].ToString();
   					break;
   				case "F":
   					go.aspect = Aspect.molecularFunction;
   					go.description = description.Split(':')[1].ToString();
   					break;
   				case "P":
   					go.aspect = Aspect.biologicalProcess;
   					go.description = description.Split(':')[1].ToString();
   					break;
   			}
   			goTerms.Add(go);
   		}
   	}
   }
}

UnmanagedThermoHelperLayer is no longer a part of nuget package

Problem

I wasn't able to run a test project in a new solution (running .NET Framework 4.71). It was complaining that it wasn't able to load ManagedThermoHelperLayer.dll.
image 2

I'm pretty sure UnmanagedThermoHelperLayer.dll isn't getting carried along. It's one of ManagedThermoHelperLayer.dll's dependencies.

Reproducing this problem

It turns out if you clean a solution right now (with mzLib v1.0.305), it doesn't delete UnmanagedThermoHelperLayer.dll, so it sneakily doesn't break. But if you delete that file manually, it isn't recreated.

I reverted back to some arbitrary earlier version (mzLib v1.0.285, I think), and UnmanagedThermoHelperLayer.dll was present in that release.

Another weird thing is that in the TestThermo project, clean is not deleting UnmanagedThermoHelperLayer.dll, but when I delete it manually and build, it's recreated.

Possible solution

It looks like the nuspec file has some weird targets at the bottom, with UnmanagedThermoHelperLayer listed under a different target than the rest of the files. I don't know what the heck mzLib.targets is, either. I think this is what needs to be changed.
image

Proteome Exchange Submission fail

FILE: 12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated.mzML
VALIDATION MESSAGE: (1) Non-fatal XML Parsing error detected on line 13, (2)
Non-fatal XML Parsing error detected on line 48, (3) Error message:
cvc-datatype-valid.1.2.1: '12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated' is
not a valid value for 'NCName'., (4) Error message: cvc-attribute.3: The value
'12-10-16_A17A_yeast_BU_fract5_rep1.raw' of attribute 'id' on element
'sourceFile' is not valid with respect to its type, 'ID'., (5) Error message:
cvc-datatype-valid.1.2.1: '12-10-16_A17A_yeast_BU_fract5_rep1.raw' is not a
valid value for 'NCName'., (6) Error message: cvc-attribute.3: The value
'12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated' of attribute 'id' on element
'run' is not valid with respect to its type, 'ID'.

FILE: 12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated_1mm.mzid
VALIDATION MESSAGE: (1) Fatal Error message: Content is not allowed in prolog.,
(2) FATAL XML Parsing error detected on line 1

Thermo's new "RawFileReader"

Thermo has a new system for reading .raw files. I'd like it integrated into mzLib so we don't have to depend on people having MSFileReader v3.0 SP2 installed. It's a constant thorn in our side and if we can ensure users have what they need it will make all of our lives easier across Proteoform Suite, MetaMorpheus, FlashLFQ, and new programs.

Method to check for ThermoMSFileReader dependency

MetaMorpheus crashes with funky nonspecific errors if ThermoMSFileReader isn't installed. Could we make a method to check for ThermoMSFileReader and return a more detailed exception and exception message if it isn't installed?

deconvolution wishlist

Some output I would like to be able to access in PS:

  • specific charge state info (intensity, number of charges, monoisotopic m/z)
  • apex RT (at what retention time was proteoform most abundant)

Some options I would like for input parameters:

  • (What is intensity ratio setting?)
  • setting for min S/N ratio of peaks to consider for deconvolution
  • setting for minimum charge state to consider for deconvolution (set to 5 for deconvoluting proteins with Thermo Deconvolution)
  • setting for RT window size over which to aggregate features

Create FASTA from xml

for use in those visualization and downstream analysis software tools that require a FASTA file as input.

Field comments in each object

Each object should have some sort of metadata that explains the fields in each object. One example: "accession" in ModificationWithLocation is a Tuple<string, string>, but it's not clear unless looking at the mzLib code what each string represents.

Updating readme

ProteinDatabaseLoader example is out of date, and perhaps some other items.

New Modification Type to replace multiple different modification types

We've had trouble with localization, naming collisions, and keeping track of which modifications are in use/not in use. We also don't have robust modification reading. It is currently subject to crash from minor formating problems. Also, currently have no ability to add "comments" to mod text files. I'd like to change all this. I'm currently creating one modification class to rule them all. Here is the first draft. Like to hear your comments/edits/suggests. All complaints can be sent to /dev/null

using System;
using System.Collections.Generic;
using System.Text;

using Chemistry;
using MassSpectrometry;

namespace Proteomics
{
public class ModificationGeneral
{
public string Id { get; private set; }
public string Accession { get; private set; }
public string ModificationType { get; private set; }
public string FeatureType { get; private set; }
public List Target { get; private set; }
public List Positions { get; private set; }
public ChemicalFormula ChemicalFormula { get; private set; }
public double? MonoisotopicMass { get; private set; }
public Dictionary<string, string> DatabaseReference { get; private set; }
public Dictionary<string, string> TaxonomicRange { get; private set; }
public List Keywords { get; private set; }
public Dictionary<DissociationType, List> NeutralLosses { get; private set; }
public Dictionary<DissociationType, List> DiagnosticIons { get; private set; }
public string FileOrigin { get; private set; }

    public ModificationGeneral()
    {
        this.Id = null;
        this.Accession = null;
        this.ModificationType = null;
        this.FeatureType = null;
        this.Target = new List<string>();
        this.Positions = new List<string>();
        this.ChemicalFormula = new ChemicalFormula();
        this.MonoisotopicMass = null;
        this.DatabaseReference = new Dictionary<string, string>();
        this.TaxonomicRange = new Dictionary<string, string>();
        this.Keywords = new List<string>();
        this.NeutralLosses = new Dictionary<DissociationType, List<double>>();
        this.DiagnosticIons = new Dictionary<DissociationType, List<double>>();
        this.FileOrigin = null;
    }
}

}

ModificationWithMass hashing seems odd

The hash codes for two ModificationWithMass objects with different IDs, different accessions, and different masses are the same.

e.g.
new ModificationWithMass("mod", new Tuple<string, string>("acc", "acc"), motif, TerminusLocalization.Any, 1, null, null, null, "type")
and
new ModificationWithMass("mod2", new Tuple<string, string>("acc2", "acc2"), motif, TerminusLocalization.Any, 10, null, null, null, "type")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.