smith-chem-wisc / mzlib Goto Github PK

Library for mass spectrometry projects

License: GNU Lesser General Public License v3.0

C# 99.93% Gherkin 0.07%

mass-spectrometry chemical-formulas isotopes chemistry spectra-files proteogenomics proteomics-data mass-spectrometry-data

mzlib's People

Contributors

Stargazers

Watchers

mzlib's Issues

add support for mascot generic format mgf file import/export

http://www.matrixscience.com/help/data_file_help.html

deconvolution wishlist

Some output I would like to be able to access in PS:

specific charge state info (intensity, number of charges, monoisotopic m/z)
apex RT (at what retention time was proteoform most abundant)

Some options I would like for input parameters:

(What is intensity ratio setting?)
setting for min S/N ratio of peaks to consider for deconvolution
setting for minimum charge state to consider for deconvolution (set to 5 for deconvoluting proteins with Thermo Deconvolution)
setting for RT window size over which to aggregate features

add read uniprot ptmlist (and similar)

In imzspectrum interface add a deconvolution method that returns a list of masses.
Parameters could specify the max charge possible (useful for ms2 where precursor charges are known), confidence level (for intact only want confidently identified masses, for ms2 ok with a lot of low confidence id's that might even correspond to a single isotope peak).
Also a parameter could be the deconvolution result of a neighboring spectrum, which would increase confidence in matched masses (this is useful for intact, but useless for ms2)

Unhandled exception in deconvoluting mzML file converted from Bruker (.d)

I received this feedback when I ran the Decon code on a mzML file converted from Bruker data file (in .d)

What does this mean?

ModificationWithMass hashing seems odd

The hash codes for two ModificationWithMass objects with different IDs, different accessions, and different masses are the same.

e.g.
new ModificationWithMass("mod", new Tuple<string, string>("acc", "acc"), motif, TerminusLocalization.Any, 1, null, null, null, "type")
and
new ModificationWithMass("mod2", new Tuple<string, string>("acc2", "acc2"), motif, TerminusLocalization.Any, 10, null, null, null, "type")

Method to check for ThermoMSFileReader dependency

MetaMorpheus crashes with funky nonspecific errors if ThermoMSFileReader isn't installed. Could we make a method to check for ThermoMSFileReader and return a more detailed exception and exception message if it isn't installed?

write index in mzml

remove meaningless offset

filter only ms1 scans in summedmsdatafile

Serializability

I would like to serialize Proteomics objects (e.g. ModificationWithMass, Protein). Could we tag them with the [Serializable] attribute?

Multiple Unimod modifications are read with same id

Add Para-Medic Tool

https://github.com/dhmay/param-medic

Add MGF Writer

mascot generic format

for reference:
http://kirchnerlab.github.io/libmgf/
https://pythonhosted.org/pyteomics/api/mgf.html
https://proteomicsresource.washington.edu/mascot/help/data_file_help.html
http://ftp.mi.fu-berlin.de/pub/OpenMS/release-documentation/html/TOPP_MascotAdapter.html

More detailed error for missing elements

Loading a ModificationWithMassAndCf before loading elements produces an error that can be very difficult to diagnose. Let's make it more detailed when we have the chance.

ProteinDbLoader should remove all whitspace from base sequences

new constructor for GoTerm

public GoTerm(Aspect aspect, string description, string id)

Working with splice variant entries in UniProt XML

Figuring out splice variant entries in UniProt XML format

Goal

I'd like to annotate splicing and gene fusion variation within the UniProt format, so that it can be handled in search software.
I'd also like to enable generating splice variant entries from UniProt XMLs.

Splice variant entries

It looks like the flat file contains a bit more detail on these entries. Let's take ZNF644 for example:

FT   VAR_SEQ       1   1222       Missing (in isoform 3).
FT                                {ECO:0000303|PubMed:14702039,
FT                                ECO:0000303|PubMed:15489334}.
FT                                /FTId=VSP_015855.
FT   VAR_SEQ    1223   1229       MDLTMHS -> MLIRQNL (in isoform 3).
FT                                {ECO:0000303|PubMed:14702039,
FT                                ECO:0000303|PubMed:15489334}.
FT                                /FTId=VSP_015856.
FT   VAR_SEQ    1230   1232       ALD -> GLI (in isoform 2).
FT                                {ECO:0000303|Ref.1}.
FT                                /FTId=VSP_012158.
FT   VAR_SEQ    1233   1327       Missing (in isoform 2).
FT                                {ECO:0000303|Ref.1}.
FT                                /FTId=VSP_012159.

There are two entries describing missing regions, and there are other similar entries for describing sequence changes.
I find this a bit confusing because they're reusing the same format for two different operations.
This confusion becomes a bit more pronounced in the XML format:

<feature type="splice variant" description="In isoform 3." id="VSP_015855" evidence="17 18">
<location>
<begin position="1"/>
<end position="1222"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 3." id="VSP_015856" evidence="17 18">
<original>MDLTMHS</original>
<variation>MLIRQNL</variation>
<location>
<begin position="1223"/>
<end position="1229"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 2." id="VSP_012158" evidence="19">
<original>ALD</original>
<variation>GLI</variation>
<location>
<begin position="1230"/>
<end position="1232"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 2." id="VSP_012159" evidence="19">
<location>
<begin position="1233"/>
<end position="1327"/>
</location>
</feature>

The "missing" detail has been omitted from these entries. However, it is easy to see that when no original and variation sequences are provided that they mean to say the sequence is missing from that isoform.
It is a little strange that the description is the only way to see what isoform the splice variant describes. Hopefully these are forced to be unique and not prone to typos. Also, hopefully, they all start with "In isoform."

Let's check that out:

# in bash: sudo pip install lxml

# in python:

from lxml import etree as et
HTML_NS = "http://uniprot.org/uniprot"
XSI_NS = "http://www.w3.org/2001/XMLSchema-instance"
NAMESPACE_MAP = {None : HTML_NS, "xsi" : XSI_NS}
UP = '{'+HTML_NS+'}'
u=et.parse("/mnt/e/uniprot-proteome%3AUP000005640.xml")
root=u.getroot()
for entry in root:
splices = []
for entry in root:
        for element in entry:
                if element.tag == UP+'feature' and element.get('type') == "splice variant":
                        splices.append(element)
len(splices) # 28663
all(e.get('description').startswith("In isoform") for e in splices) # True
len([e for e in splices if all(ee.tag != UP+"original" for ee in e.getchildren())]) # 15782 number with  missing sequences 
len([e for e in splices if any(ee.tag == UP+"original" for ee in e.getchildren())]) # 12881 number with sequences
len([e for e in splices if e.find(UP+"original") != None and e.find(UP+"variation") != None and len(e.find(UP+"original").text) != len(e.find(UP+"variation").text)]) # 7357 number of sequences where original and variation lengths are different
len([e for e in splices if e.find(UP+"original") != None and e.find(UP+"variation") != None and len(e.find(UP+"original").text) == len(e.find(UP+"variation").text)]) # 5524 number of sequences where original and variation lengths are the same

Yes, each splice variant feature has a description starting with "In isoform."
The sequences aren't all the same length.
About half have missing sequences.

Read in "cross-link" features in order to identify ubiquitinations

GO Terms in ProteinDbLoader

ProteoformSuite uses GO terms, and I expect we will use them with a quantitation side of MetaMorpheus.

Could we make a couple new classes in the Proteomics namespace: GoTerm and ProteinWithGoTerms?

public class GoTerm
    {
        public string id { get; set; }
        public string description { get; set; }
        public Aspect aspect { get; set; }
    }

    public enum Aspect
    {
        molecularFunction,
        cellularComponent,
        biologicalProcess
    }

    public class ProteinWithGoTerms : Protein
    {
             public List<GoTerm> goTerms { get; set; }
             //Constructor with the goTerms list

As for the ProteinDbReader, I found the nested switch-cases with XmlReader confusing, so I changed it to the processing XElement objects with LINQ a while back. Here is a snippet of that code for getting the GO Terms from protein entries. I'm not exactly sure where we would pull out the GO information, but it would probably involve working the "dbReference" into the nested switches. @stefanks would you be interested in switching over to XElement for somewhat easier tracing? (Here it is in the ProteoformSuite code.)

List<XElement> dbReferences = entry.Elements().Where(node => node.Name.LocalName == "dbReference").ToList();
//Process dbReferences to retrieve Gene Ontology Terms
foreach (XElement dbReference in dbReferences)
{
   string dbReference_type = GetAttribute(dbReference, "type");
   if (dbReference_type == "GO")
   {
   	GoTerm go = new GoTerm();
   	string ID = GetAttribute(dbReference, "id");
   	go.id = ID.Split(':')[1].ToString();
   	IEnumerable<XElement> dbProperties = from XElement in dbReference.Elements() where XElement.Name.LocalName == "property" select XElement;

   	foreach (XElement property in dbProperties)
   	{
   		string type = GetAttribute(property, "type");
   		if (type == "term")
   		{
   			string description = GetAttribute(property, "value");
   			switch (description.Split(':')[0].ToString())
   			{
   				case "C":
   					go.aspect = Aspect.cellularComponent;
   					go.description = description.Split(':')[1].ToString();
   					break;
   				case "F":
   					go.aspect = Aspect.molecularFunction;
   					go.description = description.Split(':')[1].ToString();
   					break;
   				case "P":
   					go.aspect = Aspect.biologicalProcess;
   					go.description = description.Split(':')[1].ToString();
   					break;
   			}
   			goTerms.Add(go);
   		}
   	}
   }
}

Make accession be the full header if regexes don't work

mzml writing output does not work for TDPortal

MsConvert
SeeMs

Need to:

Write chromatogram
Write Indexed mzml

Add read uniprot xml

GetClosestOneBasedSpectrumNumber - implement binary search

it would be helpful (faster) if this method:

int GetClosestOneBasedSpectrumNumber(double retentionTime)

was implemented with binary search instead of iterating through scans

Use filename for the first item in the Tuple<string, string> during PtmListLoad

This would allow us to distinguish between PTMs pulled from different databases.

Precursor scan number not always correct

In thermo raw files that don't have scan event = 1 for precursor scans, reading fails. See http://stackoverflow.com/questions/8971486/com-interop-how-to-use-icustommarshaler-to-call-3rd-party-component

Losing protein name in LoadProteinFasta written by FastaWriter

chemical formula in unimod standard

CHEMICAL FORMULA
Chemical formulas of modifications may be specified following the mandatory key “formula.” Formulas must use Unimod symbols (http://www.unimod.org/masses.html) and follow the Unimod composition rules (http://www.unimod.org/fields.html). The formula is displayed and entered as atoms, optionally followed by a number in parentheses. The number may be negative and, if there is no number, 1 is assumed. The atom order is not important. C, F, H, etc. are symbols for elements, not one letter codes for amino acids. Isotopes of atoms are specified by the integer preceding the atomic symbol (e.g. 13C).

Find a new serializer that implements ISerializable for custom serialization

Currently, we are using NetSerializer for serialization because it is fast and works with Dictionary objects.

However, it does not implement ISerializable, so we cannot implement custom serializers to avoid circular references like Element <-> Isotope. I drafted an custom serialization, which you can find here.

FlashLFQEngine crashed while running MM searchtask

results.txt

SearchTaskconfig.txt

Make separate scan classes for scans with/without precursor

Single spectrum deconvolution does not pick correct peak if within tolerance

In case there are two peaks within the deconvolution tolerance, the single spectrum deconvolution picks the nearest one to the theoretical, instead of the one that matches the expected intensity the closest.

UnmanagedThermoHelperLayer is no longer a part of nuget package

Problem

I wasn't able to run a test project in a new solution (running .NET Framework 4.71). It was complaining that it wasn't able to load ManagedThermoHelperLayer.dll.

I'm pretty sure UnmanagedThermoHelperLayer.dll isn't getting carried along. It's one of ManagedThermoHelperLayer.dll's dependencies.

Reproducing this problem

It turns out if you clean a solution right now (with mzLib v1.0.305), it doesn't delete UnmanagedThermoHelperLayer.dll, so it sneakily doesn't break. But if you delete that file manually, it isn't recreated.

I reverted back to some arbitrary earlier version (mzLib v1.0.285, I think), and UnmanagedThermoHelperLayer.dll was present in that release.

Another weird thing is that in the TestThermo project, clean is not deleting UnmanagedThermoHelperLayer.dll, but when I delete it manually and build, it's recreated.

Possible solution

It looks like the nuspec file has some weird targets at the bottom, with UnmanagedThermoHelperLayer listed under a different target than the rest of the files. I don't know what the heck mzLib.targets is, either. I think this is what needs to be changed.

Thermo's new "RawFileReader"

Thermo has a new system for reading .raw files. I'd like it integrated into mzLib so we don't have to depend on people having MSFileReader v3.0 SP2 installed. It's a constant thorn in our side and if we can ensure users have what they need it will make all of our lives easier across Proteoform Suite, MetaMorpheus, FlashLFQ, and new programs.

Fix "kelleherCustomType" output

Create FASTA from xml

for use in those visualization and downstream analysis software tools that require a FASTA file as input.

New Modification Type to replace multiple different modification types

We've had trouble with localization, naming collisions, and keeping track of which modifications are in use/not in use. We also don't have robust modification reading. It is currently subject to crash from minor formating problems. Also, currently have no ability to add "comments" to mod text files. I'd like to change all this. I'm currently creating one modification class to rule them all. Here is the first draft. Like to hear your comments/edits/suggests. All complaints can be sent to /dev/null

using System;
using System.Collections.Generic;
using System.Text;

using Chemistry;
using MassSpectrometry;

namespace Proteomics
{
public class ModificationGeneral
{
public string Id { get; private set; }
public string Accession { get; private set; }
public string ModificationType { get; private set; }
public string FeatureType { get; private set; }
public List Target { get; private set; }
public List Positions { get; private set; }
public ChemicalFormula ChemicalFormula { get; private set; }
public double? MonoisotopicMass { get; private set; }
public Dictionary<string, string> DatabaseReference { get; private set; }
public Dictionary<string, string> TaxonomicRange { get; private set; }
public List Keywords { get; private set; }
public Dictionary<DissociationType, List> NeutralLosses { get; private set; }
public Dictionary<DissociationType, List> DiagnosticIons { get; private set; }
public string FileOrigin { get; private set; }

    public ModificationGeneral()
    {
        this.Id = null;
        this.Accession = null;
        this.ModificationType = null;
        this.FeatureType = null;
        this.Target = new List<string>();
        this.Positions = new List<string>();
        this.ChemicalFormula = new ChemicalFormula();
        this.MonoisotopicMass = null;
        this.DatabaseReference = new Dictionary<string, string>();
        this.TaxonomicRange = new Dictionary<string, string>();
        this.Keywords = new List<string>();
        this.NeutralLosses = new Dictionary<DissociationType, List<double>>();
        this.DiagnosticIons = new Dictionary<DissociationType, List<double>>();
        this.FileOrigin = null;
    }
}

}

Field comments in each object

Each object should have some sort of metadata that explains the fields in each object. One example: "accession" in ModificationWithLocation is a Tuple<string, string>, but it's not clear unless looking at the mzLib code what each string represents.

Validate mzIdentML output with PRIDE

http://www.ebi.ac.uk/pride/help/archive/submission

PRIDE is the vehicle for getting new data into uniprot. PRIDE requires mzidentml as one of the inputs. i believe they (PRIDE) have a tool to validate the mzidentml format. see weblink

maxPeaksPerScan does not exist for Thermo

Move calibration into mzLib

As calibration is further improved and tested (especially for topdown data), if it was part of mzlib then these improvements could be used in proteoformsuite

Read .mzML file report: invalid uri the hostname could not be parsed

I found this is caused by file with something like location="file://." , which is not able to be parsed.
I change it to location="file://E:/" Then it is OK.
In case anyone meet the same problem in the future.

Make tolerance within computation more efficient

By removing the switch statement

Hierarchical modifications

N-term Acetyl, and Acetyllysine should fall under the same actylations category

Do not create separate modifications for each neutral loss

Instead, store the possible neutral loss array in a modification.
This way:

When writing the G-PTM-D database, will not write something like "Phospho of S NL:0", but instead "Phospho of S". This makes much more sense, since the neutral loss is not an inherent property of the modified protein, but rather of the fragmentation method.
I will handle the individual neutral losses in MetaMorpheus, and allow having "Phospho of S NL:0" or "Phospho of S NL:79.996" in the psms tsv file. That is informative, because PSMs correspond to a specific choice of a neutral loss.

Do two tech-reps to test deconvolution

Same masses should have same intensities and same elution times. The ones that do (with some tolerance) are considered to be real. Others considered to be wrong. Do machine learning to learn to separate real from wrong.

Switch reading capability to proteowizard

This way it's someone elses headache to figure out all the file inputs, and we'll have the ability to read mulitple vendor formats

Proteome Exchange Submission fail

FILE: 12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated.mzML
VALIDATION MESSAGE: (1) Non-fatal XML Parsing error detected on line 13, (2)
Non-fatal XML Parsing error detected on line 48, (3) Error message:
cvc-datatype-valid.1.2.1: '12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated' is
not a valid value for 'NCName'., (4) Error message: cvc-attribute.3: The value
'12-10-16_A17A_yeast_BU_fract5_rep1.raw' of attribute 'id' on element
'sourceFile' is not valid with respect to its type, 'ID'., (5) Error message:
cvc-datatype-valid.1.2.1: '12-10-16_A17A_yeast_BU_fract5_rep1.raw' is not a
valid value for 'NCName'., (6) Error message: cvc-attribute.3: The value
'12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated' of attribute 'id' on element
'run' is not valid with respect to its type, 'ID'.

FILE: 12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated_1mm.mzid
VALIDATION MESSAGE: (1) Fatal Error message: Content is not allowed in prolog.,
(2) FATAL XML Parsing error detected on line 1

Updating readme

ProteinDatabaseLoader example is out of date, and perhaps some other items.