smith-chem-wisc / mzlib Goto Github PK
View Code? Open in Web Editor NEWLibrary for mass spectrometry projects
License: GNU Lesser General Public License v3.0
Library for mass spectrometry projects
License: GNU Lesser General Public License v3.0
As calibration is further improved and tested (especially for topdown data), if it was part of mzlib then these improvements could be used in proteoformsuite
CHEMICAL FORMULA
Chemical formulas of modifications may be specified following the mandatory key “formula.” Formulas must use Unimod symbols (http://www.unimod.org/masses.html) and follow the Unimod composition rules (http://www.unimod.org/fields.html). The formula is displayed and entered as atoms, optionally followed by a number in parentheses. The number may be negative and, if there is no number, 1 is assumed. The atom order is not important. C, F, H, etc. are symbols for elements, not one letter codes for amino acids. Isotopes of atoms are specified by the integer preceding the atomic symbol (e.g. 13C).
I would like to serialize Proteomics objects (e.g. ModificationWithMass, Protein). Could we tag them with the [Serializable]
attribute?
public GoTerm(Aspect aspect, string description, string id)
This would allow us to distinguish between PTMs pulled from different databases.
I'd like to annotate splicing and gene fusion variation within the UniProt format, so that it can be handled in search software.
I'd also like to enable generating splice variant entries from UniProt XMLs.
It looks like the flat file contains a bit more detail on these entries. Let's take ZNF644 for example:
FT VAR_SEQ 1 1222 Missing (in isoform 3).
FT {ECO:0000303|PubMed:14702039,
FT ECO:0000303|PubMed:15489334}.
FT /FTId=VSP_015855.
FT VAR_SEQ 1223 1229 MDLTMHS -> MLIRQNL (in isoform 3).
FT {ECO:0000303|PubMed:14702039,
FT ECO:0000303|PubMed:15489334}.
FT /FTId=VSP_015856.
FT VAR_SEQ 1230 1232 ALD -> GLI (in isoform 2).
FT {ECO:0000303|Ref.1}.
FT /FTId=VSP_012158.
FT VAR_SEQ 1233 1327 Missing (in isoform 2).
FT {ECO:0000303|Ref.1}.
FT /FTId=VSP_012159.
There are two entries describing missing regions, and there are other similar entries for describing sequence changes.
I find this a bit confusing because they're reusing the same format for two different operations.
This confusion becomes a bit more pronounced in the XML format:
<feature type="splice variant" description="In isoform 3." id="VSP_015855" evidence="17 18">
<location>
<begin position="1"/>
<end position="1222"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 3." id="VSP_015856" evidence="17 18">
<original>MDLTMHS</original>
<variation>MLIRQNL</variation>
<location>
<begin position="1223"/>
<end position="1229"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 2." id="VSP_012158" evidence="19">
<original>ALD</original>
<variation>GLI</variation>
<location>
<begin position="1230"/>
<end position="1232"/>
</location>
</feature>
<feature type="splice variant" description="In isoform 2." id="VSP_012159" evidence="19">
<location>
<begin position="1233"/>
<end position="1327"/>
</location>
</feature>
The "missing" detail has been omitted from these entries. However, it is easy to see that when no original and variation sequences are provided that they mean to say the sequence is missing from that isoform.
It is a little strange that the description is the only way to see what isoform the splice variant describes. Hopefully these are forced to be unique and not prone to typos. Also, hopefully, they all start with "In isoform."
Let's check that out:
# in bash: sudo pip install lxml
# in python:
from lxml import etree as et
HTML_NS = "http://uniprot.org/uniprot"
XSI_NS = "http://www.w3.org/2001/XMLSchema-instance"
NAMESPACE_MAP = {None : HTML_NS, "xsi" : XSI_NS}
UP = '{'+HTML_NS+'}'
u=et.parse("/mnt/e/uniprot-proteome%3AUP000005640.xml")
root=u.getroot()
for entry in root:
splices = []
for entry in root:
for element in entry:
if element.tag == UP+'feature' and element.get('type') == "splice variant":
splices.append(element)
len(splices) # 28663
all(e.get('description').startswith("In isoform") for e in splices) # True
len([e for e in splices if all(ee.tag != UP+"original" for ee in e.getchildren())]) # 15782 number with missing sequences
len([e for e in splices if any(ee.tag == UP+"original" for ee in e.getchildren())]) # 12881 number with sequences
len([e for e in splices if e.find(UP+"original") != None and e.find(UP+"variation") != None and len(e.find(UP+"original").text) != len(e.find(UP+"variation").text)]) # 7357 number of sequences where original and variation lengths are different
len([e for e in splices if e.find(UP+"original") != None and e.find(UP+"variation") != None and len(e.find(UP+"original").text) == len(e.find(UP+"variation").text)]) # 5524 number of sequences where original and variation lengths are the same
Yes, each splice variant
feature has a description starting with "In isoform."
The sequences aren't all the same length.
About half have missing sequences.
N-term Acetyl, and Acetyllysine should fall under the same actylations category
In imzspectrum interface add a deconvolution method that returns a list of masses.
Parameters could specify the max charge possible (useful for ms2 where precursor charges are known), confidence level (for intact only want confidently identified masses, for ms2 ok with a lot of low confidence id's that might even correspond to a single isotope peak).
Also a parameter could be the deconvolution result of a neighboring spectrum, which would increase confidence in matched masses (this is useful for intact, but useless for ms2)
Instead, store the possible neutral loss array in a modification.
This way:
I found this is caused by file with something like location="file://." , which is not able to be parsed.
I change it to location="file://E:/" Then it is OK.
In case anyone meet the same problem in the future.
Same masses should have same intensities and same elution times. The ones that do (with some tolerance) are considered to be real. Others considered to be wrong. Do machine learning to learn to separate real from wrong.
ProteoformSuite uses GO terms, and I expect we will use them with a quantitation side of MetaMorpheus.
Could we make a couple new classes in the Proteomics namespace: GoTerm and ProteinWithGoTerms?
public class GoTerm
{
public string id { get; set; }
public string description { get; set; }
public Aspect aspect { get; set; }
}
public enum Aspect
{
molecularFunction,
cellularComponent,
biologicalProcess
}
public class ProteinWithGoTerms : Protein
{
public List<GoTerm> goTerms { get; set; }
//Constructor with the goTerms list
As for the ProteinDbReader, I found the nested switch-cases with XmlReader confusing, so I changed it to the processing XElement objects with LINQ a while back. Here is a snippet of that code for getting the GO Terms from protein entries. I'm not exactly sure where we would pull out the GO information, but it would probably involve working the "dbReference" into the nested switches. @stefanks would you be interested in switching over to XElement for somewhat easier tracing? (Here it is in the ProteoformSuite code.)
List<XElement> dbReferences = entry.Elements().Where(node => node.Name.LocalName == "dbReference").ToList();
//Process dbReferences to retrieve Gene Ontology Terms
foreach (XElement dbReference in dbReferences)
{
string dbReference_type = GetAttribute(dbReference, "type");
if (dbReference_type == "GO")
{
GoTerm go = new GoTerm();
string ID = GetAttribute(dbReference, "id");
go.id = ID.Split(':')[1].ToString();
IEnumerable<XElement> dbProperties = from XElement in dbReference.Elements() where XElement.Name.LocalName == "property" select XElement;
foreach (XElement property in dbProperties)
{
string type = GetAttribute(property, "type");
if (type == "term")
{
string description = GetAttribute(property, "value");
switch (description.Split(':')[0].ToString())
{
case "C":
go.aspect = Aspect.cellularComponent;
go.description = description.Split(':')[1].ToString();
break;
case "F":
go.aspect = Aspect.molecularFunction;
go.description = description.Split(':')[1].ToString();
break;
case "P":
go.aspect = Aspect.biologicalProcess;
go.description = description.Split(':')[1].ToString();
break;
}
goTerms.Add(go);
}
}
}
}
I wasn't able to run a test project in a new solution (running .NET Framework 4.71). It was complaining that it wasn't able to load ManagedThermoHelperLayer.dll.
I'm pretty sure UnmanagedThermoHelperLayer.dll isn't getting carried along. It's one of ManagedThermoHelperLayer.dll's dependencies.
It turns out if you clean a solution right now (with mzLib v1.0.305), it doesn't delete UnmanagedThermoHelperLayer.dll, so it sneakily doesn't break. But if you delete that file manually, it isn't recreated.
I reverted back to some arbitrary earlier version (mzLib v1.0.285, I think), and UnmanagedThermoHelperLayer.dll was present in that release.
Another weird thing is that in the TestThermo project, clean
is not deleting UnmanagedThermoHelperLayer.dll, but when I delete it manually and build
, it's recreated.
It looks like the nuspec file has some weird targets at the bottom, with UnmanagedThermoHelperLayer listed under a different target than the rest of the files. I don't know what the heck mzLib.targets
is, either. I think this is what needs to be changed.
FILE: 12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated.mzML
VALIDATION MESSAGE: (1) Non-fatal XML Parsing error detected on line 13, (2)
Non-fatal XML Parsing error detected on line 48, (3) Error message:
cvc-datatype-valid.1.2.1: '12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated' is
not a valid value for 'NCName'., (4) Error message: cvc-attribute.3: The value
'12-10-16_A17A_yeast_BU_fract5_rep1.raw' of attribute 'id' on element
'sourceFile' is not valid with respect to its type, 'ID'., (5) Error message:
cvc-datatype-valid.1.2.1: '12-10-16_A17A_yeast_BU_fract5_rep1.raw' is not a
valid value for 'NCName'., (6) Error message: cvc-attribute.3: The value
'12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated' of attribute 'id' on element
'run' is not valid with respect to its type, 'ID'.
FILE: 12-10-16_A17A_yeast_BU_fract5_rep1-Calibrated_1mm.mzid
VALIDATION MESSAGE: (1) Fatal Error message: Content is not allowed in prolog.,
(2) FATAL XML Parsing error detected on line 1
This way it's someone elses headache to figure out all the file inputs, and we'll have the ability to read mulitple vendor formats
Thermo has a new system for reading .raw files. I'd like it integrated into mzLib so we don't have to depend on people having MSFileReader v3.0 SP2 installed. It's a constant thorn in our side and if we can ensure users have what they need it will make all of our lives easier across Proteoform Suite, MetaMorpheus, FlashLFQ, and new programs.
MetaMorpheus crashes with funky nonspecific errors if ThermoMSFileReader isn't installed. Could we make a method to check for ThermoMSFileReader and return a more detailed exception and exception message if it isn't installed?
Some output I would like to be able to access in PS:
Some options I would like for input parameters:
for use in those visualization and downstream analysis software tools that require a FASTA file as input.
it would be helpful (faster) if this method:
int GetClosestOneBasedSpectrumNumber(double retentionTime)
was implemented with binary search instead of iterating through scans
Each object should have some sort of metadata that explains the fields in each object. One example: "accession" in ModificationWithLocation is a Tuple<string, string>, but it's not clear unless looking at the mzLib code what each string represents.
In thermo raw files that don't have scan event = 1 for precursor scans, reading fails. See http://stackoverflow.com/questions/8971486/com-interop-how-to-use-icustommarshaler-to-call-3rd-party-component
ProteinDatabaseLoader example is out of date, and perhaps some other items.
In case there are two peaks within the deconvolution tolerance, the single spectrum deconvolution picks the nearest one to the theoretical, instead of the one that matches the expected intensity the closest.
Need to:
Currently, we are using NetSerializer for serialization because it is fast and works with Dictionary objects.
However, it does not implement ISerializable, so we cannot implement custom serializers to avoid circular references like Element <-> Isotope. I drafted an custom serialization, which you can find here.
We've had trouble with localization, naming collisions, and keeping track of which modifications are in use/not in use. We also don't have robust modification reading. It is currently subject to crash from minor formating problems. Also, currently have no ability to add "comments" to mod text files. I'd like to change all this. I'm currently creating one modification class to rule them all. Here is the first draft. Like to hear your comments/edits/suggests. All complaints can be sent to /dev/null
using System;
using System.Collections.Generic;
using System.Text;
using Chemistry;
using MassSpectrometry;
namespace Proteomics
{
public class ModificationGeneral
{
public string Id { get; private set; }
public string Accession { get; private set; }
public string ModificationType { get; private set; }
public string FeatureType { get; private set; }
public List Target { get; private set; }
public List Positions { get; private set; }
public ChemicalFormula ChemicalFormula { get; private set; }
public double? MonoisotopicMass { get; private set; }
public Dictionary<string, string> DatabaseReference { get; private set; }
public Dictionary<string, string> TaxonomicRange { get; private set; }
public List Keywords { get; private set; }
public Dictionary<DissociationType, List> NeutralLosses { get; private set; }
public Dictionary<DissociationType, List> DiagnosticIons { get; private set; }
public string FileOrigin { get; private set; }
public ModificationGeneral()
{
this.Id = null;
this.Accession = null;
this.ModificationType = null;
this.FeatureType = null;
this.Target = new List<string>();
this.Positions = new List<string>();
this.ChemicalFormula = new ChemicalFormula();
this.MonoisotopicMass = null;
this.DatabaseReference = new Dictionary<string, string>();
this.TaxonomicRange = new Dictionary<string, string>();
this.Keywords = new List<string>();
this.NeutralLosses = new Dictionary<DissociationType, List<double>>();
this.DiagnosticIons = new Dictionary<DissociationType, List<double>>();
this.FileOrigin = null;
}
}
}
By removing the switch statement
The hash codes for two ModificationWithMass objects with different IDs, different accessions, and different masses are the same.
e.g.
new ModificationWithMass("mod", new Tuple<string, string>("acc", "acc"), motif, TerminusLocalization.Any, 1, null, null, null, "type")
and
new ModificationWithMass("mod2", new Tuple<string, string>("acc2", "acc2"), motif, TerminusLocalization.Any, 10, null, null, null, "type")
http://www.ebi.ac.uk/pride/help/archive/submission
PRIDE is the vehicle for getting new data into uniprot. PRIDE requires mzidentml as one of the inputs. i believe they (PRIDE) have a tool to validate the mzidentml format. see weblink
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.