thomasp85 / mzid Goto Github PK
View Code? Open in Web Editor NEWAn mzIdentML parser for R
An mzIdentML parser for R
@thomasp85 - there are currently some issues with this repo, the Bioconductor repo, and maintainer access.
Should I maintain package? Not sure I am keen to do this long term, and would possibly redirect users toward mzR
. Also, I think I would favour using only the Bioc repo, just for simplicity.
Hi,
When I read the mzid file from OMSSA, it showed error like below, but it was OK for mzid file from X!Tandem. Could you please figure out the reason?
mzID("omssa.mzid")
Error in$<-.data.frame
(*tmp*
, "name", value = c("Carbamidomethyl", :
replacement has 7 rows, data has 5
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mzID_1.1.0 XML_3.98-1.1
loaded via a namespace (and not attached):
[1] plyr_1.8 tools_3.0.1
Best regards!
Bo
The internal names are different for versions 1.0 and 1.1. For example, in object@psm@id
, we have Peptide_ref
and peptide_ref
respectively, or in object@database@database
, there is SearchDatabase_ref
or searchDatabase_ref
.
This lead to errors when flattening and mzID instance from v1.0.
Two suggestions to fix this, which I am happy to do:
Any preference, Thomas?
Since the preferred way of storing and handing mzIdentML files is in compressed form (see for reference topic 7.1 in http://www.psidev.info/sites/default/files/mzIdentML1.1.0.pdf) we should add a capability to read mzid files directly in compressed form. Indeed the compression is very effective for this type of XML files. It knocks down the size ~ 10-fold.
produces an error on flatten
ing. This can be an issue when dealing with a huge collection an one is empty.
> emptyid
An empty mzID object
> flatten(emptyid)
Error in rep(1:length(object@mapping), sapply(object@mapping, length)) :
invalid 'times' argument
> traceback()
7: `[.data.frame`(scans(object, safeNames = safeNames), rep(1:length(object@mapping),
sapply(object@mapping, length))[match(1:nrow(object@id),
unlist(object@mapping))], )
6: scans(object, safeNames = safeNames)[rep(1:length(object@mapping),
sapply(object@mapping, length))[match(1:nrow(object@id),
unlist(object@mapping))], ]
5: cbind(scans(object, safeNames = safeNames)[rep(1:length(object@mapping),
sapply(object@mapping, length))[match(1:nrow(object@id),
unlist(object@mapping))], ], id(object, safeNames = safeNames))
4: flatten(object@psm, safeNames = safeNames)
3: flatten(object@psm, safeNames = safeNames)
2: flatten(emptyid)
1: flatten(emptyid)
It seems quite straightforward to fix but I'm not sure where best to check for length (at the mzID
or mzIDpsm
level) and whether to return NULL
or anything else.
For more memory efficient parsing of large files, we should consider rewriting the parser to use SAX rather than DOM parsing. This is a major rewrite though, so it might be a while before we get to this...
The ProtGenerics
package defines generic functions that are used in several proteomics-related packages:
ls('package:ProtGenerics')
[1] "accessions" "chromatograms" "database" "intensity"
[5] "ions" "mass" "modifications" "mz"
[9] "peaks" "peptides" "proteins" "psms"
[13] "rtime" "scans" "spectra" "tic"
Generics with identical names and different signature, such as peptides
below, cause errors when mzID
and, for example MSnbase
(that uses ProtGenerics::peptides
) are loaded,
> peptides
standardGeneric for "peptides" defined from package "ProtGenerics"
function (object, ...)
standardGeneric("peptides")
<environment: 0x4ad4528>
Methods may be defined for arguments: object
Use showMethods("peptides") for currently available ones.
Would it be possible import ProtGenerics
and use these generics to avoid these conflicts?
The namespace used to parse the mzid file is hard coded in the mzID
constructor.
Would it be possible to (1) add support for mzIdentML 1.0 and, if yes, (2) implement a mechanism to detect the version automatically by extracting the information from the xml file?
Hi,
When the mzid file there is not PSM, it show the error like below:
mzid = "2927_myrimatch.mzid"
mzid.obj <- mzID(mzid)
Error in split.default(1:nrow(id), rep(1:length(nID), nID)) :
group length is 0 but data length > 0
In addition: Warning messages:
1: In countChildren(doc, ns, path, child, simplify = FALSE) :
The specified XPATH expression is empty
2: In countChildren(doc, ns, path = paste0(.path, "/x:DataCollection/x:AnalysisData/x:SpectrumIdentificationList/x:SpectrumIdentificationResult"), :
The specified XPATH expression is emptysessionInfo("mzID")
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
character(0)
other attached packages:
[1] mzID_0.99.2
loaded via a namespace (and not attached):
[1] base_3.0.1 datasets_3.0.1 graphics_3.0.1 grDevices_3.0.1
[5] methods_3.0.1 plyr_1.8 stats_3.0.1 utils_3.0.1
[9] XML_3.98-1.1
Best regards!
Bo
Dear Thomas,
I want to create a class union to handle mzID
objects but it seems that no mzID class is exported.
library("mzID")
packageVersion("mzID")
[1] ‘1.4.1’
setClassUnion("mzIDClasses", members = c("mzID", "mzIDCollection"))
Warning messages:
1: class "mzID" is defined (with package slot ‘mzID’) but no metadata object found to revise superClass information---not exported? Making a copy in package ‘.GlobalEnv’
2: class "mzIDCollection" is defined (with package slot ‘mzID’) but no metadata object found to revise superClass information---not exported? Making a copy in package ‘.GlobalEnv’
Is there any chance to export these classes?
Background: I want to create some S4-methods that just flatten
mzID
objects (regardless whether they are single objects or collections). These methods would be exactly the same for both mzID
and mzIDCollection
objects. I don't want just copy&paste the method body, e.g.:
### instead of defining two methods ...
setMethod("addIdentificationData", c("MSnExp", "mzID"),
function(object, id, ...) ...)
setMethod("addIdentificationData", c("MSnExp", "mzIDCollection"),
function(object, id, ...) ...)
### ... I want to define a class union and define the method just once
setClassUnion("mzIDClasses", members = c("mzID", "mzIDCollection"))
setMethod("addIdentificationData", c("MSnExp", "mzIDClasses"),
function(object, id, ...) ...)
Best wishes,
Sebastian
CC: @lgatto
Hey,
I am currently trying to parse an mzIdentML (".mzid") generated from PeptideShaker but somehow mzID is unable to parse it and returns me this error :
Error in mapply(sub, paste("^\\Q", database$accession[hasRightName], "\\E", : zero-length inputs cannot be mixed with those of non-zero length Calls: mzID ... new -> initialize -> initialize -> mzIDdatabase -> mapply Execution halted
Do you have any idea what might be wrong ?
Best regards,
Vivian
This will break compatibility with R < 3.0 but the representation argument is described as deprecated in the documentation.
This should be implemented before the next BioC release
I'll investigate this issues in further details, but perhaps the solution will be to make sure that XML-related objects (I guess mainly "doc") are explicitly removed and gc() explicitly called by the end every function that parses XML document.
library("mzID")
# download an mzIdentML file from PeptideAtlas
# the unzipped files size is ~ 36 Mb
dataset <- "c_elegans_B_1_02_21Apr10_Draco_10-03-05_msgfplus"
cel.path <- sprintf("ftp://PASS00308:[email protected]/MSGFPlus_Results/MZID_Files/%s.zip", dataset)
download.file(cel.path, sprintf("%s.zip", dataset))
unzip(sprintf("%s.zip", dataset))
x <- mzID(sprintf("%s.mzid", dataset))
#500 Mb in rsession.exe
rm(x)
gc()
# still ~500 Mb occupied by R (on Windows 7)
# on Mac OSX it is about ~250 Mb. Less, but still an issue.
# It looks like a memory leak in XML package. I'll double check to be sure.
# A similar issue has been discussed here:
# http://stackoverflow.com/questions/9220849/serious-memory-leak-when-iteratively-parsing-xml-files
# http://www.inside-r.org/packages/cran/XML/docs/matchNamespaces
# http://www.omegahat.org/RSXML/MemoryManagement.html
I would like to have a graphical display of mzID objects, but would like suggestion as to which should be the default one...
Some ideas:
place add suggestions...
I can't seem to decide whether users should be encouraged to change the content of mzID objects through setters? It kinda depends on whether mzID objects should reflect the content of a specific file or should merely reflect a collection of data.
If they were mutable should we provide an mzIdentML writer to save changes to a file in native format?
Find out whether it is feasible to support pepXML with the same class structure as for mzIdentML.
Hi,
I'm having the same problem as is reported in #31 , which I don't think was ever solved. I've commented there, but later realised that since it's closed that probably doesn't result in any notifications, so I'm sorry for the duplication!
At any rate: when I try to parse a .mzid file, after a short wait I get the following error. (In #31 it's said that took about 10 hrs, in my case it's usually a few minutes.)
Error in mapply(sub, paste("^\\Q", database$accession[hasRightName], "\\E", : zero-length inputs cannot be mixed with those of non-zero length
I've had a look to compare my files to the example .mzid files, but as I'm relatively unexperienced with the format I'm not sure where the problem originates. Any idea what might be causing this?
Thank you!
Tessa
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.