thomasp85 / mzid Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 772 KB

An mzIdentML parser for R

R 100.00%

mzid's People

Contributors

Stargazers

Watchers

Forkers

vladpetyuk sgibb inambioinfo

mzid's Issues

Bioc and github repos

@thomasp85 - there are currently some issues with this repo, the Bioconductor repo, and maintainer access.

This repo and Bioc are unrelated, which makes their syncing impossible. Probably fixable with instructions here.
I am not a maintainer of the Bioc repo, hence can't push any updates there.

Should I maintain package? Not sure I am keen to do this long term, and would possibly redirect users toward mzR. Also, I think I would favour using only the Bioc repo, just for simplicity.

Bug with reading the mzid file from OMSSA

Hi,
When I read the mzid file from OMSSA, it showed error like below, but it was OK for mzid file from X!Tandem. Could you please figure out the reason?

mzID("omssa.mzid")
Error in $<-.data.frame(*tmp*, "name", value = c("Carbamidomethyl", :
replacement has 7 rows, data has 5
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] mzID_1.1.0 XML_3.98-1.1

loaded via a namespace (and not attached):
[1] plyr_1.8 tools_3.0.1

Best regards!
Bo

Find solution for avoiding toLower of attribute names

Issues with mzid v1.0/v1.1

The internal names are different for versions 1.0 and 1.1. For example, in object@psm@id, we have Peptide_ref and peptide_ref respectively, or in object@database@database, there is SearchDatabase_ref or searchDatabase_ref.

This lead to errors when flattening and mzID instance from v1.0.

Two suggestions to fix this, which I am happy to do:

fix on a case by case, grep-ing the correct column name
set all the column names to lower case when creating the object and use lower case only in the rest.

Any preference, Thomas?

Capability for reading compressed mzIdentML files

Since the preferred way of storing and handing mzIdentML files is in compressed form (see for reference topic 7.1 in http://www.psidev.info/sites/default/files/mzIdentML1.1.0.pdf) we should add a capability to read mzid files directly in compressed form. Indeed the compression is very effective for this type of XML files. It knocks down the size ~ 10-fold.

empty mzID object

produces an error on flattening. This can be an issue when dealing with a huge collection an one is empty.

> emptyid
An empty mzID object
> flatten(emptyid)
Error in rep(1:length(object@mapping), sapply(object@mapping, length)) : 
  invalid 'times' argument
> traceback()
7: `[.data.frame`(scans(object, safeNames = safeNames), rep(1:length(object@mapping), 
       sapply(object@mapping, length))[match(1:nrow(object@id), 
       unlist(object@mapping))], )
6: scans(object, safeNames = safeNames)[rep(1:length(object@mapping), 
       sapply(object@mapping, length))[match(1:nrow(object@id), 
       unlist(object@mapping))], ]
5: cbind(scans(object, safeNames = safeNames)[rep(1:length(object@mapping), 
       sapply(object@mapping, length))[match(1:nrow(object@id), 
       unlist(object@mapping))], ], id(object, safeNames = safeNames))
4: flatten(object@psm, safeNames = safeNames)
3: flatten(object@psm, safeNames = safeNames)
2: flatten(emptyid)
1: flatten(emptyid)

It seems quite straightforward to fix but I'm not sure where best to check for length (at the mzID or mzIDpsm level) and whether to return NULL or anything else.

Rewrite parser to access file with SAX

For more memory efficient parsing of large files, we should consider rewriting the parser to use SAX rather than DOM parsing. This is a major rewrite though, so it might be a while before we get to this...

duplicated generics

The ProtGenerics package defines generic functions that are used in several proteomics-related packages:

ls('package:ProtGenerics')
 [1] "accessions"    "chromatograms" "database"      "intensity"    
 [5] "ions"          "mass"          "modifications" "mz"           
 [9] "peaks"         "peptides"      "proteins"      "psms"         
[13] "rtime"         "scans"         "spectra"       "tic"

Generics with identical names and different signature, such as peptides below, cause errors when mzID and, for example MSnbase (that uses ProtGenerics::peptides) are loaded,

> peptides
standardGeneric for "peptides" defined from package "ProtGenerics"

function (object, ...) 
standardGeneric("peptides")
<environment: 0x4ad4528>
Methods may be defined for arguments: object
Use  showMethods("peptides")  for currently available ones.

Would it be possible import ProtGenerics and use these generics to avoid these conflicts?

Support for mzIdentML 1.0

The namespace used to parse the mzid file is hard coded in the mzID constructor.
Would it be possible to (1) add support for mzIdentML 1.0 and, if yes, (2) implement a mechanism to detect the version automatically by extracting the information from the xml file?

Error when no PSM in mzid

Hi,
When the mzid file there is not PSM, it show the error like below:

mzid = "2927_myrimatch.mzid"
mzid.obj <- mzID(mzid)
Error in split.default(1:nrow(id), rep(1:length(nID), nID)) :
group length is 0 but data length > 0
In addition: Warning messages:
1: In countChildren(doc, ns, path, child, simplify = FALSE) :
The specified XPATH expression is empty
2: In countChildren(doc, ns, path = paste0(.path, "/x:DataCollection/x:AnalysisData/x:SpectrumIdentificationList/x:SpectrumIdentificationResult"), :
The specified XPATH expression is empty

sessionInfo("mzID")
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C

attached base packages:
character(0)

other attached packages:
[1] mzID_0.99.2

loaded via a namespace (and not attached):
[1] base_3.0.1 datasets_3.0.1 graphics_3.0.1 grDevices_3.0.1
[5] methods_3.0.1 plyr_1.8 stats_3.0.1 utils_3.0.1
[9] XML_3.98-1.1
Best regards!
Bo

export `mzID` and `mzIDCollection` classes

Dear Thomas,

I want to create a class union to handle mzID objects but it seems that no mzID class is exported.

library("mzID")
packageVersion("mzID")
[1] ‘1.4.1’
setClassUnion("mzIDClasses", members = c("mzID", "mzIDCollection"))
Warning messages:
1: class "mzID" is defined (with package slot ‘mzID’) but no metadata object found to revise superClass information---not exported?  Making a copy in package ‘.GlobalEnv’ 
2: class "mzIDCollection" is defined (with package slot ‘mzID’) but no metadata object found to revise superClass information---not exported?  Making a copy in package ‘.GlobalEnv’

Is there any chance to export these classes?

Background: I want to create some S4-methods that just flatten mzID objects (regardless whether they are single objects or collections). These methods would be exactly the same for both mzID and mzIDCollection objects. I don't want just copy&paste the method body, e.g.:

### instead of defining two methods ...
setMethod("addIdentificationData", c("MSnExp", "mzID"),
          function(object, id, ...) ...)

setMethod("addIdentificationData", c("MSnExp", "mzIDCollection"),
          function(object, id, ...) ...)

### ... I want to define a class union and define the method just once
setClassUnion("mzIDClasses", members = c("mzID", "mzIDCollection"))
setMethod("addIdentificationData", c("MSnExp", "mzIDClasses"),
          function(object, id, ...) ...)

Best wishes,

Sebastian

CC: @lgatto

mzID returning mapply error

Hey,

I am currently trying to parse an mzIdentML (".mzid") generated from PeptideShaker but somehow mzID is unable to parse it and returns me this error :
Error in mapply(sub, paste("^\\Q", database$accession[hasRightName], "\\E", : zero-length inputs cannot be mixed with those of non-zero length Calls: mzID ... new -> initialize -> initialize -> mzIDdatabase -> mapply Execution halted

Do you have any idea what might be wrong ?

Best regards,
Vivian

Update all class definitions to use the, as of R3.0, slots argument rather than representation

This will break compatibility with R < 3.0 but the representation argument is described as deprecated in the documentation.

This should be implemented before the next BioC release

Dealing with memory leak associated with parsing XML files using XML package.

I'll investigate this issues in further details, but perhaps the solution will be to make sure that XML-related objects (I guess mainly "doc") are explicitly removed and gc() explicitly called by the end every function that parses XML document.

library("mzID")

# download an mzIdentML file from PeptideAtlas
# the unzipped files size is ~ 36 Mb
dataset <- "c_elegans_B_1_02_21Apr10_Draco_10-03-05_msgfplus"
cel.path <- sprintf("ftp://PASS00308:[email protected]/MSGFPlus_Results/MZID_Files/%s.zip", dataset)
download.file(cel.path, sprintf("%s.zip", dataset))
unzip(sprintf("%s.zip", dataset))

x <- mzID(sprintf("%s.mzid", dataset))
#500 Mb in rsession.exe

rm(x)
gc()
# still ~500 Mb occupied by R (on Windows 7)
# on Mac OSX it is about ~250 Mb. Less, but still an issue.

# It looks like a memory leak in XML package. I'll double check to be sure.
# A similar issue has been discussed here:
# http://stackoverflow.com/questions/9220849/serious-memory-leak-when-iteratively-parsing-xml-files
# http://www.inside-r.org/packages/cran/XML/docs/matchNamespaces
# http://www.omegahat.org/RSXML/MemoryManagement.html

Standard plot function for mzID objects

I would like to have a graphical display of mzID objects, but would like suggestion as to which should be the default one...

Some ideas:

Score distribution plot for decoy vs. real database (problematic as not all mzID objects are based on TDA approach and the score name varies between search engines)
Some sort of coverage vs. score scatterplot

place add suggestions...

Should the content of mzID objects be mutable?

I can't seem to decide whether users should be encouraged to change the content of mzID objects through setters? It kinda depends on whether mzID objects should reflect the content of a specific file or should merely reflect a collection of data.

If they were mutable should we provide an mzIdentML writer to save changes to a file in native format?

Investigate pepXML schema

Find out whether it is feasible to support pepXML with the same class structure as for mzIdentML.

mapply error: 'zero-length inputs cannot be mixed with those of non-zero length'

Hi,

I'm having the same problem as is reported in #31 , which I don't think was ever solved. I've commented there, but later realised that since it's closed that probably doesn't result in any notifications, so I'm sorry for the duplication!

At any rate: when I try to parse a .mzid file, after a short wait I get the following error. (In #31 it's said that took about 10 hrs, in my case it's usually a few minutes.)

Error in mapply(sub, paste("^\\Q", database$accession[hasRightName], "\\E", : zero-length inputs cannot be mixed with those of non-zero length

I've had a look to compare my files to the example .mzid files, but as I'm relatively unexperienced with the format I'm not sure where the problem originates. Any idea what might be causing this?

Thank you!
Tessa