wkumler / rams Goto Github PK

View Code? Open in Web Editor NEW

21.0 7.0 7.0 119.06 MB

R-based access to Mass-Spectrometry data

License: Other

R 100.00%

tidy-data mass-spectrometry-data r

rams's People

Contributors

Stargazers

Watchers

Forkers

r-lionheart animesh daichengxin ricardo-cunha ethanbass justinzzw jardeko-1127

rams's Issues

RaMS v 1.4

Checklist for RaMS version 1.4 below:

Update NEWS with 1.3.4 stuff (#30 )
Add references to R Journal paper (#29 )
Fix RT unit extraction for BPCs (#28 )
Fix filepaths to Metabolights - update to MetabolomicsWorkbench? (#27 )
Add retention time editing
Add MS3 extraction
Add tests and merge incl_polarity option
Update DESCRIPTION and NEWS for 1.4
Push to CRAN

Implement DBSCAN/OPTICS as an mz_group option?

Realized today that m/z group construction could be done with a 1D density-based clustering algorithm like DBSCAN or OPTICS. Perks of this would be that the "hard" m/z window currently used by mz_group would be relaxed and could be determined in a more data-driven method.

There's a paper about this exact idea: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3982975/ and they talk about reducing the computational constraints through some clever preprocessing, necessary because the current implementation takes a long while for just 6 files.

Quick proof-of-concept:

library(RaMS)
ms_filedir <- system.file("extdata", package="RaMS")
ms_files <- list.files(ms_filedir, pattern="LB.*mzML", full.names=TRUE)
msdata <- grabMSdata(ms_files)

library(dbscan)
mz_groups <- dbscan(msdata$MS1[,"mz"], eps = 0.0001, minPts = 100)
msdata$MS1$mz_group <- mz_groups$cluster

library(ggplot2)
msdata$MS1[mz%between%c(110, 130)] %>%
  ggplot() +
  geom_point(aes(x=rt, y=mz, color=factor(mz_group)))

Feature request: grab "ion injection time"

Hi @wkumler,

Thanks for the nice package, it is very user-friendly and fast.

I am wondering if you could include a function to grab ion injection time?

I understand that such parameter is not universally present for all types of MS files. It could be useful for some, i.e., files generated by Orbitrap.

Thanks again.

Dong

Notes on `arrow`

At the R Cascadia conference this past weekend, Cari Gostic gave an excellent talk on the interface between tidy data and the arrow package, which handles input/output from Apache Arrow parquet files and datasets. This seems to be a direct upgrade to the tmzML document type and is at least an order of magnitude faster in both creation and retrieval.

Notes:

Super easy to implement, new functions are just the write_dataset and read_dataset plus collect from the dplyr package.
dplyr commands can be passed directly to an open dataset object but computations are trickier
- mutate(samp_type=str_extract(filename, "Blk|175m|15m|DCM|Poo|Std")) needs to be done in R
- filter(mz%between%pmppm(76.039854+1.003355, 5)) computation needs to be done in R (why does pmppm work but not simple addition?)
- str_detect seems to be very slow if called before collect? E.g. filter(str_detect(filename, "Smp"))
I'm probably going to implement this in the RaMS-and-friends vignette but it feels powerful enough that the functionality may eventually make it into the package itself. I think I'd eventually like to switch from tmzML over to parquet entirely but then I'll have to figure out how to keep the original syntax, which sounds like a headache(?).

support for UV/DAD data?

Hi,

I was wondering if you would be interested in incorporating support for a function to extract DAD data from mzML files? I have some example files that were converted from the Thermo RAW format using the ThermoRawFileParser that encode both MS and DAD data. I figured out how to extract them using the MzR package, but I think I like you're approach better since it doesn't come with so many dependencies. If you want I could try to throw together a grabMzmlUV function and/or send over some example files.

Thanks!

request: parsing chromatograms from mzML acquired in MRM

while enjoying the easiness of RaMS, I have a request for a possible new feature.

mzML allows to store chromatograms and when MS data files with data acquired in MRM (from e.g. Sciex format .wiff) is converted to mzML (via e.g. ProteoWizard) the run results are stored as chromatograms not spectra. It is possible to convert chromatograms to spectra during conversion to mzML in ProteoWizard but repeating isolation mz values become an issue. I was wondering if other chromatograms (besides TIC and BPC) in mzML could be obtained as a new functionality in RaMS. Information for each chromatogram (i.e., name, precursor mz and isolation mz) are present and should be returned with the chromatogram data. Yet, I do not know exactly where is stored in the mzML.

Currently, the error below is returned when MRM data is kept as chromatograms.
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "xml_missing"

By converting chromatograms to spectra, the MRM data can be obtained as normal "MS1".

You find attached an example mzML data file with MRM from Nitrosamines as chromatograms.
Example_MRM_Nitrosamines.zip

Please let me know if you have further questions regarding the request/idea.

Thank you in advance for the consideration and continue with the good work.
Ricardo

.wiff2 files appear to contain no data when opened with RaMS

Apparently .wiff2 files incorrectly return empty data.tables. Using grabMSdata on them causes the object returned to have MS1, MS2, BPC, and TIC data.tables of zero rows.

Move data.table to Depends instead of Imports?

Given that RaMS returns data.tables and I often then use data.table functions on them, it may make sense to automatically attach the data.table functions. Of course, this is very rarely recommended and I'm not convinced that it's worth it here but wanted to document it as a potential enhancement.

Update the package-wide file with proper link to vignettes

#' See the package intro on GitHub at https://github.com/wkumler/RaMS and
#' explore the vignettes with \code{vignette("help", package = "mypkg")}

NEWS is out of date

Forgot to include the 1.3.5 details in the NEWS document :( will fix in the next version

Warning message: xmlSAX2Characters: huge text nod

Found this error while trying to open direct injection DOM data in RaMS. Makes sense because the single node (accumulation across the entire time of the direct injection) is basically the entire 100MB file. Apparently it's DoS protection but hopefully the MS files are trustworthy.

tmzML document MS2 voltage encoding is incorrect

grabMSdata(system.file("extdata", "DDApos_2.mzML.gz", package="RaMS"), grab_what = "MS2")$MS2[premz%between%pmppm(118.0865)]

produces

            rt    premz    fragmz         int voltage         filename
  1:  4.182333 118.0864  51.81098    3809.649      35 DDApos_2.mzML.gz
  2:  4.182333 118.0864  58.06422   10133.438      35 DDApos_2.mzML.gz
  3:  4.182333 118.0864  58.06590  390179.500      35 DDApos_2.mzML.gz
  4:  4.182333 118.0864  59.07371  494165.156      35 DDApos_2.mzML.gz
  5:  4.182333 118.0864  59.56195    4696.181      35 DDApos_2.mzML.gz
 ---                                                                  
584: 14.897500 118.0865 115.38483    2501.394      35 DDApos_2.mzML.gz
585: 14.897500 118.0865 118.08650 5328035.500      35 DDApos_2.mzML.gz
586: 14.897500 118.0865 118.12283   59140.699      35 DDApos_2.mzML.gz
587: 14.897500 118.0865 119.08417    9048.057      35 DDApos_2.mzML.gz
588: 14.897500 118.0865 119.08983  161270.016      35 DDApos_2.mzML.gz

but

tmzmlMaker(system.file("extdata", "DDApos_2.mzML.gz", package="RaMS"), output_filename = "~/../Desktop/test_tmz.tmzML")
grabMSdata("~/../Desktop/test_tmz.tmzML")$MS2[premz%between%pmppm(118.0865)]

produces

            rt    premz    fragmz      voltage         int       filename
  1:  4.182333 118.0864  51.81098 1.72923e-322    3809.649 test_tmz.tmzML
  2:  4.182333 118.0864  58.06422 1.72923e-322   10133.438 test_tmz.tmzML
  3:  4.182333 118.0864  58.06590 1.72923e-322  390179.500 test_tmz.tmzML
  4:  4.182333 118.0864  59.07371 1.72923e-322  494165.156 test_tmz.tmzML
  5:  4.182333 118.0864  59.56195 1.72923e-322    4696.181 test_tmz.tmzML
 ---                                                                     
584: 14.897500 118.0865 115.38483 1.72923e-322    2501.394 test_tmz.tmzML
585: 14.897500 118.0865 118.08650 1.72923e-322 5328035.500 test_tmz.tmzML
586: 14.897500 118.0865 118.12283 1.72923e-322   59140.699 test_tmz.tmzML
587: 14.897500 118.0865 119.08417 1.72923e-322    9048.057 test_tmz.tmzML
588: 14.897500 118.0865 119.08983 1.72923e-322  161270.016 test_tmz.tmzML

Pretty clearly an encoding error but I'm not sure where exactly it's coming from, will look into soon.

Add convenience functions trapz, mz_group, and qplotMSdata

There's a couple functions that I find myself repeatedly needing to use alongside RaMS that should really just be included in the package. I'm calling these "convenience" functions because I don't really need them but they're convenient to have pre-written (and unit-tested) and this'll make it easier to share them. These would be much like the existing pmppm function that I've found very useful.

The first is one for integrating raw MS data using trapezoidal Riemann sums. This is the core step in moving from the mz/rt/int matrix data frame to a feature/area data frame. The general consensus seems to be that trapezoidal integration is the way to go since that 1) nicely handles the uneven spacing between retention times and 2) calculates the exact area under the data points. I quibble with this a little bit and wonder whether an absolute sum would be more accurate/precise because 1) the trapezoidal rule technically underestimates the area under the curve because the chromatographic peaks are mostly concave-down, 2) the detector is in fact measuring counts that could be directly summed, and 3) trapezoids can cause falsely inflate the signal of low-quality peaks by linearly interpolating from a single spike across to the next low data point. But I'm in the minority here so I think the default of trapezoidal integration is still the way to go. I currently just use some code stolen from pracma::trapz but I'd like to do some more careful handling of this. The function should just take the rt/int values and the user should be responsible for filtering those out ahead of time.

The function should check (and warn?) if there's more than one entry per retention time
- While it's technically possible to perform integration on a chromatogram that has multiple entries per RT this is usually not what's intended
The function should perform baseline subtraction if requested (options = "square", "linear")
- "Square" subtracts the lowest value in the peak from each RT, essentially removing the area in a rectangle below the lowest point on the peak
- "Linear" would remove the trapezoid composed of the leftmost point, the rightmost point, and the zero-values at each end.
The function should warn if there's a lot of data points(?)
The function should warn if the RTs are not ordered (reluctant to forcibly order them, I want the user to do this organization)

The second one is a more heuristic function I wrote because I wanted to pull out EICs from the raw data by grouping things in m/z space. When I extract a large m/z window, I often end up with multiple clear chromatograms in m/z space that can clearly be grouped but the only obvious way to do this is to specify repeated ifelse or a case_when with lots of m/z cutoffs. Instead, we can use the algorithm currently implemented in ADAP and start with the largest intensity mass in the data, identify an m/z window around it, then group all of those points into a single mean value. We then remove those points from consideration and repeat the process until there's no points left. The tricky part was figuring out how to do this in a tidy fashion but I think I'm happy with the implementation that returns a vector in the same order as the data with integer values corresponding to the EICs detected. This then can be passed neatly into a mutate statement to create a new column (usually called mz_group) which I can then group or color by. More advanced functions could accept the RTs as well and do some clever ROI detection but I think I'd rather leave this one simple.

The function should accept the m/z vector and a ppm window
Returns an integer vector? "Cluster X"?
Option to drop EICs below a certain (contiguous?) length

The final one is a plotting option - I originally avoided including this because I wanted people to learn/use the actual ggplot2 syntax but my poor fingers are tired of typing the exact same ggplot() + geom_line(aes(x=rt, y=int, group=filename)) over and over again. I think instead I'm going to add a qplotMSdata function that is basically just that line of code. I think it would also be neat to have arguments for color= and facet= because I often color/facet by additional columns too. Unfortunately, adding ggplot2 would add a bunch of dependencies and I'm trying to keep RaMS lightweight so I think it'll be a mimic for now. I wonder if I can add ggplot2 code and only run it if ggplot2 is already loaded or if that would cause issues with CRAN checks?

Release RaMS 1.0.0

First release:

usethis::use_cran_comments()
Proof read Title: and Description:
Check that all exported functions have @returns and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
NA: Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran() Checked, but get PREPERROR on Linux build due to libxml2 library not found
Draft blog post

Submit to CRAN:

NA usethis::use_version('major') (Performed manually)
devtools::submit_cran()
Approve email

Wait for CRAN...

add a filetype check BEFORE loading any files

I keep getting burned by non-mzML files included in the pass to grabMSdata, but I only find out about it after I've sunk a bunch of time into extracting all the actual mzMLs first until it hits one and crashes. Two solutions for later: one, check the filetype before loading any of the files which is simple enough to do by checking the extension but might throw some false positives; or two, throw a warning that the file type isn't recognized and ignore it. Adding to the v1.4 milestone maybe?

Constructing tmzMLs on the fly

I wonder if it would be possible to add an option to grabMSdata that constructs the tmzMLs on the fly rather than making it a separate step. Instead of reading the mzMLs directly into memory, the option (as_tmzML = TRUE?) would instead convert the files to tmzML in a temporary directory then construct and return the tmzML object. Given the (intentional) similarities between the two types I wonder if it's possible to streamline this because I often find myself held back by the initial tmzML construction step and end up spending more time waiting for the files to load repeatedly. This could also be enabled as an option if memory limits are approached - if the total size of the files to be loaded exceeds, say, a quarter of the system's RAM, it could throw a warning and suggest using as_tmzML = TRUE.

Expected issues include:

Where are the temporary files stored? In the RaMS package files (/data?)? Temporary directory?
- No matter where, will likely run into directory write permissions on some systems and will be hard to test/debug :(
When are the temporary files deleted? Using on.exit seems like it requires an active function but maybe there's an equivalent for when the R session ends overall? How to handle R crashing? This could be a major issue because mass spec files are big and could easily clog up a user's system if not cleared out regularly.
- Maybe there's a known folder location that gets checked when RaMS loads and throws a warning? When a new as_tmzML folder is requested?
tmzMLs don't operate exactly like mzMLs in RaMS. How to handle additional requests for things like metadata or precompiled chromatograms? Might need to prioritize the glimpse functionality...
Will this create more confusion than it saves?

feature request: read different types of scan

I've been using RaMS for about a year and it is amazing; finally an easy depencency-light way to fast reading and tidy manipulation of MS data! My feature request is could there be a way to label different types of MS1 scans acquired in the same experiment. (I often acquire data like this when varying ion source parameters.)

I have *.mzML files generated from Sciex *.wiff2 files that were acquired from a qTOF that was running multiple MS1 scan types. Sciex calls the different scan types "experiments" and in the XML (generated via ProteoWizard) these different scan types are referred to like this:

<spectrum index="0" id="sample=1 period=1 cycle=1 experiment=2" defaultArrayLength="2271">
[...]
<spectrum index="1" id="sample=1 period=1 cycle=1 experiment=4" defaultArrayLength="4300">
[...]
 <spectrum index="3" id="sample=1 period=1 cycle=1 experiment=7" defaultArrayLength="3">

I'd be happy to supply an example mzML file.

I imagine one output type might be an extra column (relative to what get_what = c('MS1') returns) containing the spectrum id strings like sample=1 period=1 cycle=1 experiment=4.

Metabolights no longer allows direct file access :(

so "streaming" files no longer works - looks like they switched over to an FTP system and no longer expose the raw URLs to the browser. The README needs updating as well as any examples that do this (grabMSdata? others?)

RaMS v1.3.2

To-do and progress list here:

grabMzxmlBPC still uses old method to determine RT unit

grabMzxmlBPC <- function(xml_data, TIC=FALSE, rtrange, incl_polarity){
  scan_nodes <- xml2::xml_find_all(
    xml_data, '//d1:scan[@msLevel="1"]'
  )
  rt_chrs <- xml2::xml_attr(scan_nodes, "retentionTime")
  rt_vals <- as.numeric(gsub(pattern = "PT|S", replacement = "", rt_chrs))
  if(any(rt_vals>150)){rt_vals <- rt_vals/60}

  int_attr <- ifelse(TIC, "totIonCurrent", "basePeakIntensity")
  int_vals <- as.numeric(xml2::xml_attr(scan_nodes, int_attr))
  if(!is.null(rtrange)){
    int_vals <- int_vals[rt_vals%between%rtrange]
    rt_vals <- rt_vals[rt_vals%between%rtrange]
  }


  return(data.table(rt=rt_vals, int=int_vals))
}

Should instead be the method that was patched in another PR I don't have the brainspace to find right now

Can't read mzML files written by OpenChrom because of missing namespace declaration

Hi,
The mzML files written by OpenChrom are missing a namespace declaration that is required for RaMS to parse the files correctly. It throws an error: Error in xml_find_first.xml_node(xml_data, paste0("//d1:", node_to_check)) : xmlXPathEval: evaluation failed". If I manually add the namespace declaration to the file, it parses fine.

I'm not sure what the best fix would be here. I'm not too familiar with mzML so I don't know if these namespace declarations are "required" for the mzML files to be valid. In any case I am wondering if there is a way to add a check in RaMS for the namespace declaration and maybe modify the xpaths accordingly so that RaMS can read the OpenChrom files without modification. Alternatively, I guess it would be possible to add the namespace declaration to the XML after importing it to R. I'm not sure which way would be a better approach here?

I'm attaching an example file that was produced by OpenChrom (with the .txt extension because GitHub wouldn't let me upload mzML).
alkanes 0.25 mg-ml_2-10-2021.txt

Thanks!
Ethan

mzXML files can have multiple binary encodings

The DDA file from the Skyline folks in their DIA tutorial throws a bunch of warnings when run. Prying into these revealed that the mzXML file sometimes has an encoding of "none" and sometimes has an encoding of zlib (aka gzip).

Scan 1: none
Scan 2: zlib
Scan 3: none
Scan 4: none
Scan 5: none
Scan 6: none
Scan 7: zlib

with no clear pattern. I had always assumed that the same encoding would be applied to every binary encoding (and RaMS only reads out the first peak's value). I'll have to switch it over to individual encodings which is super annoying.

<scan num="1" msLevel="1" peaksCount="17" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.1821S" injectionTime="PT0.1000S" lowMz="423.914" highMz="1765.56" basePeakMz="1765.56" basePeakIntensity="779.182" totIonCurrent="10071.9">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="none" compressedLen="0">Q9P1BUPwxShD3HUiRAlb1EPe6yBD+KoQRAOuX0PvpUdEBuBDRAdZKkQH489EAu6lRAmIZEQfVKJEGxU5RBL2OUQj+/pEIzpVRDzChUQky15EXYj5RAlMgERrSNREDDeVRHfOj0QNYvNEfzuWRDAJVESVvzREGRwVRK8DZUQsofNE3LIIRELLow==</peaks>
</scan>
<scan num="2" msLevel="1" peaksCount="352" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.6401S" injectionTime="PT0.1000S" lowMz="400.251" highMz="1753.41" basePeakMz="445.117" basePeakIntensity="2.62395e+006" totIonCurrent="1.77468e+007">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="zlib" compressedLen="2622">eJwN1vlfzWkfx/GcFnXqtKAmKqQI2UY0JrJ93tf1PaftFJP1niTLhAnNiDv3GLK1nTqRUBNZk5GkRbJlG057Km2kkjZLU9ZKy+2n1+P5H7xINXoi3/bDDFJN2s6uyB+TalcCN3ObQ6pT63Bk8G1SJZfhWFwMqfoOsaT9CZS7c4CtNhVR7q6xaLZwotyvRthgOYvydh5GkEJJeRFOXD7bjfISRfxobyPlpfewhXNXUr7Ehy3dkkr547awxzeaKD/gLA5tEFO+ohdHeB3lZ7Zhl8dMKsh6jpD9elQoXYqfxfepMCZBCGCnqDC1Gxs8t1Nh2nlWoN1GRSYWPNA+j4pSNFhlzU4qulKHhv3mVDyFIdBXl4r3ewhrVUoq0XDiR4sLqCQog7m1r6aSfX3cvuU6lYSlsev631PJ+VY0PE6mJxrq7KBlID0RJyEhbhI92S3iVavK6Mk+d8TlCPQkSV16s0dJTzJrEPiqk54MiLh2+xcqHe8vtMp6qPTPWNZ+JpxKk8r5nj+6qTTrLd48r6LS/jJ8WGVPZZbZzGdBKZVNmMlc3e5S2Q0R81nsQuUXtkPlfI/KU4ayCMer9HTKdpRP1aOnCR8Qtmo3PU014B/yMujp1Q8oun2WKqYlsQbJIqrwz5KJo/2oIuyrTPNXRhUJW+Fv5UkV760ETx8zqtTVkC78okOVHptx+8B3VOkvl/3p6EmVIUuFaQFrqDIsQfjQ3kWVndlC8QM9qtJezhfbelGVOEP4bvUBqtraKnP17qCqAxe49solVBXcyU1zs6gqbD4fNOccVUVZI66rkaou9gh+ojNUldWL0Ht2VNW5WEh89pmqNVW4ZexH1YPv4OFrXaoWH0CO3J+qJ3rx1Ta+VL11rzSl/5uV5TjsakrV2S047jqWqjs6Bb2fd1PNVnPhQvYFqsmah8tLa6jmZg+7clJENR0nuNGjWnq2pYzPW5xGzzrcsD5oMj1P00W6ZwfVHkoURB4nqTZDgwXsNaQXRmncyIDohZ8L+29wBb3YsgVFK5vpRWS302gbB3pxJYsVz9hIL96F4vKs+1RnoCGLuLuN6qbeQNnoLqpTrBY2nGigushjsrGNi6juUqv0bE0u1eu58v8Z7qJ6/d3S6Ws9qH7SEuHhL01Ub3cVmzbLqT70KpsyrpTqw5u5d0A41UfOF+xeiag+WcUDZFnUoPMcvtYO1KD7luHB39Sg78Ht445Rw9i9qAv+Sg2TLfBl1StqiHiJ/Iyh9PIvBVf9HUgv2+wRnFFNjWYvuYmBNzWmN+HoMRW9iqnAwwmf6NW5fI77f1DT2MlMrq5LTTHxPDl2BjUbF3CLm98auAPHpg5Qc4wnS/Jqp+bLH7npL97UMsRTmLOtl1qmbMSXO7HUEsVkmiuyqSVNgYOZN6jV0F3qWrmcWpUNQk+lH7XmVKFZ6xS1GZQJ+imrqU35Gx9+5yi9zuxn6zNy6M2xOgRNLqR3e3RYzqPj1K62AKX2SmofLmG/GY2i9t0PmW/LSmqPnY70R+b074Q0tqJ2FHVkvkHYA0vqvK7Jrfzr6f2sYPTVjKCPB9O4us08+qR1mx2fVk+f9nXjhH84fTq4G2+2jqMvp2NZYnU/dVl2o2JdHHWlnkLFqEPUlf2etwT/RN0OPnxkYCF138hFSHY9fY2QCpWmVvQ1YwhbFvmOeiXLhAXXBlGvqy/+9L9HvWHxgmXkaeqNkPCvQ65Tb/QedmRtAvWJb/CQW87UJ7Fmt9cFUt/QJ8zVyZb6Qobio3ET9YUt5dNb7ahPsQQv//OM+qKHIHyhGfWL3VhMWzT1h2ZjV1oEDcQqmG/pFRq42SuM8X4MtZG2fHW2Empe3vhHqMMgWT9PiZ+BQaH9XC3RDINU52XOsgoMqrXAif5EiLRTpY8qN0Ek3Sn1fxgFUbAT8zAbDFFIoHDFzAGif/5li/zvQPTYjIn7cyBSyXl72Amoa09kFY2VUPdqZ6pKB6gnWOOO3nRouG7lWaG3oRGxh1XMfwCNwrnspbsXNIqiEPd9EzQlTogfroSm82dhoNEUmgo1IbRxAjTzEtjOy1ehmV8gDO3+AVriFfwF7KClW84Pnv4ILdki9CbsgpbTcs7tvtk5hf1xfz20wqaz9DsHoRVejovzMqGV54Doe0OhswDImb8DOhHX2Lt5b6BTeJIFnboNsUSJnm1rIBb24NbNh9B7GIwUr0+QaOxEx4j5kJAhriZJIBH2orPsOgyFEmHTHicYBtfzkubZMHwUwL237Ybh42x+1GsbjLQ2s4v1FjAafI+9uTQVRtyHHd++HEYHfdDtPoCh+SP5ktHWGKY7jWXnj8EwWYk0epKAYaFVUvO7+himOsqebp2NYbnpQot6N4y1I9mIthkw1kkVRn5aAWPpAra4wQ7GstPc45A/jEOD2PmvCTDR9+Pl/66EidsB7mz6DiaR3qw8eR5MilbhuO9IfFdoKU33HAZTyWTpmqvnYeoyIB0W1wzT8ABuFbMfphHagp9JFkzzd7DzNeUwLWjix5LnYLjuOjavbQDD9bpYwwQZzPXzeeZIe5i7FfHlY6JhHpnO9XQ0YV6Uzf7AZFjoX0JJfjtG6tnj47LfMUb1lf8TuxZWOhIu1bgGKyGKbTySDSuZIVdnF2EV/CtKms1gFfKWrRl8AtbOwTj9KgzjFGuRem0SxuV1SYfUpcBGV024kvoGNk5NwmHzMNiEtfAXw1phk/eQz1iWhfHiRub3Kg8T3WRS69mOmBi5RKj4/AATi8YLyht6sNVTsOjBI2CrP1PQv+sLW5d1KPRKga3bHM7GK2Cr8MNftybANnIGW9R5GFOUHdxYTQtTSky4oXgXphp0s9jvN2Cq/NsPDFGDnUQPipaNsHNpQMVuJWY6WbEyrfGYGWbB4nkpZubZsB3BnbAX6+HE+0T86GIpfJ4VhR8VU4T8gHj8WKAvhPuNhYNuH681T4SDcyUPXLYdDgoJqzibDof8t8jMmIY5xf/jTR71cDQI4dHLRsBRvoZfyDgDx4gHXOSZCkflXa6Rq4Rj4VXmmecMx+LDUH3JwVzJc6Yw8sACww6kDLfEgqi3CHH9CxDacGRUOfDoN5yRb4aQ746cB71wLnQVXB54wkUiF+aec4TLhPfY8PwNXFw9eJ98LlwifLhCZAiXQkuuu78LrhILpK/dDrl7Mte4dw3yqIssNfcj5CWJrLbtJdz17zH3xn1wN/yAv+9Zw92tB/F7f4Z7ZC8KFu7DCsUBJPZ0YEXBKubfcAkr9ZSIN+uDl1sAH2gwgFdkAK81bIZX0a+8hGVilf42PpudxSq3n1hH4iv4RM3E4ZZNWGM4GzErfoefJAaBP6zH1mIl14wUw9/gEh8tNMFfHs+nvk6CvzKevetygH+xN1vq8AFBSXLmm6SJIFUAW5d8HkGtXuz0jJPYq72GedfOw75iJyjV3LF/YwIOLZyOsNbXyLwchPAJYjRdP4VYk2c4dP0Qzn7Oxp6ENOQ+2odzz7XwbLINjiRb/B8ZS+lz</peaks>
</scan>
<scan num="3" msLevel="1" peaksCount="84" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.9075S" injectionTime="PT0.0023S" lowMz="407.983" highMz="1790.98" basePeakMz="445.12" basePeakIntensity="3.00796e+006" totIonCurrent="1.64408e+007">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="none" compressedLen="0">Q8v920asXxtDzZYNRwcRFEPPhMhHQR1XQ8/i7kagqMtD0aWgRqjUyEPRqHNIwsWoQ9Io30fvKtBD1nIdSQBwQkPWi2pJLuT8Q9bykUexHVZD1wttSF/ySUPXccBJcLopQ9eLAkg0avND1/InR0Jy3UPYcVZJMPqXQ9jx1UfkrS5D2XEDSKTaGEPacJNHqFcdQ96Pa0o3l1ND3w94SYGzm0PfjNZHMKAUQ9+O9UlKvzZD36yESJBP00PgDvJIKL3bQ+As6kedu99D55DHSRq49kPnksFGptqKQ+ffWEamyqND6BDLSGjWckPokHtH7zsSQ+kQnkcFr3RD+LpLRw7pfEP7jcZHzBPnQ/wOBkcmXgpD/7xoR7XF/EQBil9G0CXGRAHI7UlzCfREAgjtSKzMwEQCSLdIZwmfRAKIxUeX5FxEBUw3SCa2UEQFjFZHHFy2RAWvRUbFIHpEBcwiRsxJ+kQLNWJGsVU6RAvMcEbNe0FEEMbIRvEJKUQUMsJGr9ZvRBRKKEiE0AhEFIosSExdg0QUygJH74qPRBUJ/0cuEa1EF817SMdhy0QYDX9Iioq7RBhNNUgOsMVEGI02R2ylHEQgUlpGoqg8RCbLZUfS0XBEJwtbR8oaPEQnS0dHQpPURCpOoUgT0eJEKo66R7HmVEQqzshHJfxdRCsOhUbtkfhENQntRsdXuUQ14uRGuu73RDZ3XkbuMw9ENxdNRql+CUQ4zf1GySnfRDlMkUe1PBJEOYyaR53SZUQ5zKxG3sT1RDoMYkbon1NEOvsNRqo4MkRLFOVGyFTLREtyJkbE4DFES4PARrhXdERLzbFHHdvDREwN60cPEzxElaiKRtVP1ESa4JNG6GD2RKnHHUbZld5EuR0xRuc/PkTf34JG/H+s</peaks>
</scan>

and instead of

  vals <- lapply(all_peak_nodes, function(binary){
    if(!nchar(binary))return(matrix(ncol = 2, nrow = 0))
    decoded_binary <- base64enc::base64decode(binary)
    raw_binary <- as.raw(decoded_binary)
    decomp_binary <- memDecompress(raw_binary, type = file_metadata$compression)
    final_binary <- readBin(decomp_binary, what = "numeric",
                            n=length(decomp_binary)/file_metadata$precision,
                            size = file_metadata$precision,
                            endian = file_metadata$endi_enc)
    matrix(final_binary, ncol = 2, byrow = TRUE)

I'll have to do something like

all_peak_nodes <- xml2::xml_text(xml2::xml_find_all(xml_nodes, xpath = "d1:peaks"))
all_peak_encs <- xml2::xml_attr(xml2::xml_find_all(xml_nodes, xpath = "d1:peaks"), "compressionType")
vals <- mapply(function(binary, encoding_i){
    if(!nchar(binary))return(matrix(ncol = 2, nrow = 0))
    decoded_binary <- base64enc::base64decode(binary)
    raw_binary <- as.raw(decoded_binary)
    decomp_binary <- memDecompress(raw_binary, type = encoding_i)
    final_binary <- readBin(decomp_binary, what = "numeric",
                            n=length(decomp_binary)/file_metadata$precision,
                            size = file_metadata$precision,
                            endian = file_metadata$endi_enc)
    matrix(final_binary, ncol = 2, byrow = TRUE)
}, all_peak_nodes, peak_encodings)

feature request: dia data

Hi @wkumler, I have been looking into ways I could plot XICs from precursors and their respective fragments from DIA data independently of the search software. You package seems very promising, but I can see current functions are restricted to DDA and SRM. I was wondering if you have plans to implement functions to support DIA and specifically XICs from MS1 and MS2 for precursor and fragments?