Code Monkey home page Code Monkey logo

Comments (12)

stanstrup avatar stanstrup commented on July 17, 2024 1

Then by all means, Sir! Blob it! ☺️

from compounddb.

stanstrup avatar stanstrup commented on July 17, 2024

I guess it could be interesting to ask "give me all compounds where this mz appears"...

Is several minutes a big deal? 8 million rows doesn't seem so bad once it is in memory.

Or did you mean simply storing it in the database in that format but make the long format as when it is read?

from compounddb.

jorainer avatar jorainer commented on July 17, 2024

Two things:

  1. 8 million rows is for now. I expect this to increase over time with more and more spectra being added - or for other databases than HMDB that might have more spectra.
  2. at present it is a 28 second query to retrieve all spectra. Well, that's way faster than importing it from the files, but still - users are usually not very patient.

Or did you mean simply storing it in the database in that format but make the long format as when it is read?

It it the storage of the data - in the end it's not that simple, because I have to serialize/deserialize each individual field when storing/retrieving it from the db. The idea is then to have the expanded version returned (e.g. as a Spectrum2List). But having the data as a BLOB prevents to have SQL queries in the form of where mz > 23.129 and mz < 23.2 to retrieve only specific e.g. compounds. One would have to fetch the full data first and then subset instead of letting SQLite doing the subsetting.

from compounddb.

stanstrup avatar stanstrup commented on July 17, 2024

Is storing it as a blob even faster?

from compounddb.

jorainer avatar jorainer commented on July 17, 2024

getting the full table with m/z and intensity values as a blob takes ~ 2 seconds. But this has then only ~ 450.000 rows, one row per spectrum and has to be deserialized. Still it is faster.

Coming back to your possibly interesting question: so it would be interesting to pose the question "give me all compounds with a MS/MS spectrum that has a peak at a certain m/z"? How frequently you think such a query would be?

from compounddb.

jorainer avatar jorainer commented on July 17, 2024

Update on the timing it takes to fetch the MS/MS data from the database: I compared the two functions, the first joins the two tables with the MS/MS peak data (m/z and intensity values) and the MS/MS spectrum annotations and returns the full data. The second queries the table with one line per spectrum and the m/z and intensity value vectors stored as a BLOB for each spectrum. This function deserializes also the blob and expands the returning data.frame. The result from both approaches is the same data.frame with 8240863 rows.

Unit: seconds
                    expr      min       lq     mean   median       uq      max
    dbGetQuery(con, q_a) 27.82240 28.98545 29.61636 30.14851 30.51334 30.87817
 .msms_spectra(con, q_b) 14.27698 14.96500 15.46380 15.65302 16.05721 16.46141
 neval cld
     3   b
     3  a 

If the results are supposed to be returned as a Spectrum2List the data.frames have to be further processed. Here the second approach has the advantage that the result from the query does only have to be deserialized but not expanded.

Unit: seconds
                                                   expr       min        lq
                    Spectrum2List(dbGetQuery(con, q_a)) 323.01007 332.65474
 Spectrum2List(.msms_spectra(con, q_b, expand = FALSE))  25.70828  26.67545
      mean    median        uq       max neval cld
 344.18153 342.29940 354.76726 367.23511     3   b
  27.11244  27.64262  27.81453  27.98643     3  a 

So no question here who's the winner. Saving the m/z and intensity vectors per spectrum as BLOB outperforms storing them as individual numeric values. This is mostly because the huge data.frame has to be split per spectrum for the individual-value-saving while the BLOB way returns the data.frame already in a form that can be almost immediately converted into the Spectrum2List.

Now the same performance tests for the retrieval of spectra for a single compound:

Unit: milliseconds
                                                                expr      min
    dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")) 1.347068
 .msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'")) 1.047436
       lq     mean   median       uq      max neval cld
 1.371294 1.438904 1.395520 1.484822 1.574123     3   a
 1.060169 1.189017 1.072901 1.259807 1.446714     3   a

BLOB is slightly faster, but no big difference. Returning a SpectrumList instead, the BLOB approach is again faster:

Unit: milliseconds
                                                                                                    expr
                         Spectrum2List(dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")))
 Spectrum2List(.msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'"),      expand = FALSE))
      min       lq     mean   median      uq      max neval cld
 5.184055 5.230218 5.267699 5.276380 5.30952 5.342661     3   b
 3.461168 3.574361 3.824816 3.687553 4.00664 4.325726     3  a 

Summarizing, if there is no strong reason that we need direct access to the m/z and intensity values of the individual spectra I would go for the BLOB solution.

from compounddb.

jorainer avatar jorainer commented on July 17, 2024

m/z and intensity values for a MS/MS spectrum are now stored as BLOB in the msms_spectrum database table. Advantage is that we have now only a single table for MS/MS spectra, with each row containing the data for one spectrum.

If in future we see the need/necessity to search for spectra based on m/z values (in SQL) we can still change back to the two-table approach. I made sure that only moderate code changes are necessary for that.

from compounddb.

SiggiSmara avatar SiggiSmara commented on July 17, 2024

As a side note, one could think about generating mz-range index table (e.g. +/- 10 A.U. with a 1/2 width A.U. overlap) for the spectra that would speed up fetching relevant spectra based on the input of mz values without having to go fully unblobbed (is that a word?)

The use case I am thinking of is similarity search / comparison were you have a spectra and would like to find the best n matches. For LC/MS this would almost entirely happen on the MS2 or higher level which already would reduce the possible spectra by a lot since they would have to have the same or similar parent ion. For GC/MS though you would want to perform similarity searches also on MS1 level data.

from compounddb.

jorainer avatar jorainer commented on July 17, 2024

Thanks for the input @SiggiSmara . Once we have real use cases we can then check what makes more sense, do the search with SQL within the database, or get all spectra and do the comparison in R.

The advantage of the second approach would be that we don't have to reimplement the wheel. If I got it correctly there is already spectrum comparison functionality in MSnbase that we could use here - that's why I'm exporting the spectrum data also as Spectrum2 objects (in this fancy Spectrum2List object).

from compounddb.

SiggiSmara avatar SiggiSmara commented on July 17, 2024

Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db.

from compounddb.

michaelwitting avatar michaelwitting commented on July 17, 2024

from compounddb.

jorainer avatar jorainer commented on July 17, 2024

Sounds great @michaelwitting 👍

from compounddb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.