After the first test with the full HMDB database (~450.000 spectra) it turned out that

Two things: 8 million rows is for now. I expect this to increa

m/z and intensity values for a MS/MS spectrum are now stored as <code class="notransla

Thanks for the input <a class="user-mention notranslate" data-hovercard-type="user" da

Store spectra m/z and intensity values as blob,about rformassspectrometry/compounddb

Comments (12)

stanstrup commented on July 17, 2024 1

Then by all means, Sir! Blob it! ☺️

from compounddb.

stanstrup commented on July 17, 2024

I guess it could be interesting to ask "give me all compounds where this mz appears"...

Is several minutes a big deal? 8 million rows doesn't seem so bad once it is in memory.

Or did you mean simply storing it in the database in that format but make the long format as when it is read?

from compounddb.

jorainer commented on July 17, 2024

Two things:

8 million rows is for now. I expect this to increase over time with more and more spectra being added - or for other databases than HMDB that might have more spectra.
at present it is a 28 second query to retrieve all spectra. Well, that's way faster than importing it from the files, but still - users are usually not very patient.

Or did you mean simply storing it in the database in that format but make the long format as when it is read?

It it the storage of the data - in the end it's not that simple, because I have to serialize/deserialize each individual field when storing/retrieving it from the db. The idea is then to have the expanded version returned (e.g. as a Spectrum2List). But having the data as a BLOB prevents to have SQL queries in the form of where mz > 23.129 and mz < 23.2 to retrieve only specific e.g. compounds. One would have to fetch the full data first and then subset instead of letting SQLite doing the subsetting.

from compounddb.

stanstrup commented on July 17, 2024

Is storing it as a blob even faster?

from compounddb.

jorainer commented on July 17, 2024

getting the full table with m/z and intensity values as a blob takes ~ 2 seconds. But this has then only ~ 450.000 rows, one row per spectrum and has to be deserialized. Still it is faster.

Coming back to your possibly interesting question: so it would be interesting to pose the question "give me all compounds with a MS/MS spectrum that has a peak at a certain m/z"? How frequently you think such a query would be?

from compounddb.

jorainer commented on July 17, 2024

Update on the timing it takes to fetch the MS/MS data from the database: I compared the two functions, the first joins the two tables with the MS/MS peak data (m/z and intensity values) and the MS/MS spectrum annotations and returns the full data. The second queries the table with one line per spectrum and the m/z and intensity value vectors stored as a BLOB for each spectrum. This function deserializes also the blob and expands the returning data.frame. The result from both approaches is the same data.frame with 8240863 rows.

Unit: seconds
                    expr      min       lq     mean   median       uq      max
    dbGetQuery(con, q_a) 27.82240 28.98545 29.61636 30.14851 30.51334 30.87817
 .msms_spectra(con, q_b) 14.27698 14.96500 15.46380 15.65302 16.05721 16.46141
 neval cld
     3   b
     3  a

If the results are supposed to be returned as a Spectrum2List the data.frames have to be further processed. Here the second approach has the advantage that the result from the query does only have to be deserialized but not expanded.

Unit: seconds
                                                   expr       min        lq
                    Spectrum2List(dbGetQuery(con, q_a)) 323.01007 332.65474
 Spectrum2List(.msms_spectra(con, q_b, expand = FALSE))  25.70828  26.67545
      mean    median        uq       max neval cld
 344.18153 342.29940 354.76726 367.23511     3   b
  27.11244  27.64262  27.81453  27.98643     3  a

So no question here who's the winner. Saving the m/z and intensity vectors per spectrum as BLOB outperforms storing them as individual numeric values. This is mostly because the huge data.frame has to be split per spectrum for the individual-value-saving while the BLOB way returns the data.frame already in a form that can be almost immediately converted into the Spectrum2List.

Now the same performance tests for the retrieval of spectra for a single compound:

Unit: milliseconds
                                                                expr      min
    dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")) 1.347068
 .msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'")) 1.047436
       lq     mean   median       uq      max neval cld
 1.371294 1.438904 1.395520 1.484822 1.574123     3   a
 1.060169 1.189017 1.072901 1.259807 1.446714     3   a

BLOB is slightly faster, but no big difference. Returning a SpectrumList instead, the BLOB approach is again faster:

Unit: milliseconds
                                                                                                    expr
                         Spectrum2List(dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")))
 Spectrum2List(.msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'"),      expand = FALSE))
      min       lq     mean   median      uq      max neval cld
 5.184055 5.230218 5.267699 5.276380 5.30952 5.342661     3   b
 3.461168 3.574361 3.824816 3.687553 4.00664 4.325726     3  a

Summarizing, if there is no strong reason that we need direct access to the m/z and intensity values of the individual spectra I would go for the BLOB solution.

from compounddb.

jorainer commented on July 17, 2024

m/z and intensity values for a MS/MS spectrum are now stored as BLOB in the msms_spectrum database table. Advantage is that we have now only a single table for MS/MS spectra, with each row containing the data for one spectrum.

If in future we see the need/necessity to search for spectra based on m/z values (in SQL) we can still change back to the two-table approach. I made sure that only moderate code changes are necessary for that.

from compounddb.

SiggiSmara commented on July 17, 2024

As a side note, one could think about generating mz-range index table (e.g. +/- 10 A.U. with a 1/2 width A.U. overlap) for the spectra that would speed up fetching relevant spectra based on the input of mz values without having to go fully unblobbed (is that a word?)

The use case I am thinking of is similarity search / comparison were you have a spectra and would like to find the best n matches. For LC/MS this would almost entirely happen on the MS2 or higher level which already would reduce the possible spectra by a lot since they would have to have the same or similar parent ion. For GC/MS though you would want to perform similarity searches also on MS1 level data.

from compounddb.

jorainer commented on July 17, 2024

Thanks for the input @SiggiSmara . Once we have real use cases we can then check what makes more sense, do the search with SQL within the database, or get all spectra and do the comparison in R.

The advantage of the second approach would be that we don't have to reimplement the wheel. If I got it correctly there is already spectrum comparison functionality in MSnbase that we could use here - that's why I'm exporting the spectrum data also as Spectrum2 objects (in this fancy Spectrum2List object).

from compounddb.

SiggiSmara commented on July 17, 2024

Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db.

from compounddb.

michaelwitting commented on July 17, 2024

Hello, I'm trying to implement also new functions for comparing spectra, which can be used with the compareSpectra function from MSnbase. This includes a modified dot-product according to GNPS (with constant mass shift) and X-Rank. If you have other ideas for similarity calculations please let me know. Also planning to have a forward and reverse dot-product. Michael Von: "Sigurður Smárason" <[email protected]> An: "EuracBiomedicalResearch/CompoundDb" <[email protected]> CC: "Subscribed" <[email protected]> Gesendet: Freitag, 10. August 2018 10:12:21 Betreff: Re: [EuracBiomedicalResearch/CompoundDb] Store spectra m/z and intensity values as blob (#26) Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db. — You are receiving this because you are subscribed to this thread. Reply to this email directly, [ #26 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACm6csT9jm4bmjN0Xd08RY4Wi7ezXUVNks5uPUBlgaJpZM4V1Iyi | mute the thread ] . Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. med. Dr. h. c. Matthias H. Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

from compounddb.

jorainer commented on July 17, 2024

Sounds great @michaelwitting 👍

from compounddb.

Store spectra m/z and intensity values as blob about compounddb HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent