Comments (12)
Then by all means, Sir! Blob it!
from compounddb.
I guess it could be interesting to ask "give me all compounds where this mz appears"...
Is several minutes a big deal? 8 million rows doesn't seem so bad once it is in memory.
Or did you mean simply storing it in the database in that format but make the long format as when it is read?
from compounddb.
Two things:
- 8 million rows is for now. I expect this to increase over time with more and more spectra being added - or for other databases than HMDB that might have more spectra.
- at present it is a 28 second query to retrieve all spectra. Well, that's way faster than importing it from the files, but still - users are usually not very patient.
Or did you mean simply storing it in the database in that format but make the long format as when it is read?
It it the storage of the data - in the end it's not that simple, because I have to serialize/deserialize each individual field when storing/retrieving it from the db. The idea is then to have the expanded version returned (e.g. as a Spectrum2List
). But having the data as a BLOB prevents to have SQL queries in the form of where mz > 23.129 and mz < 23.2
to retrieve only specific e.g. compounds. One would have to fetch the full data first and then subset instead of letting SQLite doing the subsetting.
from compounddb.
Is storing it as a blob even faster?
from compounddb.
getting the full table with m/z and intensity values as a blob takes ~ 2 seconds. But this has then only ~ 450.000 rows, one row per spectrum and has to be deserialized. Still it is faster.
Coming back to your possibly interesting question: so it would be interesting to pose the question "give me all compounds with a MS/MS spectrum that has a peak at a certain m/z"? How frequently you think such a query would be?
from compounddb.
Update on the timing it takes to fetch the MS/MS data from the database: I compared the two functions, the first joins the two tables with the MS/MS peak data (m/z and intensity values) and the MS/MS spectrum annotations and returns the full data. The second queries the table with one line per spectrum and the m/z and intensity value vectors stored as a BLOB
for each spectrum. This function deserializes also the blob and expands the returning data.frame
. The result from both approaches is the same data.frame
with 8240863 rows.
Unit: seconds
expr min lq mean median uq max
dbGetQuery(con, q_a) 27.82240 28.98545 29.61636 30.14851 30.51334 30.87817
.msms_spectra(con, q_b) 14.27698 14.96500 15.46380 15.65302 16.05721 16.46141
neval cld
3 b
3 a
If the results are supposed to be returned as a Spectrum2List
the data.frame
s have to be further processed. Here the second approach has the advantage that the result from the query does only have to be deserialized but not expanded.
Unit: seconds
expr min lq
Spectrum2List(dbGetQuery(con, q_a)) 323.01007 332.65474
Spectrum2List(.msms_spectra(con, q_b, expand = FALSE)) 25.70828 26.67545
mean median uq max neval cld
344.18153 342.29940 354.76726 367.23511 3 b
27.11244 27.64262 27.81453 27.98643 3 a
So no question here who's the winner. Saving the m/z and intensity vectors per spectrum as BLOB
outperforms storing them as individual numeric values. This is mostly because the huge data.frame
has to be split per spectrum for the individual-value-saving while the BLOB
way returns the data.frame
already in a form that can be almost immediately converted into the Spectrum2List
.
Now the same performance tests for the retrieval of spectra for a single compound:
Unit: milliseconds
expr min
dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")) 1.347068
.msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'")) 1.047436
lq mean median uq max neval cld
1.371294 1.438904 1.395520 1.484822 1.574123 3 a
1.060169 1.189017 1.072901 1.259807 1.446714 3 a
BLOB
is slightly faster, but no big difference. Returning a SpectrumList
instead, the BLOB
approach is again faster:
Unit: milliseconds
expr
Spectrum2List(dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")))
Spectrum2List(.msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'"), expand = FALSE))
min lq mean median uq max neval cld
5.184055 5.230218 5.267699 5.276380 5.30952 5.342661 3 b
3.461168 3.574361 3.824816 3.687553 4.00664 4.325726 3 a
Summarizing, if there is no strong reason that we need direct access to the m/z and intensity values of the individual spectra I would go for the BLOB
solution.
from compounddb.
m/z and intensity values for a MS/MS spectrum are now stored as BLOB
in the msms_spectrum
database table. Advantage is that we have now only a single table for MS/MS spectra, with each row containing the data for one spectrum.
If in future we see the need/necessity to search for spectra based on m/z values (in SQL) we can still change back to the two-table approach. I made sure that only moderate code changes are necessary for that.
from compounddb.
As a side note, one could think about generating mz-range index table (e.g. +/- 10 A.U. with a 1/2 width A.U. overlap) for the spectra that would speed up fetching relevant spectra based on the input of mz values without having to go fully unblobbed (is that a word?)
The use case I am thinking of is similarity search / comparison were you have a spectra and would like to find the best n matches. For LC/MS this would almost entirely happen on the MS2 or higher level which already would reduce the possible spectra by a lot since they would have to have the same or similar parent ion. For GC/MS though you would want to perform similarity searches also on MS1 level data.
from compounddb.
Thanks for the input @SiggiSmara . Once we have real use cases we can then check what makes more sense, do the search with SQL within the database, or get all spectra and do the comparison in R
.
The advantage of the second approach would be that we don't have to reimplement the wheel. If I got it correctly there is already spectrum comparison functionality in MSnbase
that we could use here - that's why I'm exporting the spectrum data also as Spectrum2
objects (in this fancy Spectrum2List
object).
from compounddb.
Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db.
from compounddb.
from compounddb.
Sounds great @michaelwitting
from compounddb.
Related Issues (20)
- Create an IonDb with all theoretical adducts for a CompoundDb
- Function to combine/concatenate CompDb databases
- Add an insertCompound function HOT 1
- Availability of functions to create empty CompDB and insert and delete compounds HOT 2
- Seeking suggestions for database development (HMDB version 5) HOT 2
- Pass skipErrors to read.SDFset through compound_tbl_sdf? HOT 4
- Replace mass2mz and mz2mass with the ones from MetaboCoreUtils
- Rename table "compound" into "ms_compound"
- Implement a StandardsDb that extends CompoundDb HOT 11
- Update the MsBackendCompDb HOT 1
- insertIon method HOT 4
- non integer msLevel values HOT 3
- Add possibility to delete (inserted or existing) ions or MS/MS spectra
- Prepare for Bioconductor submission HOT 3
- Arbitrary columns in insertIon
- Transfer of CompoundDb
- Support import from MoNa MSP files HOT 1
- custom Db - spectra and ion Db questions HOT 2
- Issue with GitHub install of CompoundDb HOT 3
- mass2mz method for CompDb HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compounddb.