Comments (27)
From @jotsetung on October 19, 2017 17:1
Are you planning to add each resource (i.e. its data) to the package?
from compounddb.
That was my plan if it is feasible without violating licenses. Parsing for example the json from lipidblast is very slow and people need to download a 1.6GB file.
Whereas the parsed table is only 1-2 MB in rds format.
It is not very clear to me what the license situation is.
As far as I know simple data cannot be copyrighted. For example a simple table from a paper should always be copyright free. But I am not sure what applies here.
from compounddb.
From @jotsetung on October 20, 2017 6:49
The idea is to match compounds by (adduct) m/z, right?
So you'll have some columns (like mass, id and name) that are common and have to be present in all data resources, and you might have some data resource specific columns.
In that case I would change from a data.frame
approach to a S4
class approach (see also issue #6). This would also hide internals (like the actual column names etc) from the user.
from compounddb.
Right. I was hoping not to have DB specific columns to be able to easily mix and match though.
What do you mean my hide internals?
from compounddb.
From @jotsetung on October 20, 2017 7:55
Example to explain the hide the internals: this is the concept we were following for/in the AnnotationFilter
, ensembldb
packages:
- define a common name for a filter or database attribute that the user is used to, such as
genename
. - define a filter that can be used to search in a (any) database for a certain gene by its name:
GenenameFilter
. - now, no matter which database the user is querying, he can always use the
GenenameFilter
to search for entries matching a certain gene name. The methods to access the data in a database have to translate it to the correct column name. So it does not matter whether the name of the column in the database table isgene_name
,GeneName
,genename
etc. The user doesn't have to bother what the name of the column might be and use different column names across different databases.
An example here would be to have something like a InchiFilter
that can be used to search for inchis in the database...
from compounddb.
From @jotsetung on October 20, 2017 8:8
Regarding HMDB parsing: I did implement a simple parser to extract fields from HMDBs xml file(s):
https://github.com/jotsetung/xcmsExtensions/blob/master/R/hmdb-utils.R use whatever you want/need.
from compounddb.
Yeah I say that just now when poking around your repo. I actually also wrote one some years ago I was planning to add. I have a suspicion that yours is smarter though...
So should I eventually import from your package or copy?
from compounddb.
If you don't enforce column names in databases won't it become difficult to mix them if for example you want data from HMDB and lipidmaps for the annotation?
from compounddb.
From @jotsetung on October 20, 2017 9:12
Re: code from xmcsExtensions
, please copy what you need - I wan't update/use that package anymore. Your's will be much better!
Re: enforcing column names - let's wait for your use case. I agree that common column names should be used.
from compounddb.
From @jotsetung on October 20, 2017 12:43
Are you already working on the HMDB import function? Otherwise I could do that do start getting my hands dirty...
from compounddb.
Nope, I am working on pubchem so that would be great.
from compounddb.
From @wilsontom on October 20, 2017 15:54
Hi Jan,
I parsed HMDB into a package a while ago if it's any help. And a colleague did something similar for PubChem. Some of it may be some use to you
Thanks
Tom
from compounddb.
@wilsontom Thanks! But you don't supply functions to actually generate the HMDB table? I'd like that have that in the package too so it is easy to update.
I basically have PubChem working. Trying to generate the table now. Takes a while though since Pubchem is enormous. I wonder what the final size is gonna look like.
from compounddb.
From @jotsetung on October 20, 2017 17:3
hmdb parsing is also on its way - I've just updated to use the xml2
package instead of the XML
package.
from compounddb.
I have been trying to write something feasible to handle pubchem using sqlite intermediates. The problem is that it is enormous. 130 million structures supposedly. Holding the final table in memory requires about 60GB from my approximations. An RDS file would be ~7.5GB. An sqlite file ~40GB.
So two problems:
- Any of the usual solutions even allow such a large file?
- People cannot use it on a regular computer without loads of memory.
- expanding it to adducts would balloon it even more.
With an sqlite file as far as I understand you could subset it before it is read into R. I guess that might make it useful for something.
Still don't know if that is feasible. And wouldn't know where to host a 40 GB sqlite file.
Thoughts?
from compounddb.
From @jotsetung on October 23, 2017 3:50
The HMDB is added (see stanstrup/PeakABro#10).
from compounddb.
From @jotsetung on October 23, 2017 3:59
Re PubChem: for these large files using a tibble-based approach is not feasible. SQL might do it - or on the fly access? Do they have a web API that could be queried? The approach I have in mind might also work here: define a CompoundDb
S4 object and implement all of the required methods (select
etc) for it. For smaller databases these can access the internal SQLite
database. We could then also implement a PubChemDb
class that extends the CompoundDb
and the select
method could e.g. query the database online (if they provide an API) and return the results.
Regarding the adducts - I had a thought about that too: I wouldn't create all adducts for all compounds that are in the database but rather go the other way round: calculate adducts from the identified chromatographic peaks instead. That would be more efficient, because supposedly there are always less peaks to annotate than compounds in the database
from compounddb.
Thanks for HMDB.
Re Pubchem: PubChem does have an API: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc458584424
But it will be way to slow for this purpose to me. It will be thousands of compounds if you attempt to annotate a whole peaklist. I wanted specifically to get away from the whole "look up one at a time" approach. So that once you have created your annotate peaklist you can just browse around and see everything.
I suggest we change to an sqlite databases in general such that larger databases can be accommodated in the same framework.
I say we supply the function to generate the pubchem sqlite but don't supply it anywhere. To me annotating with all of pubchem is anyway not very useful. You always get too many irrelevant hits.
Re: CompoundDb
: I think it makes sense to have such an object.
Do you know if it is possible to cache generated data in the installed package folder?
What would be nice is if there was:
CompoundDb <- generate_CompoundDb(dbs=c("HMDB","LipidBlast"))
--> The LipidBlast database have not been generated (initialized is a better word?) yet. Please run generate_db_lipidblast to create a cached database
generate_CompoundDb
would read the included sqlite files if they exist. If generate_db_lipidblast and friends could simply add the sqlite file for the specific database to the package folder you'd need to generate each only once.
Re adducts: Yes you are right. That makes much more sense.
from compounddb.
From @jotsetung on October 25, 2017 14:22
Re CompoundDb
and cached - no, I don't think it's possible to cache anything in the package folder. I would keep the annotation data separately from PeakABro
. What I would propose is the following: in the initial phase provide some CompoundDb
objects/SQLite databases within dedicated annotation packages (e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). On the longer run: distribute them via AnnotationHub
check the following:
> library(AnnotationHub)
> ah <- AnnotationHub()
updating metadata: retrieving 1 resource
|======================================================================| 100%
snapshotDate(): 2017-10-24
> ## Look for a specific resource, like gene annotations from Ensembldb, in our
> ## case we could then search e.g. for "CompoundDb", "HMDB"
> query(ah, "EnsDb.Hsapiens.v90")
AnnotationHub with 1 record
# snapshotDate(): 2017-10-24
# names(): AH57757
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-08-31
# $title: Ensembl 90 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
# "Annotation", "90", "AHEnsDbs")
# retrieve record with 'object[["AH57757"]]'
> ## retrieve the resource:
> edb <- ah[["AH57757"]]
require(โensembldbโ)
loading from cache '/Users/jo//.AnnotationHub/64495'
This means users could fetch the resource they want from AnnotationHub
and this will be cached locally. Does that make sense?
Now, I'd also like to keep separate CompoundDb
objects/databases for different resources (e.g. HMDB, LipidBlast). Reason: that way you can version the resources, respectively packages (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). Different resources will never have the same release cycles - and versioning annotation resources is key to reproducible research.
This means also that you can't query multiple resources at the same time, but that shouldn't be a problem, is it?
from compounddb.
That sounds very reasonable. It is a bit a learning curve for me with the S4 objects so I hope you have patients with me while I try to wrap my head around that.
I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.
For the last Q: It would probably be nice to be able to annotate with multiple databases at the same time. The objective is the browser in the end where you'd want a single table with all the suggested annotations.
Any idea what to do with the very big databases? The pubchem sqlite file ended up being 43GB.
from compounddb.
From @jotsetung on October 25, 2017 14:46
Re annotation with multiple databases: one could annotate with a CompoundDb
for each resource and bind_rows
the results. Then you'll have the final table.
Re very big database: only thing I could think of here is to use a central MySQL
server hosted somewhere (eventually I could do that, not sure though). And here comes the power of the S4
objects. We define simply a PubChemDb
object that extends CompoundDb
. We would only have to implement the compounds
or src_cmpdb
(or the annotating function/method) accordingly. For the user it would be just as using a simple local SQLite-base CompoundDb
object,
from compounddb.
Ah ok. If nothing prevents bind_rows then it is all good.
EDIT: now I understood. bind the result. Yes that works too.
Re very big database: I guess we can put that on the back-burner for now and just provide the parser. PubChem is rarely really useful for annotation anyway.
from compounddb.
From @jotsetung on October 26, 2017 9:19
I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.
Thinking it all over - eventually that might be a not too bad idea. I could focus on the database/data import stuff (with your help) and you can focus on the annotation, matching and browsing stuff.
Pros for splitting:
- keep database and creation of database separate from the browser and annotator - easier to maintain.
- we would not run into the Bioconductor style <-> tidyverse coding style clash. Something what I find very ugly would be e.g.
create_CompoundDb
, i.e. mixing CamelCase with snake_case. - You don't have to go through my pull request ;)
Cons: PeakABro
will become very slim (is that a cons?)- Changes in one of the two packages will have to be reflected/fixed in the other too.
@stanstrup , what do you think?
from compounddb.
In the end this is probably the most efficient way to do this so go ahead if you want.
from compounddb.
From @jotsetung on October 27, 2017 7:36
OK, I'll make a repo and add you as collaborator
from compounddb.
Do you want to move this issue and the other db related using https://github-issue-mover.appspot.com?
from compounddb.
From @jotsetung on October 27, 2017 12:30
Or we just link to this issue? Whatever you prefer.
from compounddb.
Related Issues (20)
- Create an IonDb with all theoretical adducts for a CompoundDb
- Function to combine/concatenate CompDb databases
- Add an insertCompound function HOT 1
- Availability of functions to create empty CompDB and insert and delete compounds HOT 2
- Seeking suggestions for database development (HMDB version 5) HOT 2
- Pass skipErrors to read.SDFset through compound_tbl_sdf? HOT 4
- Replace mass2mz and mz2mass with the ones from MetaboCoreUtils
- Rename table "compound" into "ms_compound"
- Implement a StandardsDb that extends CompoundDb HOT 11
- Update the MsBackendCompDb HOT 1
- insertIon method HOT 4
- non integer msLevel values HOT 3
- Add possibility to delete (inserted or existing) ions or MS/MS spectra
- Prepare for Bioconductor submission HOT 3
- Arbitrary columns in insertIon
- Transfer of CompoundDb
- Support import from MoNa MSP files HOT 1
- custom Db - spectra and ion Db questions HOT 2
- Issue with GitHub install of CompoundDb HOT 3
- mass2mz method for CompDb HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compounddb.