rformassspectrometry / compounddb Goto Github PK

View Code? Open in Web Editor NEW

16.0 10.0 14.0 14.19 MB

Creating and using (chemical) compound databases

Home Page: https://rformassspectrometry.github.io/CompoundDb/index.html

R 99.70% TeX 0.30%

metabolomics annotation databases rstats mass-spectrometry

compounddb's Introduction

CompoundDb ... preserve compound annotations.

Installation and requirements

The package can be installed with

install.packages(c("BiocManager", "remotes"))
BiocManager::install("CompoundDb")

Creating and using (chemical) compound databases

This package provides functionality to create and use compound databases generated from (mostly publicly) available resources such as HMDB, ChEBI and PubChem.

For more information see the package homepage.

Contributions

Contributions are highly welcome and should follow the contribution guidelines. Also, please check the coding style guidelines in the RforMassSpectrometry vignette.

compounddb's People

Contributors

Stargazers

Watchers

Forkers

egonw michaelwitting jmbadia ezhou89 stanstrup andreavicini wallfacerlr metabolomicshk rogerginber rebdau ibanknatoprad adafede wbs-tw

compounddb's Issues

Add vignette describing how to create CompDb databases

Add a vignette describing how CompDb databases can be created.

from a tbl
from one or multiple input files.
How to handle PubChem.

Support multiple SDF files

For example for pubchem.

Multithreading with pbapply would be nice.

Extract spectrum data from SDF files

Apparently there should be MS spectrum data in a SDF file - problem is ChemmineR seems to have no support for spectrum data.

Use case: find compounds with a certain MS2 peak

Retrieve compounds (and/or spectra) that have an MS2 peak with a certain m/z.

The query would be something like:

compounds(compdb, filter = MSnMzFilter(mz = 123.345, ppm = 10))

In the current database layout we can not query on the m/z of the individual peaks as the m/z and intensity values are stored as a blob (for performance reasons; discussed in issue #26). We thus have to:

get all potential MS2 spectra that could have a peak at the position
for each MS2 spectrum, check if any of it's peaks matches the query m/z

To speed up point 1) (i.e. to not have to retrieve all MS2 spectra): add columns msms_mz_range_min and msmsm_mz_range_max columns to the spectrum table to retrieve only spectra for which the m/z range overlaps the input m/z. This picks up also the idea from @SiggiSmara to speed up the query based on m/z ranges (#26 (comment)).

For point 2): implement a hasPeak(x, mz, ppm = 10, which = c("any", "all")) method for Spectra that returns TRUE or FALSE if the Spectrum has a peak at the given m/z(s). mz can have length > 1.

Add basic function to annotate m/z

Add a simple function that takes one or more m/z values and a list of possible adducts and returns the matching database entries.

Possible names for the (exported) function:

annotateAdduct

Create collapse_table, expand_table functions

Create functions that allow to collapse or expand a data.frame keeping unique values in some columns and list of elements in the other.

Change field collision_energy to text

MoNa provides the collision energy as a string (issue #30) instead of a single number (like HMDB does). Thus we have to change the field from numeric to text.

Import data from Massbank

Import open data from Massbank (https://github.com/MassBank/MassBank-data).

Seems the data is in nicely structured txt files, so import should be straight forward.

Lipidmaps changes

Lipidmaps have changed their naming.
There is now NAME instead of COMMON_NAME.

In addition there is now a HMDB_ID column. That conflicts with the check for the sdf being sourced from HMDB. So I have made it check for lipidmaps first.

PR: #57

Databases to convert to tables

From @stanstrup on October 19, 2017 13:11

Functions added to package:

License situation clearified

Please suggest.

Copied from original issue: stanstrup/PeakABro#2

Import from SDF format

Create functions to import compound information from files in SDF format. This should work on files from HMDB, ChEBI and LipidMaps.

The object-oriented annotation database concept

The following idea might be quite useful for e.g. lab-specific annotations. The idea is to have main, static, annotations e.g. from HMDB or Massbank and allow users to add lab specific annotations (such as retention times for specific compounds) too. In the sketch below, the CompDb object would represent a static annotation resource (getting annotations from e.g. a database). An additional object XCompDb could now extend the CompDb object and inherit thus all of its methods (and along with that the data) but it could also add additional annotations, such as retention times. These could even just be provided by a data.frame or an xls sheet. The point is that such XCompDb objects would allow to provide dynamic annotations e.g. for a specific LC-MS setup in a lab. Another possibility would be to add MS2 spectra measured in the lab to an existing CompDb that provides also MS2 spectra. The object extending CompDb would simply contain a Spectra with the MS2 spectra and whenever spectra is called on it (or a MS2 search is performed) it joins its MS2 spectra with those from the main database.

I think this might be quite helpful to cover the rather heterogenous and lab-specific annotations that are around. We could build on some public resources and build around them. @stanstrup @michaelwitting @sneumann what do you think?

The simple sketch for this setup:

Re-define compound table

The purpose of the compound table:

contain a unique entry for one compound
allow to group e.g. multiple MS2 spectra to a single entity.

The question however is how to define a compound. What is a compound? An entity with its unique, own InChI? Structure == compound?

For the HMDB database it was pretty straight forward as HMDB provides compound identifiers. MoNa (issue #23)and Massbank (issue #34) however are more complicated as they don't allow to unify the data.

What we should do:

For HMDB: check if each compound ID has its own InChI.
Check PubChem (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/)
Check ChEBI

Reduce redundancy in compound table imported from MoNa

Data import from MoNa is available (issue #30), but the compound table contains a high degree of redundancy, i.e. it has one row for each MS2 spectrum.
We need to find a way to reduce this data by identifying unique compounds in that table. That has been elusive so far, because e.g. not all entries have an InChI key. Even those with an inchi key can differ in their exact mass (because of different precision of the numerical representation).

Support getting data from Wikidata

Idea: write up a vignette that shows how to populate a local database using data from Wikidata. (You can assign this one to me).

Generated package cannot be loaded

The generated package cannot be loaded.

What is CompoundDb referring to? It cannot be found. Was it supposed to use createCompDb? Then metadata is missing.

https://github.com/EuracBiomedicalResearch/CompoundDb/blob/dbcd9cf0aaf7becd57dd1ed248d5774d36c88d8d/inst/pkg-template/R/zzz.R#L11

EDIT: ah. this is supposed to be CompDb?

support gz compressed sdf files

For Pubchem. assume same name as gz file.

gunzip(file, remove = FALSE, skip = TRUE)
SDF <- read.SDFset(sub('.sdf.gz','.sdf',file))
unlink(sub('.sdf.gz','.sdf',file))

Support multiple synonyms fields in SDF files

See #1 (comment)

Primary and foreign keys

Wouldn't it make sense to set the compound_id in the compound table as a primary key and unique?

Then synonym could refer to the key. All other additional tables added with new features would require the foreign key reference.

I don't know if it has many practical applications but I like the idea of having this explicit and potential error if tables are added without proper reference.

Add filters

Add/implement filters from Bioconductor's AnnotationFilter package:

MassFilter (numeric).
MassRangeFilter (numeric, defines lower and upper boundary)
CompoundIdFilter
CompoundNameFilter

This is for Bioconductor S4-based filtering - might have to think how we could expand that to tidyverse-based filtering.

Import data from MoNa

This is related to issue #23

MoNa provides data in SDF format and in a proprietory json format. First we have to check a) what compound information we can extract from the SDF and b) if and how we can extract the spectrum data from the SDF.

Doc lists wrong required columns

Should be compound_id and compound_name. inchi_key is also required but not mentioned in this place in the doc.

https://github.com/EuracBiomedicalResearch/CompoundDb/blob/23d7958037dd63867dfe3a8f9be2a5eaf5bc98de/R/createCompDbPackage.R#L361

Add field precursor_mz to msms_spectra

MoNa provides the precursor m/z for MS2 spectra (issue #30). Add this field to msms_spectrum.

Include ressource-specific license files

Add resource-specific license files that could be added to the CompDb packages.

Import MS/MS spectrum data from various sources

Import MS/MS spectrum data from different databases:

HMDB (http://www.hmdb.ca)
MoNa (issue #30)
Massbank

Seems ChEBI, PubChem and Lipid maps don't provide spectra; can you confirm @stanstrup @michaelwitting ?

Store spectra m/z and intensity values as blob

After the first test with the full HMDB database (~450.000 spectra) it turned out that saving the spectra data in two tables msms_spectrum_peak and msms_spectrum_metadata could be an overkill. Specifically, the msms_spectrum_peak table, that has one row for each m/z - intensity pair (peak) is huge (over 8 million rows) and fetching the full data takes several minutes (mostly because the two tables have to be joined).

An alternative approach would be to store each MS/MS spectrum as one row in a msms_spectrum table, with the m/z and intensity values saved as BLOB. This would however prevent searches/subsets by m/z and intensity values directly in the database (using SQL) - something that might anyway not be that interesting.

I'll check the performance of both approaches.

Check msPurity

Check overlaps/redundancy with the msPurity Bioconductor package: https://bioconductor.org/packages/3.9/bioc/html/msPurity.html

Modified lipidblast importer

I've modified the lipid blast importer to extract also the formula and the mass from the json. This does not require rcdk.

Setting write permission in CompDb

It would be nice to be able to set the RSQLite::SQLITE_RW flag for the database connection.

https://github.com/EuracBiomedicalResearch/CompoundDb/blob/c6358539c90adc48eb9ba639a7c3f432390046b0/R/CompDb.R#L201

That way more advanced direct hacking is easier so you don't have to make a new connection "manually".

Getter for synonyms

Is there an interface for accessing the synonyms? I could find one apart form direct DB access.

object and x same in man page?

Aren't x and object the same kind of thing?
Should they be named the same across functions?

https://github.com/EuracBiomedicalResearch/CompoundDb/blob/389c16c6234b68308a9a18630110f450f1dc4245/man/CompDb.Rd#L46

https://github.com/EuracBiomedicalResearch/CompoundDb/blob/389c16c6234b68308a9a18630110f450f1dc4245/man/CompDb.Rd#L31-L33

Add a class to concatenate Spectrum2 objects and add arbitrary annotations

Define a class that allows to concatenate several Spectrum2 objects (similar to a list) and add arbitrary annotation information to it. Such a class can thus have two purposes:

represent multiple Spectrum2 objects with the ability to implement methods for them, e.g. msLevel(object) would then return the MS level of all objects which is more user friendly than a lapply(object, msLevel) if Spectrum2 objects are in a simple list.
Add any arbitrary annotation column to it, so that we could e.g. link Spectrum2 object(s) to the respective chromatographic peak ID.

Change to Spectra::Spectra

Instead of using MSnbase::Spectra swith to Spectra::Spectra.

Default spectra variable names

As suggested by @michaelwitting in issue #61 we should agree on a base nomenclature for compound/spectra identifiers. I am generally no big friend of camelCase in variable names (just too easy to misstype), so I'd suggest to use all in lower case?

Happy for feedback, change requests and expansion of the list @michaelwitting @stanstrup @sneumann

InChI: inchi
InChIKey: inchikey
SMILES: smiles
SPLASH: splash
...

Use `Spectra` object from MSnbase instead of `Spectrum2List`

Replace Spectrum2List with Spectra.

remove -methods from alias?

Is it necessary to have the methods as aliases?

For me as the number of functions grow reading through the list of functions becomes more difficult if many functions are heavily duplicated with -method and -class.
So I find it more difficult to get a sense of what functions are available.

dbconn seems to need @nord.

License issues

From @stanstrup on October 19, 2017 9:1

Which databases can I include data from?
If there are ones I cannot they will need to be download and table generated by the user. Is there such a thing as "in-package cache"?
Which license can the package have if it includes db data?
Is license a concern at all? As far as I know data cannot be copyrighted so is there any concern at all?

~~MONA (lipidblast) is CC BY 4.0. So should be OK? http://mona.fiehnlab.ucdavis.edu/documentation/license~~
I cannot find what license lipidmaps have.
Seems hmdb require explicit permission to include http://www.hmdb.ca/downloads

The info I extract is: id, name, inchi, formula, and mass..

For the moment I force-removed the files until this is settled.

Copied from original issue: stanstrup/PeakABro#1

Filter data on MS2 mz range

This is related to issue #28, specifically the point 1) described there. Retrieve data based on the m/z range of their MSMS spectra.

Add columns msms_mz_range_min and msms_mz_range_max to the msms_spectrum database table.
Add MsmsMzRangeMinFilter and MsmsMzRangeMaxFilter filters (terrible name but we have to be compliant with the naming scheme for AnnotationFilter classes).

Use case 1: get all MSMS spectra with all of their peaks within a m/z range:

spectra(compdb, filter = ~ msms_mz_range_min > 123.4 & msms_mz_range_max < 343.4)

Use case 2: get all MSMS spectra with a peak potentially within an m/z range (that's what we need for 1) in issue #28):

spectra(compdb, filter = ~ msms_mz_range_min <= 123.4 & msms_mz_range_max >= 123.5)

Add new spectra to an `CompDb` database

This is related to issue rformassspectrometry/Spectra#135 . Would be nice to have an expandable database, e.g. pre-fill with MS2 spectra from a public repository and then (sequentially) add own spectra and annotations to it.

Maybe best to add functions

addSpectra <- function(x, spectra)
updateSpectra <- function(x, spectra)
removeSpectra <- function(x, spectra)

Where x is a CompDb and spectra a Spectra object. Note that the spectra also need to have a reference to the compound in the compound table - which might have to be added before.

Define spectra table

Define the spectra database table to hold the spectrum data.

Check what is provided in SDF format (spectra there?).
Check what is provided by different providers (MoNa, HMDB)...

make createCompDbPackage return package path

Wouldn't it make sense if createCompDbPackage returned the path of the package?

Then you could grab that and install directly.

https://github.com/EuracBiomedicalResearch/CompoundDb/blob/690258d9b7aca0439f6f8658507deffdeffbed93/R/createCompDbPackage.R#L614

What data to provide in a CompoundDb

Based on https://github.com/stanstrup/PeakABro I'm extracting the following information:

id (column "compound_id")
name (columns "compound_name").
inchi
formula
mass

Is there anything else that might at some point be interesting? @stanstrup? @SiggiSmara?

Also, I renamed id into compound_id and name into compound_name to avoid potential column name clashes and specify for what type of entity the id and the name is. @stanstrup, that OK with you?

LipidBlast seems to be fubar

From @stanstrup on October 19, 2017 13:13

Names and structures don't match:
https://bitbucket.org/fiehnlab/mona/issues/200/lipidblast-mismatch-between-name-and

Copied from original issue: stanstrup/PeakABro#3

Add code/description how to create a CompDb from MassBank

MassBank releases their databases at regular intervals and shares the data with a rather open license, which makes them an ideal candidate for annotation databases that could be distributed via Bioconductor's AnnotationHub.

Explanation: I'm building so called EnsDb databases for all species for each release of Ensembl. These databases are self-contained SQLite files with gene, transcript, exon and protein annotations and can be downloaded/fetched from AnnotationHub. This is very convenient for the user.

CompDb databases could be distributed in a similar fashion.

What I will try next is to define simple scripts to easily import data from the MassBank (MySQL database) into a CompDb database.

Add functionality to convert mass to m/z and vice versa

Add functionality to convert mass to m/z and back given adduct definitions such as "[M+H]+". This needs:

definition of adducts and related conversions/mass differences.
function to map from mass to m/z: mass2mz
function to map from m/z to mass: mz2mass

The input for the function should be a numeric with masses respectively m/z values. The output a list of length equal to the input vector. Each element should be a named numeric of length equal to the number of specified adducts, names being the adduct name and values the converted value.

extract cmpDb spectra considering spectrum_id

Hi Johannes,
As you know, I am trying to fit my BD and MS2 workflow identification to your amazing tools.
In one of the steps, we look on a CompDb object for spectra that match specific requirements (polarity, mass, originaldb, ...), and then we need to extract such spectra in order to compare them with our query spectrum. i.e. it is necessary to extract spectra from a compDb object considering their 'spectrum_id' value. Something similar to

spectraSbstd <- Spectra(CompDb..100KMS2ID.1.0, filter = ~ spectrum_id %in% c(100002, 25, 68)

spectrum_id is not considered as a variable in any filter supported by CompDb. How can I adress this need? Do I need to extend a new filter from AnnotationFilter::AnnotationFilter class?

Thanks for all your support

Merge databases

Support merging of CompDb databases, eigher as

simple functions merging the data from various CompDb databases
define a specific MultiSourceCompDb database/object which layout supports data from various sources.

rformassspectrometry / compounddb Goto Github PK

compounddb's Introduction

Installation and requirements

Creating and using (chemical) compound databases

Contributions

compounddb's People

Contributors

Stargazers

Watchers

Forkers

compounddb's Issues

Recommend Projects

Recommend Topics

Recommend Org