Code Monkey home page Code Monkey logo

compounddb's Introduction

Project Status: Active โ€“ The project has reached a stable, usable state and is being actively developed. R-CMD-check-bioc codecov license years in bioc Ranking by downloads build release build devel

CompoundDb ... preserve compound annotations.

Installation and requirements

The package can be installed with

install.packages(c("BiocManager", "remotes"))
BiocManager::install("CompoundDb")

Creating and using (chemical) compound databases

This package provides functionality to create and use compound databases generated from (mostly publicly) available resources such as HMDB, ChEBI and PubChem.

For more information see the package homepage.

Contributions

Contributions are highly welcome and should follow the contribution guidelines. Also, please check the coding style guidelines in the RforMassSpectrometry vignette.

compounddb's People

Contributors

adafede avatar andreavicini avatar jmbadia avatar jorainer avatar jwokaty avatar nturaga avatar rogerginber avatar stanstrup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compounddb's Issues

Use case: find compounds with a certain MS2 peak

Retrieve compounds (and/or spectra) that have an MS2 peak with a certain m/z.

The query would be something like:

compounds(compdb, filter = MSnMzFilter(mz = 123.345, ppm = 10))

In the current database layout we can not query on the m/z of the individual peaks as the m/z and intensity values are stored as a blob (for performance reasons; discussed in issue #26). We thus have to:

  1. get all potential MS2 spectra that could have a peak at the position
  2. for each MS2 spectrum, check if any of it's peaks matches the query m/z

To speed up point 1) (i.e. to not have to retrieve all MS2 spectra): add columns msms_mz_range_min and msmsm_mz_range_max columns to the spectrum table to retrieve only spectra for which the m/z range overlaps the input m/z. This picks up also the idea from @SiggiSmara to speed up the query based on m/z ranges (#26 (comment)).

For point 2): implement a hasPeak(x, mz, ppm = 10, which = c("any", "all")) method for Spectra that returns TRUE or FALSE if the Spectrum has a peak at the given m/z(s). mz can have length > 1.

Add basic function to annotate m/z

Add a simple function that takes one or more m/z values and a list of possible adducts and returns the matching database entries.

Possible names for the (exported) function:

  • annotateAdduct

Lipidmaps changes

Lipidmaps have changed their naming.
There is now NAME instead of COMMON_NAME.

In addition there is now a HMDB_ID column. That conflicts with the check for the sdf being sourced from HMDB. So I have made it check for lipidmaps first.

PR: #57

Databases to convert to tables

From @stanstrup on October 19, 2017 13:11

Functions added to package:

  • LipidMaps
  • LipidBlast
  • HMDB
  • MyCompoundDB
  • PhenolExplorer
  • PubChem. Too big? Not really useful?
  • ChEBI

License situation clearified

  • LipidMaps
  • LipidBlast - Confirmed CC BY. So OK with attribution.
  • HMDB
  • MyCompoundDB
  • PhenolExplorer
  • PubChem. Too big? Not really useful?
  • ChEBI

Please suggest.

Copied from original issue: stanstrup/PeakABro#2

Import from SDF format

Create functions to import compound information from files in SDF format. This should work on files from HMDB, ChEBI and LipidMaps.

The *object-oriented annotation database* concept

The following idea might be quite useful for e.g. lab-specific annotations. The idea is to have main, static, annotations e.g. from HMDB or Massbank and allow users to add lab specific annotations (such as retention times for specific compounds) too. In the sketch below, the CompDb object would represent a static annotation resource (getting annotations from e.g. a database). An additional object XCompDb could now extend the CompDb object and inherit thus all of its methods (and along with that the data) but it could also add additional annotations, such as retention times. These could even just be provided by a data.frame or an xls sheet. The point is that such XCompDb objects would allow to provide dynamic annotations e.g. for a specific LC-MS setup in a lab. Another possibility would be to add MS2 spectra measured in the lab to an existing CompDb that provides also MS2 spectra. The object extending CompDb would simply contain a Spectra with the MS2 spectra and whenever spectra is called on it (or a MS2 search is performed) it joins its MS2 spectra with those from the main database.

I think this might be quite helpful to cover the rather heterogenous and lab-specific annotations that are around. We could build on some public resources and build around them. @stanstrup @michaelwitting @sneumann what do you think?

The simple sketch for this setup:

CompoundDb

Re-define compound table

The purpose of the compound table:

  1. contain a unique entry for one compound
  2. allow to group e.g. multiple MS2 spectra to a single entity.

The question however is how to define a compound. What is a compound? An entity with its unique, own InChI? Structure == compound?

For the HMDB database it was pretty straight forward as HMDB provides compound identifiers. MoNa (issue #23)and Massbank (issue #34) however are more complicated as they don't allow to unify the data.

What we should do:

  • For HMDB: check if each compound ID has its own InChI.
  • Check PubChem (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/)
  • Check ChEBI

Reduce redundancy in compound table imported from MoNa

Data import from MoNa is available (issue #30), but the compound table contains a high degree of redundancy, i.e. it has one row for each MS2 spectrum.
We need to find a way to reduce this data by identifying unique compounds in that table. That has been elusive so far, because e.g. not all entries have an InChI key. Even those with an inchi key can differ in their exact mass (because of different precision of the numerical representation).

support gz compressed sdf files

For Pubchem. assume same name as gz file.

gunzip(file, remove = FALSE, skip = TRUE)
SDF <- read.SDFset(sub('.sdf.gz','.sdf',file))
unlink(sub('.sdf.gz','.sdf',file))

Primary and foreign keys

Wouldn't it make sense to set the compound_id in the compound table as a primary key and unique?

Then synonym could refer to the key. All other additional tables added with new features would require the foreign key reference.

I don't know if it has many practical applications but I like the idea of having this explicit and potential error if tables are added without proper reference.

Add filters

Add/implement filters from Bioconductor's AnnotationFilter package:

  • MassFilter (numeric).
  • MassRangeFilter (numeric, defines lower and upper boundary)
  • CompoundIdFilter
  • CompoundNameFilter

This is for Bioconductor S4-based filtering - might have to think how we could expand that to tidyverse-based filtering.

Import data from MoNa

This is related to issue #23

MoNa provides data in SDF format and in a proprietory json format. First we have to check a) what compound information we can extract from the SDF and b) if and how we can extract the spectrum data from the SDF.

Store spectra m/z and intensity values as blob

After the first test with the full HMDB database (~450.000 spectra) it turned out that saving the spectra data in two tables msms_spectrum_peak and msms_spectrum_metadata could be an overkill. Specifically, the msms_spectrum_peak table, that has one row for each m/z - intensity pair (peak) is huge (over 8 million rows) and fetching the full data takes several minutes (mostly because the two tables have to be joined).

An alternative approach would be to store each MS/MS spectrum as one row in a msms_spectrum table, with the m/z and intensity values saved as BLOB. This would however prevent searches/subsets by m/z and intensity values directly in the database (using SQL) - something that might anyway not be that interesting.

I'll check the performance of both approaches.

Modified lipidblast importer

I've modified the lipid blast importer to extract also the formula and the mass from the json. This does not require rcdk.

Getter for synonyms

Is there an interface for accessing the synonyms? I could find one apart form direct DB access.

Add a class to concatenate Spectrum2 objects and add arbitrary annotations

Define a class that allows to concatenate several Spectrum2 objects (similar to a list) and add arbitrary annotation information to it. Such a class can thus have two purposes:

  1. represent multiple Spectrum2 objects with the ability to implement methods for them, e.g. msLevel(object) would then return the MS level of all objects which is more user friendly than a lapply(object, msLevel) if Spectrum2 objects are in a simple list.
  2. Add any arbitrary annotation column to it, so that we could e.g. link Spectrum2 object(s) to the respective chromatographic peak ID.

Default spectra variable names

As suggested by @michaelwitting in issue #61 we should agree on a base nomenclature for compound/spectra identifiers. I am generally no big friend of camelCase in variable names (just too easy to misstype), so I'd suggest to use all in lower case?

Happy for feedback, change requests and expansion of the list @michaelwitting @stanstrup @sneumann

  • InChI: inchi
  • InChIKey: inchikey
  • SMILES: smiles
  • SPLASH: splash
    ...

remove -methods from alias?

Is it necessary to have the methods as aliases?

For me as the number of functions grow reading through the list of functions becomes more difficult if many functions are heavily duplicated with -method and -class.
So I find it more difficult to get a sense of what functions are available.

dbconn seems to need @nord.

License issues

From @stanstrup on October 19, 2017 9:1

  1. Which databases can I include data from?
  2. If there are ones I cannot they will need to be download and table generated by the user. Is there such a thing as "in-package cache"?
  3. Which license can the package have if it includes db data?
  4. Is license a concern at all? As far as I know data cannot be copyrighted so is there any concern at all?

The info I extract is: id, name, inchi, formula, and mass..

For the moment I force-removed the files until this is settled.

Copied from original issue: stanstrup/PeakABro#1

Filter data on MS2 mz range

This is related to issue #28, specifically the point 1) described there. Retrieve data based on the m/z range of their MSMS spectra.

  • Add columns msms_mz_range_min and msms_mz_range_max to the msms_spectrum database table.
  • Add MsmsMzRangeMinFilter and MsmsMzRangeMaxFilter filters (terrible name but we have to be compliant with the naming scheme for AnnotationFilter classes).

Use case 1: get all MSMS spectra with all of their peaks within a m/z range:

spectra(compdb, filter = ~ msms_mz_range_min > 123.4 & msms_mz_range_max < 343.4)

Use case 2: get all MSMS spectra with a peak potentially within an m/z range (that's what we need for 1) in issue #28):

spectra(compdb, filter = ~ msms_mz_range_min <= 123.4 & msms_mz_range_max >= 123.5)

Add new spectra to an `CompDb` database

This is related to issue rformassspectrometry/Spectra#135 . Would be nice to have an expandable database, e.g. pre-fill with MS2 spectra from a public repository and then (sequentially) add own spectra and annotations to it.

Maybe best to add functions

addSpectra <- function(x, spectra)
updateSpectra <- function(x, spectra)
removeSpectra <- function(x, spectra)

Where x is a CompDb and spectra a Spectra object. Note that the spectra also need to have a reference to the compound in the compound table - which might have to be added before.

Define spectra table

Define the spectra database table to hold the spectrum data.

  • Check what is provided in SDF format (spectra there?).
  • Check what is provided by different providers (MoNa, HMDB)...

Add code/description how to create a CompDb from MassBank

MassBank releases their databases at regular intervals and shares the data with a rather open license, which makes them an ideal candidate for annotation databases that could be distributed via Bioconductor's AnnotationHub.

Explanation: I'm building so called EnsDb databases for all species for each release of Ensembl. These databases are self-contained SQLite files with gene, transcript, exon and protein annotations and can be downloaded/fetched from AnnotationHub. This is very convenient for the user.

CompDb databases could be distributed in a similar fashion.

What I will try next is to define simple scripts to easily import data from the MassBank (MySQL database) into a CompDb database.

Add functionality to convert mass to m/z and vice versa

Add functionality to convert mass to m/z and back given adduct definitions such as "[M+H]+". This needs:

  • definition of adducts and related conversions/mass differences.
  • function to map from mass to m/z: mass2mz
  • function to map from m/z to mass: mz2mass

The input for the function should be a numeric with masses respectively m/z values. The output a list of length equal to the input vector. Each element should be a named numeric of length equal to the number of specified adducts, names being the adduct name and values the converted value.

extract cmpDb spectra considering spectrum_id

Hi Johannes,
As you know, I am trying to fit my BD and MS2 workflow identification to your amazing tools.
In one of the steps, we look on a CompDb object for spectra that match specific requirements (polarity, mass, originaldb, ...), and then we need to extract such spectra in order to compare them with our query spectrum. i.e. it is necessary to extract spectra from a compDb object considering their 'spectrum_id' value. Something similar to

spectraSbstd <- Spectra(CompDb..100KMS2ID.1.0, filter = ~ spectrum_id %in% c(100002, 25, 68)

spectrum_id is not considered as a variable in any filter supported by CompDb. How can I adress this need? Do I need to extend a new filter from AnnotationFilter::AnnotationFilter class?

Thanks for all your support

Merge databases

Support merging of CompDb databases, eigher as

  1. simple functions merging the data from various CompDb databases
  2. define a specific MultiSourceCompDb database/object which layout supports data from various sources.

Compound spectra

Maybe consider if it could be reasonable to hold spectra somehow (as available through MoNA for example).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.