Code Monkey home page Code Monkey logo

q2-qemistree's Introduction

q2-qemistree

Canonically pronounced chemis-tree.

Build Status Coverage Status

A tool to build a tree of mass-spectrometry (LC-MS/MS) features to perform chemically-informed comparison of untargeted metabolomic profiles. The manuscript describing q2-qemistree is available here.

Qemistree manuscript

Installation

Once QIIME 2 is installed, activate your QIIME 2 environment and install q2-qemistree following the steps below:

git clone https://github.com/biocore/q2-qemistree.git
cd q2-qemistree
pip install .
qiime dev refresh-cache

q2-qemistree uses SIRIUS, a software-framework developed for de-novo identification of metabolites. We use molecular substructures predicted by SIRIUS to build a hierarchy of the MS1 features in a dataset. For this demo, please download and unzip the latest version of SIRIUS from here.

Below, we download SIRIUS for macOS as follows (for linux the only thing that changes is the URL from which the binary is downloaded):

wget https://bio.informatik.uni-jena.de/repository/dist-release-local/de/unijena/bioinf/ms/sirius/4.9.3/sirius-4.9.3-osx64-headless.zip
unzip sirius-4.9.3-osx64-headless.zip

Note: Qemistree was initially developed under Sirius 4.0.1 version. Since Sirius 4.0.1 got to its end of life, Qemistree was recently adapted to work with the new Sirius versions (>4.4.29).

Demonstration

q2-qemistree ships with the following methods:

qiime qemistree compute-fragmentation-trees
qiime qemistree rerank-molecular-formulas
qiime qemistree predict-fingerprints
qiime qemistree make-hierarchy
qiime qemistree get-classyfire-taxonomy
qiime qemistree prune-hierarchy

To generate a tree that relates the MS1 features in your experiment, we need to pre-process mass-spectrometry data (.mzXML, .mzML or .mzDATA files) using MZmine2 and produce the following inputs:

  1. An MGF file with both MS1 and MS2 information. This file will be imported into QIIME 2 as a MassSpectrometryFeatures artifact.
  2. A feature table with peak areas of MS1 ions per sample. This table will be imported from a CSV file into the BIOM format, and then into QIIME 2 as a FeatureTable[Frequency] artifact.

These input files can be obtained following peak detection in MZmine2. Here is an example MZmine2 batch file used to generate these.

To begin this demonstration, create a separate folder to store all the inputs and outputs:

mkdir demo-qemistree
cd demo-qemistree

Download a small feature table and MGF file using:

wget https://raw.githubusercontent.com/biocore/q2-qemistree/master/q2_qemistree/demo/feature-table.biom
wget https://raw.githubusercontent.com/biocore/q2-qemistree/master/q2_qemistree/demo/sirius.mgf

We import these files into the appropriate QIIME 2 artifact formats as follows:

qiime tools import --input-path feature-table.biom --output-path feature-table.qza --type FeatureTable[Frequency]
qiime tools import --input-path sirius.mgf --output-path sirius.mgf.qza --type MassSpectrometryFeatures

Note: If the MGF file has formatting errors (eg. no MS1 are included in the MGF, or if an MS1 entry does not have a corresponding MS2 entry), then an appropriate error message will help users troubleshoot this step before proceeding forward. First, we generate fragmentation trees for molecular peaks detected using MZmine2:

qiime qemistree compute-fragmentation-trees --p-sirius-path 'sirius.app/Contents/MacOS' \
  --i-features sirius.mgf.qza \
  --p-ppm-max 15 \
  --p-profile orbitrap \
  --p-ions-considered '[M+H]+' \
  --p-java-flags "-Djava.io.tmpdir=/path-to-some-dir/ -Xms16G -Xmx64G" \
  --o-fragmentation-trees fragmentation_trees.qza

Note: /path-to-some-dir/ should be a directory where you have write permissions and sufficient storage space. We use -Xms16G and -Xmx64G as the minimum and maximum heap size for Java virtual machine (JVM). If left blank, q2-qemistree will use default JVM flags.

This generates a QIIME 2 artifact of type SiriusFolder. This contains fragmentation trees with candidate molecular formulas for each MS1 feature detected in your experiment.

Note 2: The new Sirius versions have the parameter --p-ions-considered, which refers to the adduct of the MS/MS data to considered. Here are some examples: [M+H]+, [M+K]+, [M+Na]+, [M+H-H2O]+, [M+H-H4O2]+, [M+NH4]+, [M-H]-, [M+Cl]-, [M-H2O-H]-, [M+Br]-.

You can also provide a comma-separated list. Example: '[M+H]+, [M+Na]+'.

Next, we select top scoring molecular formula as follows:

qiime qemistree rerank-molecular-formulas --p-sirius-path 'sirius.app/Contents/MacOS' \
  --i-features sirius.mgf.qza \
  --i-fragmentation-trees fragmentation_trees.qza \
  --p-zodiac-threshold 0.95 \
  --p-java-flags "-Djava.io.tmpdir=/path-to-some-dir/ -Xms16G -Xmx64G" \
  --o-molecular-formulas molecular_formulas.qza

This produces a QIIME 2 artifact of type ZodiacFolder with top-ranked molecular formula for MS1 features. Now, we predict molecular substructures in each feature based on the molecular formulas. We use CSI:FingerID for this purpose as follows:

qiime qemistree predict-fingerprints --p-sirius-path 'sirius.app/Contents/MacOS' \
  --i-molecular-formulas molecular_formulas.qza \
  --p-ppm-max 20 \
  --p-java-flags "-Djava.io.tmpdir=/path-to-some-dir/ -Xms16G -Xmx64G" \
  --o-predicted-fingerprints fingerprints.qza

This gives us a QIIME 2 artifact of type CSIFolder that contains probabilities of molecular substructures (total 2936 molecular properties) within in each feature. We use these predicted molecular substructures to generate a hierarchy of molecules as follows:

qiime qemistree make-hierarchy \
  --i-csi-results fingerprints.qza \
  --i-feature-tables feature-table.qza \
  --o-tree qemistree.qza \
  --o-feature-table feature-table-hashed.qza \
  --o-feature-data feature-data.qza

To support meta-analyses, this method is capable of handling one or more datasets i.e pairs of CSI results and feature tables. You will need to download a new feature table and csi fingerprint result from another experiment to test this functionality as follows:

wget https://raw.githubusercontent.com/biocore/q2-qemistree/master/q2_qemistree/demo/feature-table2.biom.qza
wget https://raw.githubusercontent.com/biocore/q2-qemistree/master/q2_qemistree/demo/fingerprints2.qza

Below is the q2_qemistree command to co-analyze the datasets together:

qiime qemistree make-hierarchy \
--i-csi-results fingerprints.qza \
--i-csi-results fingerprints2.qza \
--i-feature-tables feature-table.qza \
--i-feature-tables feature-table2.biom.qza \
--o-tree merged-qemistree.qza \
--o-feature-table merged-feature-table-hashed.qza \
--o-feature-data merged-feature-data.qza

Additionally, Qemistree also supports the inclusion of structural annotations made using MS/MS spectral library matches for downstream analysis using the optional input --i-ms2-matches as follows:

qiime qemistree make-hierarchy \
  --i-csi-results fingerprints.qza \
  --i-feature-tables feature-table.qza \
  --i-ms2-matches /path-to-MS2-spectral-matches.qza/ \
  --o-tree qemistree.qza \
  --o-feature-table feature-table-hashed.qza \
  --o-feature-data feature-data.qza

Note:

  1. The input to --i-ms2-matches can be obtained using Feature-based molecular networking or FBMN workflow supported in the web-based mass-spectrometry data analysis platform, GNPS. To use MS2 matches in Qemistree, please download the results of FBMN workflow and import the tsv file in the folder clusterinfo_summary as a QIIME2 artifact of type FeatureData[Molecules] as follows:
qiime tools import \
  --input-path path-to-MS2-spectral-matches.tsv \
  --output-path path-to-MS2-spectral-matches.qza \
  --type FeatureData[Molecules]
  1. The input CSI results, feature tables and MS2 match tables should have a one-to-one correspondence i.e CSI results, feature tables and MS2 match tables from all datasets should be provided in the same order.

This method generates the following:

  1. A combined feature table by merging all the input feature tables; MS1 features without fingerprints are filtered out of this feature table. This is done because SIRIUS predicts molecular substructures for a subset of features (typically for 70-90% of all MS1 features) in an experiment (based on factors such as sample type, the quality MS2 spectra, and user-defined tolerances such as --p-ppm-max, --p-zodiac-threshold). This output is of type FeatureTable[Frequency].
  2. A tree relating the MS1 features in these data based on molecular substructures predicted for MS1 features. This is of type Phylogeny[Rooted]. By default, we retain all fingerprint positions i.e. 2936 molecular properties). Adding --p-qc-properties filters these properties to keep only PubChem fingerprint positions (489 molecular properties) in the contingency table. Note: The latest release of SIRIUS uses PubChem version downloaded on 13 August 2017.
  3. A combined feature data file that contains unique identifiers of each feature, their corresponding original feature identifier (row ID from Mzmine2), parent mass (parent_mass), retention time (retention_time), CSI:FingerID structure predictions (csi_smiles), MS2 match structure predictions (ms2_smiles), and the table(s) (table_number) that each feature was detected in. This is of type FeatureData[Molecules]. (The renaming of features helps prevent overlap between non-unique feature identifiers in the original feature tables in case of meta-analyses)

These can be used as inputs to perform chemical phylogeny-based alpha-diversity and beta-diversity analyses.

Furthermore, Qemistree supports the classification of molecules into Classyfire chemical taxonomy. We generate a feature data table (also of the type FeatureData[Molecules]) which includes classification of molecules into chemical 'kingdom', 'superclass', 'class', 'subclass', and 'direct_parent'. We can run Classyfire using Qemistree as follows:

qiime qemistree get-classyfire-taxonomy \
  --i-feature-data merged-feature-data.qza \
  --o-classified-feature-data classified-merged-feature-data.qza

Qemistree will use ms2_smiles to make chemical taxonomy assignments, when MS2 matches are available for a feature. Otherwise, csi_smiles will be used. The column structure_source in classified-merged-feature-data.qza records whether taxonomic assignment was done using CSI:FingerID predictions or MS/MS library matches.

Lastly, Qemistree includes some utility functions that are useful to visualize and explore the molecular hierarchy generated above. Qemistree trees can be visualized using q2-empress [preprint]. Below are the installation instructions that can be run within your qiime2 environment:

pip uninstall --yes emperor
pip install git+https://github.com/biocore/empress.git
qiime dev refresh-cache
  1. Prune molecular hierarchy to keep only the molecules with annotations.
qiime qemistree prune-hierarchy \
  --i-feature-data classified-merged-feature-data.qza \
  --p-column class \
  --i-tree merged-qemistree.qza \
  --o-pruned-tree merged-qemistree-class.qza

Users can choose any of the data columns (--p-column) that are in the classified-merged-feature-data.qza file to prune the hierarchy. For e.g. '#featureID','kingdom', 'superclass', 'class', 'subclass', 'direct_parent', and 'smiles'. All features with no data in this column will be removed from the phylogeny.

  1. Generate an annotated qemistree tree in using q2-empress.
qiime empress community-plot \
    --i-tree merged-qemistree-class.qza \
    --i-feature-table feature-table-hashed.qza \
    --m-sample-metadata-file path-to-sample-metadata.tsv \
    --m-feature-metadata-file classified-merged-feature-data.qza \
    --o-visualization empress-tree.qzv

The output empress QZV can be visualized using Qiime2 Viewer; EMPress can be used to interactively modify the tree visualization. Below is an example visualization from Empress' preprint. Here, the user has sample metadata columns (food sources) to compare groups of food samples; Empress enables them to visualize metabolite relative prevalence as barcharts at the tips of the tree.

Empress plot

Please visit the Empress tutorial for all the currently supported tree visualization features that can be leveraged to explore the chemical diversity of your metabolomics dataset.

q2-qemistree's People

Contributors

anupriyatripathi avatar eldeveloper avatar fedarko avatar helenamrusso avatar mwang87 avatar qiyunzhu avatar stephramos17 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

q2-qemistree's Issues

Distinguish between origin of empty tables in collation step

Currently an error will be raised when the fingerprint table is empty. However this can occur by one of two reasons. The first being that no fingerprints are predicted and the output of CSI is empty. The second being that the output of CSI failed and produced no results. It would be useful to distinguish between these two cases.

module to merge collated fingerprints table

..to enable meta-analyses of studies if feature finding has been done separately. Once fingerprints are merged, one can generate a hierarchy of the combined tables. Would have to be supplemented with a module to merge feature tables as well such that tip names on the hierarchy match feature names in the tables

define input files

  • might need to get an mzXML to .mgf parser
  • define qiime types and transformers

implement tip aggregation

Aggregate tips to higher-level chemical classifications. This will help provide annotations to unknown molecule tips

update README

..to describe new visualization capabilities in qemistree plot

prune hierarchy based on feature abundance and/or prevalence

prune-hierarchy could take sample metadata and feature data as inputs and prune trees based on the prevalence of features in the various categories of a metadata column. Alternatively, pruning could be done independent of sample metadata based on the top x% most abundant features across the dataset.

save zodiac summary file

This file has predicted molecular formula per feature. This will help in validating the molecular formula prediction accuracy.

Allow users to collate SIRIUS' outputs

This will imply that we need to make a QIIME2 pipeline so that we can separate the steps that we currently have in fingerprint. Another consequence of this will be that we will be able to save the output from Sirius.

Add parameter for ionization mode

Should likely allow users to specify if they want to let sirius guess the mode, or if they want to explicitly set to positive/negative mode.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.