mikemc / differential-abundance-theory Goto Github PK

A manuscript exploring the effects of taxonomic bias on microbiome differential-abundance analysis.

Home Page: https://mikemc.github.io/differential-abundance-theory

R 0.10% TeX 0.97% Makefile 0.01% CSS 0.01% HTML 98.76% JavaScript 0.15%

bias differential-abundance-analysis manuscript measurements absolute-abundances

differential-abundance-theory's Introduction

Manuscript on taxonomic bias and differential-abundance analysis

This repository contains an in-progress manuscript on the effects of taxonomic bias in microbiome measurements on microbial differential-abundance analysis. The manuscript is structured as a bookdown article. The latest rendered version can be viewed here. Data analyses and simulations supporting this work can be seen here. This repository and the rendered manuscript are licensed under a CC BY 4.0 License. See the Zenodo record for how to cite the latest version.

differential-abundance-theory's People

Contributors

Stargazers

Watchers

Forkers

adw96 athowes leitemfa nearinj

differential-abundance-theory's Issues

Create figure illustrating solutions

Possible starting point is this figure illustrating how different absolute abundance strategies all give biased individual estimates, but just some give biased fold change estimates: https://twitter.com/mikemc423/status/1365300452406005763/photo/1

Create figure illustrating the fundamental problem

Normalization imposes competition, which causes the abundance of a focal taxon to depend on the relative abundances of all taxa. Perhaps have a panel that illustrates the experimental workflow, showing different types of normalization that can occur (saturating extraction yield; deliberate library normalization; computational normalization to proportions). Then have another panel (or two) showing that the error in the proportion (or abundance?) of a particular taxon depends on the composition of the rest of the sample, similar to Figure 2 of https://elifesciences.org/articles/46923

Add derivation of relationship between diversity and variance in mean efficiency under IID assumption

Some relevant notes below; I may also have the full derivation from GoodNotes typed up elsewhere already.

Notes from 2020-12-16 Wednesday

Intuition for why bias might be less problematic in diverse ecosystems

To get some intuition as to why bias might be less problematic in diverse ecosystems, consider assembling a community by adding random species, whose efficiencies are independently chosen from a distribution with mean $\mu$ and variance $\sigma^2$. The variance in the mean efficiency in the community after adding $I$ species, conditional on the community proportions $\tilde A_i$, is $\sigma^2 \sum_{i=1}^I \tilde A_i^2 = \sigma^2 / ^2D$, where $^2D = 1/ \tilde A_i^2$ is diversity of order 2, also known as the Inverse Simpson index. Thus, if species are added in such a way that the Inverse Simpson index increases, and the efficiencies are IID, then the sample mean efficiency will tend to $\mu$.

See GoodNotes file for the math. Something to keep in mind is that this result is for the variance, but the geometric variance is most relevant. (Though it should remain true that the geometric variance tends to 1 if the variance tends to 0.)

Create Docx output

Write caption for figure with regression example

Synthesize references and evidence on the use of host- and diet-derived reads as natural constant references

Theoretical points

Relevant theory for how might be used for differential absolute abundance analysis in appendix Differential absolute abundance
Can be thought of as a natural spike-in, or a reference taxon with assumed fixed abundance.
Can be used for proportion- or ratio-based AA inference.
Current studies seem to use in proportion mode - use the ratio (bacterial reads) / (host + diet reads) to infer bacterial biomass; if AA of individual microbes is needed, multiply this biomass estimate by MGS proportions.
Use in ratio-mode is more robust to bias under the MWC model for the purposes of inferring fold changes across samples; however, it may be important not to aggregate reads from different taxa (e.g. host and different plants) for this method to remain robust to bias; see appendix section Multiple reference taxa.

Key empirical refs

Human or mouse gut

Plants

Arabidopsis studies by Karasov, Regalado, and Weigel et al: regalado2019comb, karasov2020ther

Other

wallace2021thed Drosophila DNA virus metagenomics

Other notes and links

Twitter discussions

Sean Gibbons and Jotham Suez
Fabian Staubach on using this method w/ Drosophila

Ensure all comments migrated from Google Docs

Fix theorem rendering

make pdf now fails with below error message. Issue may be related updating my to R 4.1.0

> bookdown::render_book('.', 'bookdown::pdf_book', quiet = TRUE)
Rendering book in directory '.'
! Extra \fi.
l.1084 ...-102-105-99-105-101-110-116-115-93-\}\fi
                                                  {} 

Error: LaTeX failed to compile _main.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips. See _main.log for more info.
Execution halted
make: *** [Makefile:5: pdf] Error 1

Add Hypothesis commenting instructions

Currently I'm using the Hypothesis group https://hypothes.is/groups/W6BemoNn/da-manuscript to comment directly on the version of the manuscript hosted at https://da.mikemc.cc. Anyone can add or reply to comments. The main concern I have is I'm not sure what will happen to these comments once I make updates to the associated manuscript pages. Once I figure this out, should add instructions to the Readme and/or Preface.

Revise appendix Models section

DAA Critique Discussion

Hey Mike,

This popped up on my GitHub feed. It's so cool to see a manuscript being written in the open! I am a big fan of your "Consistent and Correctable Bias" paper from a couple years back, and appreciate your commitment to truly open science.

Like you, I've come to doubt the utility of differential abundance analysis. We recently wrote up our perspective that challenges DAA, and offer some possible (ratio-based) alternatives. I wanted to link you to it because I think there may be some synergy between our perspectives, and it could be fun to start a discussion.

Settle notation regarding spike-ins, targeted measurements, and reference taxa

Currently the Models appendix defines separate notation for the abundance of spike-in taxa and of taxa measured by target measurement methods. Both approaches can be seen as special cases of having estimated abundances for a set of reference taxa. For this reason the later sections analyze spike-ins using the targeted-measurement results, and thus make no real use of the notation $S$ for spike-in abundances. I should consider dropping this necessary notation, and perhaps just using just the $T$ notation for both. In this case the text needs to be updated to make clearer this fundamental unity early on and establish that all "targeted" results also apply to spike-ins (and so perhaps should really be considered "reference taxa" results).

other thoughts

Certain computational normalization approaches also fit into this category.
The Zemb2020 spike-in + qPCR method (#6) is fundamentally different; it uses a spike-in to improve the bulk abundance estimate and then do the method I describe for bulk-abundance estimation, throwing a wrinkle into my categorization scheme.

Create system for access to past manuscript versions

Ideally we'd have a system similar to Manubot's, where past html versions of the manuscript remain available online, so that Hypothesis comments remain functioning. Can look at the Deep Review and the Manubot CI approaches for inspiration here

https://greenelab.github.io/meta-review/
https://github.com/manubot/rootstock#continuous-integration

Different flavors of CoDa regression

Hi Mike,

This is a great resource/reading material. Thanks for making it public!

I was wondering, I have you looked into linear models on clr transformed relative abundances vs multinomial regression suggested by Morton et al. (2019)?

I think the former is becoming common place as it's more straightforward to run. But the latter has the benefit that ranks of coefficients are identical on both relative and absolute data. I use Justin Silverman's fido package to run it, but it would be useful if there also were a frequentist way of running it.

Thanks
Johannes

Synthesize spike-in approach of Zemb et al 2020

Zemb O, Achard CS, Hamelin J, De Almeida M, Gabinaud B, Cauquil L, Verschuren LMG, Godon J. 2020. Absolute quantitation of microbes using 16S rRNA gene metabarcoding: A rapid normalization of relative abundances by quantitative PCR targeting a 16S rRNA gene spike‐in standard. Microbiologyopen 9:1–21. doi:10.1002/mbo3.977 https://onlinelibrary.wiley.com/doi/abs/10.1002/mbo3.977

At least two interesting things to consider in this study

they apply the synthetic DNA standard prior to DNA extraction
They use qPCR of the synthetic standard and use this info somehow

The earlier experiment of Tkacz et al (2018) is also relevant