Code Monkey home page Code Monkey logo

buscorrect's Introduction

BUScorrect R package

The BUScorrect R package implements the BUS model to adjust genomic data for batch effects when there are unknown sample subtypes.

Introduction

High-throughput experimental data are accumulating exponentially in public databases. However, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed batch effects, and the latter is often modelled by subtypes.

Researchers have long been aware that samples generated on different days are not directly comparable. Samples processed at the same time are usually referred to as coming from the same batch. Even when the same biological conditions are measured, data from different batches can present very different patterns. The variation among different batches may be due to changes in laboratory conditions, preparation time, reagent lots, and experimenters [1]. The effects caused by these systematic factors are called batch effects.

Various batch effects correction methods have been proposed when the subtype information for each sample is known [2,3]. Here we adopt a broad definition for subtype. Subtype is defined as a set of samples that share the same underlying genomic profile, in other words biological variability, when measured with no technical artifacts. For instance, groupings such as case and control can be viewed as two subtypes. However, subtype information is usually unknown, and it is often the main interest of the study to learn the subtype for each collected sample, especially in personalized medicine.

Here, the R package BUScorrect fits a Bayesian hierarchical model, the Batch-effects-correction-with-Unknown-Subtypes model (BUS), to correct batch effects in the presence of unknown subtypes [4]. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity. After correcting the batch effects with BUS, the corrected value can be used for other analysis as if all samples are measured in a single batch. BUS can integrate batches measured from different platforms and allow subtypes to be measured in some but not all of the batches as long as the experimental design fulfils the conditions listed in [4].

Installation

The development version of this R package BUScorrect is now available on Bioconductor. You can use the following command to install it.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("BUScorrect", version = "devel")

User's Guide

Please refer to the vignetee for detailed function instructions using

browseVignettes("BUScorrect")

Citation

Xiangyu Luo & Yingying Wei (2019) Batch Effects Correction with Unknown Subtypes, Journal of the American Statistical Association, 114:526, 581-594, DOI: 10.1080/01621459.2018.1497494

References

  1. Leek, Jeffrey T., et al. "Tackling the widespread and critical impact of batch effects in high-throughput data." Nature Reviews Genetics 11.10 (2010): 733.
  2. Johnson, W. Evan, Cheng Li, and Ariel Rabinovic. "Adjusting batch effects in microarray expression data using empirical Bayes methods." Biostatistics 8.1 (2007): 118-127.
  3. Leek, Jeffrey T., and John D. Storey. "Capturing heterogeneity in gene expression studies by surrogate variable analysis." PLoS genetics 3.9 (2007): e161.
  4. Xiangyu Luo & Yingying Wei (2019) Batch Effects Correction with Unknown Subtypes, Journal of the American Statistical Association, 114:526, 581-594, DOI: 10.1080/01621459.2018.1497494

buscorrect's People

Contributors

xiangyuluo avatar

Stargazers

Chenxin Jiang avatar

Watchers

James Cloos avatar

buscorrect's Issues

Are tpm/cpm appropriate and question about extending for count data (i.e. RNA-seq)

Let me start off by saying that I loved the paper and how you tackle batch effect and subtypes at the same time.

If I missed this in the paper please accept my apologies, but was wondering if there are any limitations to what kind of (non-count) data behave well in this model? I was thinking that TPM/FPKM/CPM would behave, since they're "similar" to log densities that come from microarray data. Do you see any obvious obstacles in doing this?

Are there any plans to extend the R functions to model for count data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.