Code Monkey home page Code Monkey logo

prolfqua's Introduction

R-CMD-check-prolfqua

prolfqua - an R package for Proteomics Label Free Quantification Services

The R package contains functions for analyzing mass spectrometry based LFQ experiments. This package is developed at the FGCZ.

How to install prolfqua?

Requirements : A Windows/Linux/MacOSX x64 platform with R 4 or higher

We recommend to install the package using the latest release Download the prolfqua_X.Y.Z.tar.gz from the github release page. and then execute:

install.packages("prolfqua_X.Y.Z.tar.gz",repos = NULL, type="source")

To install the package without vignettes from github you can execute in R.

install.packages('remotes')
remotes::install_github('wolski/prolfqua')

If you want to build the vignettes on you system:

install.packages('remotes')
remotes::install_gitlab("wolski/prolfquadata", host="gitlab.bfabric.org")
remotes::install_github('wolski/prolfqua', build_vignettes = TRUE)

Depending on the performance of you system building the package with all vignettes can take up to 1h.

Let us please know about any installation problems or errors when using the package: https://github.com/wolski/prolfqua/issues

How to get started

See Bioconductor 2021 Conference poster. Watch the lightning (8 min) talk at EuroBioc2020 on YouTube.

Or read the pkgdown generate website https://wolski.github.io/prolfqua/index.html

Detailed documentation with R code:

Example QC and sample size report

How to cite?

If you are using the package in your work please cite: https://f1000research.com/slides/9-1476

Motivation

The package for proteomics label free quantification prolfqua (read : prolevka) evolved from a set of scripts and functions written in the R programming language to visualize and analyze mass spectrometric data, and some of them are still in R packages such as quantable, protViz or imsbInfer. For computing protein fold changes among treatment conditions, we first used t-test or linear models, then started to use functions implemented in the package limma to obtain moderated p-values. We did also try to use other packages such as MSStats, ROPECA or MSqRob all implemented in R, with the idea to integrate the various approaches to protein fold-change estimation. Although all these packages were written in R, model specification, input and output formats differ widely and wildly, which made our aim to use the original implementations challenging. Therefore, and also to understand the algorithms used, we attempted to reimplement those methods, if possible.

When developing prolfqua we were inspired by packages such as sf or stars which use data in long table format and dplyr for data transformation and ggplot2 for visualization. In the long table format each column stores a different attribute, e.g. there is only a single column with the raw intensities. In the wide table format there might be several columns with the same attribute, e.g. for each recorded sample a raw intensity column. In prolfqua the data needed for analysis is represented using a single data-frame in long format and a configuration object. The configuration annotates the table, specifies what information is in which column. The results of the statistical modelling are stored in data frames. Relying on the long data table format enabled us to access a large variety of useful visualizations as well as data preprocessing methods implemented in the R packages dplyr and ggplot2.

The use of an annotated table makes integrating new data if provided in long formatted tables simple. Hence for Spectronaut or Skyline text output, all is needed is a table annotation (see code snipped). Since MSStats formatted input is a table in long format prolefqa works with MSstats formatted files. For software, which writes the data in a wide table format, e.g. Maxquant, we implemented methods which first transform the data into a long format.

A further design decision, which differentiates prolfqua is that it embraces and supports R's linear model formula interface, or R lme4 formula interface. R's formula interface for linear models is flexible, widely used and documented. The linear model and linear mixed model interfaces allow specifying a wide range of essential models, including parallel designs, factorial designs, repeated measurements and many more. Since prolfqua uses R modelling infrastructure directly, we can fit all these models to proteomics data. This is not easily possible with any other package dedicated to proteomics data analysis. For instance, MSStats, although using the same modelling infrastructure, supports only a small subset of possible models. Limma, on the other hand, supports R formula interface but not for linear mixed models. Since the ROPECA package relies on limma it is limited to the same subset of models. MSqRob is limited to random effects model's, and it is unclear how to fit these models to factorial designs, and how interactions among factors can be computed and tested.

The use of R's formula interface does not limit prolfqua to the output provided by the R modelling infrastructure. prolfqua also implements p-value moderations, as in the limma publication or computing probabilities of differential regulation, as suggested in the ROPECA publication. Moreover, the design decision to use the R formula interface allowed us to integrate Bayesian regression models provided by the r-package brms. Because of that, we can benchmark all those methods: linear models, mixed effect models, p-value moderation, ROPECA as well as Bayesian regression models within the same framework, which enabled us to evaluate the practical relevance of these methods.

Last but not least prolfqua supports the LFQ data analysis workflow, e.g. computing coefficients of Variations (CV) for peptide and proteins, sample size estimation, visualization and summarization of missing data and intensity distributions, multivariate analysis of the data, etc. It also implements various protein intensity summarization and inference methods, e.g. top 3, or Tukeys median polish etc. Last but not least, ANOVA analysis or model selection using the likelihood ratio test for thousand of proteins can be performed.

To use prolfqua knowledge of the R regression model infrastructure is of advantage. Acknowledging, the complexity of the formula interface, we provide an MSstats emulator, where the model specification is generated based on the annotation file structure.

Running R-scripts

Generate QC report from bat file. First add <prolfqua_path>/win to your path variable. Then you can generate a QC report from a maxquant QC by running.

lfq_MQ_SampleSizeReport.bat .\data\1296877_QC.zip
Rscript ~/__checkouts/R/prolfqua/inst/run_scripts/lfq_MQ_SampleSizeReport.R --help
Rscript ~/__checkouts/R/prolfqua/inst/run_scripts/lfq_MQ_SampleSizeReport.R ~/Downloads/1330043.zip

Related resources

Relevant background information

R packages to compute contrasts from linear models

  • emmeans Obtain estimated marginal means (EMMs) for many linear, generalized linear, and mixed models.
  • lmerTest computes contrast for lme4 models
  • multcomp computes contrast for linear models and adjusts p-values (multiple comparison)

Future interesting topics or packages to look at

Sample size estimation based on FDR

What package name?

What name should we use?

https://twitter.com/WitoldE/status/1338799648149041156

  • prolfqua - PROteomics Label Free QUAntification package (read prolewka)
  • LFQService - we do proteomics LFQ services at the FGCZ.
  • nalfqua - Not Another Label Free QUAntification package (read nalewka)
  • prodea - proteomics differential expression analysis ?

prolfqua's People

Contributors

wolski avatar cpanse avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.