The pquant from bigbio

Methods to implement.

The shiny application should implement three major methods:

MSstats: The user should be able to know the parameters for the MSstats analysis is doing. Right now all parameters are chosen by the pipeline automatically.

pquant/R/msstats_pipeline.R

Line 20 in e055d23

data.processed <- dataProcess(raw = data,

. It should be probably more configurable for the user.
Proteus: The user should be able to run the dataset with proteus, Rigth now is not possible to run this analysis.
Last option: Triqler analysis. The Triqler options are based on the triqler output. See (http://ftp.pride.ebi.ac.uk/pride/data/proteomes/proteogenomics/cell-lines/PXD005942-Sample-24/proteomics_lfq/out_triqler.tsv)

These are the three fundamental methods that we will use for downstream analysis of LFQ and TMT data. The user should be able to use one of the other from the shiny application. Major challenges. We know that triqler is a python application which mean that shiny will not perform the analysis but will be able to visualizer the results. This is the main reason why this should be leave for the last option.

Heatmap horrible

The Heatmap needs to be replaced by a new version that looks better, the current version looks horrible. Please replace with the heatmap https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-pheatmap-package/

Errors in out_msstats.csv "Condition" column

the out_msstats.csv "Condition" column in "RPXD004682.1-organism-part"、 "RPMID25238572.1-cell-lines/"、"RPMID25238572.2-organism-part/" has some problems.

First release of pquantr

The following issue track all the pending tasks before the formal release of the pquantr shiny application:

Fix the current issues #44
Define if Proteus would be implemented in the package as another alternative for downstream analysis #71

Data new to analyse

First dataset is ups1 with comet search engine and aggregation inference:

ups1-uniprot-comet-proteomics_lfq.zip

The second is ups1 with multiple search engine (comet and msgf) and bayesian inference:

ups1-uniprot-multiplesearch-proteomics_lfq.zip

The last run is multisearch engine with match between runs:

uniprot-canonical-multisearch-mbr-bayesian.zip

Export results to EA file format

@enriquea @Douerww :

For every pipeline:

MSststas
Proteus
Triqler

We should output a file with the following structure:

https://github.com/bigbio/pquant/blob/dev/output/output_format/E-PROT-39-DE/E-PROT-39-analytics.tsv

In that file the columns have the following meaning:

The first column can be Gene ID or Protein ID depending on the pipeline.
Gene Name if available
g4_g3.p-value (g4 and g3) p-value
g4_g3.log2foldchange (g4 and g3) log2foldchange

The g1, g2, g3, are a group of samples from the SDRF that can be found here:

https://github.com/bigbio/pquant/blob/dev/output/output_format/E-PROT-39-DE/E-PROT-39-configuration.xml

In the configuration file we have the following:

<assay_groups>
            <assay_group id="g1" label="octogenarian age bracket; Alzheimer's disease">
                <assay>4.AD_B</assay>
                <assay>4.AD_A</assay>
                <assay>4.AD_C</assay>
            </assay_group>
            <assay_group id="g2" label="octogenarian age bracket; normal">
                <assay>3.AD_B</assay>
                <assay>3.AD_C</assay>
                <assay>3.AD_A</assay>
            </assay_group>
            <assay_group id="g3" label="sexagenarian age bracket; Alzheimer's disease">
                <assay>2.AD_B</assay>
                <assay>2.AD_A</assay>
                <assay>2.AD_C</assay>
            </assay_group>
            <assay_group id="g4" label="sexagenarian age bracket; normal">
                <assay>1.AD_B</assay>
                <assay>1.AD_A</assay>
                <assay>1.AD_C</assay>
            </assay_group>
        </assay_groups>
        <contrasts>
            <contrast id="g2_g1" cttv_primary="1">
                <name>'Alzheimer's disease' vs 'normal' in 'octogenarian age bracket'</name>
                <reference_assay_group>g2</reference_assay_group>
                <test_assay_group>g1</test_assay_group>
            </contrast>
            <contrast id="g4_g3" cttv_primary="1">
                <name>'Alzheimer's disease' vs 'normal' in 'sexagenarian age bracket'</name>
                <reference_assay_group>g4</reference_assay_group>
                <test_assay_group>g3</test_assay_group>
            </contrast>
        </contrasts>

The g2, g3 , g4 is a simple way to call the group condition without needing to name the full condition. Contrasts are the comparison between two groups and how this will call in the https://github.com/bigbio/pquant/blob/dev/output/output_format/E-PROT-39-DE/E-PROT-39-analytics.tsv

Implement package structure

We need to implement a package structure for the project, this will organize the following things:

Organize dependencies.
Organize the structure of the data and the code and sample data.
Organize analysis of the project in two types of analysis: TMT and Label-free
Organize the data structure for visualization purpose.

Some code that needs to be dynamic

This code:

pquant/R/msstats_pipeline.R

Line 64 in e055d23

row.names(comparison) <- c("C1-C2", "C2-C3")

needs to be dynamic, it only will work with the UPS1 dataset, and we need the application to wok with all datasets.

Need some kind of progressbar when processing is happening

Douer:

We need some kind of progress bar when the MSstats/Proteus processing is happening because if not the user doesn't know what is going on and can continue clicking in the interface everywhere without knowing that a process is happening.

Proteus implementation for TMT and LabelFree

Currently, the main method in the application for downstream analysis is MSststas. However, Proteus, a limma-based package for TMT and LFQ can be also used in pquantr. Before the implementation in the package, the following task should be done:

Benchmark Proteus and MSstats for TMT and LFQ data. Explore the differences.
If Proteus offers better results, implement Proteus as an alternative downstream analysis package.
Release a new version of the tool with both packages.

Even if Proteus is not implemented after the benchmark. Would be great @Douerww to keep track of that benchmark for future decisions about the project and also for publication purpose.

Need an example when multiple variables are use in the configuration file

@ypriverol

Hi Yasset, could you find an example project data files, when multiple variables are use in the configuration file?

It looks like we need an example project that can lead us create the configuration file from .sdrf file clearly.

like this configure file.

https://github.com/bigbio/pquant/blob/dev/output/output_format/E-PROT-39-DE/E-PROT-39-configuration.xml

Error loading the Proteus method

Proteus method failed with the following error:

Running triqler

I have run Triqler in the UPS1 dataset using the parameters suggested by @MatthewThe:

What happens if you define as minimum samples where a peptide is quantified as 2 (default). In the example that @MatthewThe uses 6 .
What happens if we use as fold_change_eval 0.5 because we want to load in a system then all the quantified proteins and let the system move queries across values.
@MatthewThe, in order to compare between method we have decided to use the following format (https://github.com/bigbio/pquant/blob/main/output/output_format/E-PROT-39-DE/E-PROT-39-analytics.tsv). I have seen that the output of triqler is per comparsion. I want to doble check with you for each file which column correspond to g4_g3.p-value g4_g3.log2foldchange

`

Improvements and Bugs to be fixed in pquantr

@Douerww Some major improvements we should do:

Can we remove the heatmap generation using python an use R. I think we should remove the python call from the code.
I guess the plot instead of one close to another should be one after another.
I think we should try to avoid this

pquant/pquant/shiny-app/app.R

Line 264 in 7de5946

setwd("../data/")

Why if the user provide a file, we need to go to data, should'nt be possible to read from what ever user give you. One solution would be to copy the file to the sample where the application is stored.
When the user select a protein in the protein table, it should be possible to change the panels in the protein plots. I think that was the approach of Proteus shiny.

bigbio / pquant Goto Github PK

pquant's Introduction

pquant

Installation of environment

sample datasets

Shiny application

Todo list

pquant's People

Contributors

Stargazers

Watchers

Forkers

pquant's Issues

Recommend Projects

Recommend Topics

Recommend Org