syksy / curatedpcadata Goto Github PK

View Code? Open in Web Editor NEW

8.0 5.0 4.0 4.6 MB

Bioconductor R-package: Curated Prostate Cancer Data

License: Creative Commons Attribution 4.0 International

R 94.80% TeX 5.20%

cancer cancer-data cancer-genomics cancer-research r r-package

curatedpcadata's Introduction

curatedPCaData

Overview

curatedPCaData is a collection of publically available and annotated data resources concerning prostate cancer.

Citation

If you use curatedPCaData, please consider adding the following citation:

@article {Laajala2023.01.17.524403,
    author = {Laajala, Teemu D and Sreekanth, Varsha and Soupir, Alex and Creed, Jordan and Halkola, Anni S and Calboli, Federico CF and Singaravelu, Kalaimathy and Orman, Michael and Colin-Leitzinger, Christelle and Gerke, Travis and Fidley, Brooke L. and Tyekucheva, Svitlana and Costello, James C},
    title = {A harmonized resource of integrated prostate cancer clinical, -omic, and signature features},
    year = {2023},
    doi = {10.1038/s41597-023-02335-4},
    URL = {https://www.nature.com/articles/s41597-023-02335-4},
    journal = {Scientific Data}
}

Installation

Bioconductor installation

In order to install the package from Bioconductor, make sure BiocManager is installed and then call the function to install curatedPCaData:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("curatedPCaData")

GitHub installation

A download link to the latest pre-built curatedPCaData tarball is available on the right-side in GitHub under Releases.

You can also install curatedPCaData from GitHub inside R with:

# install.packages("devtools")
devtools::install_github("Syksy/curatedPCaData")

To build the package tarball from a cloned git repo, run the following in terminal / command prompt while in the root of the project:

R CMD build curatedPCaData

It is then possible to install the self-built tarball:

R CMD INSTALL curatedPCaData_x.y.z.tar.gz

Note that building the package locally will require dependencies to be present for the R installation.

Usage

Vignettes

curatedPCaData delivers with basic vignette()s displaying the package’s generic use data retrieval and basic processing in R. The vignette overview is intended for gaining a first-line comprehensive view into the package’s contents. The intention is to display the basic functionality of the package as an ExperimentHub resource.

A sister package, curatedPCaWorkflow (GitHub link here), serves multiple specialized vignettes that delve deeper into analysis and further processing of the data. This workflow package reproduces the results presented in Laajala et al., and provides useful insight and examples for those looking to further leverage use of the multi-omics data provided in curatedPCaData.

Downloading data

The function getPCa is the primary means of extracting data from a cohort. It will automatically create a MultiAssayExperiment-object of the study:

library(curatedPCaData)

mae_tcga <- getPCa("tcga")

class(mae_tcga)
## [1] "MultiAssayExperiment"
## attr(,"package")
## [1] "MultiAssayExperiment"
names(mae_tcga)
##  [1] "cna.gistic"   "gex.rsem.log" "mut"          "cibersort"    "xcell"       
##  [6] "epic"         "quantiseq"    "mcp"          "estimate"     "scores"

Brief examples

Simple example use of curated datasets and ’omics there-in:

mae_taylor <- getPCa("taylor")
mae_sun <- getPCa("sun")

mae_tcga
## A MultiAssayExperiment object of 10 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 10:
##  [1] cna.gistic: matrix with 23151 rows and 492 columns
##  [2] gex.rsem.log: matrix with 19658 rows and 461 columns
##  [3] mut: RaggedExperiment with 30897 rows and 495 columns
##  [4] cibersort: matrix with 22 rows and 461 columns
##  [5] xcell: matrix with 39 rows and 461 columns
##  [6] epic: matrix with 8 rows and 461 columns
##  [7] quantiseq: matrix with 11 rows and 461 columns
##  [8] mcp: matrix with 11 rows and 461 columns
##  [9] estimate: matrix with 4 rows and 461 columns
##  [10] scores: matrix with 4 rows and 461 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

mae_tcga[["gex.rsem.log"]][1:4, 1:4]
##          TCGA.G9.6348.01 TCGA.CH.5766.01 TCGA.EJ.A65G.01 TCGA.EJ.5527.01
## A1BG              4.3733          6.0244          7.4927          3.7801
## A1BG-AS1          4.5576          6.3326          6.7861          4.5912
## A1CF              0.4008          0.7574          0.0000          0.0000
## A2M              14.3952         12.8331         12.5017         14.2289

mae_tcga[["cna.gistic"]][1:4, 1:4]
##       TCGA.2A.A8VL.01 TCGA.2A.A8VO.01 TCGA.2A.A8VT.01 TCGA.2A.A8VV.01
## A1BG                0               0               0               0
## A1CF                0               0              -1               0
## A2M                 0               0              -1               0
## A2ML1               0               0              -1               0

colData(mae_tcga)[1:3, 1:5]
## DataFrame with 3 rows and 5 columns
##                  study_name   patient_id     sample_name        alt_sample_name
##                 <character>  <character>     <character>            <character>
## TCGA.2A.A8VL.01        TCGA TCGA.2A.A8VL TCGA.2A.A8VL.01 F9F392D3-E3C0-4CF2-A..
## TCGA.2A.A8VO.01        TCGA TCGA.2A.A8VO TCGA.2A.A8VO.01 0BD35529-3416-42DD-A..
## TCGA.2A.A8VT.01        TCGA TCGA.2A.A8VT TCGA.2A.A8VT.01 BFECF807-0658-417B-9..
##                 overall_survival_status
##                               <integer>
## TCGA.2A.A8VL.01                       0
## TCGA.2A.A8VO.01                       0
## TCGA.2A.A8VT.01                       0

mae_taylor
## A MultiAssayExperiment object of 11 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 11:
##  [1] cna.gistic: matrix with 17832 rows and 194 columns
##  [2] cna.logr: matrix with 18062 rows and 218 columns
##  [3] gex.rma: matrix with 17410 rows and 179 columns
##  [4] mut: RaggedExperiment with 90 rows and 43 columns
##  [5] cibersort: matrix with 22 rows and 179 columns
##  [6] xcell: matrix with 39 rows and 179 columns
##  [7] epic: matrix with 8 rows and 179 columns
##  [8] quantiseq: matrix with 11 rows and 179 columns
##  [9] mcp: matrix with 11 rows and 179 columns
##  [10] estimate: matrix with 4 rows and 179 columns
##  [11] scores: matrix with 4 rows and 179 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

mae_sun
## A MultiAssayExperiment object of 8 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 8:
##  [1] gex.rma: matrix with 12784 rows and 79 columns
##  [2] cibersort: matrix with 22 rows and 79 columns
##  [3] xcell: matrix with 39 rows and 79 columns
##  [4] epic: matrix with 8 rows and 79 columns
##  [5] quantiseq: matrix with 11 rows and 79 columns
##  [6] estimate: matrix with 4 rows and 79 columns
##  [7] scores: matrix with 4 rows and 79 columns
##  [8] mcp: matrix with 11 rows and 79 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

For further details on the provided datasets and extra parameters for handling data extraction, please consult the overview-vignette.

curatedpcadata's People

Contributors

Stargazers

Watchers

Forkers

varsha090597 sudolin nabuliini emberwhirl

curatedpcadata's Issues

To do list

Clinical/pheno data for 3 example data sets (Sun, TCGA, Taylor)
Gene expression data for 3 example data sets (Sun, TCGA, Taylor)
CNA for 2 example data sets (TCGA and Taylor)
Align gene names across all data sets (for example using the same HUGO gene annotations for all datasets)
Wrappers for creating different kind of output (MAE, tibble)
Run xCell on 3 example data sets
Create function for genetic scores (Decipher, Polaris, Oncotype)
Create function for traditional risk scores (D'Amico, NICE, ...)
Fix "storing paths of more than 100 bytes is not portable" warnings with package building due to long file names/paths in Jim's legacy curation scripts

Use the generic identifiers PCA#### / PAN#### in Taylor et al.

Right now we've tried to have the mapping with GSM ids in Taylor et al.
For user convenience, we'll revert back to using the more generalized identifiers also in curatedPCaData-package.

Bioconductor review

Dear all,
we've received Bioconductor review requests at: Bioconductor/Contributions#3047 (comment)

Breaking it down (and to be updated in respect to what's been addressed):

curatedPCaData Bioconductor/Contributions#3047

Please separate the analysis functionality from the infrastructure. The
package should only deliver the data to the user and demonstrate how to
obtain the data from ExperimentHub.

OK; As discussed in the submission thread for BioConductor at Bioconductor/Contributions#3047 (comment) , these downstream vignettes are to be moved to a new workflow package called curatedPCaWorkflow at: https://github.com/Syksy/curatedPCaWorkflow . Work will continue there-in regarding the proposed improvements. Please see answers below for specific actions taken toward this package-wide adjustment.

The package has extensive efforts to work with the metadata but these
can be avoided by strategically using metadata from the ExperimentHub
resources.

Multiple steps have been taken to address this and are further elaborated in answers below; briefly:

Functionality that previously existed inside the main overview vignette has been moved to /R/getpcasummaries.R, and these functions are generalizable to any requested variable, are not dataset-specific, and contain e.g. standard documentation and unit testing.
Functionality that was deemed to be more toward downstream analyses has been moved away from the package, in particular functions contained previously at /R/wrappers.R.
The new functions utilize strategically both the internally provided metadata.csv (for example function getPCaStudies) as well as the clinical metadata accessible via colData (for example functions getPCaSummaryTable, getPCaSummarySurv, getPCaSummarySamples , and getPCaSummaryStudies).

DESCRIPTION

Looks good.

The BiocGenerics Suggests: seems out of place.

OK; Addressed in 7ba59f5 ; in addition, the package dependencies have been made more lean as the functions at /R/wrappers.R are no longer present, removing the import need for e.g. dplyr and stringr.

The CC license is not good for software consider using a different license.

OK; CC-BY 4.0 should still be fine based on discussion at Bioconductor/Contributions#3047 (comment)

NAMESPACE

Avoid importing testthat and the expect_s4_class function. To test for
S4 class, use the name of the class, e.g., is(X, "KnownS4Class")

OK; This has now been adjusted to using just native R tests in: 4c1a481 and the tests themselves were subsequently extended to cover the new functions at 3235223 and ed9d0b8 .

vignettes

Consider restricting the text to 80 column width. It makes it easier to
review and maintain.

OK; this formatting issue has been addressed in bb76b01 . Furthermore, the rest of the code has also been double-checked for formatting (multiples of 4 spacebar indents, no tabs, char width 80), of which vast majority are fixed in e.g. 019a304 . The 12 lines (~0% of all code) longer than 80 chars and 11 lines (~0% of all code) that violate 4*spacebar multiples are special cases such as long URLs or automatically generated package Rd TeX-like code, respectively.

The demos in the vignettes should show the user how to obtain the data,
it does not necessarily need to show the user how to analyze it. The latter
would be more appropriate in a separate workflow package.

OK; Notable effort has been put into revising the overview vignette so it exemplifies this effectively with generalizable code, while rest of the vignettes that had an analysis focus have been now shifted to the separate curatedPCaWorkflow-package. See comments below.

Remove the analysis vignette as it is out of the scope of the package.

OK; Done in ee1c94b by moving to use curatedPCaWorkflow instead.

Consider using MatchedAssayExperiment instead of intersecting colnames
and "sample_names" in the colData (Decipher vs BCR section).

OK; This pointer has been addressed in the PR by @ACSoupir at #45 . However, as the analyses vignette has now been incorporated into the workflow package instead, this fix was not merged into curatedPCaData. This fix will instead be incorporated into the curatedPCaWorkflow.

Consider using $ to access colData columns, e.g., mae_ren$sample_type

OK; After the other fixes both into the overview vignette as well as the generalized functions provided now in /R/getpcasummaries.R, all calls of type mae_obj[,"variable"] are now instead formatted as mae_obj$variable.

Consider using TCGAutils::TCGAprimaryTumors for subsetting for primary
tumor samples with TCGA data.

OK, at least partially; this revision request was not implemented as suggested here per se. I explored both the TCGAutils (and the curatedTCGAdata) for this purpose. However, I ended up with the conclusion that this would've been best implemented as part of the data curation step in curatedPCaData, rather than implemented here when we're just offering the "window" to the data. In all of our datasets, the sample_type field in colData was extracted from the original source and represents the sample types (primary, normal, metastatic, ...). In this respect, I feel it would be a bit much to bring in a dependency just for TCGA, when this information is already available in the colData. However, to address the original purpose of subsetting to certain sample types, I have now modified the main getPCa function to allow subsetting the creation of the MAE object via MultiAssayExperiment::subsetByColData which will utilize the sample_type available in our data : ba50efa . I hope this adjustment fulfills the requested improvement to the package - if not, please let me know and I will revise accordingly to the best of my ability.

The Oncoprint example looks complicated for a new user. If you are working
with TCGA data consider using TCGAutils::oncoPrintTCGA, otherwise consider
adding a helper function.

OK; The oncoprint related functions and wrappers have been now moved to the new curatedPCaWorkflow package (e.g. commit 62be8a4 ). In hindsight, they are somewhat downstream analyses already, and therefore are better suited for this workflow-package. Thus, the oncoprint and their related functions are no longer part of the curatedPCaData-package.

Reduce the amount of code in the overview.Rmd vignette. Pre-processing
with eval(parse(...)) is usually not the way to go. It seems that more
needs to be done at the dataset and package level to avoid the juggling
of metadata in the vignette. Add helper functions in the package to extract
and create a single row for each of your summaries of interests, for example,
gleasons(mae_object) would return one row as given by the pre-processing
step in the vignette.

OK; This is a fair point, and indeed should've been the design already in the start. Notable effort has now been done to move the functionality previously coded inside the overview vignette, and made exported by the package itself accompanied by suitable examples, testing, and documentation. To enumerate these functions and key commits, of which majority resides now in the file located at R/getpcasummaries.R:

getPCaSummaryTable : 6d1aaaa
getPCaSummarySurv : 6d1aaaa
getPCaSummarySamples : e412937
getPCaSummaryStudies : b6bffd1 and f8ef61c
getPCaStudies : 71c73c6

With the above functions now in use by the vignette and exported by the user, the revision(s) addresses the above mentioned issues and clean up the vignette (Rmd commits mainly in 2f1b02a , 90a37f3 , 7abab8c )

Consider using *.bib reference file rather than hard-coding references in
the vignette.

OK; The citations have now been revised as requested in 1940b50 and dff5dee .

Use single backticks to enclose code nouns, e.g. colData and
MultiAssayExperiment

OK; These mentioned triple backticks have been corrected to single backticks in 0b46706 .

R

Use a more descriptive argument than slots in getPCa

OK; The name of the argument has now been changed to assays in order to be more in line with the naming conventions in MultiAssayExperiment via commit 633d17e .

Consider refactoring the logic in wrapperSortonco. It may be safer and
more robust to apply a function recursively instead.

OK; This function has now been moved to be as part of the curatedPCaWorkflow instead, as it relates to oncoprints ( 62be8a4 ).

tests

Consider increasing the coverage:

covr::package_coverage()
curatedPCaData Coverage: 24.14%
R/wrappers.R: 0.00%
R/getpca.R: 76.92%

After adjustments such as moving the wrappers focused on e.g. oncoprints out of the package ( 62be8a4 ), adding new functions for replacing the ones previously defined inside vignette (e.g. 6d1aaaa , e412937 , b6bffd1 , 71c73c6), and extending the testing coverage ( 3235223 ), the output for this coverage testing is as follows:

covr::package_coverage()
curatedPCaData Coverage: 89.60%
R/getpca.R: 79.61%
R/getpcasummaries.R: 94.87%

Thus, key functionality is tested, with the parts not covered related to marginal cases like download timeout.

script data mismatch

Curation script for GSE25136 does not match clinical data for GSE25136. Most likely an error in the naming process.

Gene ID aliases alter between datasets, especially older annotation ones

For example, gene LASP1 is in rest of the datasets called 'Lasp-1' in Sun et al. Wallace et al.
For example, NFIB is called NFIB in rest of the datasets but 'CTF' in Sun et al. and Wallace et al.

For example, Alex running ESTIMATE for Wallace it finds roughly 1000 gene ids and in Weiner et al roughly 10k. Data matrix row counts 12k in Wallace et al. and 17k in Weiner et al., so this is a gene id mapping issue, not genome/transcriptome coverage thing.

Need to update everything to the latest symbols in best possible way.

Splitting apart OSF portion of TCGA

The OSF portion of TCGA data currently breaks the consensus of how each MAE dataset is structured. It should be considered, if the OSF data should be kept along as a separate MAE-entity, harmonizing the presentation of each dataset. Further, the added value of OSF over conventional TCGA dataset should be clearly reported, to avoid i.e. duplication of samples if user uses both datasets.

Benchmarking description

reading through noticed that the https://github.com/Syksy/curatedPCaData/blob/master/data-raw/benchmarking.R says that OSF is still included though I think it was removed? Don't know if i can edit text but minor change

Row names

Weiner et al

epic missing row names (are numbers from somewhere?)
- other studies seem fine

Wang et al identifiers mismatch in derived variables

For Wang et al., in creating the new MAE objects with derived variables our scripts fail due to lack of GSM-code overlap. Double-checking of ID correctness ought to be done.

Package fails to build

I merged my branch with the master and tried checking whether the package builds fine, and it does not. This is the error. Should look into the description file.

CNA and fusion sample size different in TCGA

The CNA data (n=481) and fusion status data (n=495) have different sample sizes. This makes certain operations hard to perform, such as subsetting CNA profiles by fusion status:

TCGA <- curatedPCaData::mae_tcga
TCGA.CNA <- as.data.frame(TCGA@ExperimentList@listData$cna.gistic)
TCGA.CNA.fusion.pos <- TCGA.CNA[,(TCGA@colData@listData$ERG_fusion_GEX == 1)]
TCGA.CNA.fusion.neg <- TCGA.CNA[,(TCGA@colData@listData$ERG_fusion_GEX == 0)]

Tested v.0.7.6

CNA calls

We should have the data in more ready to use format. Gistic calls (-2,-1,0,1,2) are much more usable than log ratios. Can we have a conversion or a way to take the "raw" log ratios into discrete calls? I know we talked about keep the data as raw and reproducible as possible, but these data are just not as usable.

Purity Estimates

Here is a reference to an older paper that did some benchmarking with TCGA for ABSOLUTE, ESTIMATE, and LUMP: https://www.nature.com/articles/ncomms9971

We can use this to help define methods and performance

colData does not work for MAE Barwick

The colData argument to access clinical results in an error for the Barwick MAE which needs to be fixed

Fusion status

ERG fusion status missing from primary datasets: TCGA, Barbieri, and Baca. (Package version = 0.7.3)

I manually added the fusion status for TCGA (excel doc attached). These annotations were found in the supplementary section of the publication page [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4695400/] (url), but seemed to be missing from the cBio link.

Also - NA values are present in the Taylor fusion status data...I'm guessing these samples lacked fusion status data in the original publication.

TCGA sample annotations.xlsx

ENSG ids in Friedrich

Part of gex in Friedrich don't have gene names but instead ENSG#######

> head(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE))
[1] "ENSG00000083622" "ENSG00000115934" "ENSG00000121388" "ENSG00000124593" "ENSG00000124835" "ENSG00000132832"
> length(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE))
[1] 7241

Need to homogenize them to be hugo symbols instead all the way.

Abida et al focus on polyA / TCGA focus on TPM normalized GEX (OSF)

Leave out the capture and focus just on the polyA GEX-matrices in Abida et al.
In TCGA use the TPM normalized extractable from the OSF data rather than cBioPortal's median normalized data

No need to include both

Creation of sufficiently elaborate metadata for each MAE

We ought to add metada-friends for each MAE objects, which can be easily accessed in the package and give the user brief summary of each study and their main aims; sort of an abstract or a synopsis in a structured format.

metadata-slot as in: https://bioconductor.org/packages/devel/bioc/vignettes/MultiAssayExperiment/inst/doc/QuickStartMultiAssay.html#metadata

Studies to be filled:

Open issues

Which fields to have inside the MAE as metadata
Which fields to have inside the Rd (roxygenized) documentation rather

Risk score benchmarking

Alex and Svitlana to work on risk score benchmarking and comparison

Double-checking ranks of newly normalized data

Datasets that have been processed from raw data should be double-checked by cross-referencing the values provided e.g. by cBioportal; for example compare RMA-normalized Taylor et al. samples with the same samples from MSKCC dataset in cBio, or inside GEO the matrices produced from RMA normalization against the pre-normalized data. The values should correlate, otherwise there's been some systematic error in annotating sample names or in methodology.

missing template information

In template_prad.csv:

row 26 should have 6 columns, instead of 5

duplicate gene id in TCGA GEX

I found a HGNC gene symbol overlap in the TCGA GEX data frame.

tcga.gex <- curatedPCaData::mae_tcga[["gex"]]
table(duplicated(row.names(tcga.gex)))

FALSE TRUE
19956 2

row.names(tcga.gex)[duplicated(row.names(tcga.gex))]
[1] "RCAN1" "RCAN2"

Include normal samples in Taylor et al.

Right now our data matrix does not include the normal samples present in
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21032

(prefix 'Prostate normal PAN#### exon')

Needed for example DeMiXt and IsoPure, so they should be included. Right now they are not included by the processing script.

R Studio and/or OS specificity issues with line endings

Current R CMD build fails due to:

checking for file 'curatedPCaData/DESCRIPTION' ... OK
preparing 'curatedPCaData':
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
-----------------------------------
installing source package 'curatedPCaData' ...
** using staged installation
** R
Error in parse(outFile) :
C:/Users/teemu/AppData/Local/Temp/Rtmp8Yueja/Rbuild79d8f7627c8/curatedPCaData/R/pipelines.R:18:1: unexpected symbol
17: #' @export
18: Generate_GEX_Sun
^
ERROR: unable to collate and parse R files for package 'curatedPCaData'

Looks like there are invisible line change symbol issues when working between people with R Studio or without, as well as with Windows line-endings and Mac OS based on this:

https://stackoverflow.com/questions/56471638/error-unable-to-collate-and-parse-r-files-for-package-upload-to-cran

The code syntax is seemingly all right, and I tried to save as UTF-8 manually. Need to double-check where the issues may rise (probably incompatible text line endings between Daniel's Windows Textpad and Jordan's Mac text editor).

List of methods/packages for downstream processing and appending to corresponding fields in pData

Methods with R-packages in GitHub/CRAN/BioConductor

xCell - cell types enrichment analysis
xCell is a webtool that performs cell type enrichment analysis from gene expression data for 64 immune and stroma cell types.
Data type required: RNA expression
Output: Tumor purity / cell composition
Notes: Perhaps better used via wrapper 'immunedeconv'

TCGA
Taylor et al.
Sun et al.

DeMix
Deconvolution models for mixed transcriptomes from heterogeneous tumor samples with two or three components using expression data from RNAseq or microarray platforms.
Data type required: RNA expression
Output: Tumor purity / cell composition
Notes: -

TCGA
Taylor et al.
Sun et al.

MCPcounter
Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression
Data type required: RNA expression
Output: Cell populations
Notes: Perhaps better used via wrapper 'immunedeconv'

EPIC
Package implementing EPIC method to estimate the proportion of immune, stromal, endothelial and cancer or other cells from bulk gene expression data.
Data type required: Bulk RNA expression
Output:
Notes: Perhaps better used via wrapper 'immunedeconv'

deconstructSigs (?)
deconstructSigs aims to determine the contribution of known mutational processes to a tumor sample.
Data type required: SNP or similar mutation data
Output: Known mutational processes
Notes: COSMIC v3.1 mutational signature database has to be incorporated manually to deconstructSigs.

TCGA

ABSOLUTE (?)
ABSOLUTE can estimate purity/ploidy, and from that compute absolute copy-number and mutation multiplicities.
Data type required: HAPSEG file or a segmentation file
Output: Tumor purity
Notes:

TCGA

AR activity score methods
Notes: multiple possibilities, e.g. Hieronimus, et al.

immunedeconv
an R package for unified access to computational methods for estimating immune cell fractions from bulk RNA sequencing data.
Data type required: Method-dependent GEX
Output: Immune decomposition
Notes: Wrapper package for methods quantiseq, timer, cibersort, cibersort_abs, mcp_counter, xcell, epic

TCGA
Taylor et al.
Sun et al.

Methodology outside R

pVACtools / NetMHCPan (?)
Neoantigen load
Data type required: MAF
Output: Neoantigen load
Notes: Written in Python. Set up in Jim's cluster.

TCGA

CIBERSORT
Estimation of the abundances of member cell types in a mixed cell population, using gene expression data.
Data type required: RNA expression
Output:
Notes: Behind registration wall

quanTIseq
Quantifying tumor-infiltrating immune cells from RNA sequencing data
Data type required: FASTQ of RNA-seq (?)
Output: Immune cell decomposition
Notes: Docker-image

Cibersort results for Weiner et al. missing

The immune deconvolution results for Weiner et al. are still missing, presumably due to memory issues in running over all the N=838 samples.

Xenabrowser mapping not up to date

We've been using the xenabrowser's old mapping file; we should be using the latest biomaRt annotated hugo symbols, taking into account the potential aliases while maintaining roughly the correct gene coordinates (for example, a gene symbol might be a legacy name for two different genes in two different chromosomes - the gene shouldn't jump between chromosomes, and roughly stay the same coordinates even after hg19 -> hg38 and other adjustments).

cBioPortal hg19 to hg38 liftOver

We should always be lifting over information from the old human genome build (hg19) to hg38; i.e. coordinates, gene symbols, etc. Make sure to check the matrices we're using the proper aliases from the latest human genome build.

Sample IDs in Clinical data - Taylor et.al.

Clinical data file has sample IDs which aren't uniform with the naming convention for Taylor et. al. Clinical data script needs to be re-run.

Weiner et al. & newly generated GEX for CIBERSORTx

Some of the gene expression matrices have been updated, and since CIBERSORTx is run externally these ought to be re-run; further, data for Weiner et al. is missing for this immune deconvolution method.

GEX and CNA objects

Currently we report only CNA and GEX in the MAE object, but we should have more specific information on what those CNA and GEX tables actually are. That is, CNA can be log ratio or gistic calls. GEX can be raw, normalized, z-score transformed. We should add that into the name. Instead of "gex" we should have "gex_zscore" or "gex_affy" or something similar.

Taylor gex data

When I try to read in the cel files cels <- oligo::read.celfiles(affy::list.celfiles()) in the Taylor data I get the following error:

Loading required package: pd.huex.1.0.st.v2
Loading required package: Biostrings
Loading required package: S4Vectors
Failed with error:  ‘Package ‘S4Vectors’ version 0.26.0 cannot be unloaded:
 Error in unloadNamespace(package) : namespace ‘S4Vectors’ is imported by ‘GenomicRanges’, ‘AnnotationDbi’, ‘XVector’, ‘IRanges’, ‘SummarizedExperiment’, ‘DelayedArray’, ‘curatedPCaData’, ‘oligoClasses’, ‘GenomeInfoDb’, ‘Biostrings’ so cannot be unloaded
’
Attempting to obtain 'pd.huex.1.0.st.v2' from BioConductor website.
Checking to see if your internet connection works...
installing the source package ‘pd.huex.1.0.st.v2’

trying URL 'https://bioconductor.org/packages/3.11/data/annotation/src/contrib/pd.huex.1.0.st.v2_3.14.1.tar.gz'
Content type 'application/x-gzip' length 314073786 bytes (299.5 MB)
==================================================
downloaded 299.5 MB

* installing *source* package ‘pd.huex.1.0.st.v2’ ...
** using staged installation
** R
** data
** inst
** byte-compile and prepare package for lazy loading
No methods found in package ‘RSQLite’ for request: ‘dbListFields’ when loading ‘oligo’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
No methods found in package ‘RSQLite’ for request: ‘dbListFields’ when loading ‘oligo’
** testing if installed package can be loaded from final location
No methods found in package ‘RSQLite’ for request: ‘dbListFields’ when loading ‘oligo’
** testing if installed package keeps a record of temporary installation path
* DONE (pd.huex.1.0.st.v2)

The downloaded source packages are in
	‘/private/var/folders/n9/cgxppk7n7x37jnqcqmbjtmtm001c9k/T/RtmpMVXIXx/downloaded_packages’
Loading required package: pd.huex.1.0.st.v2
Loading required package: Biostrings
Loading required package: S4Vectors
Failed with error:  ‘Package ‘S4Vectors’ version 0.26.0 cannot be unloaded:
 Error in unloadNamespace(package) : namespace ‘S4Vectors’ is imported by ‘GenomicRanges’, ‘AnnotationDbi’, ‘XVector’, ‘IRanges’, ‘SummarizedExperiment’, ‘DelayedArray’, ‘curatedPCaData’, ‘oligoClasses’, ‘GenomeInfoDb’, ‘Biostrings’ so cannot be unloaded
’
There was a problem during download or installation.
Package 'pd.huex.1.0.st.v2' cannot be loaded. Please, try again.
Error in oligo::read.celfiles(affy::list.celfiles()) : 
  The annotation package, pd.huex.1.0.st.v2, could not be loaded.

@Syksy have you run into this issue running in a fresh session? I don't know if you have something loaded in the background that made this line work.

cels <- oligo::read.celfiles(affy::list.celfiles(), pkgname="pd.huex.1.0.st.v2") does not fix the error either.

meeting notes

07-09-2020

for taylor et al. use cbioportal normalization and our own pipeline
create/update vignettes with information on the source of data for each study
create formatting for other clinical vaiables (ex: var_name = var_value | var_name2 = var_value2 | var_name3 = var_value3)