Code Monkey home page Code Monkey logo

ectmb's Introduction

ecTMB: estimation and classification of TMB

ecTMB is a powerful and flexible statistical framework for TMB estimation and classification. It uses an explicit background mutation mdoel for more robust and consistent TMB prediction. The backgournd mutation model takes account of unknown as well as known mutational heterogeneous factors, including tri-nucleotide context, sample mutational burden, gene expression level and replication timing by utilization of a Bayesian framework. The discovery of three TMB-based subtypes, including one novel subtype TMB-extreme, enable ecTMB to classify samples to biological and clinically relavent TMB subtypes.

Table of Contents

Dependency
Installation
Example Usage
License

Dependency

ecTMB has been sucessfully tested on Intel(R) Xeon(R) CPU E5-2680 v4 Machine with 28 cores.

ecTMB import following R packages: ggplot2, limma, reshape2, dplyr, R6, MASS, GenomicRanges, data.table, parallel, mixtools

ecTMB also depends on bedtools 2.27.1 and R = 3.5.1

You can install these packages using anaconda/miniconda :

conda install bedtools=2.27.1 r=3.5.1

Then you can export the conda paths as:

export PATH="/PATH/TO/CONDA/bin:$PATH"
export LD_LIBRARY_PATH="/PATH/TO/CONDA/lib:$LD_LIBRARY_PATH"

Installation

install.packages("devtools")
library(devtools);
devtools::install_github("bioinform/ecTMB");

Download Example and Reference Data

#Example file download from URL: https://www.dropbox.com/s/knpgl73samhdtvg/ecTMB_data.tar.gz?dl=1
URL = "https://github.com/bioinform/ecTMB/releases/download/v0.1.0/ecTMB_data.tar.gz"
download.file(URL,destfile = "ecTMB.example.tar.gz")
untar("./ecTMB.example.tar.gz")

URL_ref = "https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834"
download.file(URL_ref,destfile = "GRCh38.d1.vd1.fa.tar.gz")
untar("./GRCh38.d1.vd1.fa.tar.gz")

Example Usage

  • Load ecTMB package and genome annotation reference files
library(ecTMB)
load("./ecTMB_data/example/UCEC.rda")
extdataDir             = "./ecTMB_data/references"
exomef                 = file.path(extdataDir, "exome_hg38_vep.Rdata" )  #### hg38 exome file
covarf                 = file.path(extdataDir,"gene.covar.txt")   ### gene properties
mutContextf            = file.path(extdataDir,"mutation_context_96.txt" )  ### 96 mutation contexts
TST170_panel           = file.path(extdataDir,"TST170_DNA_targets_hg38.bed" )  ### 96 mutation contexts
ref                    = file.path("./","GRCh38.d1.vd1.fa" )

  • Set random 70% as training and rest as test set
set.seed(1002200)
SampleID_all   = UCEC_cli$sample
SampleID_train = sample(SampleID_all, size = round(2 * length(SampleID_all)/3), replace = F)
SampleID_test  = SampleID_all[!SampleID_all %in% SampleID_train]
  • Generate train and test data object
## mutations which are inconsistent with reference annotation files will be removed.
## train data
trainData      = UCEC_mafs[UCEC_mafs$Tumor_Sample_Barcode %in% as.character(SampleID_train),]
trainset       = readData(trainData, exomef, covarf, mutContextf, ref)

## test data for panel TST 170
sample         = data.frame(SampleID = SampleID_test, BED = TST170_panel, stringsAsFactors = FALSE)
testData       = UCEC_mafs[UCEC_mafs$Tumor_Sample_Barcode %in% as.character(SampleID_test),]
testset_panel  = readData(testData, exomef, covarf, mutContextf, ref, samplef = sample)
testset_WES    = readData(testData, exomef, covarf, mutContextf, ref)  ## to calculate WES-TMB for test samples
  • Background mutation model training

NOTE

This step takes up to ~12 mins when 24 parallel processes are used. You can skip and use the pre-loaded parameters defined from training data set.


MRtriProb_train= getBgMRtri(trainset)
trainedModel   = fit_model(trainset, MRtriProb_train, cores = 24)
  • Predict TMB for TST170 panel
## process time less than 1s. 
TMBs          =  pred_TMB(testset_panel, WES = testset_WES, cores = 1,
                        params = trainedModel, mut.nonsil = T, gid_nonsil_p = trainset$get_nonsil_passengers(0.95))
                        
## plot the prediction.    
library(dplyr)
library(ggplot2)

TMBs %>% melt(id.vars = c("sample","WES_TMB")) %>% 
  ggplot( aes(x = WES_TMB, y = value,  color = factor(variable, levels = c("ecTMB_panel_TMB",  "count_panel_TMB")), 
              group = factor(variable))) + 
  geom_point() +
  geom_abline(slope = 1, intercept = 0) + 
  scale_x_continuous(trans='log2') +
  scale_y_continuous(trans='log2') +
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  theme(legend.title=element_blank()) +
  labs(x = "TMB defined by WES", y = sprintf("Predicted TMB from panel: TST170"))
  • Classify sample to 3 subtypes
Subtypes      = assignClass(TMBs$ecTMB_panel_TMB, prior = GMM_params)

License

ecTMB is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

ectmb's People

Contributors

lijingya avatar marghoob avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ectmb's Issues

Bug in fit_model

Hello,

I found a bug in the fit_model function, which only reveals itself if one tries to use some other name for the trainset variable.

Here:

ecTMB/R/stats.R

Lines 412 to 417 in abf9b37

# count number of mutation per patient for offset calculation
if(bs.type == "all"){
mutPerP = CalTMB(trainset, sampleN = as.character(trainset$samples$SampleID))
}else if(bs.type == "nonsil"){
mutPerP = CalTMB(trainset, sampleN = as.character(trainset$samples$SampleID), type = "nonsil")
}

Instead of trainset in the CalTMB function calls (both the first argument and the sampleN), the argument should be Data, so that it refers to the argument specified in the fit_model function call.

It will be great if you fix it in the next release!

Best wishes,
Aleksandra

Bug in procMaf.R

Hi!

I found a bug in procMaf.R line 253 (maf_dnp_converter function):

a= matrix(0, nrow= nrow(mutab.onp)*umaxchar[k],ncol=ncol(mutab.onp) )

With my dataset, I got an error while performing readData, which traced back to this bug.

> trainset       = readData(trainData, exomef, covarf, mutContextf, ref)
Error in matrix(0, nrow = nrow(mutab.onp) * umaxchar[k], ncol = ncol(mutab.onp)) : 
  invalid 'nrow' value (too large or NA)

It appears that within the loop, the mutab.onp variable sometimes only contains one row, which means that it is treated as a vector and hence nrow(mutab.onp) returns NULL. It looks like the previous line, 252, was supposed to correct that, but it is formulated very strangely - length(mutab.onp) will depend on the number of accessory columns and will not necessarily be 8 when there is one row only. A fix along the lines of

if ( is.null(nrow(mutab.onp) ){mutab.onp = t(as.matrix(mutab.onp))}

is suggested.

Best regards,
Aleksandra

Apply ecTMB in our own data

Dear sir,
I cannot find a way to put my own data in the TMB model.
Is the input a maf file from Mutect2?
Do you have any suggestion?

ecTMB much higher in WES compared to Panel

Dear developer, thank you for your great effort. I'm facing a problem that I'm not able to find a solution. How is it possible that when I apply your algorithm to my own data I found a TMB's value higher (for example 200 vs 10) in total WES than in panel? (COnsidering that the experiment is made with Agilent Panel, so the entire WES is not full sequenced?)

Error when following Example Usage

Hi,

I was trying out this package, and came across what appears to be an error while running the Example Usage (using the example data).

> trainData = UCEC_mafs[UCEC_mafs$Tumor_Sample_Barcode %in% as.character(SampleID_train),]
> trainset = readData(trainData, exomef, covarf, mutContextf, ref)
Number of inconsistant annotation mutation: 0 out of total 0 mutation 
Bed file is not specified for samples. exomeGene for whole exome will be generated.
Total gene analyzed: 0

And then subsequently get this message:

> sample = data.frame(SampleID = SampleID_test, BED = TST170_panel, stringsAsFactors = FALSE)
> testData = UCEC_mafs[UCEC_mafs$Tumor_Sample_Barcode %in% as.character(SampleID_test),]
> testset_panel = readData(testData, exomef, covarf, mutContextf, ref, samplef = sample)
Number of inconsistant annotation mutation: 0 out of total 0 mutation 
Total 0 out of 176 samples with at least one mutation detected.
Only the ones with at least one mutation will be used.
Bed file is specified for samples
	All samples have the same bed regions
Error in `$<-.data.frame`(`*tmp*`, "Chromosome", value = "chr") : 
  replacement has 1 row, data has 0

Any idea what might be going on?

Thanks :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.