caravagnalab / mobster Goto Github PK

View Code? Open in Web Editor NEW

30.0 8.0 7.0 129.36 MB

Model-based subclonal deconvolution from bulk sequencing.

Home Page: https://caravagnalab.github.io/mobster/

License: GNU General Public License v3.0

R 5.41% HTML 94.59%

subclone-identification mixture-model model-based-clustering beta-distribution pareto-distributions

mobster's Introduction

mobster

mobster is a package that implements a model-based approach for subclonal deconvolution of cancer genome sequencing data (Caravagna et al; PMID: 32879509).

The package integrates evolutionary theory (i.e., population) and Machine-Learning to analyze (e.g., whole-genome) bulk data from cancer samples. This analysis relates to clustering; we approach it via a maximum-likelihood formulation of Dirichlet mixture models, and use bootstrap routines to assess the confidence of the parameters. The package implements S3 objects to visualize the data and the fits.

Citation

If you use mobster, please cite:

G. Caravagna, T. Heide, M.J. Williams, L. Zapata, D. Nichol, K. Chkhaidze, W. Cross, G.D. Cresswell, B. Werner, A. Acar, L. Chesler, C.P. Barnes, G. Sanguinetti, T.A. Graham, A. Sottoriva. Subclonal reconstruction of tumors by using machine learning and population genetics. Nature Genetics 52, 898–907 (2020).

Help and support

Installation

You can install the released version of mobster from GitHub with:

# install.packages("devtools")
devtools::install_github("caravagnalab/mobster")

Copyright and contacts

mobster's People

Contributors

Stargazers

Watchers

Forkers

sottorivalab yuanjingnan huzheng16 wook2014 girmasis zzygyx9119

mobster's Issues

No Function: squareplot

Hi, I carefully check the funtion in Mobster, but don't see any information about "squareplot". Look forward to your update.

Bestwishes,

Sunny.

bug in call to easypar

Running the following code from the example gives me an error

library(mobster)

dataset = random_dataset(
  seed = 123, 
  Beta_variance_scaling = 100    # variance ~ U[0, 1]/Beta_variance_scaling
  )

fit = mobster_fit(
  dataset$data,    
  auto_setup = "FAST",
  parallel = F
  )

Loaded input data, n = 5000.
❯ n = 5000. Mixture with k = 1,2 Beta(s). Pareto tail: TRUE and FALSE. Output clusters with
π > 0.02 and n > 10.
! mobster automatic setup FAST for the analysis.
❯ Scoring (without parallel) 2 x 2 x 2 = 8 models by reICL.

[easypar] run 1 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] run 2 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] run 3 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] run 4 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] run 5 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] run 6 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] run 7 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] run 8 - Error: Can't merge the outer name `init.value` with a vector of length > 1.
Please supply a `.name_spec` specification.

[easypar] 8/8 computations returned errors and will be removed.
Error in mobster_fit(dataset$data, auto_setup = "FAST", parallel = F) : 
  All task returned errors, no fit available, raising this error to interrupt the computation....

Input specifications

I can't seem to find what the "d" and "popsize" are indicating in the input examples.
Could someone clarify?

Special fits with 1 component

Implement a direct MLE/MM fit without EM.

Typo?

Hi,

This line in selection2clonenested() throws an error that time variable is missing. Is it a typo, and should it be the time1?

mobster/R/evodynamics.R

Line 149 in b824fe0

x3 <- log(2) * (time_end - time)

Best regards,
Paweł

Portability to R 4.0

Require the new R version due in April 2020.

fit error

Errore in -tests$likelihood : argomento non valido per l'operatore unario

How about WES data?

Hi, I was wondering whether it is feasible to deal with exome-sequencing data using mobster?

Improve speed of bootstrap

Write the bootstrap routines to work on offline data so to submit an array job to the cluster using easypar. That is much faster than the current implementation.

Wrapping support for neutralitytestr

@marcjwilliams1 Check out commit 7396fea3308aee9921220b05e4b6f7e6267cd93e. I wrapped a call to the neutralitytestr package.

It is wrapped from a general MOBSTER object (k regions) which contains MOBSTER fits in the $fit.MOBSTER field. VAF is already adjusted by MOBSTER, mutations are assumed to be diploid. The integration range is selected via custom upper and lower quantiles. The test run on tail mutations.

Function neutralitytest is run to each sample that has a fit tail. The final mutation rate M is a linear combination of the mutation rate per sample, weighted by the tail size (normalized). The idea is that if one sample has a large tail (900 muts), and 1 a very small one (100 mutations), we want to give more weight to estimate of mutation rate for the larger tail (90%).

On master R>3.6.0

Update DESCRIPTION file to reflect this after commit 72056b6

Model selection failure when sparse low frequency mutations present

Hi Giulio,

So I've been running lots of examples with mobster (very cool stuff) but I did notice a consistent pattern of model selection failure (weird beta distributions fits) when sparse and dispersed low frequency mutations are present in the VAF. See example plot below (left):

Both fits were run with the default settings (not auto_setup = FAST).

I think a straight forward solution is to just trim the neutral tail a bit by removing mutations below ~2-5% VAF (see right plot - trimming mutations below 5% VAF fixes this issue), but I am wondering if there is any automated solution within the package for this? Maybe I looked over something in the documentation.

Error in vignette during binomial_noise branch build

Error when building binomial_noise branch from repo on Rstudio

Quitting from lines 31-35 [unnamed-chunk-3] (a4_popgen.Rmd)
Error: processing vignette 'a4_popgen.Rmd' failed with diagnostics:
Bad MOBSTER input (list of fits).
--- failed re-building ‘a4_popgen.Rmd’

Can be worked around by setting 'vignettes = F' in build()

Is it possible to deal with ctDNA data?

Hi,
I find the mobster very useful for subclone reconstruction analysis and would like to use it for my ctDNA research.
Could you please let me know if mobster can deal with WES ctDNA data?
Thank you!

Strelka2 and Mutect2 inputs

Hi,

I would like to ask some questions about the inputs.
I have Mutect2 (both unfiltered and filtered with GATK FilterMutectCalls) and Strelka2 calls.
My questions are:

which are the correct arguments for the DP_column and NV_column paramenters in the load_vcf() function, for Mutect2 and Strelka2 vcf files?
is it possible to use directly Mutect2 and Strelka2 vcf files as input for load_vcf() or it is necessary to manipulate the files? In that case, how should I do it? Are there some kind of "helper" functions I missed from the package manual?
is it possible/suggested to use just the "PASS" calls?

Sorry for the silly questions, I just want to be 100% sure of what I'm doing.
Thank you!

Models without tails

Check that these can be fit appropriately.

Missing line in DESCRIPTION file

Error in `(function (command = NULL, args = character(), error_on_status = TRUE, …`:
! System command 'R' failed

Exit status: 1
stdout & stderr:

Type .Last.error to see the more details.
Warning messages:
1: In readLines(f, n) :
incomplete final line found on '/Users/madeleine.dale/Library/R/arm64/4.3/library/mobster/DESCRIPTION'
2: In readLines(file) :
incomplete final line found on '/Users/madeleine.dale/Library/R/arm64/4.3/library/mobster/DESCRIPTION'
3: In readLines(f, n) :
incomplete final line found on '/Users/madeleine.dale/Library/R/arm64/4.3/library/mobster/DESCRIPTION'
4: In readLines(file) :
incomplete final line found on '/Users/madeleine.dale/Library/R/arm64/4.3/library/mobster/DESCRIPTION'

Error: Error in mobster:::check_input(x, K, samples, init, tail, epsilon, maxIter, :

Hi,
I came across the following error. Do you think it is because I updated recently some packages? Could you please help me to fix this?

Thanks,

library(mobster)
library(tidyr)
library(dplyr)
example_data = Clusters(mobster::fit_example$best)
drivers_rows = c(2239, 3246, 3800)
example_data$is_driver = FALSE
example_data$driver_label = NA
example_data$is_driver[drivers_rows] = TRUE
example_data$driver_label[drivers_rows] = c("DR1", "DR2", "DR3")
# Fit and print the data
fit = mobster_fit(example_data, auto_setup = 'FAST')

 [ MOBSTER fit ] 

Error in mobster:::check_input(x, K, samples, init, tail, epsilon, maxIter, : There are some reserved names in the input data that cannot be used, please remove or rename columns: cluster, Tail, C1, C2
Traceback:

1. mobster_fit(example_data, auto_setup = "FAST")
2. mobster:::check_input(x, K, samples, init, tail, epsilon, maxIter, 
 .     fit.type, seed, model.selection, trace)
3. stop("There are some reserved names in the input data that cannot be used, please remove or rename columns: ", 
 .     paste0(fixed_names, collapse = ", "))

Plot updates

"guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead. "

Here are all the paths that needed updating:

mobster/R/plot_boostrap_coclustering.R
mobster/R/plot_mixing_proportions.R
mobster/R/S3_methods_plot.R
mobster/R/plot_gofit.R
mobster/R/plot_fit_scores.R
mobster/R/plot_boostrap_Beta.R
mobster/R/plot_boostrap_tail.R
mobster/R/plot_model_selection.R

I think I found all of them.

Website

After moving the repo to caravagnalab, we need to create the GitHub page again, and also ideally set up a redirect from the old URL

https://caravagn.github.io/mobster

https://caravagnalab.github.io/mobster/

Missing required dependencies

It seems it is currently missing the following CRAN package

wesanderson
reshape2

plus my github package

ctree

Can we add these to the package so they get downloaded automatically? Should we do that into development and mirror the change to master as well?

Walk-through Example?

Hi,

I would like to try MOBSTER, but can't see any examples of how to input data, formatting, or what output will look like.

Is there a 'walk-through' with some example data that I can follow?

Thanks,

Bruce.

Can't subset columns that don't exist

In plot_latent_variables a MOBSTER FIT is passed as main argument and it contains a tibble like

# A tibble: 3 x 7
     VAF cluster   Tail       C1       C2 is_driver driver_label
   <dbl> <chr>    <dbl>    <dbl>    <dbl> <lgl>     <chr>       
1 0.448  C1      0.0125 9.88e- 1 8.08e-21 TRUE      DR1         
2 0.159  C2      0.225  2.35e-34 7.75e- 1 TRUE      DR2         
3 0.0629 Tail    1.00   1.91e-82 4.02e- 5 TRUE      DR3

after passing the objet to the function Clusters we get

# A tibble: 5,000 x 10
     VAF cluster Tail...3 C1...4   C2...5 is_driver driver_label Tail...8 C1...9
   <dbl> <chr>      <dbl>  <dbl>    <dbl> <lgl>     <chr>           <dbl>  <dbl>
 1 0.497 C1       0.00736  0.993 5.22e-27 FALSE     NA            0.00736  0.993
 2 0.490 C1       0.00669  0.993 4.42e-26 FALSE     NA            0.00669  0.993
 3 0.470 C1       0.00705  0.993 1.31e-23 FALSE     NA            0.00705  0.993
 4 0.517 C1       0.0130   0.987 1.83e-29 FALSE     NA            0.0130   0.987
 5 0.506 C1       0.00903  0.991 3.86e-28 FALSE     NA            0.00903  0.991
 6 0.440 C1       0.0179   0.982 9.68e-20 FALSE     NA            0.0179   0.982
 7 0.428 C1       0.0347   0.965 3.88e-18 FALSE     NA            0.0347   0.965
 8 0.523 C1       0.0164   0.984 3.97e-30 FALSE     NA            0.0164   0.984
 9 0.482 C1       0.00648  0.994 3.87e-25 FALSE     NA            0.00648  0.994
10 0.499 C1       0.00759  0.992 3.20e-27 FALSE     NA            0.00759  0.992
# … with 4,990 more rows, and 1 more variable: C2...10 <dbl>

This results in the break of the execution of

clusters_names = names(x$pi)
assignments %>% select(clusters_names)

where assignments is the results of the Cluster function.
The two tibbles have different column names and the select doesn't work causing an error.

The problem can be replicated executing the second stage of the vignette "2. Plotting fits"

remove mutations with VAF = 0?

I get the following error sometimes, looks if I remove mutations with VAF = 0.0 it goes away. I think at the moment they get set to 1e-9, maybe we just remove them? Or do you think it's something else?

Error in if (.stoppingCriterion(i, prevNLL, fit$NLL, prevpi, fit$pi, fit.type,  : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In .dbpmm.EM(x, K = tests[r, "K"], init = init, tail = tests[r,  :
  Possible singularity in one Beta component a/b --> Inf.

Error when running exaples

At the moment R CMD check is failing when running examples. Here the error:

  Running examples in ‘mobster-Ex.R’ failed
  The error most likely occurred in:
  
  > ### Name: get_clone_trees
  > ### Title: Return clone trees from the fit.
  > ### Aliases: get_clone_trees

...

  > trees = get_clone_trees(x)
  Error in get_clone_trees(x) : 
    Your data should have driver events annotated, cannot use 'ctree' otherwise.
  Execution halted