Code Monkey home page Code Monkey logo

quantumclone's Introduction

Github Status :

Build Status AppVeyor Build Status Coverage Status

CRAN Status and statistics :

CRAN version CRAN downloads weekly CRAN total

QuantumClone and QuantumCat

R package also available on CRAN Maintainer: Paul Deveau (quantumclone.package at gmail.com)

Clonal Reconstruction from High-Throughput Sequencing data

This Readme is divided in two parts:

1. Installation instructions

2. Usage (Advanced)

Installation instructions

The full package is available and is maintained on CRAN.

The package can be installed using:

install.packages("QuantumClone")

or from the GitHub repository using devtools:

devtools::install_github("DeveauP/QuantumClone")

Usage

QuantumClone is looking for clones in your samples assuming that there is an evolutionary logic between samples, so you should use data from the same patient for one analysis (either different timepoints, or spatially separated samples, or biological replicates).

QuantumClone requires few informations in the input file:

  • The columns in the file MUST be separated by tabulations
  • Line 1 should be the column titles (Sample | Chr | Start | Alt | Depth ). An additional argument is required if you do not have a FREEC profile associated to your files: the Genotype.
  • The first column needs to be the name of your sample
  • The Chr column contains the chromosome of variant (e.g. "chr2")
  • Start is the position of the variant
  • Alt is the number of reads supporting the variant
  • Depth is the depth of coverage at the position of the variant (number of reads mapped at this position)

Any additional column will not be taken into account for the analysis

You should have something similar to this:

SampleName Chr Start Depth Alt Genotype
Timepoint_1 1 1 149 67 AB
Timepoint_1 4 2 162 2 AB
Timepoint_1 4 3 132 5 AB
Timepoint_1 4 4 57 1 AB
Timepoint_1 4 5 93 0 AB
Timepoint_1 4 6 95 0 AB

While the input file can be as large as you want, the computation time will grow with the number of variants to be studied. In order to keep computation time reasonable ( < 1h ), a reasonable set of mutation is between 100 to 1000 variants.

  • FREEC files: list of files corresponding to your samples. It is required if you do not have a Genotype column in your analysis. You should use the "Sample_ratio.txt" file, not the "Sample_ratio_normal.txt"
  • Contamination: fraction of normal cells estimated to contaminate your samples. Needs to be separated by commas (example: 0.1, 0.2)
  • Clone range: how many clones should be looked for in the samples? "2:5" means 2 to 5, whereas "2,5" means 2 and 5.
  • Save plot: Do you want to save 2D plots?
  • Save data: Do you want to keep probabilities and estimated copy numbers in a file?

(a) alt tag (b) alt tag

Usage

The QuantumClone package is divided in two:

All of this is detailed in the vignette that can be accessed with:

vignette("Use_case",package = "QuantumClone")

Clonal reconstruction

One_step_clustering() has several parameters required (some have default configuration):

One_step_clustering(SNV_list, FREEC_list = NULL, contamination, nclone_range = 2:5, clone_priors = NULL, prior_weight = NULL, Initializations = 1, preclustering = "FLASH", simulated = FALSE, epsilon = NULL, save_plot = TRUE, ncores = 1, restrict.to.AB = FALSE, output_directory = NULL, model.selection = "BIC", optim = "default", keep.all.models = FALSE, force.single.copy = FALSE)

  • SNV_list: list of dataframes. See previous section for description.
  • FREEC_list: list of outputs from FREEC (in the same order as the SNV list). See here for added information.
  • contamination: Numeric vector giving the fraction of normal cells in each sample. Is linked to the cellularity by contamination = 1 - Cellularity
  • nclone_range: number of clones to look for in the samples
  • clone_priors: list of vectors giving the position of the clones in each samples (if know from previous analysis)
  • prior_weight : fraction of variants belonging to a clone (if known from previous analysis)
  • Initializations : number of iterations to run per condition. The output will take the maximal maximum likelihood on all iterations.
  • preclustering : the method to be used for EM initialization
  • simulated : is the data generated by QuantumCat? It does not change the parameters, but will attribute shapes to different chromosomes in the plots. (see QuantumCat for more information)
  • epsilon : stop condition for the EM. If left null, will be estimated from the average depth of sequencing of the data.
  • save_plot : save the 2D plots in a folder with the patient name/output_directory.
  • ncores: number of CPUs on which to distribute calculations (used if high number of variants)
  • restrict.to.AB : should the clustering be done only on AB regions?
  • output_directory : directory in which the plots will be saved (if NULL, will create a directory with the patient name)

Plots

** 2D plot **

plot_QC_out(QClone_Output,Sample_names=NULL, simulated = FALSE,sample_selected = 1:2)

  • QClone_Output: output from One_step_clustering or QuantumClone functions
  • Sample_names: vector with names of the samples (in the same order as the input list). Only used for the names of the axes. If NULL is replaced by numbers.
  • simulated: Used to display the original cluster from QuantumCat data
  • sample_selected: samples to be used for the plot (if the number of samples to plot is smaller than the the input)

** Evolution plots **

evolution_plot(QC_out,Sample_names=NULL)

Clonal simulation

This part is about generating data to test clonal reconstruction algorithms. Its core is the QuantumCat function. It will generate data for a single cancer that can be sequenced multiple times (either spatially separated or different timepoints). It thus assumes that there is an evolutionary history between samples. The "Chr" columns stores the information of the clonal attribution.

QuantumCat(number_of_clones, number_of_mutations, ploidy = 2, depth = 100, number_of_samples = 2, Random_clones = F, contamination = NULL)

  • number_of_clones : How many clones should exist in total. For example, 5 clones in 2 samples can be distributed in the following way: 1 specific of sample 1, 1 specific of sample 2 and 3 shared between sample 1 and 2.
  • number_of_mutations : How many variants should be used for the clustering. Some algorithms reported that an increase in the number of variants decreased the clustering quality, which does not seem to be the case here. It affects the computing time however.
  • ploidy : if numeric, it will generate a Poisson distribution with mean the ploidy. Accepted inputs can be "disomic", "AB", "AAB", "A", etc.
  • depth : what is the sequencing depth? Depth of a variant will be generated according to a negative binomial distribution, which characteristics have been generated by fitting to our data from whole genome sequencing.
  • number_of_samples = How many samples should be generated?
  • Random_clones: if the number of clones should be generated randomly (sampled from 2:5)
  • contamination: estimation of the contamination by normal cells

For multiple testings, and calculation of the Normalized Mutual Information (NMI), see Multitest() and statistics_on_Multitest()

Suggestions for variant filtering

We suggest to run QuantumClone on somatic variants only detected, for example, using VarScan2. We suggest to apply the following filters to select variants for clonal reconstruction:

  • minimal depth of coverage: 50x
  • minimal percentage of reads supporting the mutation (in at least one sample per patient): 10%
  • require variants to be located in regions of high local mappability (based on the 100 bp mappability track), and outside of repeats and duplicated genomic regions. The latter can be assessed using the UCSC repeat masker, simple repeat, and segmental duplication regions.
  • delete variants that created a stretch of four or more identical nucleotides
  • require variants to be located in regions where the genotype evaluated by Control-FREEC was available.
  • filter out variants corresponding to polymorphisms present in more than 1% of the population (snp138, 1000Genomes, esp6500) except if it is a known cancer related variant (COSMIC database for coding and non-coding mutations)
  • when working with cancer samples from several patients, you can consider removing variants annotated as germline in un-matched normal (blood) samples

Any other variants can be mapped on the clonal structure using the QuantumClone function Probability.to.belong.to.clone().

Acknowledgments

Many thanks to the contributors of this work: my supervisors, Elodie for the features improvement and Linux debugging and more generally to the U830 & U900 people. This work had been funded by the Ministere de l'Enseignement Supérieur de la Recherche (AMX grant).

quantumclone's People

Contributors

deveaup avatar valeu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

quantumclone's Issues

Information about individual mutations in clonal tree

Hi,

I am new to Clonal evolution(so please pardon me if my doubt is naive). I have tried Clonefinder, MixPhy, and QuantumClone. With Clonefinder, and MixPhy I was able to get individual SNV's mutation for each node of the tree (like AATTAAA) but with QuantumClone I am getting cellularity at the cluster level. Is this by design or am I missing something?

Genotype is not available at position # ########

Dear Paul
I have another question about the One_step_clustering() function.
I ran Control-FREEC on one sample and tried QuantumClone.
Here is what my input SNV file looks like:

head(SNV_file)

  SampleName Chr   Start Depth Alt
1 sample1   1  909419    16   3
2 sample1   1  986894    20   2
3 sample1   1 1262354    95   3
4 sample1   1 1563476    73   2
5 sample1   1 1585590   117   2
6 sample1   1 1654193   247   3

And what the output of Control-FREEC looks like for this sample:

head(CNV_file)

Chromosome  Start    Ratio MedianRatio CopyNumber BAF estimatedBAF Genotype UncertaintyOfGT              Gene
1          1  65535 0.726426    0.508169          1  -1            1        A              -1   1: 65534- 65600
2          1  65857  1.12502    0.508169          1  -1            1        A              -1   1: 65856- 65948
3          1  69091 0.851791    0.508169          1   1            1        A              -1   1: 69090- 70008
4          1 721407 0.524401    0.508169          1  -1            1        A              -1 1: 721406- 721494
5          1 721556 0.876531    0.508169          1  -1            1        A              -1 1: 721555- 721781
6          1 752751  1.17044    0.508169          1   1            1        A              -1 1: 752750- 753582

When I tried the following commands:

SNV_liste = list(SNV_file)
FREEC_liste = list(CNV_file)
contamination = 30
One_step_clustering(SNV_list = SNV_liste, FREEC_list = FREEC_liste, contamination, nclone_range = 2:5, clone_priors = NULL, prior_weight = NULL, Initializations = 1, preclustering = "FLASH", simulated = FALSE, epsilon = NULL, save_plot = TRUE, ncores = 1, restrict.to.AB = FALSE, output_directory = NULL, model.selection = "BIC", optim = "default", keep.all.models = FALSE, force.single.copy = FALSE)

I got:

Checking that SNV_list and FREEC_list have the same number of samples...
Passed
epsilon set to: 0.00566125684264668
Checking all possibilities for sample1
Genotype is not available at position 1 909419
Genotype is not available at position 1 986894
Genotype is not available at position 1 1262354
[...]
Genotype is not available at position 21 47832900
Error in lis[[1]] : subscript out of bounds
In addition: Warning message:
In Patient_schrodinger_cellularities(SNV_list = SNV_list, FREEC_list = FREEC_list,  :
  4931 mutations excluded due to missing genotype or normalization issues

Could you please explain me what I am not doing right?
I guess I could add a genotype column in my snv file but I was hoping I could directly use the output file sample1_ratio.txt from Control-FREEC.
Thank you in advance

Kind Regards
-Charlotte

How to get the "Genotype" for input files?

Dear author,

Sorry for this question. I have noticed that in the paper you published, it is said that

In order to validate QuantumClone on rearranged or hyper-diploid genomes, we simulated variants in loci of genotype AB, AAB, AABB and in a nearly diploid genome, where all possible genotypes can be observed (Fig. 2).

So I guess that the genotype is describing the copy numbers and haplotypes of somatic mutations.

  1. But what is the meaning of "A" and "B"?
  2. Are "A" and "B" having relationship with major and minor allele frequencies?
  3. A site with 4 copies might be "AAAB", "AABB", "ABBB", "AAAA", and "BBBB"?
  4. Some software can give sites with 1 copy. They should be "A" or "B"?
  5. How about sites with 0 copy?
  6. Can you suggest some softwares to generate the "Genotype" atrributes?

Thanks a lot.
Best regards.

Error: package or namespace load failed for ‘QuantumClone’

Dear Paul,
I would like to use QuantumClone but I'm experiencing troubles with making it work.
Indeed when I do:
> install.packages("QuantumClone")
There is no error but once I try to load it using
> library(QuantumClone)
Then I got this error
Error: package or namespace load failed for ‘QuantumClone’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): there is no package called ‘robustbase’
I've tried to download robustbase independently but the error is persisting.
Coud you please help me with this problem?
Thank you in advance,
Kind regards, -Charlotte

NB: I'm using R version 3.4.4

Error in result$cluster[Order] : invalid subscript type 'list'

Hello,

Here is my SNV_list file :

    SampleName Chr Start Depth Alt Genotype
1   FFPE   1     1    21   2       AA
2  FFPE   1     2    70   7       AB
3   FFPE   1     3    23   2       AB
4   FFPE   1     4    22   2       AB
5   FFPE   1     5    17   2       AB
6   FFPE   1     6    68  11       AB
7   FFPE   1     7    78  21       AB
8   FFPE   1     8    77   9      AAB
9   FFPE   1     9   129  21       AB
10  FFPE   1    10    38   8       AB


> One_step_clustering(SNV_list, FREEC_list = NULL, contamination, nclone_range = 2:5, clone_priors = NULL, prior_weight = NULL, Initializations = 1, preclustering = "FLASH", simulated = FALSE, epsilon = NULL, save_plot = TRUE, ncores = 1, restrict.to.AB = FALSE, output_directory = NULL , model.selection = "BIC", optim = "default", keep.all.models = FALSE, force.single.copy = FALSE)

Starting clustering...
Keeping most likely state for each variant...

And here is the error :

Error in result$cluster[Order] : invalid subscript type 'list'
In addition: There were 45 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In spare[is.infinite(spare)] <- sign(is.infinite(spare)) *  ... :
  number of items to replace is not a multiple of replacement length
2: In spare[is.infinite(spare)] <- sign(is.infinite(spare)) *  ... :
  number of items to replace is not a multiple of replacement length
3: In spare[is.infinite(spare)] <- sign(is.infinite(spare)) *  ... :
  number of items to replace is not a multiple of replacement length
4: In spare[is.infinite(spare)] <- sign(is.infinite(spare)) *  ... :
  number of items to replace is not a multiple of replacement length
5: In spare[is.infinite(spare)] <- sign(is.infinite(spare)) *  ... :
  number of items to replace is not a multiple of replacement length

What should I do to fix the problem?
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.