Github Status :

CRAN Status and statistics :

QuantumClone and QuantumCat

R package also available on CRAN Maintainer: Paul Deveau (quantumclone.package at gmail.com)

Clonal Reconstruction from High-Throughput Sequencing data

This Readme is divided in two parts:

1. Installation instructions

2. Usage (Advanced)

Installation instructions

The full package is available and is maintained on CRAN.

The package can be installed using:

install.packages("QuantumClone")

or from the GitHub repository using devtools:

devtools::install_github("DeveauP/QuantumClone")

Usage

QuantumClone is looking for clones in your samples assuming that there is an evolutionary logic between samples, so you should use data from the same patient for one analysis (either different timepoints, or spatially separated samples, or biological replicates).

QuantumClone requires few informations in the input file:

The columns in the file MUST be separated by tabulations
Line 1 should be the column titles (Sample | Chr | Start | Alt | Depth ). An additional argument is required if you do not have a FREEC profile associated to your files: the Genotype.
The first column needs to be the name of your sample
The Chr column contains the chromosome of variant (e.g. "chr2")
Start is the position of the variant
Alt is the number of reads supporting the variant
Depth is the depth of coverage at the position of the variant (number of reads mapped at this position)

Any additional column will not be taken into account for the analysis

You should have something similar to this:

SampleName	Chr	Start	Depth	Alt	Genotype
Timepoint_1	1	1	149	67	AB
Timepoint_1	4	2	162	2	AB
Timepoint_1	4	3	132	5	AB
Timepoint_1	4	4	57	1	AB
Timepoint_1	4	5	93	0	AB
Timepoint_1	4	6	95	0	AB

While the input file can be as large as you want, the computation time will grow with the number of variants to be studied. In order to keep computation time reasonable ( < 1h ), a reasonable set of mutation is between 100 to 1000 variants.

FREEC files: list of files corresponding to your samples. It is required if you do not have a Genotype column in your analysis. You should use the "Sample_ratio.txt" file, not the "Sample_ratio_normal.txt"
Contamination: fraction of normal cells estimated to contaminate your samples. Needs to be separated by commas (example: 0.1, 0.2)
Clone range: how many clones should be looked for in the samples? "2:5" means 2 to 5, whereas "2,5" means 2 and 5.
Save plot: Do you want to save 2D plots?
Save data: Do you want to keep probabilities and estimated copy numbers in a file?

(a) (b)

Usage

The QuantumClone package is divided in two:

The clonal reconstruction: QuantumClone / One_step_clustering functions
Plots
The clonal simulation: QuantumCat (not included in the GUI)
VariantFiltering: Suggestion for variant filtering prior the QuantumClone analysis

All of this is detailed in the vignette that can be accessed with:

vignette("Use_case",package = "QuantumClone")

Clonal reconstruction

One_step_clustering() has several parameters required (some have default configuration):

One_step_clustering(SNV_list, FREEC_list = NULL, contamination, nclone_range = 2:5, clone_priors = NULL, prior_weight = NULL, Initializations = 1, preclustering = "FLASH", simulated = FALSE, epsilon = NULL, save_plot = TRUE, ncores = 1, restrict.to.AB = FALSE, output_directory = NULL, model.selection = "BIC", optim = "default", keep.all.models = FALSE, force.single.copy = FALSE)

SNV_list: list of dataframes. See previous section for description.
FREEC_list: list of outputs from FREEC (in the same order as the SNV list). See here for added information.
contamination: Numeric vector giving the fraction of normal cells in each sample. Is linked to the cellularity by contamination = 1 - Cellularity
nclone_range: number of clones to look for in the samples
clone_priors: list of vectors giving the position of the clones in each samples (if know from previous analysis)
prior_weight : fraction of variants belonging to a clone (if known from previous analysis)
Initializations : number of iterations to run per condition. The output will take the maximal maximum likelihood on all iterations.
preclustering : the method to be used for EM initialization
simulated : is the data generated by QuantumCat? It does not change the parameters, but will attribute shapes to different chromosomes in the plots. (see QuantumCat for more information)
epsilon : stop condition for the EM. If left null, will be estimated from the average depth of sequencing of the data.
save_plot : save the 2D plots in a folder with the patient name/output_directory.
ncores: number of CPUs on which to distribute calculations (used if high number of variants)
restrict.to.AB : should the clustering be done only on AB regions?
output_directory : directory in which the plots will be saved (if NULL, will create a directory with the patient name)

Plots

** 2D plot **

plot_QC_out(QClone_Output,Sample_names=NULL, simulated = FALSE,sample_selected = 1:2)

QClone_Output: output from One_step_clustering or QuantumClone functions
Sample_names: vector with names of the samples (in the same order as the input list). Only used for the names of the axes. If NULL is replaced by numbers.
simulated: Used to display the original cluster from QuantumCat data
sample_selected: samples to be used for the plot (if the number of samples to plot is smaller than the the input)

** Evolution plots **

evolution_plot(QC_out,Sample_names=NULL)

Clonal simulation

This part is about generating data to test clonal reconstruction algorithms. Its core is the QuantumCat function. It will generate data for a single cancer that can be sequenced multiple times (either spatially separated or different timepoints). It thus assumes that there is an evolutionary history between samples. The "Chr" columns stores the information of the clonal attribution.

QuantumCat(number_of_clones, number_of_mutations, ploidy = 2, depth = 100, number_of_samples = 2, Random_clones = F, contamination = NULL)

number_of_clones : How many clones should exist in total. For example, 5 clones in 2 samples can be distributed in the following way: 1 specific of sample 1, 1 specific of sample 2 and 3 shared between sample 1 and 2.
number_of_mutations : How many variants should be used for the clustering. Some algorithms reported that an increase in the number of variants decreased the clustering quality, which does not seem to be the case here. It affects the computing time however.
ploidy : if numeric, it will generate a Poisson distribution with mean the ploidy. Accepted inputs can be "disomic", "AB", "AAB", "A", etc.
depth : what is the sequencing depth? Depth of a variant will be generated according to a negative binomial distribution, which characteristics have been generated by fitting to our data from whole genome sequencing.
number_of_samples = How many samples should be generated?
Random_clones: if the number of clones should be generated randomly (sampled from 2:5)
contamination: estimation of the contamination by normal cells

For multiple testings, and calculation of the Normalized Mutual Information (NMI), see Multitest() and statistics_on_Multitest()

Suggestions for variant filtering

We suggest to run QuantumClone on somatic variants only detected, for example, using VarScan2. We suggest to apply the following filters to select variants for clonal reconstruction:

minimal depth of coverage: 50x
minimal percentage of reads supporting the mutation (in at least one sample per patient): 10%
require variants to be located in regions of high local mappability (based on the 100 bp mappability track), and outside of repeats and duplicated genomic regions. The latter can be assessed using the UCSC repeat masker, simple repeat, and segmental duplication regions.
delete variants that created a stretch of four or more identical nucleotides
require variants to be located in regions where the genotype evaluated by Control-FREEC was available.
filter out variants corresponding to polymorphisms present in more than 1% of the population (snp138, 1000Genomes, esp6500) except if it is a known cancer related variant (COSMIC database for coding and non-coding mutations)
when working with cancer samples from several patients, you can consider removing variants annotated as germline in un-matched normal (blood) samples

Any other variants can be mapped on the clonal structure using the QuantumClone function Probability.to.belong.to.clone().

Acknowledgments

Many thanks to the contributors of this work: my supervisors, Elodie for the features improvement and Linux debugging and more generally to the U830 & U900 people. This work had been funded by the Ministere de l'Enseignement Supérieur de la Recherche (AMX grant).

Genotype is not available at position # ########

Dear Paul
I have another question about the One_step_clustering() function.
I ran Control-FREEC on one sample and tried QuantumClone.
Here is what my input SNV file looks like:

head(SNV_file)

  SampleName Chr   Start Depth Alt
1 sample1   1  909419    16   3
2 sample1   1  986894    20   2
3 sample1   1 1262354    95   3
4 sample1   1 1563476    73   2
5 sample1   1 1585590   117   2
6 sample1   1 1654193   247   3

And what the output of Control-FREEC looks like for this sample:

head(CNV_file)

Chromosome  Start    Ratio MedianRatio CopyNumber BAF estimatedBAF Genotype UncertaintyOfGT              Gene
1          1  65535 0.726426    0.508169          1  -1            1        A              -1   1: 65534- 65600
2          1  65857  1.12502    0.508169          1  -1            1        A              -1   1: 65856- 65948
3          1  69091 0.851791    0.508169          1   1            1        A              -1   1: 69090- 70008
4          1 721407 0.524401    0.508169          1  -1            1        A              -1 1: 721406- 721494
5          1 721556 0.876531    0.508169          1  -1            1        A              -1 1: 721555- 721781
6          1 752751  1.17044    0.508169          1   1            1        A              -1 1: 752750- 753582

When I tried the following commands:

SNV_liste = list(SNV_file)
FREEC_liste = list(CNV_file)
contamination = 30
One_step_clustering(SNV_list = SNV_liste, FREEC_list = FREEC_liste, contamination, nclone_range = 2:5, clone_priors = NULL, prior_weight = NULL, Initializations = 1, preclustering = "FLASH", simulated = FALSE, epsilon = NULL, save_plot = TRUE, ncores = 1, restrict.to.AB = FALSE, output_directory = NULL, model.selection = "BIC", optim = "default", keep.all.models = FALSE, force.single.copy = FALSE)

I got:

Checking that SNV_list and FREEC_list have the same number of samples...
Passed
epsilon set to: 0.00566125684264668
Checking all possibilities for sample1
Genotype is not available at position 1 909419
Genotype is not available at position 1 986894
Genotype is not available at position 1 1262354
[...]
Genotype is not available at position 21 47832900
Error in lis[[1]] : subscript out of bounds
In addition: Warning message:
In Patient_schrodinger_cellularities(SNV_list = SNV_list, FREEC_list = FREEC_list,  :
  4931 mutations excluded due to missing genotype or normalization issues

Could you please explain me what I am not doing right?
I guess I could add a genotype column in my snv file but I was hoping I could directly use the output file sample1_ratio.txt from Control-FREEC.
Thank you in advance

Kind Regards
-Charlotte

deveaup / quantumclone Goto Github PK

quantumclone's Introduction

QuantumClone and QuantumCat

Clonal Reconstruction from High-Throughput Sequencing data

Installation instructions

Usage

Usage

Clonal reconstruction

Plots

Clonal simulation

Suggestions for variant filtering

Acknowledgments

quantumclone's People

Contributors

Stargazers

Watchers

Forkers

quantumclone's Issues

Recommend Projects

Recommend Topics

Recommend Org