andrej-fischer / clonehd Goto Github PK

High-definition reconstruction of clonal composition from next-generation sequencing data

License: GNU General Public License v3.0

Shell 1.34% C++ 96.70% C 1.96%

clonehd's Introduction

How to get cloneHD and filterHD?

The current stable release, as well as pre-compiled executable binaries for Mac OS X and GNU Linux (64bit), can be found here. The cloneHD software is undergoing rapid development. Watch/Star this repo to receive updates.

Run a test with simulated data

After downloading cloneHD from the release site, you can test both filterHD and cloneHD by running

$ sh run-example.sh

where you can see a typical workflow of analysing read depth and BAF data with a matched normal. All command line arguments are explained below.

Compilation

For Mac OS X and GNU Linux (64bit), pre-compiled binaries are available here. To compile cloneHD yourself, you need the GNU scientific library (GSL) v1.15 or later. Change the paths in the Makefile to point to your local GSL installation (if non-standard). Then type

$ make

in the src directory. The executables will be in build. For debugging with gdb, use make -f Makefile.debug.

Report bugs

To report bugs, use the issue interface of github.

Full documentation

The full documentation can be found in the /docs/ subfolder. Click below.

What are cloneHD and filterHD for?

cloneHD is a software for reconstructing the subclonal structure of a population from short-read sequencing data. Read depth data, B-allele count data and somatic nucleotide variant (SNV) data can be used for the inference. cloneHD can estimate the number of subclonal populations, their fractions in the sample, their individual total copy number profiles, their B-allele status and all the SNV genotypes with high resolution.

filterHD is a general purpose probabilistic filtering algorithm for one-dimensional discrete data, similar in spirit to a Kalman filter. It is a continuous state space Hidden Markov model with Poisson or Binomial emissions and a jump-diffusion propagator. It can be used for scale-free smoothing, fuzzy data segmentation and data filtering.

Visualization of the cloneHD output for the simulated data set. From top to bottom: (i) The bias corrected read depth data and the cloneHD prediction (red). (ii) The BAF (B-allele frequency), reflected at 0.5 and the cloneHD prediction (red). (iii) The total copy number posterior. (iv) The real total copy number profile. (v) The minor copy number posterior. (vi) The real minor copy number profile. (vii) The observed SNV frequencies, corrected for local ploidy, and per genotype (SNVs are assigned ramdomly according to the cloneHD SNV posterior). (All plots are created with Wolfram Mathematica.)

Tips and tricks

The read depth input files for filterHD and cloneHD can be generated from a bam-file with samtools. Given a bed file of non-overlapping 1kb windows, e.g.
```
  human-genome.1kb-grid.bed :

  1	9000	10000
  1	10000	11000
  ...
```
$ samtools bedcov human-genome.1kb-grid.bed sample.bam > read-depth.sample.txt
```
  read-depth.sample.txt :

  1	9000	10000	12009
  1	10000	11000	213557
  ...
```
$ awk '{print $1,$3,int(0.5+$4/1000.0),1}' read-depth.sample.txt > read-depth.sample.cloneHD.txt
```
  read-depth.sample.cloneHD.txt :

  1 10000 12 1
  1 11000 214 1
  ...
```
For 10 kb windows, you would use $ awk '{print $1,$3,int(0.5+$4/10000.0),10} etc. For windows smaller than 1kb, the observations might not be approximately independent.
All input files are assumed to be sorted by chromosome and genomic coordinate. With Unix, this can be achieved with sort -k1n,1 -k2n,2 file.txt > sorted-file.txt.
Pre-filtering of data is very important. If filterHD predicts many more jumps than you would expect, it might be necessary to pre-filter the data, removing variable regions, outliers or very short segments (use programs pre-filter and filterHD).
Make sure that the bias field for the tumor CNA data is meaningful. If a matched normal sample was sequenced with the same pipeline, its read depth profile, as predicted by filterHD, can be used as a bias field for the tumor CNA data. Follow the logic of the example data given here.
If the matched-normal sample was sequenced at lower coverage than the tumor sample, it might be necessary to run filterHD with a higher-than-optimal diffusion constant (set with --sigma [double]) to obtain a more faithful bias field. Otherwise, the filterHD solution is too stiff and you loose bias detail.
filterHD can sometimes run into local optima. In this case, it might be useful to set initial values for the parameters via --jumpi [double] etc.
By default, cloneHD runs with mass-gauging enabled. This seems wasteful, but is actually quite useful because you can see some alternative explanations during the course of the analysis.
Don't put too much weight on the BIC criterion. It was calibrated using simulated data. For real data, it should be supplemented with common sense and biological knowledge. Use --force [int] to use a fixed number of subclones and --max-tcn [int] to set the maximum possible total copy number.
If high copy numbers are expected only in a few chromosomes, you can increase performance by using the --max-tcn [file] option to specify per-chromosome upper limits.
For exome sequencing data, the read depth bias can be enormous. The filterHD estimate of the bias field might not be very useful, especially in segmenting the tumor data. Use rather, if available, the jumps seen in the BAF data for both CNA and BAF data (give the BAF jumps file to both --cna-jumps and --baf-jumps).

#How to cite

The cloneHD and filterHD software is free under the GNU General Public License v3. If you use this software in your work, please cite the accompanying publication:

Andrej Fischer, Ignacio Vazquez-Garcia, Christopher J.R. Illingworth and Ville Mustonen. High-definition reconstruction of subclonal composition in cancer. Cell Reports (2014), http://dx.doi.org/10.1016/j.celrep.2014.04.055

clonehd's People

Contributors

Stargazers

Watchers

Forkers

ivazquez sohrabsa jstjohn q-kim callaghanmt zhangtongyikai nfergu jml96 afc523 alxmertz

clonehd's Issues

set fault on cloneHD snv mode with single-variant chroms

I have exome-only variants, so have some chromosomes with only 1 somatic variant. I recompiled with your latest bug fixes and still appear to be getting the seg fault. When I remove the single-variant chromosomes, no seg fault. Your help is appreciated, and we look forward to using this on more samples.

Inconsistencies between CNA and BAF files

From the following I would expect the copy number between 122701000 and 122702000 to be 0, since 122702000 is the end of the first segment for the predicted region on the 3rd line. In other words, the 3rd line represents a region from 122701000 to 135534000.

tumour.cna.subclone-2.txt

10 42027000  80660 122686000 0.000 0.000 0.000 1.000 0.000
10 122687000     15 122701000 0.000 0.000 0.000 1.000 0.000
10 122702000  12833 135534000 1.000 0.000 0.000 0.000 0.000
10 135535000      1 135535000 1.000 0.000 0.000 0.000 0.000

However from the BAF file, some SNPs within the 122701000-122702000 segment have copy numbers 1 or 2 which is incompatible with total copies 0.

tumour.baf.subclone-2.txt

10 122701679      2 122701957 0.080 0.420 0.420 0.080 0.000

Maximum total copy number

It seems to me that I need to set a limit for maximum total copy number in order to reduce computation time. The documentation says that if it is not specified, normal copy number is used. That makes me confused about what maximum copy number stands for. Is it the copy number for the matched normal? If it were the maximum copy number for the tumor sample, why would normal copy be the default limit?

Also, I am able to set the limit separately for each chromosome, which is great. However, for that, do I need to know the subclonal composition of my samples beforehand? I actually have multiple samples, so I was hoping to set maximum copy numbers per chromosome per sample.

clones [file]

Dose anyone know where clones [file] come from?

Provide a general makefile

The provided makefiles have hardcoded paths to library files, and specify compilers that may not be available on every system. A makefile that doesnt hardcode such paths, and allows users to specify such options using environment variables would be helpful.

filterHD problem

Great job.
But Sth wrong with filterHD. When I try to use filterHD to get the bias estimation, it says
Initial range is 1.010e+00 < x < 3.132e+03
eval jump sigma -llh
ERROR
Aborted (core dumped)
And no more information about this error. Then I think maybe sth wrong with my input. So I tested it with head -n. The funny thing is that, the program will prompt error at some line. But those lines looks good great than 0. Even if I removed all the 0s or change 0 to 1. The program still crashed. But when I split the segments into tiny segments and do the analysis, it sometimes works. So I really don't what's the problem.

Regards,

John

Release builds for v1.17.8 both OSX

Neither of the provided builds work on linux, both work on osx.

Location of test data?

run-example.sh uses a test data set which seems like it should be included in the repo, but isn't there. Any ideas?

cloneHD error

Trying to run cloneHD using my files generated from filterHD step. Running into a error without much explanation:

./build/cloneHD --cna tumor.exome.cloneHD.noY --chr normal.exome.cloneHD.noY --baf tumor.baf --pre output/tumor --bias normal.cna.posterior-1.txt --seed 123 --trials 2 --nmax 3 --force --max-tcn 4 --cna-jumps tumor.baf.jumps.txt --min-jump 0.01 --restarts 10 --mass-gauging 1

cloneHD: probabilistic inference of sub-clonality using...

CNA data in tumor.exome.cloneHD.noY: 193717 sites in 23 chr across 1 samples
BAF data in tumor.baf: 31724 sites in 23 chr across 1 samples

Aborted (core dumped)

Where to go from here?

Docs unclear wrt CNA file spec

It is unclear from the description of CNA files whether the input read counts are of those overlapping the given position, overlapping the segment ending at the given position, or contained within the segment ending at the given position.

simulation data from article

Does anyone know where is the simulation data in the article not the test simulation?

how to run CloneHD for WES data

Hi,

I have WES matched tumor-normal data and I am looking to run CloneHD. I read that this tool will also work for WES data and wondering where I can get more information on this?

Thanks

Segmentation Fault

Running the example test set I run into a seg fault:

./build/filterHD --data ./test/data//normal.cna.txt --mode 3 --pre ./test/results//normal.cna

filterHD: Fitting jump-diffusion model to data at 20001 loci in 1 segment(s) in 1 sample(s) with a poisson emission model:

Filtering sample 1 of 1:
Initial range is 3.801e+01 < x < 8.500e+01
eval jump sigma rnd -llh
10 1.71212e-06 7.74107e-04 3.80764e-05 6.9367034070e+04
20 3.49225e-08 7.03853e-04 1.87274e-04 6.9362282092e+04
30 1.98473e-09 7.61654e-04 3.09235e-04 6.9361873946e+04
40 1.07903e-09 7.62688e-04 7.27041e-05 6.9361621633e+04
Adapted range to 5.493e+01 < x < 6.526e+01
10 1.32967e-09 7.25762e-04 2.95731e-05 6.9359987522e+04
20 1.05477e-09 7.33552e-04 1.30669e-05 6.9359948984e+04
30 1.03791e-09 7.37987e-04 9.48939e-07 6.9359941037e+04
Segmentation fault (core dumped)

provide description of baf.subclone-1.txt file

A description of the baf.subclone-1.txt files would be helpful. I assume they are similar to cna.subclone-2.txt, with the caveat that it is showing the uncertainty in which allele has x copies as a 50/50 weighting on the two copies.

#chr first-locus nloci last-locus 0 1 2 3 4
20  61098  56814 62962891 0.500 0.500 0.000 0.000 0.000
21 9427957  29809 48102613 0.500 0.500 0.000 0.000 0.000
22 16052239  31219 51229855 0.500 0.500 0.000 0.000 0.000
1  52238  13224 15436352 0.500 0.000 0.500 0.000 0.000
1 15454068     19 15463776 0.500 0.000 0.500 0.000 0.000
1 15467250     50 15506001 0.500 0.000 0.500 0.000 0.000
1 15512968     30 15536491 0.500 0.000 0.500 0.000 0.000
1 15539582     56 15560827 0.500 0.000 0.500 0.000 0.000
1 15562922      3 15564121 0.500 0.000 0.500 0.000 0.000

Normal copy number error

Trying to do a basic run using run_example.sh as a template on whole exome data. Got as far as running cloneHD but was met with this error:

"Normal copy number for some BAF data chromosomes is not 2."

Code inspection shows some sort of calculation for normal copy number from baf data? Not really sure how to proceed from here.

Bug with different number of chromosomes in SNV, mean_tcn and avail_cn files

There is a bug when running cloneHD in SNV mode with --snv ${snv}, --avail-cn ${avail_cn} and --mean-tcn ${mean_tcn}. If the number of chromosomes in ${avail_cn} is greater than the number of chromosomes in ${snv}, cloneHD exits with a segmentation fault. @vmustonen has tested this for get_avail_cn in cloneHD-functions.cpp, but it may also affect get_mean_tcn.

The problem only seems to happen if the last chr in ${avail_cn} does not have any SNV hits. One can skip intermediate chrs without a problem so this seems to be a boundary case that has not occurred. This corner case is very rare with whole-genome data, but it is likely to cause problem for exome data.

We should build a safe guard, such that the chr count in ${avail_cn} should only run to max(SNVs chr). A solution would be to set max(SNVs chr)== max(avail-cn chr).

simulated data availability for testing

The readme refers to running a test on simulated data but this data does not seem to be available.

bug 001

seg fault with very sparse data (one data point in a chr)

bug with --jumps [file] when there are no jumps

This bug appears when there are chr with no jumps. The bug is caused in Emission:coarse_grain_jumps().

cnv normalization error

Regards,

John

FilterHD error

filterHD is returning an error when I attempt to run the bias field analysis. I'm using the run-example.sh on TCGA derived data. Below the terminal output and code segment is pasted:
Code:
`

The tumor read depth is now analysed with the bias field from the matched normal. The diffusion constant is set to zero. If left free, it should converge to a very small value. The jump rate could be slightly higher. The LLH should be higher than for the run above indicating the presence of the bias field. Now we are interested in the jumps.

cmd="$filterHD --data $tumorCNA --mode 3 --pre ${results}/tumor.cna.bias --bias $bias --jumps 1"
echo $cmd
$cmd
echo

`
Terminal:
filterHD: Fitting jump-diffusion model to data at 129 loci in 23 segment(s) in 1 sample(s) with a poisson emission model:
ERROR 1 in get_bias()
24 2792295 1.70e+02 9.81e+01 0.00e+00

./build/filterHD --data ./cloneHDdata/tumor.baf.txt --mode 1 --pre ./cloneHDresults/tumor.baf --jumps 1 --reflect 1 --dist 1

Questions about the --cna input files

Why does depth files of the test data in the run-example  is not following the cloneHD command line arguments  with read depth per 1 kb?

how to create the baf input file

i noticed that the fourth columns of the baf input file are the same number in the right position of third columns of the tumor.cna input file. Could anybody tell me how to create the baf input file? Please, i have got a lot of pressure from my boss on this issue.