etal / cnvkit Goto Github PK
View Code? Open in Web Editor NEWCopy number variant detection from targeted DNA sequencing
Home Page: http://cnvkit.readthedocs.org
License: Other
Copy number variant detection from targeted DNA sequencing
Home Page: http://cnvkit.readthedocs.org
License: Other
In the new VCF export code, you were writing:
elif ncopies < ploidy:
svtype = "DEL"
svlen *= -1
formats = "GT:GQ"
genotype = "0/1:%d" % row["probes"]
Does that imply that deletions reported by cnvkit
will always be called as heterozygous, i.e. no homozygous-alt deletions (1/1
) can be identified and/or reported?
Hi Eric,
It appears to me that call -m threshold
always treats sex chromosomes as though they were diploid. Here's a small use case which results took me completely by surprise. I had 15 samples, all male; I built a reference and run all samples against it using these standard commands using CNVKit 0.7.0:
./cnvkit.py reference --male-reference -o reference.cnn -f hg19.fa samples/*.cnn
./cnvkit.py fix sample1.targetcoverage.cnn sample1.antitargetcoverage.cnn reference.cnn \
-o sample1.cnr
./cnvkit.py segment sample1.cnr -o sample1.cns
./cnvkit.py call --male-reference -m threshold -t=-1.1,-0.4,0.3,0.7 -o sample1.call.cns \
sample1.cns
./cnvkit.py export bed --male-reference sample1.call.cns -o sample1.bed
./cnvkit.py export bed --male-reference sample1.call.cns --show-neutral -o sample1.all.bed
For all of the samples I noticed an apparent miscalculation of copy numbers in sex chromosomes. The table below reflects the average log2 values for reference (column 2) and across all of the samples (columns 3-5), absolute copy numbers (column 6), as well as whether those copy numbers are considered neutral by export bed
command (column 7). Since all numbers in columns 2-4 were very close to integers, I rounded them for the sake of readability:
chr | reference.cnn | sample.cnn | cnr, cns | call.cns | abs CN | CN neutral? |
---|---|---|---|---|---|---|
autosomes | 0 | 0 | 0 | 0 | 2 | yes |
chrX | −1 | −1 | 0 | +1 | 2 | yes |
chrY | −1 | −1 | 0 | +1 | 2 | yes |
Note that chrX and chrY went from 0 in sample.cns
to +1 in sample.call.cns
, which was then reinterpreted as 2 absolute copies by export bed
command (and, interestingly enough, still considered neutral, although for sex chromosomes that would be an alteration). After inspecting the code, I discovered that this happens in cnvlib.call.absolute_threshold()
: a cycle assigns absolute copy numbers without looking at whether the segment in question comes from autosome or sex chromosome. In this case, it assigns both chrX and chrY an absolute copy number of 2.
I realize that this method is supposed to be really simple, since it only applies fixed thresholds, but intuitively I still expected it to somehow handle autosomes and allosomes differently. If this is indeed by design, and not a bug, then I believe it would be helpful to describe this detail of call -m threshold
in the documentation.
Also, a quick related question: since the threshold method is not usable for sex chromosomes, it is OK if I just always use call -m clonal -purity 1
for germline samples? It seems to handle this situation well so far.
Cheers
Kirill
In order for
conda install cnvkit
to work, one need to need to add the channel where cnvkit is hosted, i.e.
conda config --add channels https://conda.binstar.org/bcbio
I recommend adding this to README.rst
, because (at least to me) it took a fair bit of trial and error to get it working.
(Also, it seems like Anaconda installs python as well, so it's not clear to me whether you have to have Python 2.7 installed prior to installing Anaconda + cnvkit.)
Extract heterozygous allele frequencies from the tumor sample VCF, and segment those values to generate a .cns file. CBS, haar wavelet and fused lasso are all fine algorithms for this.
In the "export" command, implement converting segmented CNVs to VCF format.
Create CommonWL "tool" interfaces for each CNVkit sub-command. Try to use these to improve integration with Galaxy and bcbio-nextgen.
Maybe wrap "batch" as a workflow instead of a tool -- not sure how this works.
Spec: https://github.com/common-workflow-language/common-workflow-language
Examples: https://github.com/common-workflow-language/workflows
Hi,
I am testing out your program using 1000G version of hg19 and I am receiving the following error for cvnkit 0.6.1.
Number of target and antitarget files: 1, 1
Loading germline.targetcoverage.cnn
/lb/project/mugqic/analyste_dev/software/python/Python-2.7.8/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
Calculating GC and RepeatMasker content in /lb/project/mugqic/analyste_dev/genomes/Homo_sapiens/hg1k_v37/fasta/hg1k_v37.fasta ...
Traceback (most recent call last):
File "/home/reveleig/github/cnvkit/cnvkit.py", line 9, in
args.func(args)
File "/sb/home/reveleig/github/cnvkit/cnvlib/commands.py", line 498, in _cmd_reference
args.male_reference)
File "/sb/home/reveleig/github/cnvkit/cnvlib/commands.py", line 518, in do_reference
male_reference)
File "/sb/home/reveleig/github/cnvkit/cnvlib/reference.py", line 50, in combine_probes
gc, rmask = get_fasta_stats(cnarr1, fa_fname)
File "/sb/home/reveleig/github/cnvkit/cnvlib/reference.py", line 189, in get_fasta_stats
fa_coords)]
File "/sb/home/reveleig/github/cnvkit/cnvlib/ngfrills/faidx.py", line 71, in fasta_extract_regions
+ "file " + str(fa_fname))
ValueError: Sequence ID '1' is not in FASTA file /lb/project/mugqic/analyste_dev/genomes/Homo_sapiens/hg1k_v37/fasta/hg1k_v37.fasta
Thanks for any help you can provide.
Best,
Rob E.
The current tool description is out of date, and also doesn't pull in its dependencies automatically.
line:
Line 125 in 2d43983
samtools bedcov
doesn't return a "name" fields as far as I know. Where does this name field come from?
By default, samtools creates the BAM index file for "sample.bam" as "sample.bam.bai", so that's the file CNVkit looks for when ensuring a BAM file has been indexed.
Apparently, samtools/pysam/htslib will also accept "sample.bai" as a BAM index?
Verify this, and implement it in CNVkit/ngfrills.
Instead of printing all status messages to standard error, configure and use the logging
module. Stratify the status messages into debugging, info, warning, and maybe error levels. Add options to the CLI commands to choose the logging level / verbosity.
Then, run the unit test script with all of these messages turned off.
Just a suggestion, but the scatter command does not seem to be able to resolve gene names, if overlapping genes are given as comma separated lists.
Hi Eric,
PSCBS R package version 0.45.0 was released yesterday (2015-09-11) and apparently it breaks probe segmentation in CNVKit. Perhaps it introduces some backwards incompatible syntax changes, or maybe it's just a regression in PSCBS itself—I unfortunately don't have enough time at the moment to investigate further into this. Judging from the error message, it has something to do with namespace conflicts. I will for now stick with using version 0.44.0, which works fine, but since 0.45.0 is now installed by default, it might be inconvenient and confusing to some users. An error trace from running make
in test
directory is below.
Please let me know if you need more information on this, and thank you for your work on CNVKit!
Yours
Kirill
python ../cnvkit.py segment build/p2-5_5.cnr -o build/p2-5_5.cns
Traceback (most recent call last):
File "../cnvkit.py", line 9, in <module>
args.func(args)
File "/mnt/tmp/cnvkit/cnvlib/commands.py", line 625, in _cmd_segment
args.rlibpath)
File "/mnt/tmp/cnvkit/cnvlib/segmentation/__init__.py", line 45, in do_segmentation
seg_out = ngfrills.call_quiet('Rscript', script_fname)
File "/mnt/tmp/cnvkit/cnvlib/ngfrills/__init__.py", line 41, in call_quiet
% (' '.join(args), err))
RuntimeError: Subprocess command failed:
$ Rscript /tmp/tmp6dlQOh
PSCBS v0.45.0 (2015-09-11) successfully loaded. See ?PSCBS for help.
Attaching package: ‘PSCBS’
The following objects are masked from ‘package:base’:
append, load
Loading probe coverages into a data frame
Pre-processing the probe data for segmentation
Segmenting the probe data
Error: !hasWeights || !is.null(fit$data$w) is not TRUE
Execution halted
Hi there,
Thanks for cnvkit, what a great tool!
I've come across this error:
Relative log2 coverage of X chromosome: -0.820871 (assuming male)
Traceback (most recent call last):
File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/cnvkit.py", line 9, in <module>
args.func(args)
File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 685, in _cmd_diagram
args.output, args.male_reference)
File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/diagram.py", line 48, in create_diagram
probe_rows = probe_rows.squash_genes()
File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/cnarray.py", line 532, in squash_genes
outrows.append(squash_rows(name, subarr))
File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/cnarray.py", line 513, in squash_rows
chrom = core.check_unique(rows['chromosome'], 'chromosome')
File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/core.py", line 154, in check_unique
% (title, ' '.join(map(str, sorted(its)))))
AssertionError: Inconsistent chromosome keys: chr11_gl000202_random chr17_ctg5_hap1 chr17_gl000203_random
I think the error is caused by the gene column containing a dot ( .
) as these contigs have no annotated genes.
cnvkit.py export vcf -h
# Works fine
cnvkit.py export bed -h
# See traceback below
Traceback (most recent call last):
File "~/opt/bin/cnvkit.py", line 10, in <module>
args = commands.parse_args()
File "~/anaconda2/lib/python2.7/site-packages/cnvlib/commands.py", line 1738, in parse_args
return AP.parse_args(args=args)
File "~/anaconda2/lib/python2.7/argparse.py", line 1701, in parse_args
args, argv = self.parse_known_args(args, namespace)
File "~/anaconda2/lib/python2.7/argparse.py", line 1733, in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
File "~/anaconda2/lib/python2.7/argparse.py", line 1921, in _parse_known_args
positionals_end_index = consume_positionals(start_index)
File "~/anaconda2/lib/python2.7/argparse.py", line 1898, in consume_positionals
take_action(action, args)
File "~/anaconda2/lib/python2.7/argparse.py", line 1807, in take_action
action(self, namespace, argument_values, option_string)
File "~/anaconda2/lib/python2.7/argparse.py", line 1096, in __call__
subnamespace, arg_strings = parser.parse_known_args(arg_strings, None)
File "~/anaconda2/lib/python2.7/argparse.py", line 1733, in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
File "~/anaconda2/lib/python2.7/argparse.py", line 1921, in _parse_known_args
positionals_end_index = consume_positionals(start_index)
File "~/anaconda2/lib/python2.7/argparse.py", line 1898, in consume_positionals
take_action(action, args)
File "~/anaconda2/lib/python2.7/argparse.py", line 1807, in take_action
action(self, namespace, argument_values, option_string)
File "~/anaconda2/lib/python2.7/argparse.py", line 1096, in __call__
subnamespace, arg_strings = parser.parse_known_args(arg_strings, None)
File "~/anaconda2/lib/python2.7/argparse.py", line 1733, in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
File "~/anaconda2/lib/python2.7/argparse.py", line 1939, in _parse_known_args
start_index = consume_optional(start_index)
File "~/anaconda2/lib/python2.7/argparse.py", line 1879, in consume_optional
take_action(action, args, option_string)
File "~/anaconda2/lib/python2.7/argparse.py", line 1807, in take_action
action(self, namespace, argument_values, option_string)
File "~/anaconda2/lib/python2.7/argparse.py", line 996, in __call__
parser.print_help()
File "~/anaconda2/lib/python2.7/argparse.py", line 2340, in print_help
self._print_message(self.format_help(), file)
File "~/anaconda2/lib/python2.7/argparse.py", line 2314, in format_help
return formatter.format_help()
File "~/anaconda2/lib/python2.7/argparse.py", line 281, in format_help
help = self._root_section.format_help()
File "~/anaconda2/lib/python2.7/argparse.py", line 211, in format_help
func(*args)
File "~/anaconda2/lib/python2.7/argparse.py", line 211, in format_help
func(*args)
File "~/anaconda2/lib/python2.7/argparse.py", line 517, in _format_action
help_text = self._expand_help(action)
File "~/anaconda2/lib/python2.7/argparse.py", line 603, in _expand_help
return self._get_help_string(action) % params
TypeError: %d format: a number is required, not str
BICseq appears to perform well in benchmarks on WGS data.
The reference implementation is in matlab, but the paper describes how it works reasonably clearly. Just reimplement it.
The algorithm assumes a paired normal is available, but in CNVkit we'll use the reference-based bin weights instead to estimate reasonable normal-sample counts.
I am trying to study the tumor subclonal populations using the algorithm, phyloWGS, which takes somatic mutation variant allele frequency (ssm_file) and copy number variation (cnv_file) as inputs (https://github.com/morrislab/phylowgs). I was wondering if you can export the results of CNVkit to the input of phyloWGS (cnv file).
It needs the following information.
I understand that somatic mutation information will be analyzed separately and provided.
Thank you,
Jungmin
The segmentation algorithms that use R packages currently run Rscript in a subprocess using a filled-in R script template as input.
segmentation/__init__.py
cbs
, flasso
and haar
methods but can't be consolidated.I'm not sure how rpy2 would complicate CNVkit installation on OS X, or using a nonstandard R library path.
I just ran cnvkit.py batch
outputting to a subdirectory, but noticed that all files but the "copy number reference profile *.cnn file was still outputted to the working directory. I can see how this design/behavior makes sense. However, may I suggest that it's documented, e.g. "(Ignores -o/--output-dir option)":
cnvkit.py batch --help
...
--output-reference OUTPUT_REFERENCE
Output filename for the new reference file being
created. (Ignores -o/--output-dir option)
...
The README.rst
says:
If this pipeline completes successfully (it should take less than a minute), you've installed CNVkit correctly.
Is this an outdated benchmark, because on a decent Linux machine, this took ~10 minutes for me.
Hey,
Is there a way to get inline plots(the log2 coverage ones) in ipython notebook ? I tried %matplotlib inline, but it doesnt seem to solve it.
Thank You
The heatmap plot needs a legend. This should be a color gradient, red-to-white-to-blue, with tick marks at +1/0/-1, in a small vertically-oriented box on the right side of the plot.
In a transposed heatmap (#35), the box should be at the bottom oriented horizontally, with red on the left.
Consider recreating the same in Reportlab for the 'diagram' plot.
CODEX is a copy number developed specifically for detecting copy number from exome sequencing, and includes a segmentation method that explicitly models read depth (unlike the other segmentation methods currently available in CNVkit). See the paper and Bioconductor package.
Just wrap the R package.
See this SeqAnswers post:
http://seqanswers.com/forums/showthread.php?p=168492#post168492
The batch
command runs the pipeline by either building a new reference or using an existing one, and some of the options for building a new reference do not make sense in the latter case. Using those options should consistently raise an error that explains the conflict.
Hi Etal,
Thank you for developing CNVkit. It's brilliant!
I performed CNVkit on a small gene panel sequencing data, the result is very good.
However, I noticed that there are two chromosomes missing in the plot. These two chromosomes are missing in target or antitarget bed files too. Then I found the reason is I don't have any target designed for sequencing on these two chromosomes.
Since off-target reads should be all over the genome, I am wondering if this is a bug?
Looking forward to your reply.
Thanks ~
Hannah
Create a Docker container for CNVkit.
It should probably be based on one of the Bioconductor containers to provide the dependencies for CBS: http://bioconductor.org/help/docker/
Hi @etal, wanted to pick your brain on how to implement some kind of "confidence intervals" around the copy number estimates. Ideally there could be an interval around the point estimate that would reflect depth of sequencing and smoothness of the copy number signal around the interval (could be e.g. MAD http://en.wikipedia.org/wiki/Median_absolute_deviation with a provision for depth).
For example, a log2 copy number of 1 surely should be more meaningful in 20x coverage data compared to 1x coverage data, right?
I'm more than happy to implement something and help but wanted to hear your thoughts first.
Revisit CNVkit's reimplementation of the HaarSeg algorithm. It's not quite right.
Hi,
In short: could you check please, that following the conda instructions for the bcbio channel actually give the latest version? Or add some extra information on how to choose which version it will install?
Details: I'm afraid I'm submitting this issue for one of my users, I'm not experienced with conda
and I don't know exactly what happened, but...
Judging from his .bash_history
, he ran the advertised
conda config --add channels https://conda.binstar.org/bcbio
conda install cnvkit
and he got cnvkit-0.2.5
. Then he ran into various problems which have long been fixed. I solved it by removing the installed copy, downloading the 0.7.3 zipfile and building that with setup.py
.
I tried to ask conda what it would install, and (after I un-mangled the pasted terminal transcript) this is what it said
(cnvkit_env)mibxxxxxxi:1112_CNVkit hl11$ conda search cnvkit
Fetching package metadata: ......
cnvkit . 0.2.5 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.3.1a np18py27_0 https://conda.binstar.org/bcbio/osx-64/
0.3.2 np18py27_0 https://conda.binstar.org/bcbio/osx-64/
0.3.2 np19py27_1 https://conda.binstar.org/bcbio/osx-64/
0.3.5 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.3.5 np19py27_1 https://conda.binstar.org/bcbio/osx-64/
0.3.5 np19py27_2 https://conda.binstar.org/bcbio/osx-64/
0.3.5 np19py27_3 https://conda.binstar.org/bcbio/osx-64/
0.3.5 np19py27_4 https://conda.binstar.org/bcbio/osx-64/
0.4.0 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.4.0 np19py27_1 https://conda.binstar.org/bcbio/osx-64/
0.4.1 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.4.1 np19py27_1 https://conda.binstar.org/bcbio/osx-64/
0.4.1 np19py27_2 https://conda.binstar.org/bcbio/osx-64/
0.4.1 np19py27_3 https://conda.binstar.org/bcbio/osx-64/
0.4.1 np19py27_4 https://conda.binstar.org/bcbio/osx-64/
0.5.0 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.5.1 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.6.1 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.6.2a np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.6.2a np19py27_1 https://conda.binstar.org/bcbio/osx-64/
0.6.2a np19py27_2 https://conda.binstar.org/bcbio/osx-64/
0.7.0.dev0 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.7.0 np19py27_2 https://conda.binstar.org/bcbio/osx-64/
0.7.1.dev0 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.7.1 np19py27_0 https://conda.binstar.org/bcbio/osx-64/
0.7.2 py27_0 https://conda.binstar.org/bcbio/osx-64/
0.7.3dev0 py27_0 https://conda.binstar.org/bcbio/osx-64/
Thanks,
Matthew (for Henry )
The call -m clonal
command can rescale log2 ratios to compensate for normal-cell contamination in a sample. That can be useful outside of calling integer copy numbers, so let's pull it out into a separate rescale
command with similar options.
This feature may also be useful for analysis of tumor subclones/heterogeneity.
Keep the rescaling functionality in call -m clonal
as well for now.
The LOH command is not working for me, it doesn't seem to like my VarScan VCF file. Any ideas?
Thanks
cnvkit.py scatter Sample.cnr -s Sample.cns -v VarScan.snp.Germline.vcf
Traceback (most recent call last):
File "/usr/local/bin/cnvkit.py", line 4, in <module>
__import__('pkg_resources').run_script('CNVkit==0.3.5-dev', 'cnvkit.py')
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 534, in run_script
if req in processed:
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1445, in run_script
File "/usr/local/lib/python2.7/dist-packages/CNVkit-0.3.5_dev-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 9, in <module>
File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 714, in _cmd_scatter
File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 734, in do_scatter
File "build/bdist.linux-x86_64/egg/cnvlib/ngfrills/vcf.py", line 13, in load_vcf
# 'format': list of keys (strings)
File "build/bdist.linux-x86_64/egg/cnvlib/ngfrills/vcf.py", line 33, in filter_vcf_lines
# When instantiated, a reference can be passed to the VCF class. This may be any class that supports a
KeyError: 'AF'
For tumor-only VCFs, or paired T/N VCFs with lingering noise, it would be nice to take another VCF of known SNPs from 1000 Genomes, dbSNP 129, ESP 6500, etc. and further filter the variants in the given sample.
See also: #7
The loh
plot used to try to identify whole chromosomes with loss of heterozygosity (LOH) using an ill-conceived Mann-Whitney test. This would fail if multiple chromosomes had LOH (skewing the background distribution), or if the LOH region was much smaller than a chromosome (i.e. covering too few SNPs).
This issue has two parts:
scatter
command, group variants in the given VCF by segment. (Mostly done.)See also: #34
Treat the x-axis units as bin indices instead of genomic coordinates or basepairs, so all bins are shown with equal size. Most useful/relevant for heatmap
, but also consider scatter
and maybe diagram
.
I've installed cnvkit using pip, at first run it raises a SyntaxError:
$ cnvkit.py -h
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.4/bin/cnvkit.py", line 5, in <module>
from cnvlib import commands
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/__init__.py", line 1, in <module>
from .cnarray import CopyNumArray as _CNA
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/cnarray.py", line 11, in <module>
from . import core, metrics, ngfrills, smoothing
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/core.py", line 9, in <module>
from .ngfrills import echo, safe_write
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/ngfrills.py", line 407
except OSError, exc:
^
SyntaxError: invalid syntax
After correction of entries in python3 syntax (i.e. OSError, exc
becomes OSError as exc
) I have
$ cnvkit.py
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.4/bin/cnvkit.py", line 9, in <module>
args.func(args)
AttributeError: 'Namespace' object has no attribute 'func'
The functions to plot loss-of-heterozygosity (LOH) take allele frequencies from a given VCF file. LOH is only visible in germline SNPs that are also present in the tumor sample. Currently, all SNVs in the VCF are plotted, so this works best if the user had the foresight to separately filter the tumor variants for germline SNPs first.
VCFs from MuTect, FreeBayes and other variant callers will list the SNPs/SNVs in a matched tumor-normal pair together; this makes it feasible for CNVkit to parse this VCF and extract only the allele frequencies of SNPs for plotting.
The tumor sample name can be specified with -i
now. Next:
-n
to specify the normal sample name in the same VCF.-n
is unspecified and there are 2 or more samples, look for "NORMAL" (case-insensitive); if -i
is unspecified, look for "TUMOR".The coverage
command currently median-centers and log2-transforms the read depths in each genomic bin before writing the *.targetcoverage.cnn
and *.antitargetcoverage.cnn
files. This makes the reference
command a little more memory-efficient because the binned values across different samples are already comparable and can be processed line-by-line.
But downstream algorithms may want the original read depths to estimate reliability, etc. (see #28). Let's make reference
work a little harder and do not normalize/median-center the coverage
read depths before writing the .cnn files.
Continue to log2-transform the values, so that all CNVkit files can still be parsed the same way. To prevent a math domain error, use a tiny preconfigured value like -20 (params.NULL_LOG2_COVERAGE
) to represent 0 reads in a bin.
Hi Eric,
I'm running a whole genome analysis without normals and am getting the following error:
Splitting: . 26381
Wrote ./2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.target.bed with 4410794 target intervals
Detected file format: BED
Detected file format: BED
Detected file format: BED
Wrote ./2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.antitarget.bed with 1 background intervals
Building a flat reference...
Detected file format: BED
Traceback (most recent call last):
File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/bin/cnvkit.py", line 9, in <module>
args.func(args)
File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 80, in _cmd_batch
args.processes, args.count_reads)
File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 148, in batch_make_reference
male_reference)
File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 509, in do_reference_flat
ref_probes.extend(reference.bed2probes(antitarget_list))
File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/reference.py", line 15, in bed2probes
for chrom, start, end, name in ngfrills.parse_regions(bed_fname)]
File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/ngfrills/regions.py", line 62, in parse_regions
}[fmt]
KeyError: None
The command is roughly:
cnvkit.py batch 2_2015-04-02_bcbio_cnvkit-sort.bam -n -f /ngs/reference_data/genomes/Hsapiens/hg19/seq/hg19.fa --targets 2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.bed --access 2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.bed -d ./ --split -p 1 --output-reference ./flat_background.cnn --antitarget-avg-size 1000 --antitarget-min-size 75 --target-avg-size 1000 --rlibpath /group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/lib/R/site-library
The antitarget bed file that gets produced is empty (as expected). The target bed file has 4 columns.
Any ideas?
Reported here:
https://www.biostars.org/p/160256/
The R-based segmenters add gene names from the input .cnr as a post-processing step. The 'haar' method should do the same thing.
The current "horizontal" layout shows samples in rows, chromosomes in columns. A transposed or "vertical" layout would show samples in columns, chromosomes in rows. (Sometimes this makes a better figure.)
Consider the flags "-t/--transpose" or "-v/--vertical".
Hi,
I don't fully understand why I obtained a pattern of integers (x
) and half-integers (x + 0.5
) when using call
function. I expected to get only integers.
My test:
Assuming 100% tumor purity across my samples, I calculated the -threshold as follow (R code
):
> c("No.Copies"=log2(0.1/2),"one.Copy"=log2(1/2),
"neutral.Diploid"=log2(2/2),"three.copies"=log2(3/2),
"four.copies"=log2(4/2))
# No.Copies one.Copy neutral.Diploid three.copies four.copies
# -4.3219281 -1.0000000 0.0000000 0.5849625 1.0000000
So I run the cnvkit.py call
function using the following parameters:
$cnvkit.py call ${sampleName}".cns" -m threshold -t=-4.3,-1,0,0.6,1 -o ${sampleName}".absoluteCN.cns"
Treating sample gender as female
Calling copy number with thresholds: -4.3 => 0, -1 => 1, 0 => 2, 0.6 => 3, 1 => 4
Please, notice the output said: Calling copy number with thresholds: -4.3 => 0, -1 => 1, 0 => 2, 0.6 => 3, 1 => 4
, where I guess the first number of the pair is the log2 and the second one is the absolute copy number.
Then, if we retrieve all unique estimated copy-numbers from the output ${sampleName}".absoluteCN.cns"
file, and revert them from the log2 transformation (using bash
):
$tail -n +2 ${sampleName}".absoluteCN.cns" | cut -f 5 | sort -n -u | awk '{print "log2(CN)="$1"; then CN="2**$1;}'
log2(CN)=-10.9658; then CN=0.000499995
log2(CN)=-1; then CN=0.5
log2(CN)=0; then CN=1
log2(CN)=0.584963; then CN=1.5
log2(CN)=1; then CN=2
log2(CN)=1.32193; then CN=2.5
log2(CN)=1.58496; then CN=2.99999
log2(CN)=2; then CN=4
log2(CN)=2.16993; then CN=4.50002
log2(CN)=3.39232; then CN=10.5
I wonder why there are half-integers (i.e. 0.5, 1.5, 2.5, 4.5 and 10.5) in the output. Are these copy-numbers defined in the context of a haploid genome or in the whole genome (diploid)?
Thanks in advance,
Javier
My version:
$cnvkit.py version
0.6.1-dev
So the documentation on batch (http://cnvkit.readthedocs.org/en/latest/pipeline.html) suggests that batch runs cnvkit targets
as one of its steps. However, as far as I can tell, targets
takes a BED as input and outputs a new BED---while batch
does not output a new BED file for the targets. This has implications when you pass the --annotate
flag, as no annotation is actually done via batch
as far as I can tell.
Let me know if I'm mis-reading something here.
See the THetA2 v.0.7 release notes.
In the segmentation output file (.cns), each segment could potentially cover zero or many genes, but for simplicity CNVkit has only written "G" or "L" in that column (for gain/loss).
That's too simple. The genes in each segment region should be extracted from the corresponding .cnr file and listed, comma-separated, in the "gene" column of the .cns file.
These strings are then available for export (to BED, etc.) without pulling in the .cnr file again.
The segmetrics
output currently uses the "gene" column to print segment-level stats, but these stats can instead be written as additional columns with the stat name as the column header.
The batch
command when run on multiple CPU cores in parallel (-p 0
or -p
>1) fails to create all of the expected output files, but doesn't indicate why.
This could be more than one underlying issue:
Workaround: Run the segment
command on the .cnr file to see the error that caused segmentation and therefore batch
to originally fail/abort. Fix the error if you can, or report the issue here and wait for a timely fix. Then if segment
can be run successfully, parallel batch
probably can, too. (Or just finish the job manually by running segment
on each sample separately.)
-p
> 1, the batch
pipeline produces binned coverages and a reference (.cnn files), but fails to create a .cnr file and/or .cns file.Workaround: Run the same command with -p 1
instead, and the pipeline will successfully run on a single core.
Diagnosis: There's an issue either in CNVkit's use of the multiprocessing library, or the multiprocessing library in Python 2.7 itself has an issue synchronizing on some platforms. It works on Ubuntu 14.04 and later; I'm interested to hear where it doesn't work.
Hi Eric,
I was trying to rerun a past analysis with a newer version of cnvkit and am running into a few problems:
subprocess.CalledProcessError: Command '/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/cnvkit.py reference -f /ngs/reference_data/genomes/Hsapiens/hg19/seq/hg19.fa -o /ngs/oncology/analysis/translation/TS_0028_CRUK_StratMed_Phase2_matrix_validation/Birmingham_test/Birmingham/work/structural/HDC140P/cnvkit/raw/tx/tmpSxtH5v/flat_background.cnn -t /ngs/oncology/analysis/translation/TS_0028_CRUK_StratMed_Phase2_matrix_validation/Birmingham_test/Birmingham/work/structural/HDC140P/cnvkit/raw/NexteraRapidCapture-CRUK_SMP2-1-2-merged-annotated.target.bed -a /ngs/oncology/analysis/translation/TS_0028_CRUK_StratMed_Phase2_matrix_validation/Birmingham_test/Birmingham/work/structural/HDC140P/cnvkit/raw/NexteraRapidCapture-CRUK_SMP2-1-2-merged-annotated.antitarget.bed
Detected file format: BED
Traceback (most recent call last):
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/cnvkit.py", line 9, in <module>
args.func(args)
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 463, in _cmd_reference
args.fasta, args.male_reference)
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 515, in do_reference_flat
ref_probes.concat(reference.bed2probes(antitarget_list))
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/reference.py", line 15, in bed2probes
regions = RA.read(bed_fname)
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/rary.py", line 42, in read
return cls([])
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/rary.py", line 19, in __init__
gary.GenomicArray.__init__(self, data_table, meta_dict)
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/gary.py", line 22, in __init__
if not all(c in data_table.columns for c in self._required_columns):
File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/gary.py", line 22, in <genexpr>
if not all(c in data_table.columns for c in self._required_columns):
AttributeError: 'list' object has no attribute 'columns'
' returned non-zero exit status 1
Any idea where this might be coming from? The antitarget file is empty as this is data from a targeted panel
Hi again,
If a BAM file is present in both "Normal" and "Tumor" collections supplied to cnvkit batch
command, bin coverages for it (sample.[anti]targetcoverage.cnn
) will be calculated twice, slowing down the entire process twofold.
It is true that "Normal" and "Tumor" sets normally do not overlap at all. It is often useful, however, to run all present (presumably normal) samples against themselves, for example to construct a reference for later use and to identify noisy / otherwise unsuitable samples in the process.
Yours
Kirill
You can install both Bioconductor and CRAN package using biocLite()
, so it's enough to do:
source("http://bioconductor.org/biocLite.R")
biocLite(c("PSCBS", "matrixStats", "cghFLasso"))
Further, since PSCBS imports matrixStats, the latter is automatically installed, so it's even enough to do:
source("http://bioconductor.org/biocLite.R")
biocLite(c("PSCBS", "cghFLasso"))
Hi,
I was using an old version of CNVkit (March 2015) very well. I just updated to the last version (August 2015), so I installed Pandas (new dependency). However when I tried out the test, I got the following errors:
python ../cnvkit.py scatter -s build/p2-9_2.cns build/p2-9_2.cnr -c 'chr9:150000-45000000' -o p2-9_2-chr9p-scatter.pdf
Traceback (most recent call last):
File "../cnvkit.py", line 9, in <module>
args.func(args)
File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 800, in _cmd_scatter
args.y_max)
File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 867, in do_scatter
start, end)[chrom]
File "/local/jperales/TMP/cnvkit/cnvlib/plots.py", line 437, in gene_coords_by_range
for row in probes.in_range(chrom, start, end):
File "/local/jperales/TMP/cnvkit/cnvlib/gary.py", line 248, in in_range
table = table[table.start.searchsorted(start):]
File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1815, in __getattr__
(type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'searchsorted'
[...]
python ../cnvkit.py scatter -s build/p2-9_2.cns build/p2-9_2.cnr -g SMARCA2,PTPRD -w 4e6 -o p2-9_2-SMARCA2-PTPRD-scatter.pdf
Traceback (most recent call last):
File "../cnvkit.py", line 9, in <module>
args.func(args)
File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 800, in _cmd_scatter
args.y_max)
File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 891, in do_scatter
sel_probes = pset_cvg.in_range(chrom, *window_coords)
File "/local/jperales/TMP/cnvkit/cnvlib/gary.py", line 255, in in_range
table = table[:table.end.searchsorted(end, 'right')]
File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1815, in __getattr__
(type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'searchsorted'
make: *** [p2-9_2-SMARCA2-PTPRD-scatter.pdf] Error 1
Here are my module versions (taken from $lib.__version__
):
Module | Version |
---|---|
Python | 2.7.6 |
Biopython | 1.65 |
Reportlab | (unknown) |
matplotlib | 1.3.1 |
Numpy | 1.8.2 |
SciPy | 0.13.3 |
Pandas | 0.13.1 |
pysam | 0.8.2.1 |
PyVCF | (unknown) |
uname -srvmo
Linux 3.13.0-49-generic #83-Ubuntu SMP Fri Apr 10 20:11:33 UTC 2015 x86_64 GNU/Linux
Thanks in advance,
Javier Perales
Is it possible to get the version of CNVkit installed? I think it would be useful for reproducibility and giving feedback. I've tried:
$ cnvkit.py --version
usage: cnvkit.py [-h]
{batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
...
cnvkit.py: error: too few arguments
$ cnvkit.py -v
usage: cnvkit.py [-h]
{batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
...
cnvkit.py: error: too few arguments
$ cnvkit.py batch --version
usage: cnvkit.py [-h]
{batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
...
cnvkit.py: error: unrecognized arguments: --version
$ cnvkit.py batch -v
usage: cnvkit.py [-h]
{batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
...
cnvkit.py: error: unrecognized arguments: -v
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.