etal / cnvkit Goto Github PK

View Code? Open in Web Editor NEW

524.0 524.0 163.0 111.08 MB

Copy number variant detection from targeted DNA sequencing

Home Page: http://cnvkit.readthedocs.org

License: Other

Python 98.27% Makefile 1.05% Shell 0.17% Dockerfile 0.09% WDL 0.42%

cnvkit's People

Contributors

Stargazers

Watchers

Forkers

honglongwu roryk zengfengbo nordhuang kyleabeauchamp theboocock simexin aleksandrabliz pcingola xtmgah mpschr patchper giladmishne brentp chapmanb raonyguimaraes niknafs myronpeto snashraf vbrilhante winterli1993 jwcasement tiehan sjmentch cococou weizhiting binfyun wortjohn sunbymoon nygcdlin cez026 kellyquek lovemun kmurphy902 hcrat fatcc zitsen jfcchow onouyek mdshw5 jimmylijing cplaisier sb43 swapnilblues xuwei684 panzhaodong123 ewail rptashkin color4 sunhuaibo jeremy9959 washedgram fbrundu jun-lizst leovam mirror66 ewamarek xiaoqiwang19 linhxxx honphy zorrodong ziyy milkcookie xigyou tamadezhenshiyuncai mingcheninsz amz965 tezla93 zhouyangyu liushuqing506 junjun3039 javrodriguez danielanach xiaohuaniu0032 lisayg davidswang matthieurouland sridhar0605 vymao limbus-medtec realfeidong xuexiaohua-bio myrainy adamp83 derrickreynolds masterscott shengxinbaixiaosheng tskir dadongz dowing johnegarza fjptwenger venuu99 mr-c yangzixu standardgalactic qihaowei89 sssimonyang amora197 tetedange13

cnvkit's Issues

VCF export and genotype call of deletions

In the new VCF export code, you were writing:

 elif ncopies < ploidy:
            svtype = "DEL"
            svlen *= -1
            formats = "GT:GQ"
            genotype = "0/1:%d" % row["probes"]

Does that imply that deletions reported by cnvkit will always be called as heterozygous, i.e. no homozygous-alt deletions (1/1) can be identified and/or reported?

Threshold calling does not take ploidy of sex chromosomes into account

Hi Eric,

It appears to me that call -m threshold always treats sex chromosomes as though they were diploid. Here's a small use case which results took me completely by surprise. I had 15 samples, all male; I built a reference and run all samples against it using these standard commands using CNVKit 0.7.0:

./cnvkit.py reference --male-reference -o reference.cnn -f hg19.fa samples/*.cnn
./cnvkit.py fix sample1.targetcoverage.cnn sample1.antitargetcoverage.cnn reference.cnn \
-o sample1.cnr
./cnvkit.py segment sample1.cnr -o sample1.cns
./cnvkit.py call --male-reference -m threshold -t=-1.1,-0.4,0.3,0.7 -o sample1.call.cns \
sample1.cns
./cnvkit.py export bed --male-reference sample1.call.cns -o sample1.bed
./cnvkit.py export bed --male-reference sample1.call.cns --show-neutral -o sample1.all.bed

For all of the samples I noticed an apparent miscalculation of copy numbers in sex chromosomes. The table below reflects the average log2 values for reference (column 2) and across all of the samples (columns 3-5), absolute copy numbers (column 6), as well as whether those copy numbers are considered neutral by export bed command (column 7). Since all numbers in columns 2-4 were very close to integers, I rounded them for the sake of readability:

chr	reference.cnn	sample.cnn	call.cns	abs CN	CN neutral?
autosomes	0	0	0	2	yes
chrX	−1	−1	+1	2	yes
chrY	−1	−1	+1	2	yes

Note that chrX and chrY went from 0 in sample.cns to +1 in sample.call.cns, which was then reinterpreted as 2 absolute copies by export bed command (and, interestingly enough, still considered neutral, although for sex chromosomes that would be an alteration). After inspecting the code, I discovered that this happens in cnvlib.call.absolute_threshold(): a cycle assigns absolute copy numbers without looking at whether the segment in question comes from autosome or sex chromosome. In this case, it assigns both chrX and chrY an absolute copy number of 2.

I realize that this method is supposed to be really simple, since it only applies fixed thresholds, but intuitively I still expected it to somehow handle autosomes and allosomes differently. If this is indeed by design, and not a bug, then I believe it would be helpful to describe this detail of call -m threshold in the documentation.

Also, a quick related question: since the threshold method is not usable for sex chromosomes, it is OK if I just always use call -m clonal -purity 1 for germline samples? It seems to handle this situation well so far.

Cheers
Kirill

DOCS: Anaconda and adding the bcbio channel

In order for

conda install cnvkit

to work, one need to need to add the channel where cnvkit is hosted, i.e.

conda config --add channels https://conda.binstar.org/bcbio

I recommend adding this to README.rst, because (at least to me) it took a fair bit of trial and error to get it working.

(Also, it seems like Anaconda installs python as well, so it's not clear to me whether you have to have Python 2.7 installed prior to installing Anaconda + cnvkit.)

Segment allele frequencies directly

Extract heterozygous allele frequencies from the tumor sample VCF, and segment those values to generate a .cns file. CBS, haar wavelet and fused lasso are all fine algorithms for this.

Export to VCF

In the "export" command, implement converting segmented CNVs to VCF format.

See: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/VCF%20%28Variant%20Call%20Format%29%20version%204.0/encoding-structural-variants

Common Workflow Language tool descriptions

Create CommonWL "tool" interfaces for each CNVkit sub-command. Try to use these to improve integration with Galaxy and bcbio-nextgen.

Maybe wrap "batch" as a workflow instead of a tool -- not sure how this works.

Spec: https://github.com/common-workflow-language/common-workflow-language
Examples: https://github.com/common-workflow-language/workflows

Issues generating ref.cnn from hg1k_v37

Hi,

I am testing out your program using 1000G version of hg19 and I am receiving the following error for cvnkit 0.6.1.

Number of target and antitarget files: 1, 1
Loading germline.targetcoverage.cnn
/lb/project/mugqic/analyste_dev/software/python/Python-2.7.8/lib/python2.7/site-packages/pandas/io/parsers.py:1170: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
Calculating GC and RepeatMasker content in /lb/project/mugqic/analyste_dev/genomes/Homo_sapiens/hg1k_v37/fasta/hg1k_v37.fasta ...
Traceback (most recent call last):
File "/home/reveleig/github/cnvkit/cnvkit.py", line 9, in
args.func(args)
File "/sb/home/reveleig/github/cnvkit/cnvlib/commands.py", line 498, in _cmd_reference
args.male_reference)
File "/sb/home/reveleig/github/cnvkit/cnvlib/commands.py", line 518, in do_reference
male_reference)
File "/sb/home/reveleig/github/cnvkit/cnvlib/reference.py", line 50, in combine_probes
gc, rmask = get_fasta_stats(cnarr1, fa_fname)
File "/sb/home/reveleig/github/cnvkit/cnvlib/reference.py", line 189, in get_fasta_stats
fa_coords)]
File "/sb/home/reveleig/github/cnvkit/cnvlib/ngfrills/faidx.py", line 71, in fasta_extract_regions
+ "file " + str(fa_fname))
ValueError: Sequence ID '1' is not in FASTA file /lb/project/mugqic/analyste_dev/genomes/Homo_sapiens/hg1k_v37/fasta/hg1k_v37.fasta

Thanks for any help you can provide.

Best,
Rob E.

Update Galaxy tool

The current tool description is out of date, and also doesn't pull in its dependencies automatically.

Missing name field in bedcov

line:

cnvkit/cnvlib/coverage.py

Line 125 in 2d43983

chrom, start_s, end_s, name, basecount_s = line.split('\t')

samtools bedcov doesn't return a "name" fields as far as I know. Where does this name field come from?

Accept & look for "sample.bai" as an index for "sample.bam"

By default, samtools creates the BAM index file for "sample.bam" as "sample.bam.bai", so that's the file CNVkit looks for when ensuring a BAM file has been indexed.

Apparently, samtools/pysam/htslib will also accept "sample.bai" as a BAM index?

Verify this, and implement it in CNVkit/ngfrills.

Use the logging module for status messages

Instead of printing all status messages to standard error, configure and use the logging module. Stratify the status messages into debugging, info, warning, and maybe error levels. Add options to the CLI commands to choose the logging level / verbosity.

Then, run the unit test script with all of these messages turned off.

Feature suggestion: accept comma separated lists of overlapping genes for plots

Just a suggestion, but the scatter command does not seem to be able to resolve gene names, if overlapping genes are given as comma separated lists.

PSCBS 0.45.0 breaks probe segmentation

Hi Eric,

PSCBS R package version 0.45.0 was released yesterday (2015-09-11) and apparently it breaks probe segmentation in CNVKit. Perhaps it introduces some backwards incompatible syntax changes, or maybe it's just a regression in PSCBS itself—I unfortunately don't have enough time at the moment to investigate further into this. Judging from the error message, it has something to do with namespace conflicts. I will for now stick with using version 0.44.0, which works fine, but since 0.45.0 is now installed by default, it might be inconvenient and confusing to some users. An error trace from running make in test directory is below.

Please let me know if you need more information on this, and thank you for your work on CNVKit!

Yours
Kirill

python ../cnvkit.py segment build/p2-5_5.cnr -o build/p2-5_5.cns
Traceback (most recent call last):
  File "../cnvkit.py", line 9, in <module>
    args.func(args)
  File "/mnt/tmp/cnvkit/cnvlib/commands.py", line 625, in _cmd_segment
    args.rlibpath)
  File "/mnt/tmp/cnvkit/cnvlib/segmentation/__init__.py", line 45, in do_segmentation
    seg_out = ngfrills.call_quiet('Rscript', script_fname)
  File "/mnt/tmp/cnvkit/cnvlib/ngfrills/__init__.py", line 41, in call_quiet
    % (' '.join(args), err))
RuntimeError: Subprocess command failed:
$ Rscript /tmp/tmp6dlQOh

PSCBS v0.45.0 (2015-09-11) successfully loaded. See ?PSCBS for help.

Attaching package: ‘PSCBS’

The following objects are masked from ‘package:base’:

    append, load

Loading probe coverages into a data frame
Pre-processing the probe data for segmentation
Segmenting the probe data
Error: !hasWeights || !is.null(fit$data$w) is not TRUE
Execution halted

AssertionError in "diagram" with WGS-level gene density

Hi there,
Thanks for cnvkit, what a great tool!
I've come across this error:

Relative log2 coverage of X chromosome: -0.820871 (assuming male)
Traceback (most recent call last):
 File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/cnvkit.py", line 9, in <module>
   args.func(args)
 File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 685, in _cmd_diagram
   args.output, args.male_reference)
 File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/diagram.py", line 48, in create_diagram
   probe_rows = probe_rows.squash_genes()
 File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/cnarray.py", line 532, in squash_genes
   outrows.append(squash_rows(name, subarr))
 File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/cnarray.py", line 513, in squash_rows
   chrom = core.check_unique(rows['chromosome'], 'chromosome')
 File "/group/ngs/src/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/core.py", line 154, in check_unique
   % (title, ' '.join(map(str, sorted(its)))))
AssertionError: Inconsistent chromosome keys: chr11_gl000202_random chr17_ctg5_hap1 chr17_gl000203_random

I think the error is caused by the gene column containing a dot ( . ) as these contigs have no annotated genes.

Can't run help for export bed

cnvkit.py export vcf -h
# Works fine

cnvkit.py export bed -h
# See traceback below


Traceback (most recent call last):
  File "~/opt/bin/cnvkit.py", line 10, in <module>
    args = commands.parse_args()
  File "~/anaconda2/lib/python2.7/site-packages/cnvlib/commands.py", line 1738, in parse_args
    return AP.parse_args(args=args)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1701, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1733, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1921, in _parse_known_args
    positionals_end_index = consume_positionals(start_index)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1898, in consume_positionals
    take_action(action, args)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1807, in take_action
    action(self, namespace, argument_values, option_string)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1096, in __call__
    subnamespace, arg_strings = parser.parse_known_args(arg_strings, None)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1733, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1921, in _parse_known_args
    positionals_end_index = consume_positionals(start_index)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1898, in consume_positionals
    take_action(action, args)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1807, in take_action
    action(self, namespace, argument_values, option_string)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1096, in __call__
    subnamespace, arg_strings = parser.parse_known_args(arg_strings, None)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1733, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1939, in _parse_known_args
    start_index = consume_optional(start_index)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1879, in consume_optional
    take_action(action, args, option_string)
  File "~/anaconda2/lib/python2.7/argparse.py", line 1807, in take_action
    action(self, namespace, argument_values, option_string)
  File "~/anaconda2/lib/python2.7/argparse.py", line 996, in __call__
    parser.print_help()
  File "~/anaconda2/lib/python2.7/argparse.py", line 2340, in print_help
    self._print_message(self.format_help(), file)
  File "~/anaconda2/lib/python2.7/argparse.py", line 2314, in format_help
    return formatter.format_help()
  File "~/anaconda2/lib/python2.7/argparse.py", line 281, in format_help
    help = self._root_section.format_help()
  File "~/anaconda2/lib/python2.7/argparse.py", line 211, in format_help
    func(*args)
  File "~/anaconda2/lib/python2.7/argparse.py", line 211, in format_help
    func(*args)
  File "~/anaconda2/lib/python2.7/argparse.py", line 517, in _format_action
    help_text = self._expand_help(action)
  File "~/anaconda2/lib/python2.7/argparse.py", line 603, in _expand_help
    return self._get_help_string(action) % params
TypeError: %d format: a number is required, not str

Add BICseq segment-filtering method

BICseq appears to perform well in benchmarks on WGS data.

The reference implementation is in matlab, but the paper describes how it works reasonably clearly. Just reimplement it.

The algorithm assumes a paired normal is available, but in CNVkit we'll use the reference-based bin weights instead to estimate reasonable normal-sample counts.

phyloWGS compatibility

I am trying to study the tumor subclonal populations using the algorithm, phyloWGS, which takes somatic mutation variant allele frequency (ssm_file) and copy number variation (cnv_file) as inputs (https://github.com/morrislab/phylowgs). I was wondering if you can export the results of CNVkit to the input of phyloWGS (cnv file).
It needs the following information.

cnv: identifier for each CNV. Identifiers must start at c0 and increment, so the first data row will have c0, the second row c1, and so forth.
a: number of reference reads covering the CNV.
d: total number of reads covering the CNV. This will be affected by factors such as total copy number at the locus, sequencing depth, and the size of the chromosomal region spanned by the CNV.
ssms: SSMs that overlap with this CNV. Each entry is a comma-separated triplet consisting of SSM ID, maternal copy number, and paternal copy number. These triplets are separated by semicolons.

I understand that somatic mutation information will be analyzed separately and provided.

Thank you,
Jungmin

Use rpy2 instead of an Rscript subprocess for segmentation

The segmentation algorithms that use R packages currently run Rscript in a subprocess using a filled-in R script template as input.

It's weird to have R scripts stored in Python strings in segmentation/__init__.py
The R scripts complicate refactoring; some of the code is redundant across the cbs, flasso and haar methods but can't be consolidated.

I'm not sure how rpy2 would complicate CNVkit installation on OS X, or using a nonstandard R library path.

cnvkit.py batch: Document that --output-reference ignored --output-dir [SUGGESTION]

I just ran cnvkit.py batch outputting to a subdirectory, but noticed that all files but the "copy number reference profile *.cnn file was still outputted to the working directory. I can see how this design/behavior makes sense. However, may I suggest that it's documented, e.g. "(Ignores -o/--output-dir option)":

cnvkit.py batch --help
...
  --output-reference OUTPUT_REFERENCE
                        Output filename for the new reference file being
                        created. (Ignores -o/--output-dir option)
...

DOCS: Optimistic/outdated that `make test` "should take less than a minute"?

The README.rst says:

If this pipeline completes successfully (it should take less than a minute), you've installed CNVkit correctly.

Is this an outdated benchmark, because on a decent Linux machine, this took ~10 minutes for me.

Inline Plots in ipython notebook

Hey,
Is there a way to get inline plots(the log2 coverage ones) in ipython notebook ? I tried %matplotlib inline, but it doesnt seem to solve it.

Thank You

heatmap: Add a scale/legend with a color gradient

The heatmap plot needs a legend. This should be a color gradient, red-to-white-to-blue, with tick marks at +1/0/-1, in a small vertically-oriented box on the right side of the plot.

In a transposed heatmap (#35), the box should be at the bottom oriented horizontally, with red on the left.

Consider recreating the same in Reportlab for the 'diagram' plot.

Add CODEX segmentation method

CODEX is a copy number developed specifically for detecting copy number from exome sequencing, and includes a segmentation method that explicitly models read depth (unlike the other segmentation methods currently available in CNVkit). See the paper and Bioconductor package.

Just wrap the R package.

batch: Make reference vs. de novo options mutually exclusive

See this SeqAnswers post:
http://seqanswers.com/forums/showthread.php?p=168492#post168492

The batch command runs the pipeline by either building a new reference or using an existing one, and some of the options for building a new reference do not make sense in the latter case. Using those options should consistently raise an error that explains the conflict.

antitarget function

Hi Etal,

Thank you for developing CNVkit. It's brilliant!

I performed CNVkit on a small gene panel sequencing data, the result is very good.

However, I noticed that there are two chromosomes missing in the plot. These two chromosomes are missing in target or antitarget bed files too. Then I found the reason is I don't have any target designed for sequencing on these two chromosomes.

Since off-target reads should be all over the genome, I am wondering if this is a bug?

Looking forward to your reply.
Thanks ~

Hannah

Docker

Create a Docker container for CNVkit.

It should probably be based on one of the Bioconductor containers to provide the dependencies for CBS: http://bioconductor.org/help/docker/

Depth based confidence intervals around copy number estimates

Hi @etal, wanted to pick your brain on how to implement some kind of "confidence intervals" around the copy number estimates. Ideally there could be an interval around the point estimate that would reflect depth of sequencing and smoothness of the copy number signal around the interval (could be e.g. MAD http://en.wikipedia.org/wiki/Median_absolute_deviation with a provision for depth).

For example, a log2 copy number of 1 surely should be more meaningful in 20x coverage data compared to 1x coverage data, right?

I'm more than happy to implement something and help but wanted to hear your thoughts first.

'haar' segmentation method is buggy

Revisit CNVkit's reimplementation of the HaarSeg algorithm. It's not quite right.

bcbio channel installing old cnvkit by default?

Hi,

In short: could you check please, that following the conda instructions for the bcbio channel actually give the latest version? Or add some extra information on how to choose which version it will install?

Details: I'm afraid I'm submitting this issue for one of my users, I'm not experienced with conda and I don't know exactly what happened, but...

Judging from his .bash_history, he ran the advertised

conda config --add channels https://conda.binstar.org/bcbio
conda install cnvkit

and he got cnvkit-0.2.5. Then he ran into various problems which have long been fixed. I solved it by removing the installed copy, downloading the 0.7.3 zipfile and building that with setup.py.

I tried to ask conda what it would install, and (after I un-mangled the pasted terminal transcript) this is what it said

(cnvkit_env)mibxxxxxxi:1112_CNVkit hl11$ conda search cnvkit
Fetching package metadata: ......
cnvkit                    .  0.2.5                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.3.1a               np18py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.3.2                np18py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.3.2                np19py27_1    https://conda.binstar.org/bcbio/osx-64/
                             0.3.5                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.3.5                np19py27_1    https://conda.binstar.org/bcbio/osx-64/
                             0.3.5                np19py27_2    https://conda.binstar.org/bcbio/osx-64/
                             0.3.5                np19py27_3    https://conda.binstar.org/bcbio/osx-64/
                             0.3.5                np19py27_4    https://conda.binstar.org/bcbio/osx-64/
                             0.4.0                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.4.0                np19py27_1    https://conda.binstar.org/bcbio/osx-64/
                             0.4.1                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.4.1                np19py27_1    https://conda.binstar.org/bcbio/osx-64/
                             0.4.1                np19py27_2    https://conda.binstar.org/bcbio/osx-64/
                             0.4.1                np19py27_3    https://conda.binstar.org/bcbio/osx-64/
                             0.4.1                np19py27_4    https://conda.binstar.org/bcbio/osx-64/
                             0.5.0                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.5.1                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.6.1                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.6.2a               np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.6.2a               np19py27_1    https://conda.binstar.org/bcbio/osx-64/
                             0.6.2a               np19py27_2    https://conda.binstar.org/bcbio/osx-64/
                             0.7.0.dev0           np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.7.0                np19py27_2    https://conda.binstar.org/bcbio/osx-64/
                             0.7.1.dev0           np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.7.1                np19py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.7.2                    py27_0    https://conda.binstar.org/bcbio/osx-64/
                             0.7.3dev0                py27_0    https://conda.binstar.org/bcbio/osx-64/

Thanks,
Matthew (for Henry )

Add "rescale" command

The call -m clonal command can rescale log2 ratios to compensate for normal-cell contamination in a sample. That can be useful outside of calling integer copy numbers, so let's pull it out into a separate rescale command with similar options.

This feature may also be useful for analysis of tumor subclones/heterogeneity.

Keep the rescaling functionality in call -m clonal as well for now.

LOH VCF file

The LOH command is not working for me, it doesn't seem to like my VarScan VCF file. Any ideas?

Thanks

cnvkit.py scatter Sample.cnr -s Sample.cns -v VarScan.snp.Germline.vcf
Traceback (most recent call last):
  File "/usr/local/bin/cnvkit.py", line 4, in <module>
    __import__('pkg_resources').run_script('CNVkit==0.3.5-dev', 'cnvkit.py')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 534, in run_script
    if req in processed:
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1445, in run_script

  File "/usr/local/lib/python2.7/dist-packages/CNVkit-0.3.5_dev-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 9, in <module>

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 714, in _cmd_scatter

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 734, in do_scatter

  File "build/bdist.linux-x86_64/egg/cnvlib/ngfrills/vcf.py", line 13, in load_vcf
    # 'format': list of keys (strings)
  File "build/bdist.linux-x86_64/egg/cnvlib/ngfrills/vcf.py", line 33, in filter_vcf_lines
    # When instantiated, a reference can be passed to the VCF class.  This may be any class that supports a
KeyError: 'AF'

Filter a sample VCF versus another VCF of known SNPs

For tumor-only VCFs, or paired T/N VCFs with lingering noise, it would be nice to take another VCF of known SNPs from 1000 Genomes, dbSNP 129, ESP 6500, etc. and further filter the variants in the given sample.

scatter: Identify and highlight LOH segments in plot

The loh plot used to try to identify whole chromosomes with loss of heterozygosity (LOH) using an ill-conceived Mann-Whitney test. This would fail if multiple chromosomes had LOH (skewing the background distribution), or if the LOH region was much smaller than a chromosome (i.e. covering too few SNPs).

This issue has two parts:

If segments are provided to the scatter command, group variants in the given VCF by segment. (Mostly done.)
Within a segment (or whole chromosome, if segments are not given), use a statistical test to determine if SNP allele frequencies in the region differ from those with neutral-frequency. See PSCBS paper and R package since @HenrikBengtsson already put the necessary thought into this.

Plots: Optionally use bins instead of basepairs as the x-axis increment

Treat the x-axis units as bin indices instead of genomic coordinates or basepairs, so all bins are shown with equal size. Most useful/relevant for heatmap, but also consider scatter and maybe diagram.

Not working in python 3?

I've installed cnvkit using pip, at first run it raises a SyntaxError:

$ cnvkit.py -h
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.4/bin/cnvkit.py", line 5, in <module>
    from cnvlib import commands
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/__init__.py", line 1, in <module>
    from .cnarray import CopyNumArray as _CNA
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/cnarray.py", line 11, in <module>
    from . import core, metrics, ngfrills, smoothing
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/core.py", line 9, in <module>
    from .ngfrills import echo, safe_write
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/cnvlib/ngfrills.py", line 407
    except OSError, exc:
                  ^
SyntaxError: invalid syntax

After correction of entries in python3 syntax (i.e. OSError, exc becomes OSError as exc) I have

$ cnvkit.py 
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.4/bin/cnvkit.py", line 9, in <module>
    args.func(args)
AttributeError: 'Namespace' object has no attribute 'func'

LOH: Read germline SNPs from paired tumor-normal VCF

The functions to plot loss-of-heterozygosity (LOH) take allele frequencies from a given VCF file. LOH is only visible in germline SNPs that are also present in the tumor sample. Currently, all SNVs in the VCF are plotted, so this works best if the user had the foresight to separately filter the tumor variants for germline SNPs first.

VCFs from MuTect, FreeBayes and other variant callers will list the SNPs/SNVs in a matched tumor-normal pair together; this makes it feasible for CNVkit to parse this VCF and extract only the allele frequencies of SNPs for plotting.

The tumor sample name can be specified with -i now. Next:

Add -n to specify the normal sample name in the same VCF.
If -n is unspecified and there are 2 or more samples, look for "NORMAL" (case-insensitive); if -iis unspecified, look for "TUMOR".
Fall back to current behavior if none of this magic works.

Keep raw, un-normalized read counts in (anti)targetcoverage.cnn files

The coverage command currently median-centers and log2-transforms the read depths in each genomic bin before writing the *.targetcoverage.cnn and *.antitargetcoverage.cnn files. This makes the reference command a little more memory-efficient because the binned values across different samples are already comparable and can be processed line-by-line.

But downstream algorithms may want the original read depths to estimate reliability, etc. (see #28). Let's make reference work a little harder and do not normalize/median-center the coverage read depths before writing the .cnn files.

Continue to log2-transform the values, so that all CNVkit files can still be parsed the same way. To prevent a math domain error, use a tiny preconfigured value like -20 (params.NULL_LOG2_COVERAGE) to represent 0 reads in a bin.

KeyError: None in regions.py, }[fmt]

Hi Eric,
I'm running a whole genome analysis without normals and am getting the following error:

Splitting: .                26381
Wrote ./2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.target.bed with 4410794 target intervals
Detected file format: BED
Detected file format: BED
Detected file format: BED
Wrote ./2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.antitarget.bed with 1 background intervals
Building a flat reference...
Detected file format: BED
Traceback (most recent call last):
  File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/bin/cnvkit.py", line 9, in <module>
    args.func(args)
  File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 80, in _cmd_batch
    args.processes, args.count_reads)
  File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 148, in batch_make_reference
    male_reference)
  File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 509, in do_reference_flat
    ref_probes.extend(reference.bed2probes(antitarget_list))
  File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/reference.py", line 15, in bed2probes
    for chrom, start, end, name in ngfrills.parse_regions(bed_fname)]
  File "/group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/ngfrills/regions.py", line 62, in parse_regions
    }[fmt]
KeyError: None

The command is roughly:
cnvkit.py batch 2_2015-04-02_bcbio_cnvkit-sort.bam -n -f /ngs/reference_data/genomes/Hsapiens/hg19/seq/hg19.fa --targets 2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.bed --access 2_2015-04-02_bcbio_cnvkit-sort-callable-callableblocks-merged-annotated.bed -d ./ --split -p 1 --output-reference ./flat_background.cnn --antitarget-avg-size 1000 --antitarget-min-size 75 --target-avg-size 1000 --rlibpath /group/ngs/src/bcbio-nextgen/0.8.8/rhel6-x64/lib/R/site-library

The antitarget bed file that gets produced is empty (as expected). The target bed file has 4 columns.

Any ideas?

'haar' segmentation output doesn't include gene names

Reported here:
https://www.biostars.org/p/160256/

The R-based segmenters add gene names from the input .cnr as a post-processing step. The 'haar' method should do the same thing.

heatmap: Option to transpose the plot

The current "horizontal" layout shows samples in rows, chromosomes in columns. A transposed or "vertical" layout would show samples in columns, chromosomes in rows. (Sometimes this makes a better figure.)

Consider the flags "-t/--transpose" or "-v/--vertical".

Call function: How to interpret log2(absolute copy number integers of a segment)

Hi,
I don't fully understand why I obtained a pattern of integers (x) and half-integers (x + 0.5) when using call function. I expected to get only integers.

My test:
Assuming 100% tumor purity across my samples, I calculated the -threshold as follow (R code):

> c("No.Copies"=log2(0.1/2),"one.Copy"=log2(1/2),
"neutral.Diploid"=log2(2/2),"three.copies"=log2(3/2),
"four.copies"=log2(4/2))
#      No.Copies        one.Copy neutral.Diploid    three.copies     four.copies 
#     -4.3219281      -1.0000000       0.0000000       0.5849625       1.0000000

So I run the cnvkit.py call function using the following parameters:

$cnvkit.py call ${sampleName}".cns" -m threshold -t=-4.3,-1,0,0.6,1 -o ${sampleName}".absoluteCN.cns"
Treating sample gender as female
Calling copy number with thresholds: -4.3 => 0, -1 => 1, 0 => 2, 0.6 => 3, 1 => 4

Please, notice the output said: Calling copy number with thresholds: -4.3 => 0, -1 => 1, 0 => 2, 0.6 => 3, 1 => 4, where I guess the first number of the pair is the log2 and the second one is the absolute copy number.

Then, if we retrieve all unique estimated copy-numbers from the output ${sampleName}".absoluteCN.cns" file, and revert them from the log2 transformation (using bash):

$tail -n +2  ${sampleName}".absoluteCN.cns" | cut -f 5 | sort -n -u | awk '{print "log2(CN)="$1"; then CN="2**$1;}'
log2(CN)=-10.9658; then CN=0.000499995
log2(CN)=-1; then CN=0.5
log2(CN)=0; then CN=1
log2(CN)=0.584963; then CN=1.5
log2(CN)=1; then CN=2
log2(CN)=1.32193; then CN=2.5
log2(CN)=1.58496; then CN=2.99999
log2(CN)=2; then CN=4
log2(CN)=2.16993; then CN=4.50002
log2(CN)=3.39232; then CN=10.5

I wonder why there are half-integers (i.e. 0.5, 1.5, 2.5, 4.5 and 10.5) in the output. Are these copy-numbers defined in the context of a haploid genome or in the whole genome (diploid)?

Thanks in advance,
Javier

My version:

$cnvkit.py version
0.6.1-dev

Inconsistency in Docs?

So the documentation on batch (http://cnvkit.readthedocs.org/en/latest/pipeline.html) suggests that batch runs cnvkit targets as one of its steps. However, as far as I can tell, targets takes a BED as input and outputs a new BED---while batch does not output a new BED file for the targets. This has implications when you pass the --annotate flag, as no annotation is actually done via batch as far as I can tell.

Let me know if I'm mis-reading something here.

Update "export theta" to use THetA2 features

See the THetA2 v.0.7 release notes.

List each segment's covered genes in the .cns file

In the segmentation output file (.cns), each segment could potentially cover zero or many genes, but for simplicity CNVkit has only written "G" or "L" in that column (for gain/loss).

That's too simple. The genes in each segment region should be extracted from the corresponding .cnr file and listed, comma-separated, in the "gene" column of the .cns file.

These strings are then available for export (to BED, etc.) without pulling in the .cnr file again.

The segmetrics output currently uses the "gene" column to print segment-level stats, but these stats can instead be written as additional columns with the stat name as the column header.

Parallel "batch" sometimes fails to generate output files (.cnn/.cns)

The batch command when run on multiple CPU cores in parallel (-p 0 or -p >1) fails to create all of the expected output files, but doesn't indicate why.

This could be more than one underlying issue:

If segmentation crashes under certain conditions, it's a silent error and the command simply completes without creating the .cns file. But the .cnr file has been successfully produced by this point.

Workaround: Run the segment command on the .cnr file to see the error that caused segmentation and therefore batch to originally fail/abort. Fix the error if you can, or report the issue here and wait for a timely fix. Then if segment can be run successfully, parallel batch probably can, too. (Or just finish the job manually by running segment on each sample separately.)

Reported in bcbio/bcbio-nextgen#915: With -p > 1, the batch pipeline produces binned coverages and a reference (.cnn files), but fails to create a .cnr file and/or .cns file.

Workaround: Run the same command with -p 1 instead, and the pipeline will successfully run on a single core.

Diagnosis: There's an issue either in CNVkit's use of the multiprocessing library, or the multiprocessing library in Python 2.7 itself has an issue synchronizing on some platforms. It works on Ubuntu 14.04 and later; I'm interested to hear where it doesn't work.

Issue in cnvkit 'list' object has no attribute 'columns'

Hi Eric,
I was trying to rerun a past analysis with a newer version of cnvkit and am running into a few problems:

subprocess.CalledProcessError: Command '/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/cnvkit.py reference -f /ngs/reference_data/genomes/Hsapiens/hg19/seq/hg19.fa -o /ngs/oncology/analysis/translation/TS_0028_CRUK_StratMed_Phase2_matrix_validation/Birmingham_test/Birmingham/work/structural/HDC140P/cnvkit/raw/tx/tmpSxtH5v/flat_background.cnn -t /ngs/oncology/analysis/translation/TS_0028_CRUK_StratMed_Phase2_matrix_validation/Birmingham_test/Birmingham/work/structural/HDC140P/cnvkit/raw/NexteraRapidCapture-CRUK_SMP2-1-2-merged-annotated.target.bed -a /ngs/oncology/analysis/translation/TS_0028_CRUK_StratMed_Phase2_matrix_validation/Birmingham_test/Birmingham/work/structural/HDC140P/cnvkit/raw/NexteraRapidCapture-CRUK_SMP2-1-2-merged-annotated.antitarget.bed
Detected file format: BED
Traceback (most recent call last):
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/cnvkit.py", line 9, in <module>
    args.func(args)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 463, in _cmd_reference
    args.fasta, args.male_reference)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/commands.py", line 515, in do_reference_flat
    ref_probes.concat(reference.bed2probes(antitarget_list))
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/reference.py", line 15, in bed2probes
    regions = RA.read(bed_fname)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/rary.py", line 42, in read
    return cls([])
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/rary.py", line 19, in __init__
    gary.GenomicArray.__init__(self, data_table, meta_dict)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/gary.py", line 22, in __init__
    if not all(c in data_table.columns for c in self._required_columns):
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cnvlib/gary.py", line 22, in <genexpr>
    if not all(c in data_table.columns for c in self._required_columns):
AttributeError: 'list' object has no attribute 'columns'
' returned non-zero exit status 1

Any idea where this might be coming from? The antitarget file is empty as this is data from a targeted panel

batch: Allow overlapping sets of tumor and normal samples as input

Hi again,

If a BAM file is present in both "Normal" and "Tumor" collections supplied to cnvkit batch command, bin coverages for it (sample.[anti]targetcoverage.cnn) will be calculated twice, slowing down the entire process twofold.

It is true that "Normal" and "Tumor" sets normally do not overlap at all. It is often useful, however, to run all present (presumably normal) samples against themselves, for example to construct a reference for later use and to identify noisy / otherwise unsuitable samples in the process.

Yours
Kirill

SUGGESTION: Simpler R package instructions

You can install both Bioconductor and CRAN package using biocLite(), so it's enough to do:

source("http://bioconductor.org/biocLite.R")
biocLite(c("PSCBS", "matrixStats", "cghFLasso"))

Further, since PSCBS imports matrixStats, the latter is automatically installed, so it's even enough to do:

source("http://bioconductor.org/biocLite.R")
biocLite(c("PSCBS", "cghFLasso"))

test has crashed, Pandas-related issue

Hi,
I was using an old version of CNVkit (March 2015) very well. I just updated to the last version (August 2015), so I installed Pandas (new dependency). However when I tried out the test, I got the following errors:

python ../cnvkit.py scatter -s build/p2-9_2.cns build/p2-9_2.cnr -c 'chr9:150000-45000000' -o p2-9_2-chr9p-scatter.pdf
Traceback (most recent call last):
  File "../cnvkit.py", line 9, in <module>
    args.func(args)
  File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 800, in _cmd_scatter
    args.y_max)
  File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 867, in do_scatter
    start, end)[chrom]
  File "/local/jperales/TMP/cnvkit/cnvlib/plots.py", line 437, in gene_coords_by_range
    for row in probes.in_range(chrom, start, end):
  File "/local/jperales/TMP/cnvkit/cnvlib/gary.py", line 248, in in_range
    table = table[table.start.searchsorted(start):]
  File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1815, in __getattr__
    (type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'searchsorted'

[...]

python ../cnvkit.py scatter -s build/p2-9_2.cns build/p2-9_2.cnr -g SMARCA2,PTPRD -w 4e6 -o p2-9_2-SMARCA2-PTPRD-scatter.pdf
Traceback (most recent call last):
  File "../cnvkit.py", line 9, in <module>
    args.func(args)
  File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 800, in _cmd_scatter
    args.y_max)
  File "/local/jperales/TMP/cnvkit/cnvlib/commands.py", line 891, in do_scatter
    sel_probes = pset_cvg.in_range(chrom, *window_coords)
  File "/local/jperales/TMP/cnvkit/cnvlib/gary.py", line 255, in in_range
    table = table[:table.end.searchsorted(end, 'right')]
  File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1815, in __getattr__
    (type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'searchsorted'
make: *** [p2-9_2-SMARCA2-PTPRD-scatter.pdf] Error 1

Here are my module versions (taken from $lib.__version__):

Module	Version
Python	2.7.6
Biopython	1.65
Reportlab	(unknown)
matplotlib	1.3.1
Numpy	1.8.2
SciPy	0.13.3
Pandas	0.13.1
pysam	0.8.2.1
PyVCF	(unknown)

uname -srvmo
Linux 3.13.0-49-generic #83-Ubuntu SMP Fri Apr 10 20:11:33 UTC 2015 x86_64 GNU/Linux

Thanks in advance,
Javier Perales

cnvkit.py: How to get version?

Is it possible to get the version of CNVkit installed? I think it would be useful for reproducibility and giving feedback. I've tried:

$ cnvkit.py --version
usage: cnvkit.py [-h]
                 {batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
                 ...
cnvkit.py: error: too few arguments

$ cnvkit.py -v
usage: cnvkit.py [-h]
                 {batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
                 ...
cnvkit.py: error: too few arguments

$ cnvkit.py batch --version
usage: cnvkit.py [-h]
                 {batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
                 ...
cnvkit.py: error: unrecognized arguments: --version

$ cnvkit.py batch -v
usage: cnvkit.py [-h]
             {batch,target,antitarget,coverage,reference,fix,segment,cbs,diagram,scatter,loh,heatmap,breaks,gainloss,gender,metrics,import-picard,import-seg,import-theta,export}
                 ...
cnvkit.py: error: unrecognized arguments: -v