polyactis / accucopy Goto Github PK

Accucopy is a computational method that infers Allele-Specific Copy Number alterations from low-coverage low-purity tumor sequencing data.

Home Page: https://www.yfish.org/software/Accucopy

License: GNU General Public License v3.0

CMake 0.35% Makefile 0.55% Rust 8.88% Dockerfile 0.14% Shell 1.82% R 0.43% Python 17.42% Perl 0.26% C++ 70.15%

copy-number cancer-genomics somatic-variants gaussian-mixture-models expectation-maximization maximum-likelihood-estimation bayesian-information-criterion auto-correlation

accucopy's Introduction

Introduction
- Publication
License
Get our software
Run Accucopy
Simulated data for testing
Accucopy output
- A clean-data example
- A noisy-data example
Feedback
FAQ
- "At genomic regions of the first type, all cancer subclones have the same integral copy number....we call these regions clonal." Why is the copy number integral?
- "How does Accucopy deal with multiple subclones(>2)?"

Introduction

Accucopy is a CNA-calling method that extends our previous Accurity model to predict both total (TCN) and allele-specific copy numbers (ASCN) for the tumor genome. Accucopy adopts a tiered Gaussian mixture model coupled with an innovative autocorrelation-guided EM algorithm to find the optimal solution quickly. The Accucopy model utilizes information from both total sequencing coverage and allelic sequencing coverage. Through benchmark in both simulated and real-sequencing samples, we demonstrate that Accucopy is more accurate than existing methods

Accucopy's main strength is in handling low coverage and/or low tumor-purity samples.

Publication

X Fan, G Luo, YS Huang# (2021) BMC Bioinformatics. Accucopy: Accurate and Fast Inference of Allele-specific Copy Number Alterations from Low-coverage Low-purity Tumor Sequencing Data.

License

The license follows our institute policy that you can use the program for free as long as you are using Accucopy strictly for non-profit research purposes. However, if you plan to use Accucopy for commercial purposes, a license is required and please contact [email protected] to obtain one.

The full-text of the license is included in the software package.

Get our software

News

2023/6: Fixed the ploidy (must be within 1-4) bug.
2022/3: Commandline argument to allow user to choose which period (TRE histogram) to use.
2020/3: Can handle non-human genomes.
2019/10/22 First release.

Register to receive updates

Please register here to receive updates and the download link (a standalone Accucopy package without dependencies). If you have trouble installing packages described below, use the docker image instead.

Docker image

NOTE Due to the difficulty (i.e. no root access to install required libraries or incompatible libraries) in running our binary software, we have made a docker image available at dockerhub, which contains the latest development version of our software and all dependent libraries. Accucopy inside the image is usually newer than what is downloadable from this website.

Install docker before you do anything below.
Download the ref genome package.
To run it on a HPC cluster, singularity might be a better fit than docker.

An example docker session:

yh@cichlet:~$ docker pull polyactis/accucopy
Using default tag: latest         
latest: Pulling from polyactis/accucopy
...                           
fd6992ef54e0: Pull complete
Digest: sha256:a6f72af3114ba903f26b60265e10e6f13b8d943d25e740ab0a715d1a99000188
Status: Downloaded newer image for polyactis/accucopy:latest
yh@cichlet:~$ docker images
REPOSITORY                  TAG                 IMAGE ID            CREATED             SIZE
polyactis/accucopy          latest              a11fdb62c5d4        5 months ago        1.04GB

# Get inside the image, without mounting. Useful to just check what's inside the image.
yh@cichlet:~$ docker run -i -t polyactis/accucopy /bin/bash

# Download the reference genome folder (links on this page) into /home/mydata (or any folder)
# Put your bam files into /home/mydata
# Mount /home/mydata to /mnt inside the image
# Get inside the docker image.
yh@cichlet:~$ docker run -i -t -v /home/mydata:/mnt polyactis/accucopy /bin/bash

root@cc7807445e40:/$ cd /usr/local/Accucopy/
/usr/local/Accucopy
root@cc7807445e40:/usr/local/Accucopy$ ls
GADA         maestre    main.py             plot_autocor_diff.py                  plot_snp_maf_peak.py
LICENSE      configure  plot.tre.autocor.R  plot_coverage_after_normalization.py  plot_tre.py
__init__.py  infer      plotCPandMCP.py     plot_snp_maf_exp.py

root@cc7807445e40:/usr/local/Accucopy$ ./main.py
usage: main.py [-h] [-v] -c CONFIGURE_FILEPATH -t TUMOR_BAM -n NORMAL_BAM -o
               OUTPUT_DIR [--snp_output_dir SNP_OUTPUT_DIR] [--clean CLEAN]
               [--segment_stddev_divider SEGMENT_STDDEV_DIVIDER]
               [--snp_coverage_min SNP_COVERAGE_MIN]
               [--snp_coverage_var_vs_mean_ratio SNP_COVERAGE_VAR_VS_MEAN_RATIO]
               [--max_no_of_peaks_for_logL MAX_NO_OF_PEAKS_FOR_LOGL]
               [--nCores NCORES] [-s STEP] [-l LAM] [-d DEBUG] [--auto AUTO]
main.py: error: argument -c/--configure_filepath is required
# modify file "configure" to reflect paths of input data and relevant binaries
root@cc7807445e40:/usr/local/Accucopy$ cat configure 
read_length     101
window_size     500
reference_folder_path   /mnt/hs37d5
samtools_path   /usr/local/bin/samtools
caller_path     /usr/local/strelka
binary_folder   /usr/local/Accucopy

root@cc7807445e40:/usr/local/Accucopy$ ls /usr/local/bin/
total 11640
-rwxrwxr-x  1 root root 4436160 Jul  7  2018 samtools*

Install Accucopy and all its dependencies

Prerequisites

A computer with at least 32GB of memory (recommend 64GB).
Strelka2. A variant caller that is used to call SNPs.
Python
matplotlib
numpy
pandas
Pyflow
samtools)
libbz2-1.0 (a high-quality block-sorting file compressor library, install it via "apt install libbz2-1.0" in Debian/Ubuntu)
If your OS (like CentOS) has this library installed but Accucopy still fails to load it, you can do a symlink from the installed libarary file to "libbz2.so.1.0".
libgsl2 -liblzma5 (XZ-format compression library)
libssl1.0.0
libboost-program-options1.58.0
libboost-iostreams1.58.0
libhdf5-dev
(Only for building from source) pkg-config: used by Rust compiler to find library paths. i.e. "pkg-config --libs --cflags openssl"
(Optional) R packages ggplot2, grid, scales. Only needed if you obtain a development version of Accucopy. Required to make one R plot.
- But the R plot is NOT a must-have, one python plot has similar content as the R plot.

Running Accucopy requires a project-specific configure file, details below. configure according to your OS environment.

Install pyflow and other Python packages

git clone https://github.com/Illumina/pyflow.git pyflow
cd pyflow/pyflow
python setup.py build install

Other python packages can be installed through Python package system "pip install ..." or Ubuntu package system, dpkg/apt-get.

Register to download the Accucopy binary package and receive update emails

Please register here to receive an email that contains a download link. After finishing download, unpack the package via this:

tar -xvzf Accucopy.tar.gz

The Accucopy package contains a few binary executables and R/Python scripts. All binary executables were compiled for the Linux platform (Ubuntu 18 tested). It also contains a sample configure file. Denote the full path of the Accucopy folder as accucopy_path in the configure file (described below).

NOTE

If you are having difficulty in getting Accucopy to work, please use the docker image instead.
This binary package is older than the docker release.

Compile source code (for advanced users)

Instead of downloading binary, you can also choose to compile the source code. Be forewarned, you may run into problems (missing packages, wrong paths, etc.) in compiling the C++ portion on non-Ubuntu platforms. Rust compiling is relatively easy.

Compiling Accucopy requires those "lib..." packages mentioned above and their corresponding development packages (for example, libbz2-dev). In addition, it requires an installation of Rust, https://www.rust-lang.org/. We have compiled successfully on Ubuntu 16.04 and 18.04.

cd src_o

# to get a debug version (recommended)
make debug

# to get a release version
make release

The difference between debug and release version:

The debug version will contain a Rust binary that can print a stack trace in case an error happens. It is only slightly slower than the release version Rust binary.
The debug version will output more diagnostic plots.

NOTE

The public source code on github is older than the development version in the docker image. We advise users to use the latest version that is encapsulated in the docker image.

Download a reference genome

Accucopy requires a reference genome folder to run. We provide two different versions of human reference genomes, hs37d5 and hs38d1.

We recommend users to re-align reads against one of our pre-packaged human genomes in order to minimize any unexpected errors. However, if your reference genome is not human or slightly different (i.e. a different hs38 variant) from our pre-packaged ones (and you do not want to re-align), you can make a new reference genome folder by following instructions below.

The snp file inside the ref genome zip file contains coordinates of common (allele frequency >10%) SNPs from the 1000 Genomes project. The chromosome coordinates are denoted as "chr1", not "1". We advise users to align reads against the genome file included in the package to re-generate their bam files, in order to minimize wrong alignments and more importantly, match the coordinates of the 1000Genomes SNP file.

hs38d1.7z (714MB, NCBI hs38 is equivalent to UCSC hg20)
hs37d5.7z (718MB, NCBI hs37 is equivalent to UCSC hg19)

We use the 7z compressor. Run 7z x hs38d1.7z to extract all files.

Make your own reference genome package

Accucopy can also work on non-human genomes. Here are the instructions to make a custom reference genome folder. This folder should contain these files:

genome.fa: the fasta file of the reference genome.
genome.fa.fai: the index file of genome.fa by "samtools faidx".
genome.dict: the chromosome:length dictionary file generated by Picard CreateSequenceDictionary.
- https://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference
snp_sites.gz: the common SNP file in the bed format, 3 columns:

chr1    14598   14599
chr1    14603   14604
chr1    14929   14930

snp_sites.gz.tbi: the index file of snp_sites.gz generated by tabix.

Make sure the chromosome coordinates are denoted as "chr1", not "1". Non-numerically-labelled chromosomes (i.e. X, Y) are ignored by our software.

Run Accucopy

Input bam files

For an example, you have a pair of matched tumor and normal samples.

sample_1_cancer.bam
sample_1_cancer.bam.bai
sample_1_normal.bam
sample_1_normal.bam.bai

.bam.bai files (bam index) are not required. Accucopy will call samtools to generate them if they are missing.

Setup the configure file (latest format, as in the docker version)

Copy the sample configure file (tab-delimited) from the Accucopy package into your project folder and modify it accordingly. An example looks like this:

read_length     101
window_size     500
reference_folder_path   /mnt/hs38d1
samtools_path   /usr/local/bin/samtools
caller_path    /usr/local/strelka
accucopy_path   /usr/local/Accucopy

read_length: read length in base pairs.
window_size: the window size in base pair for segmentation. The segmentation program (GADA) first calculates the number of reads for each window and then perform segmentation over the genome. A small window size often leads to a large number of small segments. The recommended window size is 500bp.
reference_folder_path: the path of the genome folder. Two human versions are downloadable from this site. You can make a custom one.
samtools_path: the path of the samtools binary
caller_path: the path of the 3rd-party variant calling program. We use Strelka2. This is the path of the folder that contains all Strelka2 code/executables, i.e. /usr/local/strelka.
accucopy_path: the path of the Accucopy software.

Example commandlines

Accucopy consists of several binary executables. To make everything easy, we have written a Python program main.py ( inside the "Accucopy" folder ) which wraps all binary executables in a workflow.

./main.py –h gives you an explanation of all the arguments:

yh@hello:~/Accucopy$ ./main.py  -h
usage: main.py [-h] -c CONFIGURE_FILEPATH -t TUMOR_BAM -n NORMAL_BAM -o
               OUTPUT_DIR [--snp_output_dir SNP_OUTPUT_DIR] [--clean CLEAN]
               [--segment_stddev_divider SEGMENT_STDDEV_DIVIDER]
               [--snp_coverage_min SNP_COVERAGE_MIN]
               [--snp_coverage_var_vs_mean_ratio SNP_COVERAGE_VAR_VS_MEAN_RATIO]
               [--max_no_of_peaks_for_logL MAX_NO_OF_PEAKS_FOR_LOGL]
               [--nCores NCORES] [-s STEP] [-l LAM] [-d DEBUG] [--auto AUTO]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIGURE_FILEPATH, --configure_filepath CONFIGURE_FILEPATH
                        the path to the configure file.
  -t TUMOR_BAM, --tumor_bam TUMOR_BAM
                        the path to the tumor bam file. If the bam is not
                        indexed, an index file will be generated
  -n NORMAL_BAM, --normal_bam NORMAL_BAM
                        the path to the normal bam file. If the bam is not
                        indexed, an index file will be generated
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        the output directory path.
  --snp_output_dir SNP_OUTPUT_DIR
                        the directory to hold the SNP calling output. Default
                        is the same folder as the bam file.
  --clean CLEAN         whether to remove the existing output folders and
                        files? 0 No, 1 Yes. Default is 0.
  --segment_stddev_divider SEGMENT_STDDEV_DIVIDER
                        A factor that reduces the segment noise level. The
                        default value is recommended. Default is 20.
  --snp_coverage_min SNP_COVERAGE_MIN
                        the minimum SNP coverage in adjusting the expected SNP
                        MAF. Default is 2.
  --snp_coverage_var_vs_mean_ratio SNP_COVERAGE_VAR_VS_MEAN_RATIO
                        Instead of using the observed SNP coverage variance
                        (not consistent), use coverage_mean X this-parameter
                        as the variance for the negative binomial model which
                        is used in adjusting the expected SNP MAF. Default is
                        10.
  --max_no_of_peaks_for_logL MAX_NO_OF_PEAKS_FOR_LOGL
                        the maximum number of peaks used in the log likelihood
                        calculation. The final logL is average over the number
                        of peaks used. Default is 3
  --nCores NCORES       the max number of CPUs to use in parallel. Increase
                        the number if you have many cores. Default is 2.
  -s STEP, --step STEP  0: start from the very begining (Default).
                        1: obtain the read positions and the major allele fractions.
                        2: normalization. 3: segmentation. 4: infer purity and
                        ploidy only.
  -l LAM, --lam LAM     lambda for the segmentation algorithm. Default is 4.
  -d DEBUG, --debug DEBUG
                        Set debug value. Default is 0, which means no debug
                        output. Anything >0 enables several plots being made.
  --auto AUTO           The integer-valued argument that decides which method
                        to use to detect the period in the read-count ratio
                        histogram. 0: the simple auto-correlation method. 1: a
                        GADA-based algorithm (recommended). Default is 1.

In the debug mode (-d 1), Accucopy will produce several intermediate plots, offering insights into how well it is handling the input data.

Run Accurity from scratch given two input bam files, use 30 cores, output to folder sample_1_infer, enable debug mode:

./main.py -c configure_file --nCores 30 -t sample_1_cancer.bam -n sample_1_normal.bam -o sample1_output -d 1

Resume Accurity from step 2 and change the snp output folder (default was the bam file folder)

./main.py -c configure_file --nCores 20 -t sample_1_cancer.bam -n sample_1_normal.bam -o sample1_output --snp_output_dir sample1_output -d 1 --step 2

Override all previous output:

./main.py -c configure_file -t sample_1_cancer.bam -n sample_1_normal.bam -o sample1_output -d 1 --clean 1

Simulated data for testing

We generated in silico tumor and matching-normal WGS data using an EAGLE-based workflow. EAGLE is a software developed by Illumina to mimic their own high-throughput DNA sequencers and the simulated reads bear characteristics that are close to real-sequencing reads. We introduced twenty-one somatic copy number alterations (SCNAs), with length varying from 5MB to 135MB and copy number from 0 to 8, affecting about 28% of the genome, to each simulated tumor genome. The entire genome of its matching normal sample is of copy number two. A total of 1.8 million heterozygous single-nucleotide loci were introduced to each normal and its matching tumor sample. For each coverage setting, we first generated a pure tumor sample (purity=1.0) and its matching normal sample. We then generated nine different impure tumor samples (purity from 0.1 to 0.9) by data of the pure tumor sample with its matching normal data. The mixing proportion determines the tumor sample’s true purity.

Due to space constraints on our public server, we can only provide simulation data of tumor samples that contain only one subclone. If you need simulated data with more subclones or HCC1187 mixed data, please contact us.

Here is the http list of one simulated normal sample (coverage=5X) and nine simulated tumor samples (coverage=2X), with purity from 0.1 to 0.9, for researchers to test their own method.

https://www.yfish.org/data/singleclone_2x/

The true CNA profile of all tumor samples is identical (that's why it's called singleclone, no other subclones) and can be downloaded from https://www.yfish.org/data/singleclone_2x/CNA_truth.tsv. Here is what each column means.

Column cp is the total copy number of the designated region.
Column major_allele_cp is the copy number of the major allele. The copy number of the minor allele = cp - major_allele_cp.
Please ignore the IsClonal column.
Regions not included in the file are of copy number two.

chr     start   end     cp      major_allele_cp IsClonal
chr1    200000000       240000000       3       2       T
chr2    20000000        25010000        4       3       T
chr2    40000000        45000000        1       1       T
chr3    50000000        100000000       8       7       T
chr4    40000000        80000000        1       1       T
chr10   30000000        45000000        2       2       T
chr20   3000000 45000000        3       3       T
chr15   20000000        40000000        7       6       T
chr17   10000000        60000000        5       4       T
chr8    10000000        145000000       3       2       T
chr1    135000000       185000000       5       3       T
chr2    80000000        130000000       4       2       T
chr5    30000000        75000000        6       4       T
chr6    50000000        90000000        6       3       T
chr7    90000000        130000000       7       5       T
chr12   25000000        65000000        7       4       T
chr11   80000000        130000000       4       4       T
chr12   85000000        120000000       5       5       T
chr13   25000000        65000000        6       6       T
chr14   30000000        45000000        7       7       T
chr15   50000000        90000000        8       8       T

Accucopy output

These are the output files that matters.

infer.out.tsv
- A summary output that contains purity and ploidy estimates, and some other statistical measures. Probably the most important file to a user.

purity  ploidy
0.66735 2.0612
logL    period  best_no_of_copy_nos_bf_1st_peak first_peak_int
9.9811e+06      327     2       980
no_of_segments  no_of_segments_used     no_of_snps      no_of_snps_used
539     539     1333539 933576

infer.out.details.tsv
- This contains lots of internal model output, useful for developers.
cnv.output.tsv
- This contains preliminary copy number alteration predictions.
- The important columns are chr, start, end.
- "cp" is the predicted copy number.
- "copy_no_float" is the raw copy number outputted by our program, which will be converted to an integer (the "cp" column) if our model deems it a clonal (shared by all cancer cells) CNV. Some "cp" will stay as "float" because our model thinks they are subclonal (some cancer cells in one CNV state, some cancer cells in another).

chr     start      end        cp      major_allele_cp copy_no_float   oneSegment.stddev       maf_mean        maf_stddev      maf_expected   cumu_start   cumu_end
5       8215001 8363001       2       1       1.76147 0.00987896      0.622568        0.0685194       0.632948        895215001       895363001
3       16591001        16751001       2       1       1.76758 0.00891515      0.62856 0.063834        0.632948        511591001       511751001
...

plot.cnv.png
- A genome-wide CNV plot.
plot.tre.jpg
- A period plot. Check if the model fits data well.
plot.tre.autocor.jpg
- A plot for developers.

A clean-data example

All major results are stored in the output directory. File sample_1_infer/infer.out.tsv contains the purity and ploidy estimates. Here is an example. (Viewing in Excel is a lot nicer) :

purity        ploidy       purity_naive     ploidy_naive    rc_ratio_of_cp_2      rc_ratio_of_cp_2_corrected      segment_stddev_divider         snp_maf_stddev_divider  snp_coverage_min   snp_coverage_var_vs_mean_ratio    period_discover_run_type
0.7246      2.2428      0.72078    2.2282      924  919.13      10     20     2       10     1
logL period       best_no_of_copy_nos_bf_1st_peak first_peak_int
1.5204e+07      333  1       585
no_of_segments       no_of_segments_used      no_of_snps       no_of_snps_used
697  697  1517851  1062619

In the output above, the column ‘purity’ is the final purity estimate, and ‘purity_naive’ is the pre-adjusted estimate which can be ignored. ‘logL’ is the maximum likelihood of the hierarchical Gaussian Mixture model. ‘period’ is the 1000 X period of the tumor-read-enrichment (TRE) histogram (=333 in this case), which is detected by auto-regression. 'no_of_snps' and ‘no_of_segments’ is the result of step 3 and step 4. Other columns are values of commandline arguments.
There are other important output files, such as all_segments.tsv.gz and het_snp.tsv.gz, which are output of step 4 and step 3 respectively. If the sample is abnormal, we can usually see an unreasonable number of segments and SNPs in these two files.
Besides the text output, Accucopy will produce some graphic output. One of the the most important plots is plot.tre.png, available only in debug mode ( -d 1):

TRE stands for Tumor Read Enrichment. You can think of it as a normalized version of the read count ratio between the tumor and normal samples for one chromosome window. More details can be found in our paper. The Y axis in the two panels is the window count. The lower panel is in the log scale. A clean TRE histogram leads to a confident purity estimate.
In this clean-data example, the tumor read enrichment (TRE) histogram displays a beautiful periodic pattern. That means we can confidently infer the period (=0.333) from the TRE data and the ensuing maximum likelihood estimates will be more robust. The CNV estimates in plot.cnv.png, also demonstrates a clean copy number variation (CNV) profile.
plot.cnv.png:

This is the estimated CNV profile for the example. The top plot is the estimated absolute copy number for each segment. For a normal sample, the absolute copy number should be 2 throughout the genome. The lower plot shows the major allele copy number for each segment.

Cases where Accucopy fails to infer purity and ploidy:

The cancer genome contains too few somatic copy number alterations.
The noise level is too high, or the noise level is moderate but the sample purity is very low (<0.05).

A noisy-data example

Occasionally, a user will encounter extremely noisy data. The user should learn to identify the noisy data from plots and do NOT use the estimates made by Accucopy. Here is a noisy-data example.

Content of infer.out.tsv for a noisy-data example. The high number of segments is a red flag.

purity        ploidy       purity_naive     ploidy_naive    rc_ratio_of_cp_2      rc_ratio_of_cp_2_corrected         segment_stddev_divider   snp_maf_stddev_divider  snp_coverage_min   snp_coverage_var_vs_mean_ratio         period_discover_run_type
0.90938    1.9375      0.91675    1.9551      1021 1029.3      10     20     2       10     2
logL period       best_no_of_copy_nos_bf_1st_peak first_peak_int
4.3681e+06      468  1       555
no_of_segments       no_of_segments_used      no_of_snps       no_of_snps_used
19909       19909       1559676  1092048

Its tumor-read-enrichment (TRE) histogram (plot.tre.png) has one big and unclean peak (its landscape looks like being cut through by a lousy jigsaw). It makes it really difficult to accurately estimate its period. The period estimate (0.468, 468 in the 2nd cell of the 4th line is 1000Xperiod.) is probably far from the truth. All ensuing maximum likelihood estimates are questionable. The estimated CNV profile further confirms the great amount of noise in this data.

plot.cnv.png:

Feedback

If you encounter any issues, please email [email protected] or file an issue at https://github.com/polyactis/Accucopy/issues (so that everyone can learn).

FAQ

"At genomic regions of the first type, all cancer subclones have the same integral copy number....we call these regions clonal." Why is the copy number integral?

Integer copy number estimates, like 1, 2, 3, are easy to understand. But some regions are of fractional (2.3, 3.5) copy numbers because:

The sequencing data is a mixture of thousands or even millions of cells , which is called batch sequencing (not single-cell sequencing).
A tumor is usually not homogeneous, so some cancer cells differ from others in terms of copy numbers. For example, in one region, 50% cells are of copy number 2, the other 50% are of copy number 3. Then you'll see 2.5 as a whole. These regions are called subclonal.

"How does Accucopy deal with multiple subclones(>2)?"

Accucopy does not estimate the number (or fractions) of cancer subclones. Accucopy tells the user whether a region is clonal or subclonal, and their corresponding copy number estimates.

Please note. "subclone" and "subclonal" are referring to different things.

"subclone" or "clone" refers to a lineage of cancer cells during the cancer cell evolution process.
"subclonal" or "clonal" is referring to mutations. "Subclonal" mutations are the ones that lead to a type of cancer subclones on the evolutionary branch, and thus these are the mutations that not shared across all cancer cells in the tumor sample. "Clonal" mutations are the ones that are shared across all cancer cells.

accucopy's People

Contributors

Stargazers

Watchers

Forkers

gerde slw287r asangphukieo schaudge

accucopy's Issues

can accucopy support create a baseline by normal samples and then use tumor only samples?

thanks a lot

Empty Output

Dear author,
I am running Accucopy on a batch of low-coverage WGS data. All raw-data were mapped with BWA(grch37 provided on document chapter 3.7), and marked with picard.
Some of the bams have ideal results, while the others have empty output in infer.out.tsv and infer.out.details.tsv.

the last 20 lines of infer.status.txt are shown below,
_segment_stddev_divider=20 _snp_maf_stddev_divider=20 _snp_covearge_min=2 _snp_coverage_var_vs_mean_ratio=10 _no_of_peaks_for_logL=3 Reading SNPs from /gpfs/share/home/1601111669/WGS_15/accucopy_result//MLPS_ZJW_C/het_snp.tsv.gz ... 22 chromosomes, 423742 SNPs, 570146 lines. Reading in segments from /gpfs/share/home/1601111669/WGS_15/accucopy_result//MLPS_ZJW_C/all_segments.tsv.gz ... Outputting segment ratio data to /gpfs/share/home/1601111669/WGS_15/accucopy_result//MLPS_ZJW_C/rc_ratio_window_count_smoothed.tsv...Done. Outputting segment ratio data to /gpfs/share/home/1601111669/WGS_15/accucopy_result//MLPS_ZJW_C/rc_ratio_no_of_windows_by_chr.tsv...Done. 32 segments. 32 segments used. 423717 SNPs used. Calculating auto correlation ...Done. Outputting SNP logORs by segments to /gpfs/share/home/1601111669/WGS_15/accucopy_result//MLPS_ZJW_C/snp_logOR_by_segment.tsv ... 32 segments. Calculating auto correlation shift-1 difference ... #mean is: 0, sigma is: 0.0010521 Done. Inferring candidate periods through GADA, run_type=1, left_x=-0.000266546, right_x=0.000266546 ... Initiating GADA instance ...GADA done Found 0 candidate periods. ERROR: No candidate period discovered.

Could you please kindly suggest the reason GADA found 0 candidate and if there's sth I can done?

Best Regards,
Junyi

Segmentation Fault

Hi! I am running into a segmentation fault. Why might this be happening?

Here is the infer.status.txt file:

Reading in genome coverage from "/mnt/CT432.bam" ...
New chromosome chr1, length=248956422, window size=500, no_of_windows=497913.
59661040 reads so far for "/mnt/CT432.bam". Chromosome chr1 contains 0 valid fragments.
New chromosome chr10, length=133797422, window size=500, no_of_windows=267595.
81042310 reads so far for "/mnt/CT432.bam". Chromosome chr10 contains 0 valid fragments.
New chromosome chr11, length=135086622, window size=500, no_of_windows=270174.
110417527 reads so far for "/mnt/CT432.bam". Chromosome chr11 contains 0 valid fragments.
New chromosome chr12, length=133275309, window size=500, no_of_windows=266551.
139816736 reads so far for "/mnt/CT432.bam". Chromosome chr12 contains 0 valid fragments.
New chromosome chr13, length=114364328, window size=500, no_of_windows=228729.
161439829 reads so far for "/mnt/CT432.bam". Chromosome chr13 contains 0 valid fragments.
New chromosome chr14, length=107043718, window size=500, no_of_windows=214088.
181255065 reads so far for "/mnt/CT432.bam". Chromosome chr14 contains 0 valid fragments.
New chromosome chr15, length=101991189, window size=500, no_of_windows=203983.
199363069 reads so far for "/mnt/CT432.bam". Chromosome chr15 contains 0 valid fragments.
New chromosome chr16, length=90338345, window size=500, no_of_windows=180677.
218150818 reads so far for "/mnt/CT432.bam". Chromosome chr16 contains 0 valid fragments.
New chromosome chr17, length=83257441, window size=500, no_of_windows=166515.
235637677 reads so far for "/mnt/CT432.bam". Chromosome chr17 contains 0 valid fragments.
New chromosome chr18, length=80373285, window size=500, no_of_windows=160747.
253134738 reads so far for "/mnt/CT432.bam". Chromosome chr18 contains 0 valid fragments.
New chromosome chr19, length=58617616, window size=500, no_of_windows=117236.
265860905 reads so far for "/mnt/CT432.bam". Chromosome chr19 contains 0 valid fragments.
New chromosome chr2, length=242193529, window size=500, no_of_windows=484388.
319870334 reads so far for "/mnt/CT432.bam". Chromosome chr2 contains 0 valid fragments.
New chromosome chr20, length=64444167, window size=500, no_of_windows=128889.
333674604 reads so far for "/mnt/CT432.bam". Chromosome chr20 contains 0 valid fragments.
New chromosome chr21, length=46709983, window size=500, no_of_windows=93420.
343433771 reads so far for "/mnt/CT432.bam". Chromosome chr21 contains 0 valid fragments.
New chromosome chr22, length=50818468, window size=500, no_of_windows=101637.
351723849 reads so far for "/mnt/CT432.bam". Chromosome chr22 contains 0 valid fragments.
New chromosome chr3, length=198295559, window size=500, no_of_windows=396592.
396184917 reads so far for "/mnt/CT432.bam". Chromosome chr3 contains 0 valid fragments.
New chromosome chr4, length=190214555, window size=500, no_of_windows=380430.
441143709 reads so far for "/mnt/CT432.bam". Chromosome chr4 contains 0 valid fragments.
New chromosome chr5, length=181538259, window size=500, no_of_windows=363077.
482509996 reads so far for "/mnt/CT432.bam". Chromosome chr5 contains 0 valid fragments.
New chromosome chr6, length=170805979, window size=500, no_of_windows=341612.
518332470 reads so far for "/mnt/CT432.bam". Chromosome chr6 contains 0 valid fragments.
New chromosome chr7, length=159345973, window size=500, no_of_windows=318692.
562631614 reads so far for "/mnt/CT432.bam". Chromosome chr7 contains 0 valid fragments.
New chromosome chr8, length=145138636, window size=500, no_of_windows=290278.
594804392 reads so far for "/mnt/CT432.bam". Chromosome chr8 contains 0 valid fragments.
New chromosome chr9, length=138394717, window size=500, no_of_windows=276790.
653668784 reads so far for "/mnt/CT432.bam". Chromosome chr9 contains 0 valid fragments.
Reading and smoothing of coverage from "/mnt/CT432.bam" is Done. 22 unique chromosomes, 653668784 reads.
Genome wide mean coverage is 0
Reading in genome coverage from "/mnt/CT434.bam" ...
New chromosome chr1, length=248956422, window size=500, no_of_windows=497913.
21543443 reads so far for "/mnt/CT434.bam". Chromosome chr1 contains 0 valid fragments.
New chromosome chr10, length=133797422, window size=500, no_of_windows=267595.
33519753 reads so far for "/mnt/CT434.bam". Chromosome chr10 contains 0 valid fragments.
New chromosome chr11, length=135086622, window size=500, no_of_windows=270174.
45224913 reads so far for "/mnt/CT434.bam". Chromosome chr11 contains 0 valid fragments.
New chromosome chr12, length=133275309, window size=500, no_of_windows=266551.
57193479 reads so far for "/mnt/CT434.bam". Chromosome chr12 contains 0 valid fragments.
New chromosome chr13, length=114364328, window size=500, no_of_windows=228729.
65967040 reads so far for "/mnt/CT434.bam". Chromosome chr13 contains 0 valid fragments.
New chromosome chr14, length=107043718, window size=500, no_of_windows=214088.
73940039 reads so far for "/mnt/CT434.bam". Chromosome chr14 contains 0 valid fragments.
New chromosome chr15, length=101991189, window size=500, no_of_windows=203983.
81214655 reads so far for "/mnt/CT434.bam". Chromosome chr15 contains 0 valid fragments.
New chromosome chr16, length=90338345, window size=500, no_of_windows=180677.
88145461 reads so far for "/mnt/CT434.bam". Chromosome chr16 contains 0 valid fragments.
New chromosome chr17, length=83257441, window size=500, no_of_windows=166515.
95287511 reads so far for "/mnt/CT434.bam". Chromosome chr17 contains 0 valid fragments.
New chromosome chr18, length=80373285, window size=500, no_of_windows=160747.
102166467 reads so far for "/mnt/CT434.bam". Chromosome chr18 contains 0 valid fragments.
New chromosome chr19, length=58617616, window size=500, no_of_windows=117236.
106899655 reads so far for "/mnt/CT434.bam". Chromosome chr19 contains 0 valid fragments.
New chromosome chr2, length=242193529, window size=500, no_of_windows=484388.
128578844 reads so far for "/mnt/CT434.bam". Chromosome chr2 contains 0 valid fragments.
New chromosome chr20, length=64444167, window size=500, no_of_windows=128889.
133963433 reads so far for "/mnt/CT434.bam". Chromosome chr20 contains 0 valid fragments.
New chromosome chr21, length=46709983, window size=500, no_of_windows=93420.
137937298 reads so far for "/mnt/CT434.bam". Chromosome chr21 contains 0 valid fragments.
New chromosome chr22, length=50818468, window size=500, no_of_windows=101637.
141126722 reads so far for "/mnt/CT434.bam". Chromosome chr22 contains 0 valid fragments.
New chromosome chr3, length=198295559, window size=500, no_of_windows=396592.
159234832 reads so far for "/mnt/CT434.bam". Chromosome chr3 contains 0 valid fragments.
New chromosome chr4, length=190214555, window size=500, no_of_windows=380430.
177053732 reads so far for "/mnt/CT434.bam". Chromosome chr4 contains 0 valid fragments.
New chromosome chr5, length=181538259, window size=500, no_of_windows=363077.
193446228 reads so far for "/mnt/CT434.bam". Chromosome chr5 contains 0 valid fragments.
New chromosome chr6, length=170805979, window size=500, no_of_windows=341612.
209476810 reads so far for "/mnt/CT434.bam". Chromosome chr6 contains 0 valid fragments.
New chromosome chr7, length=159345973, window size=500, no_of_windows=318692.
223320566 reads so far for "/mnt/CT434.bam". Chromosome chr7 contains 0 valid fragments.
New chromosome chr8, length=145138636, window size=500, no_of_windows=290278.
236709378 reads so far for "/mnt/CT434.bam". Chromosome chr8 contains 0 valid fragments.
New chromosome chr9, length=138394717, window size=500, no_of_windows=276790.
263020395 reads so far for "/mnt/CT434.bam". Chromosome chr9 contains 0 valid fragments.
Reading and smoothing of coverage from "/mnt/CT434.bam" is Done. 22 unique chromosomes, 263020395 reads.
Genome wide mean coverage is 0
Outputting normalized coverage ratio of chr1 ... Output done.
Outputting normalized coverage ratio of chr2 ... Output done.
Outputting normalized coverage ratio of chr3 ... Output done.
Outputting normalized coverage ratio of chr4 ... Output done.
Outputting normalized coverage ratio of chr5 ... Output done.
Outputting normalized coverage ratio of chr6 ... Output done.
Outputting normalized coverage ratio of chr7 ... Output done.
Outputting normalized coverage ratio of chr8 ... Output done.
Outputting normalized coverage ratio of chr9 ... Output done.
Outputting normalized coverage ratio of chr10 ... Output done.
Outputting normalized coverage ratio of chr11 ... Output done.
Outputting normalized coverage ratio of chr12 ... Output done.
Outputting normalized coverage ratio of chr13 ... Output done.
Outputting normalized coverage ratio of chr14 ... Output done.
Outputting normalized coverage ratio of chr15 ... Output done.
Outputting normalized coverage ratio of chr16 ... Output done.
Outputting normalized coverage ratio of chr17 ... Output done.
Outputting normalized coverage ratio of chr18 ... Output done.
Outputting normalized coverage ratio of chr19 ... Output done.
Outputting normalized coverage ratio of chr20 ... Output done.
Outputting normalized coverage ratio of chr21 ... Output done.
Outputting normalized coverage ratio of chr22 ... Output done.
program name is /usr/local/Accucopy/GADA.
Reading data from /mnt/output_CT432_CT434/chr14.ratio.w500.csv.gz ... program name is /usr/local/Accucopy/GADA.
Reading data from /mnt/output_CT432_CT434/chr22.ratio.w500.csv.gz ... 0 data points for chromosome chr14.
Running SBLandBE ...
0 data points for chromosome chr22.
Running SBLandBE ...

Using hole exome sequencing data

Hi,

Great tool and documentation! I've read on your documentation page that the tool might work with whole exome data - could you please provide more information on that? Is possible and is it recommended?

Thank you a lot in advance!

probalica

unable to infer tumour purity

Hi there,
Hope you're well! I am running accucopy on multiple paired tumour-normal WGS samples. It has worked perfectly on most of them, except for in one sample where the infer_out.txt file is empty. I have checked the infer.status.txt file and the last line shows this error: ### Best period from likelihood: 0
best_purity: -1
best_ploidy: -1
Q: -1
logL: 0
best_no_of_copy_nos_bf_1st_peak: 0
first_peak_int: 0
ERROR: logL 0<=0 or best_purity -1 <=0!

Please could you let me know what this means? I am reading it as it is returning the purity to be 0%, but this would be impossible.

Thanks very much

[ERROR] [normalize] Error Message

Hi @polyactis,

First, thank you for this great tool.

I aligned my reads to the reference genome included in hs38d1 provided by Accucopy and sorted the bam files with duplicates marked. However, I am having the error as shown below when trying to run the main.py file.

Please let me know if you have any suggestions! Thank you very much!

[2022-04-28T20:45:05.792488] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr4' to master workflow
  [2022-04-28T20:45:05.792738] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr5' to master workflow
  [2022-04-28T20:45:05.792866] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr6' to master workflow
  [2022-04-28T20:45:05.792992] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr7' to master workflow
  [2022-04-28T20:45:05.793113] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr8' to master workflow
  [2022-04-28T20:45:05.793232] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr9' to master workflow
  [2022-04-28T20:45:05.793362] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr10' to master workflow
  [2022-04-28T20:45:05.793481] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr11' to master workflow
  [2022-04-28T20:45:05.793596] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr12' to master workflow
  [2022-04-28T20:45:05.794122] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr13' to mast' to master workflow
  [2022-04-28T20:45:05.793481] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr11' to master workflow
  [2022-04-28T20:45:05.793596] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr12' to master workflow
  [2022-04-28T20:45:05.794122] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr13' to master workflow
  [2022-04-28T20:45:05.794601] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr14' to master workflow
  [2022-04-28T20:45:05.795005] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr15' to master workflow
  [2022-04-28T20:45:05.795131] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr16' to master workflow
  [2022-04-28T20:45:05.795248] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr17' to master workflow
  [2022-04-28T20:45:05.795368] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr18' to master workflow
  [2022-04-28T20:45:05.795483] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr19' to master workflow
  [2022-04-28T20:45:05.795597] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr20' to master workflow
  [2022-04-28T20:45:05.795710] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr21' to master workflow
  [2022-04-28T20:45:05.795840] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'segment_chr22' to master workflow
  [2022-04-28T20:45:05.795961] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'reduce_all_segments' to master workflow
  [2022-04-28T20:45:05.796443] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'rm_individual_seg_files' to master workflow
  Last step time span: 0:00:00.007164
  step 5: Infer tumor purity and ploidy.
  	start time: 2022-04-29 04:45:05.796882
  [2022-04-28T20:45:05.797097] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'infer' to master workflow
  Last step time span: 0:00:00.000715
  step 6: Make plots.
  	start time: 2022-04-29 04:45:05.797597
  [2022-04-28T20:45:05.797682] [1d46eeca3e55] [8_1] [WorkflowRunner] Adding command task 'plot_cnv' to master workflow
  Last step time span: 0:00:00.000146
  End time: 2022-04-29 04:45:05.797743
  [2022-04-28T20:45:05.797779] [1d46eeca3e55] [8_1] [TaskRunner:masterWorkflow] Finished task specification for master workflow
  [2022-04-28T20:45:59.922904] [1d46eeca3e55] [8_1] [TaskManager] Completed command task: 'indexTumorBam' launched from master workflow
  [2022-04-28T20:46:00.087241] [1d46eeca3e55] [8_1] [TaskManager] Completed command task: 'indexNormalBam' launched from master workflow
  [2022-04-28T20:46:00.087753] [1d46eeca3e55] [8_1] [TaskManager] Launching command task: 'strelka_prepare' from master workflow
  [2022-04-28T20:46:00.088045] [1d46eeca3e55] [8_1] [TaskManager] Launching command task: 'normalize' from master workflow
  [2022-04-28T20:46:00.090603] [1d46eeca3e55] [8_1] [TaskRunner:strelka_prepare] Task initiated on local node
  [2022-04-28T20:46:00.091298] [1d46eeca3e55] [8_1] [TaskRunner:normalize] Task initiated on local node
  [2022-04-28T20:46:00.151922] [1d46eeca3e55] [8_1] [TaskManager] [ERROR] Failed to complete command task: 'normalize' launched from master workflow, error code: 101, command: 
  [2022-04-28T20:46:00.151960] [1d46eeca3e55] [8_1] [TaskManager] [ERROR] [normalize] Error Message:
  [2022-04-28T20:46:00.151972] [1d46eeca3e55] [8_1] [TaskManager] [ERROR] [normalize] Last 0 stderr lines from task (of 0 total lines):
  [2022-04-28T20:46:00.151983] [1d46eeca3e55] [8_1] [TaskManager] [ERROR] Shutting down task submission. Waiting for remaining tasks to complete.
  [2022-04-28T20:46:00.366062] [1d46eeca3e55] [8_1] [TaskManager] Completed command task: 'strelka_prepare' launched from master workflow
  [2022-04-28T20:46:08.952897] [1d46eeca3e55] [8_1] [WorkflowRunner] [ERROR] Worklow terminated due to the following task errors:
  [2022-04-28T20:46:08.952995] [1d46eeca3e55] [8_1] [WorkflowRunner] [ERROR] Failed to complete command task: 'normalize' launched from master workflow, error code: 101, command: 
  [2022-04-28T20:46:08.953010] [1d46eeca3e55] [8_1] [WorkflowRunner] [ERROR] [normalize] Error Message:
  [2022-04-28T20:46:08.953019] [1d46eeca3e55] [8_1] [WorkflowRunner] [ERROR] [normalize] Last 0 stderr lines from task (of 0 total lines):

unmatched chromosome name get error

Dear professor,
thansk for such a accurate software.

when I am using it, it raise errors like follows. MAYbe caused by chrM and chrMT.

what is more, since your genome.dict has many patch chromosome names, like

SN:GL000207.1
SN:GL000226.1
SN:GL000229.1
SN:GL000231.1
, but when users input bam, they may use a different version of genome, even for hg19, the patch chromosome seems to be different, so why not the input is a fastq, but a bam? ,can you give me some suggestions

ERRORS screenshot like following

input to pyclone-vi

Hi there, thank you for developing accucopy. it was relatively straight forward to run and fast! How do we prepare input for pyclone-vi (example: https://github.com/Roth-Lab/pyclone-vi/blob/master/examples/tracerx.tsv) from cnv.output.tsv? Specifically, how do we get major_cn and minor_cn? I think pyclone-vi only accepts integer values for both.
Thanks

Failed to complete command task: 'rm_individual_seg_files' launched from master workflow,

Dear author,

I built the singularity environment from the Docker image recently. I got this errror as mentioned in the title when I run it. The detailed error information is as follows:
[2020-11-17T17:07:15.734560] [node127.cm.cluster] [68840_1] [WorkflowRunner] [ERROR] Worklow terminated due to the following task errors:
[2020-11-17T17:07:15.735768] [node127.cm.cluster] [68840_1] [WorkflowRunner] [ERROR] Failed to complete command task: 'rm_individual_seg_files' launched from master workflow, error code: 1, command: 'rm'
[2020-11-17T17:07:15.736237] [node127.cm.cluster] [68840_1] [WorkflowRunner] [ERROR] [rm_individual_seg_files] Error Message:
[2020-11-17T17:07:15.736848] [node127.cm.cluster] [68840_1] [WorkflowRunner] [ERROR] [rm_individual_seg_files] Last 2 stderr lines from task (of 2 total lines):
[2020-11-17T17:07:15.736848] [node127.cm.cluster] [68840_1] [WorkflowRunner] [ERROR] [2020-11-17T15:50:23.975587] [node127.cm.cluster] [68840_1] [rm_individual_seg_files] rm: missing operand
[2020-11-17T17:07:15.736848] [node127.cm.cluster] [68840_1] [WorkflowRunner] [ERROR] [2020-11-17T15:50:23.975877] [node127.cm.cluster] [68840_1] [rm_individual_seg_files] Try 'rm --help' for more information.

Do you have any clue why this error happened? Can you please help me to solve it?

Additional information: I was implementing the program on a pair of WGS of canine tumour and normal tissue. The required reference files were properly made, except the snp_sites.gz file. But, I removed the option --callRegions of the SNP calling step using the Strelka from the main.py file. So the program can still work without the snp_sites file.

Regards,
Yun

maestre: libbz2.so.1.0 not found

maestre does not find the libbz2.so.1.0 and fails with the usual lib not found error message:

/opt/accucopy/maestre: error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory

Even though I installed libbz2 for the current conda environment as well as on the system:

conda install -c conda-forge bzip2
yum install bzip2-libs

Licensing issues

Hi @polyactis,

we considered to package Accucopy for Bioconda to make it easily obtainable for end users.
(When or if this might happen depends on demand, required effort, and chiefly on the outcome of this issue.)

I noticed a few licensing issues which would prevent a distribution, though:

Depending on the interpretation, https://github.com/polyactis/Accucopy/blob/v1.1/LICENSE#L26 might outright prohibit distribution on our part.
https://github.com/polyactis/Accucopy/blob/v1.1/LICENSE#L1-L6 and further restrictions in the license text violate the license agreements needed for Accucopy's dependencies:
- GSL is licensed under the GPL 3.0: http://git.savannah.gnu.org/cgit/gsl.git/tree/COPYING?h=release-2-6
- Accucopy bundles parts of https://github.com/rpique/GADA which is licensed under GPL 3.0+:
  - https://github.com/polyactis/Accucopy/blob/v1.1/src_o/BaseGADA.h#L11-L14
  - https://github.com/polyactis/Accucopy/blob/v1.1/src_o/BaseGADA.cc#L11-L14
- Strelka2 also uses GPL 3.0+: https://github.com/Illumina/strelka/blob/v2.9.10/COPYRIGHT.txt
  (This one might not be of interest for the source distribution but if you redistribute Strelka2 in a bundle with your binaries/containers and apply your license restrictions to the whole bundle.)
- You cannot add additional restrictions like those from https://github.com/polyactis/Accucopy/blob/v1.1/LICENSE (non-commercial, distribution restrictions, etc.) in conjunction with your GPL-licensed dependencies, see: https://www.gnu.org/licenses/gpl-faq.html#NoMilitary

Unrelated to our interest in redistributing Accucopy via Bioconda, I'd like to ask you to review the license conformance of your software with its dependencies. (Considering the nature of the GPL, this would probably mean you would have to relicense Accucopy under the GPL 3.0 and as such have to drop the academic/non-commercial and other restrictions.)
I'm not a lawyer and as such cannot offer further help with this issue, but just wanted to make you aware of it.

Cheers,
Marcel

(cc @dlaehnemann)

Stuck at reduce_all_segments

Hi,
I am trying to get results for purity and ploidy using your pipeline.
But somehow the tool is stuck for 2 days at this step:

[2020-12-18T09:03:35.706978] [713099f5307d] [135_1] [WorkflowRunner] [StatusUpdate] ===== MainFlow StatusUpdate =====
[2020-12-18T09:03:35.707082] [713099f5307d] [135_1] [WorkflowRunner] [StatusUpdate] Workflow specification is complete?: True
[2020-12-18T09:03:35.707112] [713099f5307d] [135_1] [WorkflowRunner] [StatusUpdate] Task status (waiting/queued/running/complete/error): 3/0/1/6/0
[2020-12-18T09:03:35.707137] [713099f5307d] [135_1] [WorkflowRunner] [StatusUpdate] Longest ongoing queued task time (hrs): 0.0000
[2020-12-18T09:03:35.707163] [713099f5307d] [135_1] [WorkflowRunner] [StatusUpdate] Longest ongoing queued task name: ''
[2020-12-18T09:03:35.707189] [713099f5307d] [135_1] [WorkflowRunner] [StatusUpdate] Longest ongoing running task time (hrs): 46.0008
[2020-12-18T09:03:35.707215] [713099f5307d] [135_1] [WorkflowRunner] [StatusUpdate] Longest ongoing running task name: 'reduce_all_segments'

I called it like this:
'./main.py -c configure -t /mnt/Tumor.bam -n /mnt/Normal.bam -o /mnt/accucopy_output/ --nCores 30'
Any suggestions?

What would be the fastest way to get just purity/ploidy results, based on either FASTQ, BAM or copynumer files.
Thanks!

The meaning of minus cp

Hi,

I ran Accucopy with WGS data and got minus copy number and "na" for major_allele_cp as shown below.
What is the meaning of this number?

chr start end cp major_allele_cp copy_no_float cumu_start cumu_end
7 142652501 142797500 14 10 14.0741 1374656804 1374801803
7 38299501 38337500 17 12 17 1270303804 1270341803
14 22177501 22352000 21 16 20.9444 2213584811 2213759310
17 22999501 26574500 -6.66667 NA -6.66667 2513780063 2517355062
2 92732501 93811000 -6.61111 NA -6.61111 341688923 342767422
7 63674501 63726000 -5.16667 NA -5.16667 1295678804 1295730303
14 18605501 19241000 1.25926 NA 1.25926 2210012811 2210648310
15 17009001 20739000 1.44444 NA 1.44444 2315460029 2319190028

Thank you very much,
Apiwat

unable to open file: name = '/data/modif_genome_test1/model_selection_l og/model_selection.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0

Hi,

I'm getting the following error while trying to run accucopy with various samples :

" 5_1] [plot_model_select] IOError: Unable to open file (unable to open file: name = '/data/modif_genome_test1/model_selection_l
og/model_selection.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) "

Below is the TRE ratio histogram image for reference :

I would really appreciate it if you could kindly help me troubleshoot the error.

Thanks,
Sreejita

Exception in thread TaskFileWriter-Thread:

Dear author,

Thanks for developing this software. I re-aligned my bam files with the provided reference genome, then ran Accucopy docker image with Singularity following the instructions, but I always get the following errors:

Exception in thread TaskFileWriter-Thread:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 1650, in run
self._writeIfSet()
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 1660, in _writeIfSet
self.writeFunc()
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 537, in wrapped
return f(self, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 2717, in writeTaskInfo
fp = open(self.taskInfoFile, "a")
IOError: [Errno 2] No such file or directory: 'output/pyflow.data/state/pyflow_tasks_info.txt'
Exception in thread TaskFileWriter-Thread:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 1650, in run
self._writeIfSet()
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 1660, in _writeIfSet
self.writeFunc()
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 537, in wrapped
return f(self, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/pyflow/pyflow.py", line 2618, in writeTaskStatus
tmpFp = open(tmpFile, "w")
IOError: [Errno 2] No such file or directory: 'output/pyflow.data/state/pyflow_tasks_runstate.txt.update.incomplete'

I might have missed something. Any suggestions? Thanks.

Jack

Bug at "--period" option

Hi,
I tried the new mode "--period" option and found small bug in the main.py script at line 540.
It shows "period" parameter not defind.
To fix this, I just removed Line 540-543, and the script works!

if (ap.period < 0):
msg = (f"Argument period is non-negative integer-value, but you have "
f"specified {ap.period}. O will be used instead.")
sys.stderr.write(msg)

Thank you for your nice tool.
Apiwat

CNV.output.tsv not generated

Hello working with several samples, all but 1 sample output the cnv.output.tsv. Have ran accucopy on this single sample many times, however no CNV output. Accucopy finishes with no error. Job finishes but the cnv.output file is not written. What are the reasons this woud happen?

Changing window size

Hi,

I tested the accucopy docker with a dataset of 100 samples, but no breakpoints were found for 80 of those. So I tried to use a different window size but it seems that somewhere the window size is hard coded as it tries to find the file 'chr22.ratio.w500.csv.gz' in the next step no matter what the actual window size is. Of course, this file does not exist as for a window size of 250 it's actually 'chr22.ratio.w250.csv.gz' and the computation fails. Is it possible to fix this? Is there anything else i could try to get a Tumor purity value for those 80 samples?

Cheers,
Nicole

Suggestion on input bams?

Hi @polyactis,
First of all, thanks for your great work of Accucopy. It really helps a lot in my low-coverage WGS project.
And I wonder which kind of bam files you would like to suggest as input bams? Such as raw bams after bwa, or sorted bams, or sorted bams with picard-MarkDuplicate procedure?

Thanks,
Junyi

caller_path

Hi
Thanks for this excellent tool. caller_path the path of the 3rd-party variant calling program; in demo, Strelka2 was used. could gatk be uesd as caller?

Lacking python3 packages in docker

I've ran into some errors with the docker (build in Singularity). The plotting steps require matplotlib. However, plot_cnv errors out when importing matplotlib. Interestingly, plotCPandMCP.py is written in Python3. I suspect the docker only installed matplotlib for python2.

log:

[15_1] [TaskRunner:plot_cnv] Task initiated on local node
[15_1] [TaskManager] [ERROR] Failed to complete command task: 'plot_cnv' launched from master workflow, error code: 1, command: '/usr/local/Accucopy/plotCPandMCP.py -i /.../accucopy/results/cnv.output.tsv -r /mnt/refData/genome.dict --no_of_autosomes 22 -o /.../accucopy/results/plot.cnv.png'
[15_1] [TaskManager] [ERROR] [plot_cnv] Error Message:
[15_1] [TaskManager] [ERROR] [plot_cnv] Last 27 stderr lines from task (of 27 total lines):
[15_1] [TaskManager] [ERROR] [plot_cnv] Error processing line 1 of /usr/local/lib/python3.6/dist-packages/matplotlib-3.3.4-py3.6-nspkg.pth:
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   Traceback (most recent call last):
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     File "/usr/lib/python3.6/site.py", line 174, in addpackage
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     exec(line)
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     File "<string>", line 1, in <module>
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     File "<frozen importlib._bootstrap>", line 568, in module_from_spec
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   AttributeError: 'NoneType' object has no attribute 'loader'
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv] Remainder of file ignored
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv] Matplotlib created a temporary config/cache directory at /tmp/matplotlib-pql0p325 because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv] Traceback (most recent call last):
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "/usr/local/Accucopy/plotCPandMCP.py", line 7, in <module>
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     from matplotlib import pyplot as plt
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py", line 36, in <module>
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     import matplotlib.colorbar
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "/usr/local/lib/python3.6/dist-packages/matplotlib/colorbar.py", line 44, in <module>
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     import matplotlib.contour as contour
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "/usr/local/lib/python3.6/dist-packages/matplotlib/contour.py", line 21, in <module>
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]     import matplotlib.texmanager as texmanager
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "<frozen importlib._bootstrap_external>", line 674, in exec_module
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "<frozen importlib._bootstrap_external>", line 779, in get_code
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv]   File "<frozen importlib._bootstrap_external>", line 487, in _compile_bytecode
[15_1] [TaskManager] [ERROR] [] [] [15_1] [plot_cnv] EOFError: marshal data too short
[15_1] [TaskManager] [ERROR] Shutting down task submission. Waiting for remaining tasks to complete.

In the singularity container i couldn't import matplotlib, pandas or numpy for python3 (but could for python2).

https://www.yfish.org/display/PUB/Accucopy is missing (404)

The web site https://www.yfish.org/display/PUB/Accucopy referenced here is broken

Run on long reads data?

Hi, I want to use Accucopy on long read data, such as Pacbio or Nanopore. So I want to ask, whether your model fits long reads data or not. If yes, how can i set parameters like read_length and widow_size?

thanks.