bioinform / neusomatic Goto Github PK

View Code? Open in Web Editor NEW

167.0 167.0 51.0 70.01 MB

NeuSomatic: Deep convolutional neural networks for accurate somatic mutation detection

License: Other

Shell 11.61% CMake 0.41% C++ 21.78% Python 65.78% Dockerfile 0.41%

convolutional-neural-networks deep-learning genomics somatic-variants

neusomatic's People

Contributors

Stargazers

Watchers

Forkers

yodeng vimlu xtmgah aurotripathy lmiroslaw sll513 wangdi2014 doanle0906 zhangjiekui chingyiwu peterdonnelly1 shahirb yu-l leitaoxman fengqing-code merckey baiyuanxiang zorrodong stephanholgerd schkri xflicsu akmohtashami snowripple akshaydharsey kino7270 rmanluo angeltgc521 wjnl jaleedaslam x1angyang linhxxx mdakashahammed firdaaminy skoyamamd chloezhu genomicsnx shanmukhkatragadda zzygyx9119 max-vdl zhangbiwu giacatxt synthia-3 mesky1238 chichizhao fa-fa97 gitt666 gailrosen dev5710 sbyrum21

neusomatic's Issues

fails in run_test.sh for "No such file 'work_standalone/work_tumor_without_q/work.0/count.bed''

Hi,
I tried to run run_test.sh but it fails in the ERROR as following:
ERROR 2018-12-06 16:04:59,978 scan_alignments No such file 'work_standalone/work_tumor_without_q/work.0/count.bed'
Traceback (most recent call last):
File "/mnt/nfs/gigantor/ifs/DCEG/Resources/Tools/neusomatic/neusomatic/neusomatic/python/scan_alignments.py", line 120, in scan_alignments
outputs = pool.map_async(run_scan_alignments, map_args).get()
File "/DCEG/Resources/Tools/Anaconda/Anaconda2-5.3.0/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
IOError: No such file 'work_standalone/work_tumor_without_q/work.0/count.bed'
Traceback (most recent call last):
File "/DCEG/Resources/Tools/neusomatic/neusomatic/neusomatic/python/preprocess.py", line 359, in
args.scan_alignments_binary)
File "/DCEG/Resources/Tools/neusomatic/neusomatic/neusomatic/python/preprocess.py", line 208, in preprocess
calc_qual=False, dbsnp_regions=[])
File "/DCEG/Resources/Tools/neusomatic/neusomatic/neusomatic/python/preprocess.py", line 50, in process_split_region
calc_qual=calc_qual)
File "/mnt/nfs/gigantor/ifs/DCEG/Resources/Tools/neusomatic/neusomatic/neusomatic/python/scan_alignments.py", line 126, in scan_alignments
raise Exception
Exception

How to fix that?
Thanks,

undefined symbol: _ZN3c107Warning4warnENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I ran neusomatic train.py on a remote server, but got the following errors:

Traceback (most recent call last):
File "neusomatic-master/neusomatic/python/train.py", line 18, in
from torchvision import transforms
File "/public/home//.conda/envs//lib/python3.6/site-packages/torchvision/init.py", line 1, in
from torchvision import models
File "/public/home//.conda/envs//lib/python3.6/site-packages/torchvision/models/init.py", line 11, in
from . import detection
File "/public/home//.conda/envs//lib/python3.6/site-packages/torchvision/models/detection/init.py", line 1, in
from .faster_rcnn import *
File "/public/home//.conda/envs//lib/python3.6/site-packages/torchvision/models/detection/faster_rcnn.py", line 7, in
from torchvision.ops import misc as misc_nn_ops
File "/public/home//.conda/envs//lib/python3.6/site-packages/torchvision/ops/init.py", line 1, in
from .boxes import nms, box_iou
File "/public/home//.conda/envs//lib/python3.6/site-packages/torchvision/ops/boxes.py", line 2, in
from torchvision import _C
ImportError: /public/home//.conda/envs//lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c107Warning4warnENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I tried several times to install g++ on the remote server, but were not successful, so I did not run ./build.sh (which requires cmake 3.13.2 and g++ 5.4.0) on the remote server and cannot run quick test. Instead I ran ./build.sh on my local machine, and ran preprocess.py on the local machine.

/bin/scan_alignments Not Found

Hi , I met "OSError: File not found: ../bin/scan_alignments" when I ran run_test.sh

Is this file missed in the source code?

How to produce synthetic training data with different tumor purity and normal contamination ratio?

Hi,
Wonderful work! I think this is one of the most impressive works in the field of somatic mutation calling. I'm very interested in it. And I'd like to reproduce some evaluation results mentioned in your paper. But now I've problem in producing synthetic training data with different tumor purity and normal contamination ratio. It seems that this link didn't describe the details about producing training data with different tumor purity. I hope you can help me solve this problem.

Best,
Zekun

Zero coverage SNVs and their relation to deletions

We have set up Neusomatic using the pre-built 'general purpose' training network models and believe we have it running properly. Most of the somatic mutations called can be confirmed visually through IGV.

However, we found that very small proportion of the called variants (~1%) have no rare allele at all (AO=0), despite passing QC. The variant is not seen through IGV, nor is it within any of the VCFs created before the Ensemble step. For each of these false SNVs, I've found a corresponding deletion (~20-40nt in length) called pre-ensemble with a starting coordinate 1nt upstream of the SNV (e.g. if the false variant was called chrX:123456C>A, a deletion would have been called at position chrX:123455). I've evaluated 7 examples, and in all 7 cases, this corresponding deletion was present.

Is this a known bug, and if so, do you know how it could be avoided in the future? Thank you.

RuntimeError: Expected object of type torch.DoubleTensor but found type torch.FloatTensor for argument #2 'weight'

Hi, I am running quick test, I got the following error:

INFO 2018-10-23 18:40:01,715 main Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=40000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='../normal.bam', num_threads=1, reference='Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', region_bed='../region.bed', restart=False, scan_alignments_binary='../../neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='../tumor.bam', work='work_standalone')
INFO 2018-10-23 18:40:01,716 preprocess ----------------------Preprocessing------------------------
INFO 2018-10-23 18:40:01,716 preprocess Scan tumor bam (first without quality scores).
INFO 2018-10-23 18:40:01,716 process_split_region Scan bam.
INFO 2018-10-23 18:40:01,716 scan_alignments -------------------Scan Alignment BAM----------------------
INFO 2018-10-23 18:40:01,728 split_region ------------------------Split region-----------------------
INFO 2018-10-23 18:40:01,742 split_region Total length: 40516
INFO 2018-10-23 18:40:01,746 split_region Split 0: 40516
INFO 2018-10-23 18:40:01,746 split_region Total splitted length: 40516
INFO 2018-10-23 18:40:01,758 run_scan_alignments (PoolWorker-1) Running command: ['../../neusomatic/bin/scan_alignments', '--ref', 'Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', '-b', '../tumor.bam', '-L', 'work_standalone/work_tumor_without_q/region_0.bed', '--out_vcf_file', 'work_standalone/work_tumor_without_q/work.0/candidates.vcf', '--out_count_file', 'work_standalone/work_tumor_without_q/work.0/count.bed', '--window_size', '2000', '--min_af', '0.05', '--min_mapq', '10', '--max_depth', '40000', '--num_thread', '1']
INFO 2018-10-23 18:40:02,894 process_split_region Filter candidates.
INFO 2018-10-23 18:40:02,898 filter_candidates (PoolWorker-2) ---------------------Filter Candidates---------------------
INFO 2018-10-23 18:40:02,937 preprocess Scan tumor bam (and extracting quality scores).
INFO 2018-10-23 18:40:02,937 process_split_region Scan bam.
INFO 2018-10-23 18:40:02,938 scan_alignments -------------------Scan Alignment BAM----------------------
INFO 2018-10-23 18:40:02,938 scan_alignments split_regions to be used (will ignore region_bed): work_standalone/work_tumor_without_q/candidates_region_0.bed
INFO 2018-10-23 18:40:02,941 run_scan_alignments (PoolWorker-3) Running command: ['../../neusomatic/bin/scan_alignments', '--ref', 'Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', '-b', '../tumor.bam', '-L', 'work_standalone/work_tumor_without_q/candidates_region_0.bed', '--out_vcf_file', 'work_standalone/work_tumor/work.0/candidates.vcf', '--out_count_file', 'work_standalone/work_tumor/work.0/count.bed', '--window_size', '2000', '--min_af', '0.05', '--min_mapq', '10', '--max_depth', '40000', '--num_thread', '1', '--calculate_qual_stat']
INFO 2018-10-23 18:40:03,184 process_split_region Filter candidates.
INFO 2018-10-23 18:40:03,190 filter_candidates (PoolWorker-4) ---------------------Filter Candidates---------------------
INFO 2018-10-23 18:40:03,202 preprocess Scan normal bam (and extracting quality scores).
INFO 2018-10-23 18:40:03,202 process_split_region Scan bam.
INFO 2018-10-23 18:40:03,202 scan_alignments -------------------Scan Alignment BAM----------------------
INFO 2018-10-23 18:40:03,203 scan_alignments split_regions to be used (will ignore region_bed): work_standalone/work_tumor_without_q/candidates_region_0.bed
INFO 2018-10-23 18:40:03,211 run_scan_alignments (PoolWorker-5) Running command: ['../../neusomatic/bin/scan_alignments', '--ref', 'Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', '-b', '../normal.bam', '-L', 'work_standalone/work_tumor_without_q/candidates_region_0.bed', '--out_vcf_file', 'work_standalone/work_normal/work.0/candidates.vcf', '--out_count_file', 'work_standalone/work_normal/work.0/count.bed', '--window_size', '2000', '--min_af', '0.2', '--min_mapq', '10', '--max_depth', '40000', '--num_thread', '1', '--calculate_qual_stat']
INFO 2018-10-23 18:40:03,395 preprocess Generate dataset.
INFO 2018-10-23 18:40:03,395 preprocess Dataset for region work_standalone/work_tumor_without_q/candidates_region_0.bed
INFO 2018-10-23 18:40:03,395 generate_dataset ---------------------Generate Dataset----------------------
INFO 2018-10-23 18:40:03,404 generate_dataset len_candids: 80
INFO 2018-10-23 18:40:03,405 split_region ------------------------Split region-----------------------
INFO 2018-10-23 18:40:03,419 split_region Total length: 2456
INFO 2018-10-23 18:40:03,421 split_region Split 0: 2456
INFO 2018-10-23 18:40:03,421 split_region Total splitted length: 2456
INFO 2018-10-23 18:40:03,425 find_records (PoolWorker-6) Start find_records for worker 0
INFO 2018-10-23 18:40:03,459 find_records (PoolWorker-6) N_none: 80
INFO 2018-10-23 18:40:03,460 generate_dataset Write 1/1 split to work_standalone/dataset/work.0/candidates_0.tsv for cnts (0..80)/80
INFO 2018-10-23 18:40:03,693 generate_dataset Generating dataset is Done.
INFO 2018-10-23 18:40:03,693 preprocess Preprocessing is Done.
INFO 2018-10-23 18:40:04,114 main use_cuda: False
INFO 2018-10-23 18:40:04,114 call_neusomatic -----------------Call Somatic Mutations--------------------
INFO 2018-10-23 18:40:04,124 call_neusomatic Load pretrained model from checkpoint ../../neusomatic/models/NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth
INFO 2018-10-23 18:40:04,126 call_neusomatic tag: NeuSomatic_v0.1.0_standalone_Dream3_70purity
INFO 2018-10-23 18:40:04,128 call_neusomatic Run for candidate files: ['work_standalone/dataset/work.0/candidates_0.tsv']
INFO 2018-10-23 18:40:04,128 dataloader (1024, 1048576)
INFO 2018-10-23 18:40:04,128 dataloader (524288, 1048576)
INFO 2018-10-23 18:40:04,128 dataloader Len's of tsv files in this batch: [80]
INFO 2018-10-23 18:40:04,130 extract_info_tsv (MainProcess) Loaded 80 candidates for work_standalone/dataset/work.0/candidates_0.tsv
INFO 2018-10-23 18:40:04,131 call_neusomatic N_dataset: 80
ERROR 2018-10-23 18:40:04,154 main Traceback (most recent call last):
File "../../neusomatic/python/call.py", line 529, in
use_cuda)
File "../../neusomatic/python/call.py", line 466, in call_neusomatic
net, vartype_classes, call_loader, out_dir, model_tag, use_cuda)
File "../../neusomatic/python/call.py", line 63, in call_variants
outputs, _ = net(matrices)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/jingmeng/Desktop/neusomatic/neusomatic/python/network.py", line 67, in forward
x = self.pool1(F.relu(self.bn1(self.conv1(x))))
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: Expected object of type torch.DoubleTensor but found type torch.FloatTensor for argument #2 'weight'

ERROR 2018-10-23 18:40:04,155 main Aborting!
ERROR 2018-10-23 18:40:04,155 main call.py failure on arguments: Namespace(batch_size=100, candidates_tsv=['work_standalone/dataset/work.0/candidates_0.tsv'], checkpoint='../../neusomatic/models/NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth', ensemble=False, lowqual_threshold=0.4, max_load_candidates=1000000, num_threads=1, out='work_standalone', pass_threshold=0.7, reference='Homo_sapiens.GRCh37.75.dna.chromosome.22.fa')
Traceback (most recent call last):
File "../../neusomatic/python/call.py", line 534, in
raise e
RuntimeError: Expected object of type torch.DoubleTensor but found type torch.FloatTensor for argument #2 'weight'

How to fix the problem, please? Thank you!

/bin/scan_alignment not found

Hi. Congratulations on the tool. I installed the tool and at the moment trying to do some testing. I encountered a problem. I am not IT expert, therefore would be grateful for some additional help.
I am running it on Databrick and also on Mac computer and both times the same problem appeared.

The command line:
%sh
/databricks/python/bin/python /local_disk0/neu/neusomatic-master/neusomatic/python/preprocess.py
--mode call
--reference /dbfs/FileStore/tables/ucsc.hg19.chr20.unittest.fasta
--region_bed /dbfs/FileStore/tables/test_nist_b37_chr20_100kbp_at_10mb-f8b09.bed
--tumor_bam /dbfs/FileStore/tables/NA12878_S1_chr20_10_10p1mb-41bf7.bam
--normal_bam /dbfs/FileStore/tables/NA12878_S1_chr20_10_10p1mb-41bf7.bam
--work work_train
--scan_alignments_binary ../bin/scan_alignments

Error:
INFO 2018-09-06 11:25:01,741 main Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=1, mode='call', normal_bam='/dbfs/FileStore/tables/NA12878_S1_chr20_10_10p1mb-41bf7.bam', num_threads=1, reference='/dbfs/FileStore/tables/ucsc.hg19.chr20.unittest.fasta', region_bed='/dbfs/FileStore/tables/test_nist_b37_chr20_100kbp_at_10mb-f8b09.bed', restart=False, scan_alignments_binary='../bin/scan_alignments', scan_maf=0.01, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=3, snp_min_bq=10, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/dbfs/FileStore/tables/NA12878_S1_chr20_10_10p1mb-41bf7.bam', work='work_train')
INFO 2018-09-06 11:25:01,743 main -----------------------------------------------------------
INFO 2018-09-06 11:25:01,743 main Postprocessing
INFO 2018-09-06 11:25:01,743 main -----------------------------------------------------------
INFO 2018-09-06 11:25:01,743 main Scan tumor bam (first without quality scores).
INFO 2018-09-06 11:25:01,744 main Scan bam.
INFO 2018-09-06 11:25:01,744 scan_alignments -----------------------------------------------------------
INFO 2018-09-06 11:25:01,744 scan_alignments Scan Alignment BAM
INFO 2018-09-06 11:25:01,744 scan_alignments -----------------------------------------------------------
INFO 2018-09-06 11:25:02,151 utils Running command: ['../bin/scan_alignments', '--ref', '/dbfs/FileStore/tables/ucsc.hg19.chr20.unittest.fasta', '-b', '/dbfs/FileStore/tables/NA12878_S1_chr20_10_10p1mb-41bf7.bam', '-L', 'work_train/work_tumor_without_q/region_0.bed', '--out_vcf_file', 'work_train/work_tumor_without_q/work.0/candidates.vcf', '--out_count_file', 'work_train/work_tumor_without_q/work.0/count.bed', '--window_size', '2000', '--min_af', '0.01', '--min_mapq', '1', '--num_thread', '1']
ERROR 2018-09-06 11:25:02,155 scan_alignments [Errno 2] No such file or directory
Traceback (most recent call last):
File "/local_disk0/neu/neusomatic-master/neusomatic/python/scan_alignments.py", line 120, in scan_alignments
outputs = pool.map_async(run_scan_alignments, map_args).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
OSError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "/local_disk0/neu/neusomatic-master/neusomatic/python/preprocess.py", line 359, in
args.scan_alignments_binary)
File "/local_disk0/neu/neusomatic-master/neusomatic/python/preprocess.py", line 208, in preprocess
calc_qual=False, dbsnp_regions=[])
File "/local_disk0/neu/neusomatic-master/neusomatic/python/preprocess.py", line 50, in process_split_region
calc_qual=calc_qual)
File "/local_disk0/neu/neusomatic-master/neusomatic/python/scan_alignments.py", line 126, in scan_alignments
raise Exception
Exception

I was trying to find scan_alignments in bin folder but the folder is empty.
Thank you a lot for all help.

support for singularity?

Hi there!

From what I can gather, the prepare_callers-scripts.sh command and associated scripts here will only work with Docker. Do I have that right? Are there any plans to develop a similar pipeline using Singularity?

Looks like the somaticseq project has already created a similar variant so perhaps development would not be too complicated.

Thanks for considering this request!

The meaning of the input parameters

Hi there. I am not so sure about some of the parameters as I pass to preprocess.py , call.py and postprocess.py functions, and I could not find a document describing what they are. I am new in this field and could you kindly provide a brief description of all the parameters in the function? Typically, does *_ao mean number of ALT allele observed? Thanks.

Kind regards,
Tianyu

Invalid VCF format [postprocess.py]

Hiya,

this may be just a minor bug but it throws off tools in downstream analysis (e.g. GATK, vcfR).
When postprocess.py generated the final VCF after calling, it outputs a format that is not exactly standard-compliant:

##fileformat=VCFv4.2
##NeuSomatic Version=0.2.0
##FORMAT=<ID=SCORE,Number=1,Type=Float,Description="Prediction probability score">
##FILTER=<ID=PASS,Description="Accept as a higher confidence somatic mutation calls with probability score value at least 0.7">
##FILTER=<ID=LowQual,Description="Less confident somatic mutation calls with probability score value at least 0.4">
##FILTER=<ID=REJECT,Description="Rejected as a confident somatic mutation with probability score value below 0.4">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth in the tumor">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count in the tumor">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count in the tumor">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    96593   .       C       T       39.9993 PASS    SCORE=0.9999;DP=48;RO=22;AO=26;AF=0.5417;       GT:DP:RO:AO:AF  0/1:48:22:26:0.5417

I had problems with the following elements:

INFO fields are not declared in the header
trailing semicolon in INFO column
not exactly an error, but why do DP, RO, AO, AF appear both in INFO and FORMAT?

Bonus question: What's the difference between QUAL and SCORE metrics?

Cheers.
-- Harry

adapt for ultra-deep sequencing data

Hi,

I try to adapt your model for my deep-targeted sequencing data(> 100,000X depth) for somatic mutation calling. I noticed that the default read depth is identified as 100 in the proprocess.py.
https://github.com/bioinform/neusomatic/blob/master/neusomatic/python/preprocess.py#L243

I wonder whether this depth could be modified to a larger number for our deep sequencing data and whether our data can be encoded into layers for the architecture of the CNN model you built.

Thanks a lot!

Best,
Weiwei

ensemble.tsv error

Hello.
When I try to make .tsv file for ensemble mode using SomaticSeq.Wrapper.sh, I encounter below error.
"""
This SomaticSeq.Wrapper.sh script is being obsolesced.
It is only here to maintain compatibility with previous versions.
You should look into run_parallel.py or somaticseq/run_somaticseq.py in the future.
This script simply runs the somaticseq/run_somaticseq.py script.

2018-10-18 10:05:36,629 - SomaticSeq - INFO - SomaticSeq Input Arguments: output_directory=/hdd8tb/sgan/29-O, genome_reference=/hdd8tb/sgan/ucsc.hg19.fasta, truth_snv=None, truth_indel=None, classifier_snv=None, classifier_indel=None, pass_threshold=0.5, lowqual_threshold=0.1, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=dbsnp.GRCh38.vcf, cosmic_vcf=None, inclusion_region=region.bed, exclusion_region=None, threads=1, keep_intermediates=False, somaticseq_train=False, tumor_bam_file=/hdd8tb/sgan/29-O.recal.bam, normal_bam_file=/hdd8tb/sgan/29-N.recal.bam, tumor_sample=TUMOR, normal_sample=NORMAL, mutect_vcf=None, indelocator_vcf=None, mutect2_vcf=MuTect2.vcf, varscan_snv=VarScan2.snp.vcf, varscan_indel=VarScan2.indel.vcf, jsm_vcf=None, somaticsniper_vcf=SomaticSniper.vcf, vardict_vcf=VarDict.vcf, muse_vcf=MuSE.vcf, lofreq_snv=None, lofreq_indel=None, scalpel_vcf=None, strelka_snv=somatic.snvs.vcf.gz, strelka_indel=somatic.indels.vcf.gz, tnscope_vcf=None, platypus_vcf=None, which=paired
sh: 1: intersectBed: not found
Traceback (most recent call last):
File "/home/sgan/software/somaticseq/somaticseq/run_somaticseq.py", line 361, in
keep_intermediates = runParameters['keep_intermediates'] )
File "/home/sgan/software/somaticseq/somaticseq/run_somaticseq.py", line 58, in runPaired
outSnv, outIndel, intermediateVcfs, tempFiles = combineCallers.combinePaired(outdir=outdir, ref=ref, tbam=tbam, nbam=nbam, inclusion=inclusion, exclusion=exclusion, mutect=mutect, indelocator=indelocator, mutect2=mutect2, varscan_snv=varscan_snv, varscan_indel=varscan_indel, jsm=jsm, sniper=sniper, vardict=vardict, muse=muse, lofreq_snv=lofreq_snv, lofreq_indel=lofreq_indel, scalpel=scalpel, strelka_snv=strelka_snv, strelka_indel=strelka_indel, tnscope=tnscope, platypus=platypus, keep_intermediates=True)
File "/home/sgan/software/somaticseq/somaticseq/../somaticseq/combine_callers.py", line 237, in combinePaired
mod_mutect2.convert(mutect2_in, snv_mutect_out, indel_mutect_out, False)
File "/home/sgan/software/somaticseq/somaticseq/../vcfModifier/modify_MuTect2.py", line 69, in convert
normal_index = header.index(normal_name) - 9
UnboundLocalError: local variable 'normal_name' referenced before assignment
"""
I use Python 3.6.3.

How could I fix this error? Do I did anything wrong?

failed at the last step

example sample ran well, but when run with a pair of real data set, neusomatic failed at the last step with error:

NeuSomatic stand-alone FAILED: Files ../example/work_standalone/NeuSomatic_standalone.vcf and ../NeuSomatic_standalone.vcf Are Different!

NeuSomatic ensemble FAILED: Files ../example/work_ensemble/NeuSomatic_ensemble.vcf and ../NeuSomatic_ensemble.vcf Are Different!

Please advise what could be wrong

split_bed.py errors

I'm having issues working with the split_bed source code.
raise NotImplementedError(help_str)
NotImplementedError: "sortBed" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method.
I installed the recent version of BEDTools but the problem persist. Any idea what the issues might be?

How to get the vaildation tsv?

Hi,
I got the truth_vcf . I want to train the data using the truth vcf by 5-fold-cross-validation analysis, and I got the dataset that including many files by 'process.py'. However I don't know how to get the vaildation tsv. Could you help me?
Thanks,
Jiangyuan

ValueError: Image must be a numpy error

Hi,

I am trying to run NeuSomatic with CUDA, I have all the right versions of libraries installed, C++ binaries created, but when I run run_test.sh I get the following error:

ERROR 2019-04-04 12:03:19,757 main Traceback (most recent call last):
File "/code/neusomatic/0.2.0/neusomatic/python/call.py", line 590, in
use_cuda)
File "/code/neusomatic/0.2.0/neusomatic/python/call.py", line 518, in call_neusomatic
net, vartype_classes, call_loader, out_dir, model_tag, use_cuda)
File "/code/neusomatic/0.2.0/neusomatic/python/call.py", line 92, in call_variants
imwrite(file_name, non_transformed_matrices[i, :, :, 0:3])
File "/code/Anaconda/3/2018/lib/python3.7/site-packages/imageio/core/functions.py", line 255, in imwrite
raise ValueError("Image must be a numpy array.")
ValueError: Image must be a numpy array.

ERROR 2019-04-04 12:03:19,757 main Aborting!

ERROR 2019-04-04 12:03:19,757 main call.py failure on arguments: Namespace(batch_size=100, candidates_tsv=['work_standalone/dataset/work.0/candidates_0.tsv'], checkpoint='/code/neusomatic/0.2.0/neusomatic/models/NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth', ensemble=False, lowqual_threshold=0.4, max_load_candidates=100000, num_threads=1, out='work_standalone', pass_threshold=0.7, reference='Homo_sapiens.GRCh37.75.dna.chromosome.22.fa')
Traceback (most recent call last):
File "/code/neusomatic/0.2.0/neusomatic/python/call.py", line 595, in
raise e
File "/code/neusomatic/0.2.0/neusomatic/python/call.py", line 590, in
use_cuda)
File "/code/neusomatic/0.2.0/neusomatic/python/call.py", line 518, in call_neusomatic
net, vartype_classes, call_loader, out_dir, model_tag, use_cuda)
File "/code/neusomatic/0.2.0/neusomatic/python/call.py", line 92, in call_variants
imwrite(file_name, non_transformed_matrices[i, :, :, 0:3])
File "/code/Anaconda/3/2018/lib/python3.7/site-packages/imageio/core/functions.py", line 255, in imwrite
raise ValueError("Image must be a numpy array.")
ValueError: Image must be a numpy array.

Supplementary materials for NeuSomatic paper

Hello! Where can I find the supplementary materials for the NeuSomatic paper? They do not seem to be part of the PDF here: https://doi.org/10.1101/39380

scan_alignment fails

Hi. i am running dream challenge dataset 3 and scan_alignment fails. What could be an issue?
Thank you.

input:
%sh
export PYTHONPATH="/databricks/python/local/lib/python2.7/site-packages:$PYTHONPATH" # this is just a hack, don't use this code.
cd /local_disk0/neu2/neusomatic-master/test/example
#Stand-alone NeuSomatic test
/usr/bin/python ../../neusomatic/python/preprocess.py
--mode call
--reference /local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.fa
--region_bed /local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.region.bed
--tumor_bam /dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.tumor.bam
--normal_bam /dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.normal.bam
--work work_standalone
--scan_maf 0.05
--min_mapq 10
--snp_min_af 0.05
--snp_min_bq 20
--snp_min_ao 10
--ins_min_af 0.05
--del_min_af 0.05
--num_threads 1
--scan_alignments_binary ../../neusomatic/bin/scan_alignments

output:
ERROR 2018-09-21 11:28:03,988 run_scan_alignments (PoolWorker-1) Command '['../../neusomatic/bin/scan_alignments', '--ref', '/local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.fa', '-b', '/dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.tumor.bam', '-L', 'work_standalone/work_tumor_without_q/region_3731.bed', '--out_vcf_file', 'work_standalone/work_tumor_without_q/work.3731/candidates.vcf', '--out_count_file', 'work_standalone/work_tumor_without_q/work.3731/count.bed', '--window_size', '2000', '--min_af', '0.05', '--min_mapq', '10', '--num_thread', '1']' returned non-zero exit status -11
ERROR 2018-09-21 11:28:03,989 run_scan_alignments (PoolWorker-1) Please check error log at work_standalone/work_tumor_without_q/work.3731/scan.err
ERROR 2018-09-21 11:28:03,989 main Traceback (most recent call last):
File "../../neusomatic/python/preprocess.py", line 389, in
args.scan_alignments_binary)
File "../../neusomatic/python/preprocess.py", line 234, in preprocess
calc_qual=False, dbsnp_regions=[])
File "../../neusomatic/python/preprocess.py", line 50, in process_split_region
calc_qual=calc_qual)
File "/local_disk0/neu2/neusomatic-master/neusomatic/python/scan_alignments.py", line 136, in scan_alignments
raise Exception("scan_alignments failed!")
Exception: scan_alignments failed!

ERROR 2018-09-21 11:28:03,989 main Aborting!
ERROR 2018-09-21 11:28:03,990 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='/dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.normal.bam', num_threads=1, reference='/local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.fa', region_bed='/local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.region.bed', restart=False, scan_alignments_binary='../../neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.tumor.bam', work='work_standalone')
Traceback (most recent call last):
File "../../neusomatic/python/preprocess.py", line 395, in
raise e
Exception: scan_alignments failed!

log.docx

How could I use different NT samples to train some times，and then call variants

Hi,
I want to use the different NT samples to trains some times. I'm not sure could the train_work can be saved in one file and then used to call variants.
Thanks in advance.
Best wishes.

How to deal with multiallelic variants?

Hi, @msahraeian

I am wondering how did you deal with those multiallelic variants. If there is a site on genome with 3 or more alternate variant alleles meets the conditions (eg. min_af), Should I keep all these alleles or a major allele?

Sincerely

CUDA example

how to call train.py on a workstation with CUDA enabled?
I guess some of the parameters such as num_threads will not be relevant.

Could you provide an example ?

training - input, target not matching

An error: RuntimeError: input and target shapes do not match: input [24 x 1], target [24] at /pytorch/aten/src/THCUNN/generic/SmoothL1Criterion.cu:12 was already reported by other people. For example:
https://discuss.pytorch.org/t/runtimeerror-multi-target-not-supported-newbie/10216/3

I am using pytorch version 0.4.1.

Input:
%sh
export PYTHONPATH="/databricks/python/local/lib/python2.7/site-packages:$PYTHONPATH" # this is just a hack, don't use this code.
cd /local_disk0/neu2/neusomatic-master/test/example

/usr/bin/python ../../neusomatic/python/preprocess.py
--mode train
--reference Homo_sapiens.GRCh37.75.dna.chromosome.22.fa
--region_bed ../region.bed
--tumor_bam ../tumor.bam
--normal_bam ../normal.bam
--work work_train
--truth_vcf ../NeuSomatic_ensemble.vcf
--min_mapq 10
--num_threads 1
--scan_alignments_binary ../../neusomatic/bin/scan_alignments

%sh
export PYTHONPATH="/databricks/python/local/lib/python2.7/site-packages:$PYTHONPATH"
cd /local_disk0/neu2/neusomatic-master/test/example

/usr/bin/python ../../neusomatic/python/train.py
--candidates_tsv work_train/dataset//candidates.tsv
--out work_train
--num_threads 10
--batch_size 100

Output:
INFO 2018-09-15 16:43:43,998 main Namespace(batch_size=100, boost_none=10, candidates_tsv=['work_train/dataset/work.0/candidates_0.tsv'], checkpoint=None, coverage_thr=100, ensemble=False, lr=0.01, lr_drop_epochs=400, lr_drop_ratio=0.1, max_epochs=1000, max_load_candidates=1000000, momentum=0.9, none_count_scale=2, num_threads=10, out='work_train', validation_candidates_tsv=[])
INFO 2018-09-15 16:43:44,018 main use_cuda: True
INFO 2018-09-15 16:43:44,018 main -----------------------------------------------------------
INFO 2018-09-15 16:43:44,018 main Train NeuSomatic Network
INFO 2018-09-15 16:43:44,018 main -----------------------------------------------------------
INFO 2018-09-15 16:43:46,327 main tag: neusomatic_18-09-15-16-43-46
INFO 2018-09-15 16:43:46,328 dataloader [211]
INFO 2018-09-15 16:43:46,478 dataloader Loaded 211 candidates for work_train/dataset/work.0/candidates_0.tsv
INFO 2018-09-15 16:43:46,484 main Non-somatic candidates: 203
INFO 2018-09-15 16:43:46,484 main Somatic candidates: 8
INFO 2018-09-15 16:43:46,484 main Non-somatic considered in each epoch: 16
INFO 2018-09-15 16:43:46,484 main #Train cadidates: 211
INFO 2018-09-15 16:43:46,484 main count type classes: [('DEL', 2), ('INS', 1), ('NONE', 16), ('SNP', 8)]
INFO 2018-09-15 16:43:46,485 main weight type classes: [('DEL', 0.23148148148148148), ('INS', 0.24074074074074076), ('NONE', 0.10185185185185186), ('SNP', 0.17592592592592593)]
INFO 2018-09-15 16:43:46,485 main weight length classes: [(0, 16), (1, 8), (2, 2), (3, 1)]
INFO 2018-09-15 16:43:46,485 main weight length classes: [(0, 0.10185185185185186), (1, 0.17592592592592593), (2, 0.23148148148148148), (3, 0.24074074074074076)]
INFO 2018-09-15 16:43:46,485 main weights_type:[0.23148148 0.24074074 1.01851852 0.17592593], weights_length:[1.01851852 0.17592593 0.23148148 0.24074074]
INFO 2018-09-15 16:43:46,487 main Number of candidater per epoch: 24
Traceback (most recent call last):
File "../../neusomatic/python/train.py", line 420, in
args.max_load_candidates, args.coverage_thr, use_cuda)
File "../../neusomatic/python/train.py", line 326, in train_neusomatic
) + 1 * criterion_crossentropy2(outputs_len, var_len_labels)
File "/databricks/python/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/databricks/python/local/lib/python2.7/site-packages/torch/nn/modules/loss.py", line 735, in forward
return F.smooth_l1_loss(input, target, reduction=self.reduction)
File "/databricks/python/local/lib/python2.7/site-packages/torch/nn/functional.py", line 1687, in smooth_l1_loss
return torch._C._nn.smooth_l1_loss(input, target, reduction)
RuntimeError: input and target shapes do not match: input [24 x 1], target [24] at /pytorch/aten/src/THCUNN/generic/SmoothL1Criterion.cu:12

Thank you.
Best,
vendi

Model training

Should I train a model every time I run a new project? Don’t you have a single model for Illumina reads like Deepvariant?

scan_alignments failed! when using "run_test.sh"

Hi,

I tried to run 'run_test.sh' to check if neusomatic works properly in my system. After installing the tool in my system and running the example code "run_test.sh", I've got the following error message. Would you please look into it? Thanks.

{0}: ./run_test.sh
INFO 2018-10-08 11:46:31,584 main Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='../normal.bam', num_threads=1, reference='Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', region_bed='../region.bed', restart=False, scan_alignments_binary='/site/ne/app/x86_64/neusomatic/v0.1.1/neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='../tumor.bam', work='work_standalone')
INFO 2018-10-08 11:46:31,584 preprocess ----------------------Preprocessing------------------------
INFO 2018-10-08 11:46:31,587 preprocess Scan tumor bam (first without quality scores).
INFO 2018-10-08 11:46:31,589 process_split_region Scan bam.
INFO 2018-10-08 11:46:31,590 scan_alignments -------------------Scan Alignment BAM----------------------
INFO 2018-10-08 11:46:31,612 split_region ------------------------Split region-----------------------
INFO 2018-10-08 11:46:31,629 split_region Total length: 40516
INFO 2018-10-08 11:46:31,639 split_region Split 0: 40516
INFO 2018-10-08 11:46:31,639 split_region Total splitted length: 40516
ERROR 2018-10-08 11:46:31,656 run_scan_alignments (PoolWorker-1) Traceback (most recent call last):
File "/site/ne/app/x86_64/neusomatic/v0.1.1/python/scan_alignments.py", line 32, in run_scan_alignments
raise IOError("File not found: {}".format(scan_alignments_binary))
IOError: File not found: /site/ne/app/x86_64/neusomatic/v0.1.1/neusomatic/bin/scan_alignments

ERROR 2018-10-08 11:46:31,656 run_scan_alignments (PoolWorker-1) File not found: /site/ne/app/x86_64/neusomatic/v0.1.1/neusomatic/bin/scan_alignments
ERROR 2018-10-08 11:46:31,658 main Traceback (most recent call last):
File "/site/ne/app/x86_64/neusomatic/v0.1.1/python/preprocess.py", line 389, in
args.scan_alignments_binary)
File "/site/ne/app/x86_64/neusomatic/v0.1.1/python/preprocess.py", line 234, in preprocess
calc_qual=False, dbsnp_regions=[])
File "/site/ne/app/x86_64/neusomatic/v0.1.1/python/preprocess.py", line 50, in process_split_region
calc_qual=calc_qual)
File "/site/ne/app/x86_64/neusomatic/v0.1.1/python/scan_alignments.py", line 136, in scan_alignments
raise Exception("scan_alignments failed!")
Exception: scan_alignments failed!

ERROR 2018-10-08 11:46:31,658 main Aborting!
ERROR 2018-10-08 11:46:31,658 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='../normal.bam', num_threads=1, reference='Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', region_bed='../region.bed', restart=False, scan_alignments_binary='/site/ne/app/x86_64/neusomatic/v0.1.1/neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='../tumor.bam', work='work_standalone')
Traceback (most recent call last):
File "/site/ne/app/x86_64/neusomatic/v0.1.1/python/preprocess.py", line 395, in
raise e
Exception: scan_alignments failed!

Test fail

Hi,
I just cloned the repository and tried running the test. I am getting this error:

INFO 2020-05-13 20:24:16,791 __main__             Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='/n/data1/hms/dbmi/park/victor/software/neusomatic/test/normal.bam', num_threads=1, reference='Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', region_bed='/n/data1/hms/dbmi/park/victor/software/neusomatic/test/region.bed', restart=False, scan_alignments_binary='/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/n/data1/hms/dbmi/park/victor/software/neusomatic/test/tumor.bam', work='work_standalone')
INFO 2020-05-13 20:24:16,792 preprocess           ----------------------Preprocessing------------------------
INFO 2020-05-13 20:24:16,798 preprocess           Scan tumor bam (and extracting quality scores).
INFO 2020-05-13 20:24:16,799 process_split_region Scan bam.
INFO 2020-05-13 20:24:16,799 scan_alignments      -------------------Scan Alignment BAM----------------------
INFO 2020-05-13 20:24:16,833 split_region         ------------------------Split region-----------------------
INFO 2020-05-13 20:24:16,869 split_region         Total length: 33514
INFO 2020-05-13 20:24:16,882 split_region         Split 0: 33514
INFO 2020-05-13 20:24:16,882 split_region         Total splitted length: 33514
ERROR 2020-05-13 20:24:16,898 run_scan_alignments (ForkPoolWorker-1) Traceback (most recent call last):
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/scan_alignments.py", line 40, in run_scan_alignments
    raise IOError("File not found: {}".format(scan_alignments_binary))
OSError: File not found: /n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/bin/scan_alignments

ERROR 2020-05-13 20:24:16,898 run_scan_alignments (ForkPoolWorker-1) File not found: /n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/bin/scan_alignments
ERROR 2020-05-13 20:24:16,899 __main__             Traceback (most recent call last):
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/preprocess.py", line 435, in <module>
    args.scan_alignments_binary)
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/preprocess.py", line 291, in preprocess
    dbsnp_regions=dbsnp_regions_q)
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/preprocess.py", line 54, in process_split_region
    calc_qual=calc_qual)
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/scan_alignments.py", line 147, in scan_alignments
    raise Exception("scan_alignments failed!")
Exception: scan_alignments failed!

ERROR 2020-05-13 20:24:16,900 __main__             Aborting!
ERROR 2020-05-13 20:24:16,900 __main__             preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='/n/data1/hms/dbmi/park/victor/software/neusomatic/test/normal.bam', num_threads=1, reference='Homo_sapiens.GRCh37.75.dna.chromosome.22.fa', region_bed='/n/data1/hms/dbmi/park/victor/software/neusomatic/test/region.bed', restart=False, scan_alignments_binary='/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/n/data1/hms/dbmi/park/victor/software/neusomatic/test/tumor.bam', work='work_standalone')
Traceback (most recent call last):
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/preprocess.py", line 441, in <module>
    raise e
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/preprocess.py", line 435, in <module>
    args.scan_alignments_binary)
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/preprocess.py", line 291, in preprocess
    dbsnp_regions=dbsnp_regions_q)
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/preprocess.py", line 54, in process_split_region
    calc_qual=calc_qual)
  File "/n/data1/hms/dbmi/park/victor/software/neusomatic/neusomatic/python/scan_alignments.py", line 147, in scan_alignments
    raise Exception("scan_alignments failed!")
Exception: scan_alignments failed!

Do you know why this is?

Bed file from my bam file

Hi, I want to run my own bam files for variant calling.
And I have two questions.

How do I make bed file from my bam file? Do you have any recommand? (ex bedtools bamtobed) which bam file (ref / normal / tumor) should I use?
If i use different bed file for calling, do I have to train the model again with different bed file?

Thanks!

AssertionError on fast_file.fetch()

I'm trying to run a the following command:

preprocess.py \
    --mode train \
    --reference ${refGenome} \
    --region_bed ${bed} \
    --tumor_bam ${input_dir}/syntheticTumor.bam \
    --normal_bam ${input_dir}/syntheticNormal.bam \
    --work ${input_dir}/work_train_2 \
    --truth_vcf ${input_dir}/synthetic_snvs.vcf \
    --min_mapq 10 \
    --num_threads 20 \
    --scan_alignments_binary ${NEUSOMATIC_SCAN_ALIGNMENTS}

And it is giving me the error messages below. I'm assuming this is because the data that I'm using are not yielding any results and the fasta files are not being created? But maybe it is something totally different. Could someone comment?

It also might be useful to know that this is running using this container from DockerHub and I'm running it with Singularity.

Any help is appreciated. Thanks!

[...snip]
INFO 2020-02-04 14:03:29,122 find_records (ForkPoolWorker-59) Start find_records for worker 18
INFO 2020-02-04 14:03:29,125 find_records (ForkPoolWorker-60) Start find_records for worker 19
ERROR 2020-02-04 14:03:29,780 find_records (ForkPoolWorker-45) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,780 find_records (ForkPoolWorker-45) 
ERROR 2020-02-04 14:03:29,788 find_records (ForkPoolWorker-47) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,788 find_records (ForkPoolWorker-47) 
ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-49) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-49) 
ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-60) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-60) 
ERROR 2020-02-04 14:03:29,796 find_records (ForkPoolWorker-55) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,796 find_records (ForkPoolWorker-55) 
ERROR 2020-02-04 14:03:29,796 find_records (ForkPoolWorker-42) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,797 find_records (ForkPoolWorker-42) 
ERROR 2020-02-04 14:03:29,798 find_records (ForkPoolWorker-41) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,799 find_records (ForkPoolWorker-41) 
ERROR 2020-02-04 14:03:29,803 find_records (ForkPoolWorker-51) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,803 find_records (ForkPoolWorker-51) 
ERROR 2020-02-04 14:03:29,805 find_records (ForkPoolWorker-57) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,805 find_records (ForkPoolWorker-57) 
ERROR 2020-02-04 14:03:29,808 find_records (ForkPoolWorker-58) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,809 find_records (ForkPoolWorker-58) 
ERROR 2020-02-04 14:03:29,809 find_records (ForkPoolWorker-48) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,809 find_records (ForkPoolWorker-56) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,810 find_records (ForkPoolWorker-56) 
ERROR 2020-02-04 14:03:29,810 find_records (ForkPoolWorker-48) 
ERROR 2020-02-04 14:03:29,812 find_records (ForkPoolWorker-46) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,813 find_records (ForkPoolWorker-46) 
INFO 2020-02-04 14:03:29,845 find_records (ForkPoolWorker-53) N_none: 263 
INFO 2020-02-04 14:03:29,845 find_records (ForkPoolWorker-54) N_none: 239 
INFO 2020-02-04 14:03:29,846 find_records (ForkPoolWorker-50) N_none: 272 
INFO 2020-02-04 14:03:29,847 find_records (ForkPoolWorker-43) N_none: 250 
ERROR 2020-02-04 14:03:29,855 find_records (ForkPoolWorker-44) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,855 find_records (ForkPoolWorker-44) 
ERROR 2020-02-04 14:03:29,855 find_records (ForkPoolWorker-59) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,856 find_records (ForkPoolWorker-59) 
ERROR 2020-02-04 14:03:29,929 find_records (ForkPoolWorker-52) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,930 find_records (ForkPoolWorker-52) 
ERROR 2020-02-04 14:03:29,931 __main__             Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 435, in <module>
    args.scan_alignments_binary)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 335, in preprocess
    ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 129, in generate_dataset_region
    tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset
    raise Exception("find_records failed!")
Exception: find_records failed!

ERROR 2020-02-04 14:03:29,931 __main__             Aborting!
ERROR 2020-02-04 14:03:29,931 __main__             preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=1, mode='train', normal_bam='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/syntheticNormal.bam', num_threads=20, reference='/data/godlovedc/slurm-job/hg38.fa', region_bed='/data/godlovedc/slurm-job/broad_MDA_mocha_overlap_cds.bed', restart=False, scan_alignments_binary='/opt/neusomatic/neusomatic/bin/scan_alignments', scan_maf=0.01, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=3, snp_min_bq=10, truth_vcf='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/synthetic_snvs.vcf', tsv_batch_size=50000, tumor_bam='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/syntheticTumor.bam', work='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/work_train_2')
Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 441, in <module>
    raise e
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 435, in <module>
    args.scan_alignments_binary)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 335, in preprocess
    ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 129, in generate_dataset_region
    tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset
    raise Exception("find_records failed!")
Exception: find_records failed!

Fail in installing step

Hi, friend

when I install the software, it errors as this :

Scanning dependencies of target seqan
[  5%] Creating directories for 'seqan'
[ 10%] Performing download step (git clone) for 'seqan'
Cloning into 'seqan'...
error: RPC failed; result=18, HTTP code = 200
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
Cloning into 'seqan'...
fatal: unable to access 'https://github.com/seqan/seqan.git/': Encountered end of file
Cloning into 'seqan'...
error: RPC failed; result=18, HTTP code = 200
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
-- Had to git clone more than once:
      3 times.
CMake Error at /../neusomatic/third_party/seqan/tmp/seqan-gitclone.cmake:66 (message):
Failed to clone repository: 'https://github.com/seqan/seqan.git'


make[2]: *** [/../neusomatic/third_party/seqan/src/seqan-stamp/seqan-download] Error 1
make[1]: *** [CMakeFiles/seqan.dir/all] Error 2
make: *** [all] Error 2

Does it mean that my cluster can not connnect to the 'seqan' github ？ I have already installed the depened softwares and python packages.

Thanks

Multi-patient training

I would like to run a large scale training task over thousands of labelled patient BAMs. Is this currently supported with neusomatic in any way, or will I have to write some custom code to recombine the generated training data?

Train.py Runtime Error

Very new to NeuSomatic and get Runtime error from train.py. Helps are appreciated

CMD: train.py --candidates_tsv dataset//candidates.tsv --out out --num_threads 10 --batch_size 100
OUTPUT:
INFO 2020-01-31 07:34:34,124 train_neusomatic PyTorch Version: 1.1.0
INFO 2020-01-31 07:34:34,124 train_neusomatic Torchvision Version: 0.3.0
INFO 2020-01-31 07:34:34,165 train_neusomatic GPU training!
INFO 2020-01-31 07:34:38,699 train_neusomatic tag: neusomatic_20-01-31-07-34-38
.
.
. run fine to this point

INFO 2020-01-31 07:34:45,115 train_neusomatic Number of candidater per epoch: 86022
ERROR 2020-01-31 07:34:46,082 main Traceback (most recent call last):
File "/opt/neusomatic/neusomatic/python//train.py", line 570, in use_cuda)
File "/opt/neusomatic/neusomatic/python//train.py", line 430, in train_neusomatic outputs, _ = net(inputs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/opt/neusomatic/neusomatic/python/network.py", line 67, in forward x = self.pool1(F.relu(self.bn1(self.conv1(x))))
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size 64 26 1 3, expected input[100, 119, 5, 32] to have 26 channels, but got 119 channels instead

scan_alignments using up all remaining disk space (>100GB) and fails

Hi,

I am trying to run preprocess.py in call mode on a whole genome sample. The target regions .bed file I used is resources/hg19.bed. The work_call directory uses up a lot of space until it runs out of all space of the storage device ( > 100GB). Is this expected and how much space is the scan_alignments step expected to consume? Thank you.

Exception: scan_alignments failed!

terminate called after throwing an instance of 'std::runtime_error'

what(): cannot find index file for GRCh38.fa

Error when running build.sh

Error log:

error: 'coverage' function with trailing return type has 'decltype(auto)' as its type rather than plain 'aut'
 41 | inline decltype (auto) coverage(const std::vector<Interval>& invs) -> std::vector<typename Interval::Depth> {

error: 'coverage' was not declared in this scope
 298 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);

error: expected primary-expression before ',' token
 298 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);

error: 'coverage' was not declared in this scope
 341 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);

error: expected primary-expression before ',' token
 341 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);

How to access CGC files?

Hi and thank you very much for the wonderful work you provided with NeuSomatic and SEQC-II.
I would like to use SEQC-II for benchmarking, without having to realign or run the costly pipelines. I noticed that all the files I am interested in are present on this page.
Although when I click and use my eRA login, I have an error page:

Does this mean I need special permission or that the links are dead?
Thank you.

Ensemble method error in preprocess.py & generate_dataset.py

Hello!

I just had a quick question based on the error below, which I am getting when I try to use the ensemble mode of neusomatic. (I have had a good experience using the stand-alone method so first off I just want to say thanks for the great tool!)

Error:
line 1350, in extract_ensemble: s = ensemble_data[:, np.array(i_s)]
IndexError: arrays used as indices must be of integer (or boolean) type

Some background: I am currently running a test of one sample where I have a matched tumor & normal bam files, in addition to mutect2, strelka2 (indels/SNV), and vardict vcf files. I have generated the ensemble_ann.tsv as described in the documentation but when providing it to the preprocess.py script, this error is thrown.

The line of this error is from the generate_dataset.py script

neusomatic/neusomatic/python/generate_dataset.py

Line 1350 in d2dd889

s = ensemble_data[:, np.array(i_s)]

And so my question is that I'm hoping someone could explain and help me understand better, where and what the variable "i_s" is being defined as.

I noticed earlier on in this function, that on the lines 1267 & 1268, float values are being appended to "ensemble_data" so I'm not sure if that has anything to do with the above error. I do think that understanding the "i_s" variable more could help me narrow down the source of my error, so if anyone has any suggestion or other opinions, they would be greatly appreciated!

neusomatic/neusomatic/python/generate_dataset.py

Line 1267 in d2dd889

ensemble_data.append(list(map(lambda x: float(

Thank you!
Stephanie

Regarding the necessity of normal.bam file

I am currently working on a tumor bulk RNA-seq data set in order to identify somatic mutations. I have the BAM file for the tumor data set but I do not have the BAM file for a normal sample. Is there any way to use NeuSomatic without a normal BAM file?

truth somatic variant .vcf file

hi:
where can I get the truth somatic variant .vcf file?

sincerely, Yueyang.

RuntimeError: The size of tensor a (26) must match the size of tensor b (3) at non-singleton dimension 0 (call.py)

I'm getting this RunTimeError when I run call.py in call mode. I am using python3.7, torch==1.0.1 and torchvision==0.2.2 as listed in the README file.

Here is the traceback:

Traceback (most recent call last):
File "call.py", line 595, in
raise e
File "call.py", line 590, in
use_cuda)
File "call.py", line 518, in call_neusomatic
net, vartype_classes, call_loader, out_dir, model_tag, use_cuda)
File "call.py", line 67, in call_variants
for data in loader_:
File "/home/kiran/neusomaticenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/home/kiran/neusomaticenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/kiran/neusomaticenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/kiran/neusomaticenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/kiran/neusomatic/neusomatic/python/dataloader.py", line 393, in getitem
matrix = self.transform(matrix)
File "/home/kiran/neusomaticenv/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 60, in call
img = t(img)
File "/home/kiran/neusomaticenv/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 163, in call
return F.normalize(tensor, self.mean, self.std, self.inplace)
File "/home/kiran/neusomaticenv/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 208, in normalize
tensor.sub(mean[:, None, None]).div(std[:, None, None])
RuntimeError: The size of tensor a (26) must match the size of tensor b (3) at non-singleton dimension 0

OSError: [Errno 12] Cannot allocate memory

ERROR 2019-11-06 16:21:08,570 main Aborting!
ERROR 2019-11-06 16:21:08,598 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='/home//Desktop/BAM/NA12878_HiSeq1_normal.bam', num_threads=1, reference='/media//TOSHIBA/hg38/hg38-new/hg38.fa', region_bed='/home//Desktop/BAM/NA12878_HiSeq1_benchmark.bed', restart=False, scan_alignments_binary='/home//Downloads/neusomatic-master/neusomatic/bin/scan_alignments', scan_maf=0.01, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=3, snp_min_bq=10, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/home//Desktop/BAM/NA12878_25_snv_indel_sorted.bam', work='/media//JingMeng/BAM/25_work_call')
Traceback (most recent call last):
File "/home/**/.local/lib/python3.6/site-packages/pybedtools/helpers.py", line 407, in call_bedtools
bufsize=BUFSIZE)
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1295, in _execute_child
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home//Downloads/neusomatic-master/neusomatic/python/preprocess.py", line 441, in
raise e
File "/home//Downloads/neusomatic-master/neusomatic/python/preprocess.py", line 435, in
args.scan_alignments_binary)
File "/home//Downloads/neusomatic-master/neusomatic/python/preprocess.py", line 335, in preprocess
ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size)
File "/home//Downloads/neusomatic-master/neusomatic/python/preprocess.py", line 129, in generate_dataset_region
tsv_batch_size)
File "/home//Downloads/neusomatic-master/neusomatic/python/generate_dataset.py", line 1433, in generate_dataset
tumor_pred_vcf_file).intersect(region_bed_file, u=True))
File "/home//.local/lib/python3.6/site-packages/pybedtools/bedtool.py", line 840, in decorated
result = method(self, *args, kwargs)
File "/home//.local/lib/python3.6/site-packages/pybedtools/bedtool.py", line 345, in wrapped
decode_output=decode_output,
File "/home/**/.local/lib/python3.6/site-packages/pybedtools/helpers.py", line 458, in call_bedtools
print('\n\t' + '\n\t'.join(problems[err.errno]))
KeyError: 12

When running neusomatic, it uses just about 1.0% memory (my ram is 8G). I really do not understand why it gave me such error: OSError: [Errno 12] Cannot allocate memory. Thank you!

It takes several days to preprocess BAM files for testing

Hi Sayed, I started preprocessing the whole genome bam files (tumor and matched normal) for testing on my local computer (--number_threads 1) four days ago, and it has not finished now. Why it takes such long time to preprocess the whole genome bam files?

Run preprocess.py on Colab

Hello, I'm new in the field of bioinformatics and my thesis will be about Neusomatic and the work it can do.
The first step that needs to be accomplished is to preprocess the datasets that are available here.
May you guide me on how to run it on Google Colab ? I'd truly appreciate it

Ensemble mode issue

Hi. I tried to run ensemble mode neusomatic.
Since there was no 'SomaticSeq.Wrapper.sh' on 'https://github.com/bioinform/somaticseq/blob/master/SomaticSeq.Wrapper.sh',
I ran 'somaticseq_parallel.py' using recommended command on 'https://github.com/bioinform/somaticseq/' and got 'Ensemble.sSNV.tsv'

However, when I tried to run 'preprocess.py' of NeuSomatic with Ensemble mode using 'Ensemble.sSNV.tsv' I got,
I faced this exception.

extract_ensemble The following features are missing from ensemble file: ['nBAM_Z_Ranksums_EndPos', 'tBAM_Z_Ranksums_MQ', 'nBAM_Z_Ranksums_MQ', 'nBAM_Z_Ranksums_BQ', 'tBAM_Z_Ranksums_EndPos', 'tBAM_Z_Ranksums_BQ']

File "preprocess.py", line 435, in
args.scan_alignments_binary)
File "preprocess.py", line 241, in preprocess
ensemble_bed = extract_ensemble(work, ensemble_tsv)
File "neusomatic/python/generate_dataset.py", line 1296, in extract_ensemble
raise Exception

Seems the issue is related to the 'Ensemble.sSNV.tsv' and the problem is that it does not have upper features including 'nBAM_Z_Ranksums_EndPos', ...

Is there any process did I do wrong? or How can I get an exact ensemble SNV file with whole features to run Neusomatic ensemble mode without any issue.

Thanks,
Ahn

which validate script you use?

Hi,
I wonder which validate script you use?
I evaluated different validate methods like hap.py/som.py and just compare [chorm-pos-alt-ref]，this two functions showed different results.
So, I am curious about which validate script you use to test detect results and truth file and get recall/precision??

Thanks!

Error: unable to open file or unable to determine types for file synthetic.vcf

Hi,

I am trying to run neusomatic in ensemble mode, but got stuck after the SomaticSeq.Wrapper.sh step and at the "preprocess.py --mode train" step.
I get the following error
Error: unable to open file or unable to determine types for file synthetic.vcf

the synthetic.vcf is generated by processing the Ensemble.s*.tsv files generated by SomaticSeq.Wrapper.sh, following the details on your repository (my understanding of it, of course)
cat <(cat Ensemble.s*.tsv |grep CHROM|head -1) <(cat Ensemble.s*.tsv |grep -v CHROM) | sed "s/nan/0/g" > ensemble_ann1.tsv

python preprocess.py --mode train --reference $GENFILE.fa --region_bed $INTERVALFILE --tumor_bam $syntheticTumor.bam --normal_bam $syntheticNormal.bam --work WORK --truth_vcf synthetic.vcf --ensemble_tsv ensemble_ann.tsv --min_mapq 10 --num_threads 20 --scan_alignments_binary $HOME/bin/neusomatic/neusomatic/bin/scan_alignments

A few lines from the synthetic.vcf are below
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPIKEIN
chr1 1503492 . G A 100 PASS SOMATIC;VAF=0.2;DPR=9.66666666667 GT 0/1
chr1 3752754 . G A 100 PASS SOMATIC;VAF=0.307692307692;DPR=90.0 GT 0/1
chr1 3763621 . C A 100 PASS SOMATIC;VAF=0.222222222222;DPR=17.6666666667 GT 0/1
chr1 6152482 . T A 100 PASS SOMATIC;VAF=0.127868852459;DPR=304.666666667 GT 0/1
chr1 6199629 . G C 100 PASS SOMATIC;VAF=0.21978021978;DPR=181.333333333 GT 0/1

Can you please help me understand my mistake ?

Thanks
Gianfilippo

Getting compilation error with latest codebase on Mac OS with gcc/g++ 9

In file included from /Users/siakhnin/tools/ont/neusomatic/neusomatic/include/bedio.hpp:8,
                 from /Users/siakhnin/tools/ont/neusomatic/neusomatic/cpp/scan_alignments.cpp:31:
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp: At global scope:
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp:41:8: error: 'coverage' function with trailing return type has 'decltype(auto)' as its type rather than plain 'auto'
   41 | inline decltype (auto) coverage(const std::vector<Interval>& invs) -> std::vector<typename Interval::Depth> {
      |        ^~~~~~~~
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp: In member function 'std::vector<T> neusomatic::bio::IRanges<Interval, half_open>::reduce() const':
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp:298:49: error: 'coverage' was not declared in this scope
  298 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);
      |                                                 ^~~~~~~~
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp:298:66: error: expected primary-expression before ',' token
  298 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);
      |                                                                  ^
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp: In member function 'std::vector<T> neusomatic::bio::IRanges<Interval, half_open>::disjoint() const':
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp:341:49: error: 'coverage' was not declared in this scope
  341 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);
      |                                                 ^~~~~~~~
/Users/siakhnin/tools/ont/neusomatic/neusomatic/include/Interval.hpp:341:66: error: expected primary-expression before ',' token
  341 |     std::vector<typename Interval::Depth> cov = coverage<Interval, true> (invs_);
      |                        

...                                          ^
make[2]: *** [cpp/CMakeFiles/scan_alignments.dir/scan_alignments.cpp.o] Error 1
make[1]: *** [cpp/CMakeFiles/scan_alignments.dir/all] Error 2
make: *** [all] Error 2

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Hi,
When I run NeuSomatic, I got the error:

Traceback (most recent call last):
  File "/home/user_home/haoz/vctools/neusomatic/neusomatic/python/call.py", line 610, in <module>
    raise e
  File "/home/user_home/haoz/vctools/neusomatic/neusomatic/python/call.py", line 605, in <module>
    use_cuda)
  File "/home/user_home/haoz/vctools/neusomatic/neusomatic/python/call.py", line 530, in call_neusomatic
    net, vartype_classes, call_loader, out_dir, model_tag, use_cuda)
  File "/home/user_home/haoz/vctools/neusomatic/neusomatic/python/call.py", line 77, in call_variants
    outputs, _ = net(matrices)
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_home/haoz/vctools/neusomatic/neusomatic/python/network.py", line 67, in forward
    x = self.pool1(F.relu(self.bn1(self.conv1(x))))
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_home/haoz/miniconda3/envs/neusomatic-gpu/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

I install NeuSomatic under the instruction in README.
pytorch version: pytorch=1.1.0 torchvision=0.3.0 cudatoolkit=9.0

do you have any suggestions?
Thanks!

Seqlib building problem

neusomatic_9_19.log

training using bamsurgeon on multiple samples

Hi,

I generated a training set comprising multiple samples using the method suggested in the README (using somaticseq's docker pipeline for bamsurgeon): https://github.com/bioinform/neusomatic#creating-training-data

After training Neusomatic on this dataset, I do not get the expected performance on an external test sample independent of the training set - I am not sure where the issue is. I have previously evaluated Neusomatic's output without issues.

One issue I suspect is with the vcf generated by bamsurgeon. While I can view the alignments at the mutated sites in the generated insilico normal/tumor samples, bamsurgeon reports a DPR (depth) of 0 for most mutations in the VCF. Below is a sample of the synthetic vcf generated by bamsurgeon. Could this be an issue for Neusomatic? i.e. does it rely on the DPR reported in the VCF when generating training data? The insilico samples look fine otherwise, and the Neusomatic training ran smoothly for 1000 epochs.

Thank you.

##fileformat=VCFv4.1
##phasing=none
##INDIVIDUAL=TRUTH
##SAMPLE=<ID=TRUTH,Description="bamsurgeon spike-in",Individual=TRUTH>
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation in primary">
##INFO=<ID=VAF,Number=1,Type=Float,Description="Variant Allele Frequency">
##INFO=<ID=DPR,Number=1,Type=Float,Description="Avg Depth in Region (+/- 1bp)">
##INFO=<ID=MATEID,Number=1,Type=String,Description="Breakend mate">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=INS,Description="Insertion">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SPIKEIN
1 139455 . A G 100 PASS SOMATIC;VAF=0.288888888889;DPR=0.0 GT 0/1
1 148410 . C A 100 PASS SOMATIC;VAF=0.123595505618;DPR=0.0 GT 0/1
1 148416 . T G 100 PASS SOMATIC;VAF=0.239583333333;DPR=0.0 GT 0/1
1 822932 . G C 100 PASS SOMATIC;VAF=0.0903225806452;DPR=0.0 GT 0/1
21 10569712 . AT A 100 PASS SOMATIC GT 0/1
21 10569714 . TGA T 100 PASS SOMATIC GT 0/1
21 10703018 . C CCCT 100 PASS SOMATIC GT 0/1
21 10706729 . G GA 100 PASS SOMATIC GT 0/1

testing pretrined model

Hello, I tested my data and dream_challenge data with all the pre-trained model than you provided. The test results show that the recall rate is very low, less than 50%, I do not know the reason. can you provide me with the test data to verify the model, thank you

preprocess unable to find candidates for training

I'm trying to train a network using a set of exome data from the Dream Challenge. I create a bed file as suggested in issue 16 and followed your recommendation for distributed data processing on a cluster. Unfortunately there are no candidates found in the preprocessing. Are there certain specifications for the vcf_truth file? I tried adding VarType Information with
java -jar SnpSift.jar varType truth_initial.vcf > truth_final.vcf
but it didn't help.

I attached the output from the main job and the output of one of the 10 sub-region-jobs.

job.txt

sub-job.txt

Edit: The problem may be corrupted bam files. I'll check this and close the issue if that caused the problem.