bioinform / somaticseq Goto Github PK
View Code? Open in Web Editor NEWAn ensemble approach to accurately detect somatic mutations using SomaticSeq
Home Page: http://bioinform.github.io/somaticseq/
License: BSD 2-Clause "Simplified" License
An ensemble approach to accurately detect somatic mutations using SomaticSeq
Home Page: http://bioinform.github.io/somaticseq/
License: BSD 2-Clause "Simplified" License
I am trying to run $SOMATICSEQ_HOME/utilities/singularities/submit_callers_multiThreads.sh
analogously to the docker method described at https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines, but it doesn't get past SomaticSniper, which fails in the end with these messages (I redacted the full path name for the issue here):
bash: /mnt//path-to/RESULTS/1/SomaticSniper.vcf: No such file or directory
Error: Unable to open file /mnt//path-to/RESULTS/1/1.bed. Exiting.
My RESULTS folder doesn't have a 1
directory; the files sought by the program are at the top-level:
$ ls RESULTS/
10.bed 12.bed 14.bed 16.bed 2.bed 4.bed 6.bed 8.bed genome.bed SomaticSniper.vcf
11.bed 13.bed 15.bed 1.bed 3.bed 5.bed 7.bed 9.bed logs
Is the Training dataset for somaticseq available somewhere ?
I am running the different callers in order to have the vcf files for each. I was wondering if there is any differences in the arguments that you specify in the methods of your article.
Thanks!
Hi,
I believe there is a missing "split" option in the call to submit_JointSNVMix2.sh within the submit_callers_multiThreads.sh script (probably also in the submit_callers_singleThread.sh).
Best
Hi,
I am modifying your pipeline to get it to work without Docker or Singularity, because of the other issue I reported.
Right now, things seems to be (possibly) ok, until addsnv.py, which generates a very large number of the ERRORS and WARNINGS. Below is an example:
Start at 2019/11/06 16:15:23
ERROR 2019-11-06 16:15:28.826929 encountered error in mutation spikein: ['chr1_76895_76895_0.402845353397_None']
WARN 2019-11-06 16:15:28.952514 haplo_chr1_876440_876441 could not pileup for region: chr1:876404
ERROR 2019-11-06 16:15:33.110163 encountered error in mutation spikein: ['chr1_959852_959852_0.105696809142_None']
ERROR 2019-11-06 16:15:33.162897 encountered error in mutation spikein: ['chr1_1031390_1031390_0.251373895866_None']
WARN 2019-11-06 16:15:33.371460 haplo_chr1_1554292_1554293 could not pileup for region: chr1:1554243
WARN 2019-11-06 16:15:34.074624 haplo_chr1_1554292_1554293 could not pileup for region: chr1:1554259
I do get an "unsorted.snvs.added.bam" at the end, but in the next step, makevcf.py, I seem to be missing the file "addsnv_logs_unsorted.snvs.added.bam".
I was unable to identify the script that should create it.
What do you think ?
Thanks
Gianfilippo
I tried to run somaticseq to train the model based on truth VCF files. However, after running SSeq_merged.vcf2tsv.py step, I found the last column (TrueVariant_or_False) of the Ensemble.sSNV.tsv are incorrect. Some true variants are tagged as "0". I traced back and found the truth VCF file I used ordered chromosome in the the default ordering (1,10, 11, ..., 2, MT,X) rather than natural ordering (1,2, ..., 10, 11, ..., MT,X). Is this the reason that produced last column in error? If yes, what's the ordering requirement for a truth VCF?
Thanks,
Can I train a model using all the b17 files and then predict somatic variants using hg19 files? Same question about the opposite way? Thanks,
Hi,
is there a way to run the "in silico somatic mutation spike in pipeline" without docker ?
Do you have a version for singularity ?
Thanks
Hi,
is the IndelRealigner really needed ?
I am asking since GATK4 does no longer use it.
Does any of the other callers used by somaticseq need it ?
Thanks
Gianfilippo
Hi, a prediction call to somaticseq_parallel.py (20 threads) failed on a sample with the error below. Can you please help ?
Traceback (most recent call last):
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/gpfs/ycga/$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/EGG-INFO/scripts/somaticseq_parallel.py", line 31, in runPaired_by_region
run_somaticseq.runPaired(outdir_i, ref, tbam, nbam, tumor_name, normal_name, truth_snv, truth_indel, classifier_snv, classifier_indel, pass_threshold, lowqual_threshold, hom_threshold, het_threshold, dbsnp, cosmic, inclusion, exclusion, mutect, indelocator, mutect2, varscan_snv, varscan_indel, jsm, sniper, vardict, muse, lofreq_snv, lofreq_indel, scalpel, strelka_snv, strelka_indel, tnscope, platypus, min_mq, min_bq, min_caller, somaticseq_train, ensembleOutPrefix, consensusOutPrefix, classifiedOutPrefix, algo, keep_intermediates)
File "$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/somaticseq/run_somaticseq.py", line 88, in runPaired
tsv2vcf.tsv2vcf(classifiedSnvTsv, classifiedSnvVcf, snvCallers, pass_score=pass_threshold, lowqual_score=lowqual_threshold, hom_threshold=hom_threshold, het_threshold=het_threshold, single_mode=False, paired_mode=True, normal_sample_name=normal_name, tumor_sample_name=tumor_name, print_reject=True, phred_scaled=True)
File "$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/somaticseq/SSeq_tsv2vcf.py", line 110, in tsv2vcf
with open(tsv_fn) as tsv, open(vcf_fn, 'w') as vcf:
FileNotFoundError: [Errno 2] No such file or directory: '$HOME/test/Sample/18/SSeq.Classified.sSNV.tsv'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "$HOME/.local/bin/somaticseq_parallel.py", line 4, in
import('pkg_resources').run_script('SomaticSeq==3.3.0', 'somaticseq_parallel.py')
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/site-packages/pkg_resources/init.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/site-packages/pkg_resources/init.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/gpfs/ycga/$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/EGG-INFO/scripts/somaticseq_parallel.py", line 113, in
subdirs = pool.map(runPaired_by_region_i, bed_splitted)
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '$HOME/test/Sample/18/SSeq.Classified.sSNV.tsv'
slurmstepd: error: Detected 3 oom-kill event(s) in step 9794252.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Hi,
is there a way to include vcfs generated by other callers, like freebayes ?
Thanks
Gianfilippo
Hi,
I am at the point where I can generate the training data (although I am not sure on how to set Nsnvs, Nindels, Nsvs with WES).
I also got most of the callers to work (with the reported exception) with 10 threads
The somaticseq_parallel.py also successfully completed the run and generated the Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv within each subDIR corresponding to the 10 threads.
As far as I have understood, now I am supposed to combine the individual Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv and use the resulting overall Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv files for training.
Can you please let me know how I can combine the individual Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv ?
Thanks
On http://bioinform.github.io/somaticseq/data.html
Hi again,
I am having another issue with a libcurl.so.4 import error.
Start at 2019/10/27 15:20:45
Traceback (most recent call last):
File "/usr/local/bamsurgeon/scripts/sortedBamSplit.py", line 4, in
import pysam, random, argparse, sys
File "/softwarepath/Python/2.7.13-foss-2016b/lib/python2.7/site-packages/pysam-0.10.0-py2.7-linux-x86_64.egg/pysam/init.py", line 5, in
from pysam.libchtslib import *
ImportError: libcurl.so.4: cannot open shared object file: No such file or directory
This happens at the following command in the (bamsurgeon) file your script generates
singularity exec --bind /:/mnt docker://lethalfang/bamsurgeon:1.1-3
/usr/local/bamsurgeon/scripts/sortedBamSplit.py
--bam /mnt//home/Test/Sample_G1700T_012/1/TNMerged.bam
--proportion 0.5
--downsample 1
--pick1 /mnt//home/Test/Sample_G1700T_012/1/Designated.Normal.bam
--pick2 /mnt//home/Test/Sample_G1700T_012/1/Designated.Tumor.bam
--seed 0
It seems it is trying to run pycurl from a Python 2.7, while everything else runs from python 3.x
Do you think this is some env variable that I have to set or some sort of inconsistency in the docker file ?
Unfortunately I am not very familiar with either docker or singularity
Thanks
Hi!
First of all, thank you for this awesome tool!
We are using Somaticseq for target panel sequencing data, in consensus mode, using only the tumor sample. For the calculation of sensitivity, we observe the results using the horizon 200 control (somatic variants of FFPE tissue), we have good results, however, the Strelka2 (one of the software used with SomaticSeq) does not identify the SNVs, it is only identifying the Indels.
According to Strelka's recommendations, we added the parameter --targeted, however the results were the same.
We are using the image of the docker for Strelka2.
Is there a difference between the docker and the installed versions of Strelka?
If you can give me any recommendation, I would be very grateful.
Thank you
The horizon 200 (FFPE control), strelka have only 0 (It is not identifying control variants)
Hi, I tried to run SomaticSeq.Wrapper.sh to combine four VCF files.
I got following error:
Traceback (most recent call last):
File "~/modify_VJSD.py", line 269, in
raise Exception('No sample information here.')
Exception: No sample information here.
I checked that one VCF file called from lofreq has no sample column. Is this the reason? But some other samples are running well even with the lofreq VCFs without sample column.
Hello,
I am using somaticseq for carrying out ensembl calling on my tumour-normal paired exome data. I am also using Strelka as one of my somatic callers and don't see it as one of the options. Is there any development going on to incorporate strelka into the mix?
If I were to try and modify the code myself, are there any points to keep in mind so as to not break any functionality?
thanks
Arun
Hi,
what is the difference between ada_model_builder_ntChange.R and ada_model_builder.R.
The name of the .RData model file of my first sample has the "ada: string in it, while the new samples show "ntChange". Can you please explain ?
Thanks
Trying to convert the SomaticSeq output to a VCF format and getting the following
python3.4 somaticseq-2.2.1/SSeq_tsv2vcf.py -tsv Sample1_trained.tsv -vcf Sample-trained.vcf -pass 0.7 -low 0.1 -tools VarScan2 SomaticSniper MuSE -all -phred
File "somaticseq-2.2.1/SSeq_tsv2vcf.py", line 234, in <module>
n_MQ0 = tsv_item[nBAM_MQ0] if tsv_item[nBAM_MQ0] != 'nan' else '.'
NameError: name 'nBAM_MQ0' is not defined
Hi!
Does somaticseq support GATK4 Mutect2 as well, as opposed to GATK3 MuTect2 which is mentioned in the documentation?
Thanks!
try to use BamsimMutiThread.sh to create own synthetic tumor/normal pair, but out system use slurm not sge. I substituted "--action qsub" with "--action qsub". Is this correct?
The process seems to run, but return a lot of errors as "ERROR 2019-12-11 22:12:34.719984 encountered error in mutation spikein: ['chr9_104
14396_10414396_0.479213988448_None']
Traceback (most recent call last):
File "/usr/local/bamsurgeon/bin/addsnv.py", line 156, in makemut
mutfail, hasSNP, maxfrac, outreads, mutreads, mutmates = mutation.mutate(args, log, b
amfile, bammate, chrom, min(mutpos_list), max(mutpos_list)+1, mutpos_list, avoid=avoid, m
utid_list=mutid_list, is_snv=True, mutbase_list=mutbase_list, reffile=reffile)
File "build/bdist.linux-x86_64/egg/bamsurgeon/mutation.py", line 239, in mutate
assert maxfrac is not None, "Error: could not pile up over region: %s" % region
AssertionError: Error: could not pile up over region: haplo_chr9_10414396_10414397
"
and finally hang at "haplo_chr9_110468931_110468931 aligning addsnv.t
mp/haplo_chr9_110468931_110468931.tmpbam.1067bfd6-fe56-412d-8b73-035e6bc6bcc1.fastq with
bwa mem"
Hi,
I got the following error when running SomaticSeq train with the following command:
singularity exec /groups/zuber/zubarchive/USERS/tobias/.singularity/somaticseq:3.0.0.img \
/opt/somaticseq/somaticseq/run_somaticseq.py \
--somaticseq-train \
--output-directory /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/model \
--genome-reference /groups/obenauf/Software/hg38/Homo_sapiens_assembly38.fasta \
--inclusion-region /groups/obenauf/NGS_data/GIAB/tumor-normal/benchmark/truth/na12878-na24385-somatic-hg38-truth-regions.bed \
\
\
--dbsnp /groups/zuber/zubarchive/USERS/tobias/hg38/GATK/Homo_sapiens_assembly38.dbsnp138.vcf \
\
\
--truth-snv /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.snp.recode.vcf.gz \
--truth-indel /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.indel.recode.vcf.gz \
\
paired \
--tumor-bam-file /groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA12878_NA24385_mixture.aligned.sorted.dedup.bam \
--normal-bam-file /groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA24385_normal.aligned.sorted.dedup.bam \
\
\
--mutect2-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/MuTect2.concat.vcf \
\
\
\
--somaticsniper-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/SomaticSniper.concat.vcf \
--vardict-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/VarDict.concat.vcf \
\
--lofreq-snv /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.snvs.concat.vcf.gz \
--lofreq-indel /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.indels.concat.vcf.gz \
--scalpel-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Scalpel.concat.vcf \
--strelka-snv /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.snvs.concat.vcf.gz \
--strelka-indel /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.indels.concat.vcf.gz > sseq_train.log 2> sseq_train.err
This is the error.
2018-11-04 09:49:27,786 - SomaticSeq - INFO - SomaticSeq Input Arguments: output_directory=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/model, genome_reference=/groups/obenauf/Software/hg38/Homo_sapiens_assembly38.fasta, truth_snv=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.snp.recode.vcf.gz, truth_indel=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.indel.recode.vcf.gz, classifier_snv=None, classifier_indel=None, pass_threshold=0.5, lowqual_threshold=0.1, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=/groups/zuber/zubarchive/USERS/tobias/hg38/GATK/Homo_sapiens_assembly38.dbsnp138.vcf, cosmic_vcf=None, inclusion_region=/groups/obenauf/NGS_data/GIAB/tumor-normal/benchmark/truth/na12878-na24385-somatic-hg38-truth-regions.bed, exclusion_region=None, threads=1, keep_intermediates=False, somaticseq_train=True, tumor_bam_file=/groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA12878_NA24385_mixture.aligned.sorted.dedup.bam, normal_bam_file=/groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA24385_normal.aligned.sorted.dedup.bam, tumor_sample=TUMOR, normal_sample=NORMAL, mutect_vcf=None, indelocator_vcf=None, mutect2_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/MuTect2.concat.vcf, varscan_snv=None, varscan_indel=None, jsm_vcf=None, somaticsniper_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/SomaticSniper.concat.vcf, vardict_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/VarDict.concat.vcf, muse_vcf=None, lofreq_snv=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.snvs.concat.vcf.gz, lofreq_indel=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.indels.concat.vcf.gz, scalpel_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Scalpel.concat.vcf, strelka_snv=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.snvs.concat.vcf.gz, strelka_indel=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.indels.concat.vcf.gz, tnscope_vcf=None, which=paired
2018-11-04 09:53:55,647 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
/usr/local/lib/python3.6/dist-packages/scipy/stats/stats.py:4971: RuntimeWarning: invalid value encountered in double_scalars
z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
Traceback (most recent call last):
File "/opt/somaticseq/somaticseq/run_somaticseq.py", line 357, in <module>
keep_intermediates = runParameters['keep_intermediates'] )
File "/opt/somaticseq/somaticseq/run_somaticseq.py", line 71, in runPaired
somatic_vcf2tsv.vcf2tsv(is_vcf=outSnv, nbam_fn=nbam, tbam_fn=tbam, truth=truth_snv, cosmic=cosmic, dbsnp=dbsnp, mutect=mutect_infile, varscan=varscan_snv, jsm=jsm, sniper=sniper, vardict=intermediateVcfs['VarDict']['snv'], muse=muse, lofreq=lofreq_snv, scalpel=None, strelka=strelka_snv, tnscope=intermediateVcfs['TNscope']['snv'], dedup=True, min_mq=min_mq, min_bq=min_bq, min_caller=min_caller, ref_fa=ref, p_scale=None, outfile=ensembleSnv)
File "/opt/somaticseq/somaticseq/../somaticseq/somatic_vcf2tsv.py", line 400, in vcf2tsv
if lofreq: got_lofreq, lofreq_variants, lofreq_line = genome.find_vcf_at_coordinate(my_coordinate, lofreq_line, lofreq, chrom_seq)
File "/opt/somaticseq/somaticseq/../genomicFileHandler/genomic_file_handlers.py", line 563, in find_vcf_at_coordinate
latest_vcf_run = catchup_multilines(my_coordinate, latest_vcf_line, vcf_file_handle, chrom_seq)
File "/opt/somaticseq/somaticseq/../genomicFileHandler/genomic_file_handlers.py", line 514, in catchup_multilines
raise Exception('{} does not seem to be properly sorted'.format(filehandle_j.name) )
Exception: /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.snvs.concat.vcf.gz does not seem to be properly sorted
I checked and there are variants in the LoFreq.snvs.concat.vcf.gz
vcf file and it appears to be sorted otherwise indexing with tabix would not work right?
Any ideas?
Hi,
I have to report also an issue with VarScan2. The run did not complete.
It did create the normal and tumor pileup files, then it fails
The error is
^[[31mFATAL: ^[[0m Unable to handle docker:///:/mnt uri: failed to get SHA of docker:///:/mnt: unable to parse image name docker:///:/mnt: invalid reference format
Can you please look into it ?
Thanks
I am confused with regards to how one should train SomaticSeq, and I had a few questions about that and the cluster:
somaticseq_parallel.py
with truth vcf files to generate the classifier scripts. However, I also noticed that there is another section that talks about creating synthetic BAM files in order to train the classifier, a method that does not use somaticseq_parallel.py
. Which should we use, and which is more effective?SomaticSeq issues with regex versions and getting the same error. Which version of regex is specifically required by SomaticSeq ?
2016.03.26, regex-2016.4.8 and 2016.06.05
Traceback (most recent call last):
File "/programs/ngsprograms/somaticseq_latest/somaticseq/modify_VJSD.py", line 15, in <module>
import regex as reg
File "/usr/local/lib/python3.4/site-packages/regex.py", line 683, in <module>
_pattern_type = type(_compile("", 0, {}))
File "/usr/local/lib/python3.4/site-packages/regex.py", line 545, in _compile
parsed = parsed.optimise(info, reverse)
TypeError: optimise() takes 2 positional arguments but 3 were given
Would appreciate help
Hi,
using somaticseq, i analysis WES data.
i ran al l caller somaticseq support, and carried out merge.script.
but it makes some issue.
the output 'Ensemble.sSNV.tsv ' does not match with vardict output.
for example
VarDict.vcf have 'chr1 | 3712588 | . | G | A'
but in Ensemble.sSNV.ts if_VarDict is '0'
was the variant filtered by script?
Best Regards
Jeongmin
Hi,
using your docker environment and prediction script, i analysis my WES data.
but about a week ago, docker often stops working while printing following message.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.
i try to search solution of this error, but i could not find anything to solve it.
so do you have some ideas of this?
Best Regards
Jeongmin
Hi,
I am getting the error below when I run somaticseq in training mode. Can you please let me know what am I doing wrong ? Thanks
Loading required package: ada
Loading required package: rpart
Remove Consistent_Mates
Remove Inconsistent_Mates
[1] "Fitting model..."
[1] "Seed = 21860"
Call:
ada(model_formula, data = train_data, iter = boosting_iters,
control = rpart.control(cp = -1, maxdepth = 16, minsplit = 0,
xval = 0))
Loss: exponential Method: discrete Iteration: 500
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 5762 0
1 0 1
Train Error: 0
Out-Of-Bag Error: 0 iteration= 10
Additional Estimates of number of iterations:
train.err1 train.kap1
2 2
WARNING: ignoring environment value of R_HOME
Loading required package: ada
Loading required package: rpart
Error: In training mode, there must be both true positives and false positives in the call set.
Execution halted
2020-03-24 21:18:29,861 - SomaticSeq - INFO - Removed: Sample_Test/1.th.input.bed
2020-03-24 21:18:29,863 - SomaticSeq - INFO - Removed sub-directory: Sample_Test/1
Hi Dr. Fang,
in the manual, you mentioned that,
If both MuTect2 and MuTect/Indelocator VCF les are provided, the script is written such
that it will use MuTect2 and ignore MuTect.
since SNV calls of mutect and mutect2 are very different (http://gatkforums.broadinstitute.org/gatk/discussion/6794/difference-between-mutect-1-1-7-and-mutect2), it makes sense to join them together.
Best,
Hongen
Hi!
First of all, thank you for this awesome tool!
I would like to know if I could use SomaticSeq for target panel sequencing data. Concretely, I am managing Ion Torrent panel data. We have ~50 patients with mostly all the detected variants validated which, I guess, is a good starting point for the training step.
However, I am not sure if the tool would perform good in target rather than WGS or WES.
Any ideas?
Thanks!
I'm trying to use your singularity pipeline in singularities/bamSimulator to inject mutation into a bam of the human genome (using BamSimulator_multiThreads.sh). The creation of the output directories and scripts goes fine, as does splitting the bam file into multiple regions, But when the script gets to actually creating and adding SNVs, I get a long set of errors that take the form:
ERROR 2019-12-20 14:50:57.380525 encountered error in mutation spikein: ['1_787825_787825_0.549717808335_None']
Traceback (most recent call last):
File "/usr/local/bamsurgeon/bin/addsnv.py", line 275, in makemut
aligners.remap_bam(args.aligner, tmpoutbamname, args.refFasta, alignopts, mutid=hapstr, paired=(not args.single), picardjar=args.picardjar, insane=args.i
nsane)
File "/usr/local/lib/python2.7/dist-packages/bamsurgeon-1.0-py2.7.egg/bamsurgeon/aligners.py", line 67, in remap_bam
remap_bwamem_bam(bamfn, threads, fastaref, picardjar, mutid=mutid, paired=paired, insane=insane)
File "/usr/local/lib/python2.7/dist-packages/bamsurgeon-1.0-py2.7.egg/bamsurgeon/aligners.py", line 211, in remap_bwamem_bam
raise ValueError("ERROR\t" + now() + "\t" + mutid + "\tbam readcount < fastq readcount, alignment sanity check failed!\n")
ValueError: ERROR 2019-12-20 14:50:57.380429 haplo_1_787825_787825 bam readcount < fastq readcount, alignment sanity check failed!
Then at the very end, There is a message that says "no successful mutations", and then samtools complains about not being able to open "unsorted.snvs.added.bam" and the script bails out. This happened for all the threads. Do you have any suggestions about where to start trying to debug this?
Thanks
Hi Litai,
I used SSeq_merged.vcf2tsv.py to merge indel calls by four callers, CGA VarScan2 VarDict LoFreq, and then used SSeq_tsv2vcf.py to change TSV file to VCF file. In the output VCF file, I found some indel calls with KEY "NUM_TOOLS=0". I thought all the INDEL calls in the output file should be called by one caller. Would you please help me?
1 28796 . T TATCCCATCCCACCCCATCCCATTCCATCCCATCCTGTATTCCATCCCCATCCCATTCCAATCCCATCCTTATTCCATCCCCATCCCATCCCATCCCATCCCATCCC 0.0000 REJECT MVDL=0,0,0,0;NUM_TOOLS=0 GT:DP4:CD4:refMQ:altMQ:refBQ:altBQ:refNM:altNM:fetSB:fetCD:zMQ:zBQ:VAF 0/0:3,0,0,0:3,0,0,0:53.3333:.:36.3333:.:1.33333:.:1.00:1.00:.:.:0 0/0:2,0,0,0:2,0,0,0:60:.:35.5:.:2:.:1.00:1.00:.:.:0
Best,
Hongen
Hi Li,
Would you kindly make the sSNV.Classifier.RData and sINDEL.Classifier.RData files you produced for the SomaticSeq publication available?
Thanks!
Hi,
I got another error at the point where the submit_SomaticSeq.sh gets executed (the singularity version).
The error is
/var/spool/slurmd/job9700596/slurm_script: line 14: docker: command not found
I checked the submit_SomaticSeq.sh code and it appears that at line 270 you have the following
echo "docker pull lethalfang/somaticseq:${VERSION}" >> $out_script
the 'docker' is not recognized as command, of course.
I replaced it with the following
echo "singularity pull docker://lethalfang/somaticseq:${VERSION}" >> $out_script
but I get the following error
FATAL: Failed to initialize runtime engine: engine "" is not found
Clearly, what I did is not working. Can you please look into it ?
Thanks
Hi,
I tried to run somaticseq (v3.2.1) with the following command line
~/.pyenv/versions/3.6.1/bin/python ${SomaticSeq_home}/somaticseq_parallel.py paired \
-outdir ${vcf_output_path}/SomaticSeq_new \
--genome-reference /site/ne/home/wings/ref_data/reference_genome/hg19/hg19_decoy/hg19.decoy.fa \
-nbam ${bam_n} \
-tbam ${bam_t} \
--threads 8 \
--mutect2-vcf ${tumour_id}.mutect2.pass.vcf \
--vardict-vcf ${tumour_id}.vardict.pass.vcf \
--strelka-snv ${tumour_id}.strelka2.snvs.pass.vcf \
--strelka-indel ${tumour_id}.strelka2.indels.pass.vcf
At the end, I've got the following error - somaticseq_parallel.py: error: the following arguments are required: -ref/--genome-reference
If you have suggestions, please let me know.
Thank you,
Joon
Error and change logs:
Traceback (most recent call last):
File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 116, in
with genome.open_textfile(right_files[0]) as vcf:
File "/home/huangwt/Codes/somaticseq/genomic_file_handlers.py", line 224, in open_textfile
return gzip.open(file_name, 'rt')
File "/mnt/software/src/Python-3.2.3/Lib/gzip.py", line 46, in open
return GzipFile(filename, mode, compresslevel)
File "/mnt/software/src/Python-3.2.3/Lib/gzip.py", line 156, in init
raise IOError("Mode " + mode + " not supported")
IOError: Mode rt not supported
#Changed:
#File "/home/huangwt/Codes/somaticseq/genomic_file_handlers.py", line 224, in open_textfile
#return gzip.open(file_name, 'rt') to return gzip.open(file_name, 'r')
Traceback (most recent call last):
File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 126, in
while line_i.startswith('#'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
#Changed:
#File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 126, in
#while line_i.startswith('#'): to while line_i.startswith(b'#'):
Traceback (most recent call last):
File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 128, in
if re.match(r'##fileformat=', line_i):
File "/mnt/software/src/Python-3.2.3/Lib/re.py", line 153, in match
return _compile(pattern, flags).match(string)
TypeError: can't use a string pattern on a bytes-like object
Currently, part of the pipeline requires GATK4 and another (the wrapper) uses GATK3. To avoid needing two GATK installations to run the whole pipeline, it'd be best to update the wrapper to use GATK4.
this was discussed in #50. This ticket is for tracking purposes.
Hi Li,
Thanks for maintaining SomaticSeq with an excellent documentation and inline comments. I have just read SomaticSeq manuscript and am thinking of using it for the dog genome mutation data (matched tumor-normal wgs+wes data) I would much appreciate your comments on following:
Since top 18 or 20 features are largely based on individual callers and sequence quality, I like to give a try using default somaticseq features and model on canine mutation calls from gatk4 mutect2, varscan2 and lofreq. Based on --somaticseq-train
option, I can train a model based on canine data (n=60-80), though truthset is limited (disease specific mutations, ~100-200) and not at par with one used in DREAM challenge.
majority-vote consensus mode
instead of training or prediction mode, how much trade-off in precision/recall I should expect? My dataset approximates to setting C (~pure Normal, contaminated tumor) of Fig 3 in Fang et al. 2015.Does somaticseq have an option to run in tumor-only mode or preferably supplying panel of normal vcf instead of matched normal bam file?
Thanks,
Samir
I know this question has been asked before, but it doesn't look like it was resolved and I am getting a similar issue:
2020-06-02 23:02:32,540 - SomaticSeq - INFO - /home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/somaticseq/../r_scripts/xgboost_model_builder_ntChange.R /path/to/Ensemble.sINDEL.tsv
Loading required package: lattice
Loading required package: ggplot2
Error: In training mode, there must be both true positives and false positives in the call set.
I only inputted SNV vcf files and a truth vcf file, but the error seems to be related to the INDEL tsv. I am not interested in INDELs at the moment. Is it still possible to proceed?
Dear maintainers,
Thank you for this nice program. Can you release a version (using for example git tag) so SomaticSeq can be used reproducibly? We can then refer to the version in papers etc.
Also, I was planning to package this program into Bioconda. This requires a version. It will make it easier for people to install your package and easier to integrate it in to our pipelines here at the institute.
So if you could please release a version, I will make sure it is packaged. Thanks in advance!
Hi,
I just tried to run SomaticSeq.Wrapper.sh for a Dream dataset. I got an error as followings. Could you let me know what the reason?
"Traceback (most recent call last):
File "/home/wangm6/resource/somaticseq-2.2.2/SSeq_merged.vcf2tsv.py", line 434, in
if args.cosmic_vcf: got_cosmic, cosmic_variants, cosmic_line = genome.find_vcf_at_coordinate(my_coordinate, cosmic_line, cosmic, chrom_seq)
File "/mnt/nfs/gigantor/ifs/DCEG/Home/wangm6/resource/somaticseq-2.2.2/genomic_file_handlers.py", line 617, in find_vcf_at_coordinate
latest_vcf_run = catchup_multilines(my_coordinate, latest_vcf_line, vcf_file_handle, chrom_seq)
File "/mnt/nfs/gigantor/ifs/DCEG/Home/wangm6/resource/somaticseq-2.2.2/genomic_file_handlers.py", line 574, in catchup_multilines
is_behind = whoisbehind( coordinate_i, coordinate_j, chrom_sequence )
File "/mnt/nfs/gigantor/ifs/DCEG/Home/wangm6/resource/somaticseq-2.2.2/genomic_file_handlers.py", line 378, in whoisbehind
chrom1_position = chrom_sequence[chrom1]
KeyError: '23'"
Dear author,
Could you add some examples with input data?
thank you
Yulu
Hi Litai,
score_Somatic.Variants.py calculate a oncoscore for SNV and Indel calls. Since version 2.1.2, score_Somatic.Variants.py has been removed. Do you have any plan to incorporate it into the newer version of SomaticSEQ? Or this score is not useful anymore?
Best regards,
Hongen
Hi,
I am running submit_callers_multiThreads.sh under Singularity using 20 threads.
I am running with various callers available, including VarDict. All the callers seem to be completing. I vardict script fails with the following error:
Start at 2020/07/03 22:02:34
FATAL: Unable to handle docker://lethalfang/somaticseq:3.5.1 uri: failed to get SHA of docker://lethalfang/somaticseq:3.5.1: Error reading manifest 3.5.1 in docker.io/lethalfang/somaticseq: manifest unknown: manifest unknown
This is because it is trying to run
somaticseq/utilities/split_mergedBed.py
Can you help me with the error ?
Also, why did you use split_mergedBed.py before VarDict ? this does not seem to be used for the other callers
Thanks
Hello,
Can you please upload the "fastq2bam_pipeline_singleThread.sh" script?
It is a requirement for running " somaticseq/utilities/dockered_pipelines/alignments/fastq2bam_pipeline.sh".
Thanks!
Hello,
While evaluating somaticseq for majority voting for Indels, I ran into an issue. Here's an example of how an Indel called in at least one caller is missed due to variant normalization with GATK CombineVariants.
The calls from two callers at the position are:
strelka.indel.vcf:1 76786670 . TTA T . QSI_ref
strelka.indel.vcf:1 76786670 . T TTA . PASS
mutect.indel.vcf:1 76786670 . TTA T . alt_allele_in_normal;germline_risk;str_contraction
The resulting entry in the merged VCF produced by GATK CombineVariants is:
CombineVariants_MVJSD.indel.vcf:1 76786670 . TTA T,TTATA . PASS
CombineVariants has normalized the variants here to create a multi-allelic record in the merged VCF where one is an insertion and one is a deletion.
As I understand, in the next step to get Ensemble.sIndel.tsv from this merged VCF, each allele in multiallelic ALT is treated separately and has its own row in the TSV file if the number of tools with a PASS call for that variant is > mincaller
(hardcoded as 0.5 in SomaticSeq.Wrapper.sh), the record is printed to Ensemble.sIndel.tsv and otherwise it's excluded completely.
In this case, the two variants it looks for after splitting multiallelic ALT are:
1:76786670:TTA>T
1:76786670:TTA>TTATA
Now this is where the problem occurs. The first variant is not PASS in any of the callers so it's not printed. The second is a PASS in Strelka but it is represented as 1:76786670:T>TTA
in Strelka VCF. somaticseq looks for the normalized indel in the original VCF and does not find the record so the variant is not printed from this point on.
Do you have any suggestions on how to work around this?
Thanks,
Prachi
Hi I was looking in your somatiseq. I have a quite naive question what do you suggest to use as a true snp and Indel vcf for model training. Let's say I have some new patient their mutation status is unknown. How can I generate the true set in a practical way?
Thanks a lot for your suggestions!
Hi,
Is it possible to update somaticseq on anaconda? current version there is 2.8.1
https://anaconda.org/bioconda/somaticseq
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.