bioinform / somaticseq Goto Github PK

View Code? Open in Web Editor NEW

190.0 12.0 53.0 67.16 MB

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Home Page: http://bioinform.github.io/somaticseq/

License: BSD 2-Clause "Simplified" License

R 1.42% Shell 24.24% Python 73.23% Perl 0.22% Dockerfile 0.88%

somatic-variants cancer-genomics

somaticseq's People

Contributors

Stargazers

Watchers

Forkers

fw1121 dshulgin2015 rdbcasillas lozybean bgistone xflicsu genomicsnx hongenxu tyler5huang ruixiangliu scchess zzygyx9119 msahraeian xtmgah inambioinfo lethalfang congrongssh iranmdl simexin wangdi2014 wenluo711 vamst sunbymoon miro1942 whiteorchid zacshi ccgenomics pythseq shahirb tao-deng flywind2 zorrodong guoyuh cgpu evanburawa bacemdatascience gianfilippo wenliangz xuexiaohua-bio zprh shulp2211 zpeng1989 schaudge iserf dnousome nickier0510 gerde buihoangnam1988 duanshumeng yqyuhao xiaoyulei0406 hidaisyyu beratna-steve

somaticseq's Issues

submitSomaticSniper: unable to open file /mnt/<path-to-RESULTS>/1/1.bed

I am trying to run $SOMATICSEQ_HOME/utilities/singularities/submit_callers_multiThreads.sh analogously to the docker method described at https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines, but it doesn't get past SomaticSniper, which fails in the end with these messages (I redacted the full path name for the issue here):

bash: /mnt//path-to/RESULTS/1/SomaticSniper.vcf: No such file or directory
Error: Unable to open file /mnt//path-to/RESULTS/1/1.bed. Exiting.

My RESULTS folder doesn't have a 1 directory; the files sought by the program are at the top-level:

$ ls RESULTS/
10.bed  12.bed  14.bed  16.bed  2.bed  4.bed  6.bed  8.bed  genome.bed  SomaticSniper.vcf
11.bed  13.bed  15.bed  1.bed   3.bed  5.bed  7.bed  9.bed  logs

Training Dataset

Is the Training dataset for somaticseq available somewhere ?

Arguments for each caller

I am running the different callers in order to have the vcf files for each. I was wondering if there is any differences in the arguments that you specify in the methods of your article.

Thanks!

submit_callers_multiThreads.sh missing "split" in call to submit_JointSNVMix2.sh

Hi,
I believe there is a missing "split" option in the call to submit_JointSNVMix2.sh within the submit_callers_multiThreads.sh script (probably also in the submit_callers_singleThread.sh).
Best

default processing run_somaticseq.py

addsnv.py ERROR/WARNINGS: encountered error in mutation spikein; could not pileup for region

Hi,
I am modifying your pipeline to get it to work without Docker or Singularity, because of the other issue I reported.

Right now, things seems to be (possibly) ok, until addsnv.py, which generates a very large number of the ERRORS and WARNINGS. Below is an example:
Start at 2019/11/06 16:15:23

ERROR 2019-11-06 16:15:28.826929 encountered error in mutation spikein: ['chr1_76895_76895_0.402845353397_None']

WARN 2019-11-06 16:15:28.952514 haplo_chr1_876440_876441 could not pileup for region: chr1:876404

ERROR 2019-11-06 16:15:33.110163 encountered error in mutation spikein: ['chr1_959852_959852_0.105696809142_None']

ERROR 2019-11-06 16:15:33.162897 encountered error in mutation spikein: ['chr1_1031390_1031390_0.251373895866_None']

WARN 2019-11-06 16:15:33.371460 haplo_chr1_1554292_1554293 could not pileup for region: chr1:1554243
WARN 2019-11-06 16:15:34.074624 haplo_chr1_1554292_1554293 could not pileup for region: chr1:1554259

I do get an "unsorted.snvs.added.bam" at the end, but in the next step, makevcf.py, I seem to be missing the file "addsnv_logs_unsorted.snvs.added.bam".
I was unable to identify the script that should create it.

What do you think ?

Thanks
Gianfilippo

The format of Truth VCF files

I tried to run somaticseq to train the model based on truth VCF files. However, after running SSeq_merged.vcf2tsv.py step, I found the last column (TrueVariant_or_False) of the Ensemble.sSNV.tsv are incorrect. Some true variants are tagged as "0". I traced back and found the truth VCF file I used ordered chromosome in the the default ordering (1,10, 11, ..., 2, MT,X) rather than natural ordering (1,2, ..., 10, 11, ..., MT,X). Is this the reason that produced last column in error? If yes, what's the ordering requirement for a truth VCF?
Thanks,

hg19 and b37

Can I train a model using all the b17 files and then predict somatic variants using hg19 files? Same question about the opposite way? Thanks,

somatic mutation spike in pipeline without docker

Hi,

is there a way to run the "in silico somatic mutation spike in pipeline" without docker ?

Do you have a version for singularity ?

Thanks

RealignerTargetCreator and IndelRealigner steps

Hi,

is the IndelRealigner really needed ?

I am asking since GATK4 does no longer use it.

Does any of the other callers used by somaticseq need it ?

Thanks
Gianfilippo

failed prediciton

Hi, a prediction call to somaticseq_parallel.py (20 threads) failed on a sample with the error below. Can you please help ?

Traceback (most recent call last):
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/gpfs/ycga/$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/EGG-INFO/scripts/somaticseq_parallel.py", line 31, in runPaired_by_region
run_somaticseq.runPaired(outdir_i, ref, tbam, nbam, tumor_name, normal_name, truth_snv, truth_indel, classifier_snv, classifier_indel, pass_threshold, lowqual_threshold, hom_threshold, het_threshold, dbsnp, cosmic, inclusion, exclusion, mutect, indelocator, mutect2, varscan_snv, varscan_indel, jsm, sniper, vardict, muse, lofreq_snv, lofreq_indel, scalpel, strelka_snv, strelka_indel, tnscope, platypus, min_mq, min_bq, min_caller, somaticseq_train, ensembleOutPrefix, consensusOutPrefix, classifiedOutPrefix, algo, keep_intermediates)
File "$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/somaticseq/run_somaticseq.py", line 88, in runPaired
tsv2vcf.tsv2vcf(classifiedSnvTsv, classifiedSnvVcf, snvCallers, pass_score=pass_threshold, lowqual_score=lowqual_threshold, hom_threshold=hom_threshold, het_threshold=het_threshold, single_mode=False, paired_mode=True, normal_sample_name=normal_name, tumor_sample_name=tumor_name, print_reject=True, phred_scaled=True)
File "$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/somaticseq/SSeq_tsv2vcf.py", line 110, in tsv2vcf
with open(tsv_fn) as tsv, open(vcf_fn, 'w') as vcf:
FileNotFoundError: [Errno 2] No such file or directory: '$HOME/test/Sample/18/SSeq.Classified.sSNV.tsv'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "$HOME/.local/bin/somaticseq_parallel.py", line 4, in
import('pkg_resources').run_script('SomaticSeq==3.3.0', 'somaticseq_parallel.py')
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/site-packages/pkg_resources/init.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/site-packages/pkg_resources/init.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/gpfs/ycga/$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/EGG-INFO/scripts/somaticseq_parallel.py", line 113, in
subdirs = pool.map(runPaired_by_region_i, bed_splitted)
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '$HOME/test/Sample/18/SSeq.Classified.sSNV.tsv'
slurmstepd: error: Detected 3 oom-kill event(s) in step 9794252.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

freebayes or other callers

Hi,

is there a way to include vcfs generated by other callers, like freebayes ?

Thanks
Gianfilippo

multi-thread run clarification

Hi,

I am at the point where I can generate the training data (although I am not sure on how to set Nsnvs, Nindels, Nsvs with WES).

I also got most of the callers to work (with the reported exception) with 10 threads

The somaticseq_parallel.py also successfully completed the run and generated the Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv within each subDIR corresponding to the 10 threads.

As far as I have understood, now I am supposed to combine the individual Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv and use the resulting overall Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv files for training.

Can you please let me know how I can combine the individual Ensemble.sINDEL.tsv and Ensemble.sSNV.tsv ?

Thanks

updates to data download page

On http://bioinform.github.io/somaticseq/data.html

Linked instructions to GeneTorrent now say to use Bionimbus instead. It might be better to point directly to those new instructions.
SomaticSpike: "We will include the link to the spiked BAM files when they are done uploading". Maybe an update is needed here.
Analysis Files - both links are broken.

ImportError: libcurl.so.4: cannot open shared object file: No such file or directory

Hi again,

I am having another issue with a libcurl.so.4 import error.
Start at 2019/10/27 15:20:45
Traceback (most recent call last):
File "/usr/local/bamsurgeon/scripts/sortedBamSplit.py", line 4, in
import pysam, random, argparse, sys
File "/softwarepath/Python/2.7.13-foss-2016b/lib/python2.7/site-packages/pysam-0.10.0-py2.7-linux-x86_64.egg/pysam/init.py", line 5, in
from pysam.libchtslib import *
ImportError: libcurl.so.4: cannot open shared object file: No such file or directory

This happens at the following command in the (bamsurgeon) file your script generates
singularity exec --bind /:/mnt docker://lethalfang/bamsurgeon:1.1-3
/usr/local/bamsurgeon/scripts/sortedBamSplit.py
--bam /mnt//home/Test/Sample_G1700T_012/1/TNMerged.bam
--proportion 0.5
--downsample 1
--pick1 /mnt//home/Test/Sample_G1700T_012/1/Designated.Normal.bam
--pick2 /mnt//home/Test/Sample_G1700T_012/1/Designated.Tumor.bam
--seed 0

It seems it is trying to run pycurl from a Python 2.7, while everything else runs from python 3.x

Do you think this is some env variable that I have to set or some sort of inconsistency in the docker file ?
Unfortunately I am not very familiar with either docker or singularity

Thanks

Strelka parameters on targeted amplicon panels, single mode (only tumor sample). Using the docker file for run.

Hi!
First of all, thank you for this awesome tool!

We are using Somaticseq for target panel sequencing data, in consensus mode, using only the tumor sample. For the calculation of sensitivity, we observe the results using the horizon 200 control (somatic variants of FFPE tissue), we have good results, however, the Strelka2 (one of the software used with SomaticSeq) does not identify the SNVs, it is only identifying the Indels.
According to Strelka's recommendations, we added the parameter --targeted, however the results were the same.

We are using the image of the docker for Strelka2.
Is there a difference between the docker and the installed versions of Strelka?

If you can give me any recommendation, I would be very grateful.

Thank you

The horizon 200 (FFPE control), strelka have only 0 (It is not identifying control variants)

No sample information error

Hi, I tried to run SomaticSeq.Wrapper.sh to combine four VCF files.
I got following error:
Traceback (most recent call last):
File "~/modify_VJSD.py", line 269, in
raise Exception('No sample information here.')
Exception: No sample information here.
I checked that one VCF file called from lofreq has no sample column. Is this the reason? But some other samples are running well even with the lofreq VCFs without sample column.

adding Strelka

Hello,
I am using somaticseq for carrying out ensembl calling on my tumour-normal paired exome data. I am also using Strelka as one of my somatic callers and don't see it as one of the options. Is there any development going on to incorporate strelka into the mix?

If I were to try and modify the code myself, are there any points to keep in mind so as to not break any functionality?

thanks
Arun

clarification: ada_model_builder_ntChange.R vs ada_model_builder.R

Hi,

what is the difference between ada_model_builder_ntChange.R and ada_model_builder.R.

The name of the .RData model file of my first sample has the "ada: string in it, while the new samples show "ntChange". Can you please explain ?

Thanks

Could you provide documentation on the 70 features extracted by somaticseq? Trying to better understand SomaticSeq's model

TSV to VCF conversion

Trying to convert the SomaticSeq output to a VCF format and getting the following

python3.4 somaticseq-2.2.1/SSeq_tsv2vcf.py -tsv Sample1_trained.tsv -vcf Sample-trained.vcf -pass 0.7 -low 0.1 -tools VarScan2 SomaticSniper MuSE -all -phred

File "somaticseq-2.2.1/SSeq_tsv2vcf.py", line 234, in <module>
    n_MQ0     = tsv_item[nBAM_MQ0]             if tsv_item[nBAM_MQ0]             != 'nan' else '.'
NameError: name 'nBAM_MQ0' is not defined

GATK4 Mutect2

Hi!

Does somaticseq support GATK4 Mutect2 as well, as opposed to GATK3 MuTect2 which is mentioned in the documentation?

Thanks!

slurm vs sge

try to use BamsimMutiThread.sh to create own synthetic tumor/normal pair, but out system use slurm not sge. I substituted "--action qsub" with "--action qsub". Is this correct?
The process seems to run, but return a lot of errors as "ERROR 2019-12-11 22:12:34.719984 encountered error in mutation spikein: ['chr9_104
14396_10414396_0.479213988448_None']
Traceback (most recent call last):
File "/usr/local/bamsurgeon/bin/addsnv.py", line 156, in makemut
mutfail, hasSNP, maxfrac, outreads, mutreads, mutmates = mutation.mutate(args, log, b
amfile, bammate, chrom, min(mutpos_list), max(mutpos_list)+1, mutpos_list, avoid=avoid, m
utid_list=mutid_list, is_snv=True, mutbase_list=mutbase_list, reffile=reffile)
File "build/bdist.linux-x86_64/egg/bamsurgeon/mutation.py", line 239, in mutate
assert maxfrac is not None, "Error: could not pile up over region: %s" % region
AssertionError: Error: could not pile up over region: haplo_chr9_10414396_10414397
"
and finally hang at "haplo_chr9_110468931_110468931 aligning addsnv.t
mp/haplo_chr9_110468931_110468931.tmpbam.1067bfd6-fe56-412d-8b73-035e6bc6bcc1.fastq with
bwa mem"

Exception: LoFreq.snvs.concat.vcf.gz does not seem to be properly sorted

Hi,

I got the following error when running SomaticSeq train with the following command:

singularity exec /groups/zuber/zubarchive/USERS/tobias/.singularity/somaticseq:3.0.0.img \
/opt/somaticseq/somaticseq/run_somaticseq.py \
--somaticseq-train \
--output-directory /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/model \
--genome-reference /groups/obenauf/Software/hg38/Homo_sapiens_assembly38.fasta \
--inclusion-region /groups/obenauf/NGS_data/GIAB/tumor-normal/benchmark/truth/na12878-na24385-somatic-hg38-truth-regions.bed \
 \
 \
--dbsnp /groups/zuber/zubarchive/USERS/tobias/hg38/GATK/Homo_sapiens_assembly38.dbsnp138.vcf \
 \
 \
--truth-snv /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.snp.recode.vcf.gz \
--truth-indel /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.indel.recode.vcf.gz \
 \
paired \
--tumor-bam-file /groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA12878_NA24385_mixture.aligned.sorted.dedup.bam \
--normal-bam-file  /groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA24385_normal.aligned.sorted.dedup.bam \
 \
 \
--mutect2-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/MuTect2.concat.vcf \
 \
 \
 \
--somaticsniper-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/SomaticSniper.concat.vcf \
--vardict-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/VarDict.concat.vcf \
 \
--lofreq-snv /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.snvs.concat.vcf.gz \
--lofreq-indel /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.indels.concat.vcf.gz \
--scalpel-vcf /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Scalpel.concat.vcf \
--strelka-snv /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.snvs.concat.vcf.gz \
--strelka-indel /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.indels.concat.vcf.gz > sseq_train.log 2> sseq_train.err

This is the error.

2018-11-04 09:49:27,786 - SomaticSeq - INFO - SomaticSeq Input Arguments: output_directory=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/model, genome_reference=/groups/obenauf/Software/hg38/Homo_sapiens_assembly38.fasta, truth_snv=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.snp.recode.vcf.gz, truth_indel=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/truth/na12878-na24385-somatic-hg38-truth.indel.recode.vcf.gz, classifier_snv=None, classifier_indel=None, pass_threshold=0.5, lowqual_threshold=0.1, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=/groups/zuber/zubarchive/USERS/tobias/hg38/GATK/Homo_sapiens_assembly38.dbsnp138.vcf, cosmic_vcf=None, inclusion_region=/groups/obenauf/NGS_data/GIAB/tumor-normal/benchmark/truth/na12878-na24385-somatic-hg38-truth-regions.bed, exclusion_region=None, threads=1, keep_intermediates=False, somaticseq_train=True, tumor_bam_file=/groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA12878_NA24385_mixture.aligned.sorted.dedup.bam, normal_bam_file=/groups/obenauf/NGS_data/GIAB/tumor-normal/aln/NA24385_normal.aligned.sorted.dedup.bam, tumor_sample=TUMOR, normal_sample=NORMAL, mutect_vcf=None, indelocator_vcf=None, mutect2_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/MuTect2.concat.vcf, varscan_snv=None, varscan_indel=None, jsm_vcf=None, somaticsniper_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/SomaticSniper.concat.vcf, vardict_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/VarDict.concat.vcf, muse_vcf=None, lofreq_snv=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.snvs.concat.vcf.gz, lofreq_indel=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.indels.concat.vcf.gz, scalpel_vcf=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Scalpel.concat.vcf, strelka_snv=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.snvs.concat.vcf.gz, strelka_indel=/groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/Strelka.indels.concat.vcf.gz, tnscope_vcf=None, which=paired
2018-11-04 09:53:55,647 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
/usr/local/lib/python3.6/dist-packages/scipy/stats/stats.py:4971: RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
Traceback (most recent call last):
  File "/opt/somaticseq/somaticseq/run_somaticseq.py", line 357, in <module>
    keep_intermediates = runParameters['keep_intermediates'] )
  File "/opt/somaticseq/somaticseq/run_somaticseq.py", line 71, in runPaired
    somatic_vcf2tsv.vcf2tsv(is_vcf=outSnv, nbam_fn=nbam, tbam_fn=tbam, truth=truth_snv, cosmic=cosmic, dbsnp=dbsnp, mutect=mutect_infile, varscan=varscan_snv, jsm=jsm, sniper=sniper, vardict=intermediateVcfs['VarDict']['snv'], muse=muse, lofreq=lofreq_snv, scalpel=None, strelka=strelka_snv, tnscope=intermediateVcfs['TNscope']['snv'], dedup=True, min_mq=min_mq, min_bq=min_bq, min_caller=min_caller, ref_fa=ref, p_scale=None, outfile=ensembleSnv)
  File "/opt/somaticseq/somaticseq/../somaticseq/somatic_vcf2tsv.py", line 400, in vcf2tsv
    if lofreq:  got_lofreq,  lofreq_variants,  lofreq_line  = genome.find_vcf_at_coordinate(my_coordinate, lofreq_line,  lofreq,  chrom_seq)
  File "/opt/somaticseq/somaticseq/../genomicFileHandler/genomic_file_handlers.py", line 563, in find_vcf_at_coordinate
    latest_vcf_run  = catchup_multilines(my_coordinate, latest_vcf_line, vcf_file_handle, chrom_seq)
  File "/opt/somaticseq/somaticseq/../genomicFileHandler/genomic_file_handlers.py", line 514, in catchup_multilines
    raise Exception('{} does not seem to be properly sorted'.format(filehandle_j.name) )
Exception: /groups/obenauf/NGS_data/GIAB/tumor-normal/somaticSeq/train/trainingInput/LoFreq.snvs.concat.vcf.gz does not seem to be properly sorted

I checked and there are variants in the LoFreq.snvs.concat.vcf.gz vcf file and it appears to be sorted otherwise indexing with tabix would not work right?

Any ideas?

dockered VarScan2 issue

Hi,

I have to report also an issue with VarScan2. The run did not complete.

It did create the normal and tumor pileup files, then it fails

The error is
^[[31mFATAL: ^[[0m Unable to handle docker:///:/mnt uri: failed to get SHA of docker:///:/mnt: unable to parse image name docker:///:/mnt: invalid reference format

Can you please look into it ?

Thanks

Some clarifications?

I am confused with regards to how one should train SomaticSeq, and I had a few questions about that and the cluster:

I noticed in the guide that there was an option to use somaticseq_parallel.py with truth vcf files to generate the classifier scripts. However, I also noticed that there is another section that talks about creating synthetic BAM files in order to train the classifier, a method that does not use somaticseq_parallel.py. Which should we use, and which is more effective?
If we decide to go with the synthetic BAM file approach, are we just using a two identical BAM files or are we using two independent duplicates (ie. different BAM files but the same sample)? Should we use the tumor or normal?
How coverage dependent is the classifier? I have not tried it, but does training at a lower coverage significantly hinder classification at higher coverages?
If we use SLURM instead of SGE, is there a way this could still work?

SomaticSeq - regex error

SomaticSeq issues with regex versions and getting the same error. Which version of regex is specifically required by SomaticSeq ?
2016.03.26, regex-2016.4.8 and 2016.06.05

Traceback (most recent call last):
  File "/programs/ngsprograms/somaticseq_latest/somaticseq/modify_VJSD.py", line 15, in <module>
    import regex as reg
  File "/usr/local/lib/python3.4/site-packages/regex.py", line 683, in <module>
    _pattern_type = type(_compile("", 0, {}))
  File "/usr/local/lib/python3.4/site-packages/regex.py", line 545, in _compile
    parsed = parsed.optimise(info, reverse)
TypeError: optimise() takes 2 positional arguments but 3 were given

Would appreciate help

Ensemble.sSNV.tsv does not have some vardict variant.

Hi,

using somaticseq, i analysis WES data.
i ran al l caller somaticseq support, and carried out merge.script.

but it makes some issue.
the output 'Ensemble.sSNV.tsv ' does not match with vardict output.

for example

VarDict.vcf have 'chr1 | 3712588 | . | G | A'

but in Ensemble.sSNV.ts if_VarDict is '0'

was the variant filtered by script?

Best Regards

Jeongmin

docker error - docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

Hi,
using your docker environment and prediction script, i analysis my WES data.
but about a week ago, docker often stops working while printing following message.

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

i try to search solution of this error, but i could not find anything to solve it.

so do you have some ideas of this?

Best Regards

Jeongmin

Error: In training mode, there must be both true positives and false positives in the call set.

Hi,
I am getting the error below when I run somaticseq in training mode. Can you please let me know what am I doing wrong ? Thanks

Loading required package: ada
Loading required package: rpart
Remove Consistent_Mates
Remove Inconsistent_Mates
[1] "Fitting model..."
[1] "Seed = 21860"
Call:
ada(model_formula, data = train_data, iter = boosting_iters,
control = rpart.control(cp = -1, maxdepth = 16, minsplit = 0,
xval = 0))

Loss: exponential Method: discrete Iteration: 500

Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 5762 0
1 0 1

Train Error: 0

Out-Of-Bag Error: 0 iteration= 10

Additional Estimates of number of iterations:

train.err1 train.kap1
2 2

WARNING: ignoring environment value of R_HOME
Loading required package: ada
Loading required package: rpart
Error: In training mode, there must be both true positives and false positives in the call set.
Execution halted
2020-03-24 21:18:29,861 - SomaticSeq - INFO - Removed: Sample_Test/1.th.input.bed
2020-03-24 21:18:29,863 - SomaticSeq - INFO - Removed sub-directory: Sample_Test/1

is it possible to support both mutect and mutect2

Hi Dr. Fang,

in the manual, you mentioned that,

If both MuTect2 and MuTect/Indelocator VCF les are provided, the script is written such
that it will use MuTect2 and ignore MuTect.

since SNV calls of mutect and mutect2 are very different (http://gatkforums.broadinstitute.org/gatk/discussion/6794/difference-between-mutect-1-1-7-and-mutect2), it makes sense to join them together.

Best,
Hongen

SomaticSeq for targeted panel data

Hi!
First of all, thank you for this awesome tool!
I would like to know if I could use SomaticSeq for target panel sequencing data. Concretely, I am managing Ion Torrent panel data. We have ~50 patients with mostly all the detected variants validated which, I guess, is a good starting point for the training step.
However, I am not sure if the tool would perform good in target rather than WGS or WES.

Any ideas?
Thanks!

No succesful mutations?

I'm trying to use your singularity pipeline in singularities/bamSimulator to inject mutation into a bam of the human genome (using BamSimulator_multiThreads.sh). The creation of the output directories and scripts goes fine, as does splitting the bam file into multiple regions, But when the script gets to actually creating and adding SNVs, I get a long set of errors that take the form:

ERROR   2019-12-20 14:50:57.380525      encountered error in mutation spikein: ['1_787825_787825_0.549717808335_None']
Traceback (most recent call last):
  File "/usr/local/bamsurgeon/bin/addsnv.py", line 275, in makemut
    aligners.remap_bam(args.aligner, tmpoutbamname, args.refFasta, alignopts, mutid=hapstr, paired=(not args.single), picardjar=args.picardjar, insane=args.i
nsane)
  File "/usr/local/lib/python2.7/dist-packages/bamsurgeon-1.0-py2.7.egg/bamsurgeon/aligners.py", line 67, in remap_bam
    remap_bwamem_bam(bamfn, threads, fastaref, picardjar, mutid=mutid, paired=paired, insane=insane)
  File "/usr/local/lib/python2.7/dist-packages/bamsurgeon-1.0-py2.7.egg/bamsurgeon/aligners.py", line 211, in remap_bwamem_bam
    raise ValueError("ERROR\t" + now() + "\t" + mutid + "\tbam readcount < fastq readcount, alignment sanity check failed!\n")
ValueError: ERROR       2019-12-20 14:50:57.380429      haplo_1_787825_787825   bam readcount < fastq readcount, alignment sanity check failed!

Then at the very end, There is a message that says "no successful mutations", and then samtools complains about not being able to open "unsorted.snvs.added.bam" and the script bails out. This happened for all the threads. Do you have any suggestions about where to start trying to debug this?

Thanks

question about indel calls by "NUM_TOOLS=0"

Hi Litai,

I used SSeq_merged.vcf2tsv.py to merge indel calls by four callers, CGA VarScan2 VarDict LoFreq, and then used SSeq_tsv2vcf.py to change TSV file to VCF file. In the output VCF file, I found some indel calls with KEY "NUM_TOOLS=0". I thought all the INDEL calls in the output file should be called by one caller. Would you please help me?

1 28796 . T TATCCCATCCCACCCCATCCCATTCCATCCCATCCTGTATTCCATCCCCATCCCATTCCAATCCCATCCTTATTCCATCCCCATCCCATCCCATCCCATCCCATCCC 0.0000 REJECT MVDL=0,0,0,0;NUM_TOOLS=0 GT:DP4:CD4:refMQ:altMQ:refBQ:altBQ:refNM:altNM:fetSB:fetCD:zMQ:zBQ:VAF 0/0:3,0,0,0:3,0,0,0:53.3333:.:36.3333:.:1.33333:.:1.00:1.00:.:.:0 0/0:2,0,0,0:2,0,0,0:60:.:35.5:.:2:.:1.00:1.00:.:.:0

Best,
Hongen

publication classifiers

Hi Li,

Would you kindly make the sSNV.Classifier.RData and sINDEL.Classifier.RData files you produced for the SomaticSeq publication available?

Thanks!

singularity submit_SomaticSeq.sh code error ?

Hi,

I got another error at the point where the submit_SomaticSeq.sh gets executed (the singularity version).

The error is
/var/spool/slurmd/job9700596/slurm_script: line 14: docker: command not found

I checked the submit_SomaticSeq.sh code and it appears that at line 270 you have the following
echo "docker pull lethalfang/somaticseq:${VERSION}" >> $out_script

the 'docker' is not recognized as command, of course.

I replaced it with the following
echo "singularity pull docker://lethalfang/somaticseq:${VERSION}" >> $out_script

but I get the following error
FATAL: Failed to initialize runtime engine: engine "" is not found

Clearly, what I did is not working. Can you please look into it ?

Thanks

Error with somaticseq_parallel.py

Hi,

I tried to run somaticseq (v3.2.1) with the following command line

~/.pyenv/versions/3.6.1/bin/python ${SomaticSeq_home}/somaticseq_parallel.py paired \
-outdir ${vcf_output_path}/SomaticSeq_new \
--genome-reference /site/ne/home/wings/ref_data/reference_genome/hg19/hg19_decoy/hg19.decoy.fa \
-nbam ${bam_n} \
-tbam ${bam_t} \
--threads 8 \
--mutect2-vcf ${tumour_id}.mutect2.pass.vcf \
--vardict-vcf ${tumour_id}.vardict.pass.vcf \
--strelka-snv ${tumour_id}.strelka2.snvs.pass.vcf \
--strelka-indel ${tumour_id}.strelka2.indels.pass.vcf

At the end, I've got the following error - somaticseq_parallel.py: error: the following arguments are required: -ref/--genome-reference

If you have suggestions, please let me know.

Thank you,
Joon

Issue with Varscan modify

Error and change logs:

Traceback (most recent call last):
File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 116, in
with genome.open_textfile(right_files[0]) as vcf:
File "/home/huangwt/Codes/somaticseq/genomic_file_handlers.py", line 224, in open_textfile
return gzip.open(file_name, 'rt')
File "/mnt/software/src/Python-3.2.3/Lib/gzip.py", line 46, in open
return GzipFile(filename, mode, compresslevel)
File "/mnt/software/src/Python-3.2.3/Lib/gzip.py", line 156, in init
raise IOError("Mode " + mode + " not supported")
IOError: Mode rt not supported

#Changed:
#File "/home/huangwt/Codes/somaticseq/genomic_file_handlers.py", line 224, in open_textfile
#return gzip.open(file_name, 'rt') to return gzip.open(file_name, 'r')

Traceback (most recent call last):
File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 126, in
while line_i.startswith('#'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

#Changed:
#File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 126, in
#while line_i.startswith('#'): to while line_i.startswith(b'#'):

Traceback (most recent call last):
File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 128, in
if re.match(r'##fileformat=', line_i):
File "/mnt/software/src/Python-3.2.3/Lib/re.py", line 153, in match
return _compile(pattern, flags).match(string)
TypeError: can't use a string pattern on a bytes-like object

SomaticSeq.Wrapper.sh should support GATK4

Currently, part of the pipeline requires GATK4 and another (the wrapper) uses GATK3. To avoid needing two GATK installations to run the whole pipeline, it'd be best to update the wrapper to use GATK4.

this was discussed in #50. This ticket is for tracking purposes.

Using somaticseq for somatic calls in non-human species and tumor-only mode

Hi Li,

Thanks for maintaining SomaticSeq with an excellent documentation and inline comments. I have just read SomaticSeq manuscript and am thinking of using it for the dog genome mutation data (matched tumor-normal wgs+wes data) I would much appreciate your comments on following:

Since top 18 or 20 features are largely based on individual callers and sequence quality, I like to give a try using default somaticseq features and model on canine mutation calls from gatk4 mutect2, varscan2 and lofreq. Based on --somaticseq-train option, I can train a model based on canine data (n=60-80), though truthset is limited (disease specific mutations, ~100-200) and not at par with one used in DREAM challenge.
- What would be a better approach to run somaticseq in this context?
- If I default to majority-vote consensus mode instead of training or prediction mode, how much trade-off in precision/recall I should expect? My dataset approximates to setting C (~pure Normal, contaminated tumor) of Fig 3 in Fang et al. 2015.
Does somaticseq have an option to run in tumor-only mode or preferably supplying panel of normal vcf instead of matched normal bam file?

Thanks,
Samir

Error: In training mode, there must be both true positives and false positives in the call set

I know this question has been asked before, but it doesn't look like it was resolved and I am getting a similar issue:

2020-06-02 23:02:32,540 - SomaticSeq - INFO - /home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/somaticseq/../r_scripts/xgboost_model_builder_ntChange.R /path/to/Ensemble.sINDEL.tsv
Loading required package: lattice
Loading required package: ggplot2
Error: In training mode, there must be both true positives and false positives in the call set.

I only inputted SNV vcf files and a truth vcf file, but the error seems to be related to the INDEL tsv. I am not interested in INDELs at the moment. Is it still possible to proceed?

Please release a version

Dear maintainers,

Thank you for this nice program. Can you release a version (using for example git tag) so SomaticSeq can be used reproducibly? We can then refer to the version in papers etc.

Also, I was planning to package this program into Bioconda. This requires a version. It will make it easier for people to install your package and easier to integrate it in to our pipelines here at the institute.

So if you could please release a version, I will make sure it is packaged. Thanks in advance!

KeyError: '23'

Hi,
I just tried to run SomaticSeq.Wrapper.sh for a Dream dataset. I got an error as followings. Could you let me know what the reason?
"Traceback (most recent call last):
File "/home/wangm6/resource/somaticseq-2.2.2/SSeq_merged.vcf2tsv.py", line 434, in
if args.cosmic_vcf: got_cosmic, cosmic_variants, cosmic_line = genome.find_vcf_at_coordinate(my_coordinate, cosmic_line, cosmic, chrom_seq)
File "/mnt/nfs/gigantor/ifs/DCEG/Home/wangm6/resource/somaticseq-2.2.2/genomic_file_handlers.py", line 617, in find_vcf_at_coordinate
latest_vcf_run = catchup_multilines(my_coordinate, latest_vcf_line, vcf_file_handle, chrom_seq)
File "/mnt/nfs/gigantor/ifs/DCEG/Home/wangm6/resource/somaticseq-2.2.2/genomic_file_handlers.py", line 574, in catchup_multilines
is_behind = whoisbehind( coordinate_i, coordinate_j, chrom_sequence )
File "/mnt/nfs/gigantor/ifs/DCEG/Home/wangm6/resource/somaticseq-2.2.2/genomic_file_handlers.py", line 378, in whoisbehind
chrom1_position = chrom_sequence[chrom1]
KeyError: '23'"

request for examples

Dear author,

Could you add some examples with input data?

thank you

Yulu

Is oncoscore available after the removal of score_Somatic.Variants.py in version newer than 2.1.2

Hi Litai,

score_Somatic.Variants.py calculate a oncoscore for SNV and Indel calls. Since version 2.1.2, score_Somatic.Variants.py has been removed. Do you have any plan to incorporate it into the newer version of SomaticSEQ? Or this score is not useful anymore?

Best regards,
Hongen

dockered vardict issue

Hi,

I am running submit_callers_multiThreads.sh under Singularity using 20 threads.

I am running with various callers available, including VarDict. All the callers seem to be completing. I vardict script fails with the following error:

Start at 2020/07/03 22:02:34
FATAL: Unable to handle docker://lethalfang/somaticseq:3.5.1 uri: failed to get SHA of docker://lethalfang/somaticseq:3.5.1: Error reading manifest 3.5.1 in docker.io/lethalfang/somaticseq: manifest unknown: manifest unknown

This is because it is trying to run
somaticseq/utilities/split_mergedBed.py

Can you help me with the error ?

Also, why did you use split_mergedBed.py before VarDict ? this does not seem to be used for the other callers

Thanks

Fastq2bam

Hello,
Can you please upload the "fastq2bam_pipeline_singleThread.sh" script?
It is a requirement for running " somaticseq/utilities/dockered_pipelines/alignments/fastq2bam_pipeline.sh".
Thanks!

Normalized multiallelic indels in CombineVariants_MVJSD.indel.vcf excluded from Ensemble.tsv

Hello,

While evaluating somaticseq for majority voting for Indels, I ran into an issue. Here's an example of how an Indel called in at least one caller is missed due to variant normalization with GATK CombineVariants.

The calls from two callers at the position are:

strelka.indel.vcf:1	76786670	.	TTA	T	.	QSI_ref
strelka.indel.vcf:1	76786670	.	T	TTA	.	PASS
mutect.indel.vcf:1	76786670	.	TTA	T	. alt_allele_in_normal;germline_risk;str_contraction

The resulting entry in the merged VCF produced by GATK CombineVariants is:
CombineVariants_MVJSD.indel.vcf:1 76786670 . TTA T,TTATA . PASS

CombineVariants has normalized the variants here to create a multi-allelic record in the merged VCF where one is an insertion and one is a deletion.

As I understand, in the next step to get Ensemble.sIndel.tsv from this merged VCF, each allele in multiallelic ALT is treated separately and has its own row in the TSV file if the number of tools with a PASS call for that variant is > mincaller (hardcoded as 0.5 in SomaticSeq.Wrapper.sh), the record is printed to Ensemble.sIndel.tsv and otherwise it's excluded completely.

In this case, the two variants it looks for after splitting multiallelic ALT are:

1:76786670:TTA>T 
1:76786670:TTA>TTATA

Now this is where the problem occurs. The first variant is not PASS in any of the callers so it's not printed. The second is a PASS in Strelka but it is represented as 1:76786670:T>TTA in Strelka VCF. somaticseq looks for the normalized indel in the original VCF and does not find the record so the variant is not printed from this point on.

Do you have any suggestions on how to work around this?

Thanks,
Prachi

Question: true set?

Hi I was looking in your somatiseq. I have a quite naive question what do you suggest to use as a true snp and Indel vcf for model training. Let's say I have some new patient their mutation status is unknown. How can I generate the true set in a practical way?
Thanks a lot for your suggestions!

update somaticseq on anaconda

Hi,

Is it possible to update somaticseq on anaconda? current version there is 2.8.1

https://anaconda.org/bioconda/somaticseq

Thanks!

bioinform / somaticseq Goto Github PK

somaticseq's People

Contributors

Stargazers

Watchers

Forkers

somaticseq's Issues

Recommend Projects

Recommend Topics

Recommend Org