Comments (23)
SomaticSeq generates separate models for SNVs and indels, so you shouldn't combine them.
If you enable the -train
flag in somaticseq_parallel.py
(before the paired
option), it'll create the model automatically. If not, you can just run commands: ada_model_builder_ntChange.R Ensemble.sSNV.tsv
and/or xgboost_model_builder_ntChange.R Ensemble.sSNV.tsv
to build SNV models based on AdaBoost and/or XGBoost, respectively. Same things for indels.
SNV and indel models are separate because true positive/false positive distributions for many features are different for SNVs and indels.
from somaticseq.
thanks. Sorry about the confusion, but I am referring to how to combine the Ensemble.sSNV.tsv (or Ensemble.sINDEL.tsv) from each thread into one
from somaticseq.
Ah, in this case, these files should be combined automatically in the output directory as Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv.
But if for whatever reason they aren't there, or if you want to combine such files from multiple samples (which we do all the time):
cat */Ensemble.sSNV.tsv | awk 'NR==1 || $0 !~ /^CHROM/' > combined.ensemble.sSNV.tsv
.
from somaticseq.
I get the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv in each of the 10 subDIR, so I have 10 Ensemble.sSNV.tsv and 10 Ensemble.sINDEL.tsv. Is theref a specific argument I should pass to get only 1 Ensemble.sSNV.tsv and 1 Ensemble.sINDEL.tsv in a multi-thread run ?
Otherwise, I will just use the cat/awk command you just gave me. Thanks!
from somaticseq.
If somaticseq_parallel.py
runs to completion, each of those 10 sub-directories should be deleted, and only the combined files remain. If not, what are the last lines in the log file?
By the way, is bedtools
installed and in your PATH?
from somaticseq.
I just combined the thread specific files into the overall Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv for my sample.
I have to admit that I am a bit confused now and I realize this is because of the Python library vs Docker/Singularity version of it.
In the manual, at this point I should use the makeSomaticScripts.py, but I cannot see how to input the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv for training. This script runs directly the individual callers and the rest of the pipeline.
If I use the somaticseq_parallel.py (or run_somaticseq.py) again, I cannot see how to input the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv for training. This script requires teh path of each individual caller VCF, not the TSV ensemble files.
What am I missing ?
I feel the paragraph "5.3.2 SomaticSeq Training" is a bit confusing for me
from somaticseq.
I am running the submit_callers_multiThreads.sh script on 10 threads, with the argument "--somaticseq-action echo". After the callers are done, I run the generated somaticseq .CMD in each of the 10 subDIRs. This .CMD runs somaticseq_parallel.py (of course you know all of this, aready).
The last few lines of the log file are below
2020-07-15 14:38:06,883 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
/usr/local/lib/python3.6/dist-packages/scipy/stats/stats.py:5700: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
2020-07-15 14:38:56,194 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Done at 2020/07/15 14:39:07
from somaticseq.
Ah, if you've run submit_callers_multiThreads.sh
, then the Ensemble.*.tsv files are not automatically combined. There is a script to combine everything in your outdir/logs directory.
from somaticseq.
For SomaticSeq, the only core command is somaticseq_parallel.py
. This is the command you may run after you have run individual somatic callers (e.g., MuTect2, etc.) using whatever pipelines/workflows you've already installed. The required inputs are the BAM files, the VCF files from (at least one of) those callers you have run. The output files are Ensemble..tsv, Consensus..vcf. If you have supplied a classifier, then SSeq.Classified.*.vcf will be out as well, using the classifiers (.RData files) to score each variant in the .tsv files. If you invoke training, you require the "truth" VCF files, and the models (.RData) will be in the output as well. That's all there is for SomaticSeq algorithm.
Now, to make things easier to get people started, we have created some add-ons, like makeSomaticScripts.py
which creates command line scripts for the individual callers that we have incorporated into SomaticSeq, by calling docker images of those tools, etc. Previously, submit_callers_multiThreads.sh
served that purpose. Those are just add on scripts. They may help, especially for those new to NGS, but they are not the core somaticseq algorithms. The ones for singularities were not kept up to date because we haven't used singularities internally.
from somaticseq.
All I get is the following:
under $outDIR/logs : the somaticsniper .cmd script
under $outDIR/1/logs: the .cmd script for the other callers
...
under $outDIR/N/logs: the .cmd script for the other callers
under $outDIR/1/SomaticSeq/logs: the somaticseq .cmd script
...
under $outDIR/N/SomaticSeq/logs: the somaticseq .cmd script
I run the somaticseq .cmd scripts after the callers jobs complete. The thread specific somaticseq .cmd script generates Consensus and Ensemble files in each thread subDIR. I do not get any script to combine the thread specific VCFs or TSVs.
Am I not passing some argument ?
In the manual, it is clear that somaticseq_parallel.py is the core. But i needed the callers and bamsurgeon scripts.
I cannot run Docker, so I use the Singularity based scripts.
Nevertheless, I assume, at this point, that I can run the caller scripts on multiple threads and then, for each caller, combine the VCFs from the multiple subDIR/threads into one. Then I can use somaticseq_parallel.py and the various VCFs as input. This should be ok. I got confused with the paragraph "5.3.2 SomaticSeq Training". I will let you know. Thanks
from somaticseq.
Right. In that case, just combine those files manually. Make sure to run somaticSniper as well because somaticseq.cmd
will be expecting it.
from somaticseq.
from somaticseq.
Hi, the somaticseq_parallel.py in training mode generates an error:
Strelka.somatic.indels.vcf does not seem to be properly sorted. No errors associated to the VCFs from the other callers.
I make all the input VCFs using vcf-concat on the VCFs from the individual threads.
If I remove the Strelka.somatic.indels.vcf it seems to work and on its way to produce the RData files.
Should I concatenate the VCFs in some other way ? or sort in some way ?
from somaticseq.
Try sorting the vcf file: vcfsorter.pl genome_reference.dict Strelka.somatic.indels.vcf > Strelka.somatic.indels.sorted.vcf
.
from somaticseq.
It seems to be working.I will run the vcfsorter.pl on every VCF before training to avoid issues. Thanks
from somaticseq.
Seems like the vcf files from each numbered sub-directories were not merged in numeric order. It probably should've been vcf-concat 1/caller.vcf 2/caller.vcf 3/caller.vcf .... 10/caller.vcf > merged.caller.vcf
.
from somaticseq.
Hi, it completed and generated the following stats for the model
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 2316 0
1 0 326
Train Error: 0
Out-Of-Bag Error: 0 iteration= 126
Additional Estimates of number of iterations:
train.err1 train.kap1
75 75
You are right about the last comment. i used a $(ls */caller.vcf) to get the file list and 10 is probably before 1. Anyway, sorting worked. Just strange that it did not generate an error for the other callers. I will use the sorting on all then VCFs
from somaticseq.
Cool.
I guess you can try cross validation on Ensemble.sSNV.tsv and/or Ensemble.sINDEL.tsv, i.e., /PATH/TO/somaticseq/r_scripts/ada_cross_validation.R Ensemble.sSNV.tsv
. It'll just randomly split the data 50-50 into training and testing, and do it 10 times.
from somaticseq.
Hi, I restarted the training after sorting all the VCFs, just in case. i got the following;
Loss: exponential Method: discrete Iteration: 500
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 2482 0
1 0 345
Train Error: 0
Out-Of-Bag Error: 0 iteration= 131
Additional Estimates of number of iterations:
train.err1 train.kap1
47 47
And the last pass of cross-validation
10 th_try
PASS_Calls = 1073
REJECT_Calls = 4490
PASS_TruePositives = 1048
PASS_FalsePositives = 25
Sensitivity = 0.942446
Precision = 0.9767008
F1 = 0.9592677
I think everything looks good now. I will run the prediction. I do not expect any problem, but i will keep this open, just in case. Otherwise I will close the issue.
As usual, may thanks for your help and patience! Be safe!
from somaticseq.
Hi, I started the prediction mode yesterday and it is still running. I am also having now an issue with training, as I get the following error message:
Exception: $PATH/synthetic_snvs.vcf does not seem to be properly sorted
I use the automatically generate merge script to generate the final synthetic_snvs.vcf.
from somaticseq.
I had to sort both the synthetic_snvs.vcf and the synthetic_indels.leftAlign.sorted.vcf. I submitted a multi-thread (20) job in prediciton mode and it is stuck (see below). The job has generated, so far, 20 core dumps. Any suggestions ?
Loading required package: ada
Loading required package: rpart
Loading required package: ada
Loading required package: rpart
from somaticseq.
I did another run on a single thread and got again a core dump at the same point. Unfortunately no error message.
I did a gdb of the core file, and a backtrace, but it is not giving me the name of any functon.
Interrupting the multi-thread job at the same point, when it is stuck after dumping the core files, shows the following
File "somaticseq_parallel.py", line 116, in
subdirs = pool.map(runPaired_by_region_i, bed_splitted)
from somaticseq.
I did a quick run with the example you provide, first in training, than i modified it for prediction and it does not crash.
It is a memory allocation problem. The .RData model files are quite large and one needs to allocate enough memory.
It works now. Thanks
from somaticseq.
Related Issues (20)
- Special setting for b37? HOT 14
- Question about simulating somatic mutations HOT 7
- Pretrained Classifier HOT 3
- Docker issue with latest version HOT 1
- SEQC2: Some high confidence SNVs and INDELs in VCF are outside of regions defined by High-Confidence_Regions_v1.2.bed HOT 2
- Somaticseq makeSomaticScripts.py running and output issues HOT 8
- Slow RNA variant calling HOT 8
- Question for the paper on establishing the reference call set HOT 3
- Where are the 10x Genomics single-cell copy number variation (CNV) analysis results? HOT 7
- Ground Truths required for training HOT 1
- somaticseq failing for same command it had previously successfully run HOT 11
- Applying internal filters to outputs before running SomaticSeq HOT 1
- Dockerized alignment workflow does not work with multiple input files HOT 5
- Error when running makeSomaticScripts with multiple threads HOT 3
- Output allele of the normal sample HOT 2
- UnboundLocalError: cannot access local variable 'normal_name' where it is not associated with a value HOT 10
- how to obtain all variants where the "FILTER" column is not labeled as "PASS" HOT 1
- Are multi-nucleotide and complex variants ignored? HOT 2
- Error when running FFPE training data from SEQ2C HOT 5
- AI consensus calling error on WGS samples HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from somaticseq.