Code Monkey home page Code Monkey logo

Comments (23)

litaifang avatar litaifang commented on June 12, 2024

SomaticSeq generates separate models for SNVs and indels, so you shouldn't combine them.
If you enable the -train flag in somaticseq_parallel.py (before the paired option), it'll create the model automatically. If not, you can just run commands: ada_model_builder_ntChange.R Ensemble.sSNV.tsv and/or xgboost_model_builder_ntChange.R Ensemble.sSNV.tsv to build SNV models based on AdaBoost and/or XGBoost, respectively. Same things for indels.
SNV and indel models are separate because true positive/false positive distributions for many features are different for SNVs and indels.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

thanks. Sorry about the confusion, but I am referring to how to combine the Ensemble.sSNV.tsv (or Ensemble.sINDEL.tsv) from each thread into one

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

Ah, in this case, these files should be combined automatically in the output directory as Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv.
But if for whatever reason they aren't there, or if you want to combine such files from multiple samples (which we do all the time):
cat */Ensemble.sSNV.tsv | awk 'NR==1 || $0 !~ /^CHROM/' > combined.ensemble.sSNV.tsv.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

I get the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv in each of the 10 subDIR, so I have 10 Ensemble.sSNV.tsv and 10 Ensemble.sINDEL.tsv. Is theref a specific argument I should pass to get only 1 Ensemble.sSNV.tsv and 1 Ensemble.sINDEL.tsv in a multi-thread run ?
Otherwise, I will just use the cat/awk command you just gave me. Thanks!

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

If somaticseq_parallel.py runs to completion, each of those 10 sub-directories should be deleted, and only the combined files remain. If not, what are the last lines in the log file?
By the way, is bedtools installed and in your PATH?

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

I just combined the thread specific files into the overall Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv for my sample.
I have to admit that I am a bit confused now and I realize this is because of the Python library vs Docker/Singularity version of it.
In the manual, at this point I should use the makeSomaticScripts.py, but I cannot see how to input the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv for training. This script runs directly the individual callers and the rest of the pipeline.
If I use the somaticseq_parallel.py (or run_somaticseq.py) again, I cannot see how to input the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv for training. This script requires teh path of each individual caller VCF, not the TSV ensemble files.
What am I missing ?

I feel the paragraph "5.3.2 SomaticSeq Training" is a bit confusing for me

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

I am running the submit_callers_multiThreads.sh script on 10 threads, with the argument "--somaticseq-action echo". After the callers are done, I run the generated somaticseq .CMD in each of the 10 subDIRs. This .CMD runs somaticseq_parallel.py (of course you know all of this, aready).
The last few lines of the log file are below

2020-07-15 14:38:06,883 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
/usr/local/lib/python3.6/dist-packages/scipy/stats/stats.py:5700: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
2020-07-15 14:38:56,194 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Done at 2020/07/15 14:39:07

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

Ah, if you've run submit_callers_multiThreads.sh, then the Ensemble.*.tsv files are not automatically combined. There is a script to combine everything in your outdir/logs directory.

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

For SomaticSeq, the only core command is somaticseq_parallel.py. This is the command you may run after you have run individual somatic callers (e.g., MuTect2, etc.) using whatever pipelines/workflows you've already installed. The required inputs are the BAM files, the VCF files from (at least one of) those callers you have run. The output files are Ensemble..tsv, Consensus..vcf. If you have supplied a classifier, then SSeq.Classified.*.vcf will be out as well, using the classifiers (.RData files) to score each variant in the .tsv files. If you invoke training, you require the "truth" VCF files, and the models (.RData) will be in the output as well. That's all there is for SomaticSeq algorithm.

Now, to make things easier to get people started, we have created some add-ons, like makeSomaticScripts.py which creates command line scripts for the individual callers that we have incorporated into SomaticSeq, by calling docker images of those tools, etc. Previously, submit_callers_multiThreads.sh served that purpose. Those are just add on scripts. They may help, especially for those new to NGS, but they are not the core somaticseq algorithms. The ones for singularities were not kept up to date because we haven't used singularities internally.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

All I get is the following:
under $outDIR/logs : the somaticsniper .cmd script
under $outDIR/1/logs: the .cmd script for the other callers
...
under $outDIR/N/logs: the .cmd script for the other callers
under $outDIR/1/SomaticSeq/logs: the somaticseq .cmd script
...
under $outDIR/N/SomaticSeq/logs: the somaticseq .cmd script

I run the somaticseq .cmd scripts after the callers jobs complete. The thread specific somaticseq .cmd script generates Consensus and Ensemble files in each thread subDIR. I do not get any script to combine the thread specific VCFs or TSVs.
Am I not passing some argument ?

In the manual, it is clear that somaticseq_parallel.py is the core. But i needed the callers and bamsurgeon scripts.
I cannot run Docker, so I use the Singularity based scripts.
Nevertheless, I assume, at this point, that I can run the caller scripts on multiple threads and then, for each caller, combine the VCFs from the multiple subDIR/threads into one. Then I can use somaticseq_parallel.py and the various VCFs as input. This should be ok. I got confused with the paragraph "5.3.2 SomaticSeq Training". I will let you know. Thanks

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

Right. In that case, just combine those files manually. Make sure to run somaticSniper as well because somaticseq.cmd will be expecting it.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

Hi, the somaticseq_parallel.py in training mode generates an error:
Strelka.somatic.indels.vcf does not seem to be properly sorted. No errors associated to the VCFs from the other callers.
I make all the input VCFs using vcf-concat on the VCFs from the individual threads.
If I remove the Strelka.somatic.indels.vcf it seems to work and on its way to produce the RData files.
Should I concatenate the VCFs in some other way ? or sort in some way ?

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

Try sorting the vcf file: vcfsorter.pl genome_reference.dict Strelka.somatic.indels.vcf > Strelka.somatic.indels.sorted.vcf.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

It seems to be working.I will run the vcfsorter.pl on every VCF before training to avoid issues. Thanks

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

Seems like the vcf files from each numbered sub-directories were not merged in numeric order. It probably should've been vcf-concat 1/caller.vcf 2/caller.vcf 3/caller.vcf .... 10/caller.vcf > merged.caller.vcf.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

Hi, it completed and generated the following stats for the model
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 2316 0
1 0 326
Train Error: 0
Out-Of-Bag Error: 0 iteration= 126
Additional Estimates of number of iterations:
train.err1 train.kap1
75 75

You are right about the last comment. i used a $(ls */caller.vcf) to get the file list and 10 is probably before 1. Anyway, sorting worked. Just strange that it did not generate an error for the other callers. I will use the sorting on all then VCFs

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024

Cool.
I guess you can try cross validation on Ensemble.sSNV.tsv and/or Ensemble.sINDEL.tsv, i.e., /PATH/TO/somaticseq/r_scripts/ada_cross_validation.R Ensemble.sSNV.tsv. It'll just randomly split the data 50-50 into training and testing, and do it 10 times.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

Hi, I restarted the training after sorting all the VCFs, just in case. i got the following;
Loss: exponential Method: discrete Iteration: 500
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 2482 0
1 0 345
Train Error: 0
Out-Of-Bag Error: 0 iteration= 131
Additional Estimates of number of iterations:
train.err1 train.kap1
47 47

And the last pass of cross-validation
10 th_try
PASS_Calls = 1073
REJECT_Calls = 4490
PASS_TruePositives = 1048
PASS_FalsePositives = 25
Sensitivity = 0.942446
Precision = 0.9767008
F1 = 0.9592677

I think everything looks good now. I will run the prediction. I do not expect any problem, but i will keep this open, just in case. Otherwise I will close the issue.
As usual, may thanks for your help and patience! Be safe!

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

Hi, I started the prediction mode yesterday and it is still running. I am also having now an issue with training, as I get the following error message:
Exception: $PATH/synthetic_snvs.vcf does not seem to be properly sorted
I use the automatically generate merge script to generate the final synthetic_snvs.vcf.

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

I had to sort both the synthetic_snvs.vcf and the synthetic_indels.leftAlign.sorted.vcf. I submitted a multi-thread (20) job in prediciton mode and it is stuck (see below). The job has generated, so far, 20 core dumps. Any suggestions ?
Loading required package: ada
Loading required package: rpart
Loading required package: ada
Loading required package: rpart

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

I did another run on a single thread and got again a core dump at the same point. Unfortunately no error message.
I did a gdb of the core file, and a backtrace, but it is not giving me the name of any functon.
Interrupting the multi-thread job at the same point, when it is stuck after dumping the core files, shows the following
File "somaticseq_parallel.py", line 116, in
subdirs = pool.map(runPaired_by_region_i, bed_splitted)

from somaticseq.

gianfilippo avatar gianfilippo commented on June 12, 2024

I did a quick run with the example you provide, first in training, than i modified it for prediction and it does not crash.
It is a memory allocation problem. The .RData model files are quite large and one needs to allocate enough memory.
It works now. Thanks

from somaticseq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.