I am confused with regards to how one should train SomaticSeq, and I had a few questio

For somaticseq_parallel.py to generate classifi

Thank you. A few follow up questions: With regards to creating

makeSomatics.py is not the core SomaticSe

Yes. You should best run those callers with your own workflows, to keep them upd

Unless I misunderstood: Couldn't I just run <code class="notra

somaticseq_parallel.py requires input from at l

Some clarifications? about somaticseq HOT 8 CLOSED

vymao commented on June 12, 2024

Some clarifications?

from somaticseq.

Comments (8)

litaifang commented on June 12, 2024

For somaticseq_parallel.py to generate classifiers, you need training data with ground truth and the callers run. The section about creating synthetic BAM files is about creating that training data.
You don't use two identical BAM files. You use two sequencing replicates of the same "normal" sample, so the replicate-to-replicate variability will give you false positive calls, which mimic false positives in real tumor-normal sequencing data set.
Coverage has some effect, generally speaking from our experiences, a couple of fold in coverage difference is OK, but too much difference may be problematic.
somaticseq_parallel.py is just a command. You can run it in any workflow management system you like. The scripts that create synthetic BAMs just produce a bunch of bash scripts. They mainly serve as template. They can be submitted as SGE scripts, but they are not core SomaticSeq algorithms. If there need to be some differences between SGE and SLURM scripts, you'll need to make those modifications.

from somaticseq.

vymao commented on June 12, 2024

Thank you. A few follow up questions:

With regards to creating the synthetic BAM files, I noticed that the way you create the classifiers is through makeSomaticScripts . py paired ... /. This is different than somaticseq_parallel.py; what is the difference?
Do you provide any pretrained models?
There are more somatic callers in the documentation than listed as used in the paper. For example, I noticed that LoFreq, Scalpel, and Strelka2 are not mentioned in the paper. Should we be using these, and how does the performance compare if we do not? Which callers should we definitely be using?
According to methods 6.2.2 and 6.2.3, it seems like one does not necessarily need two sequencing replicates. Is this correct? If not, could you point me to a sample that has two sequencing replicates? I'm having trouble finding some.

from somaticseq.

litaifang commented on June 12, 2024

makeSomaticScripts.py is not the core SomaticSeq algorithm. It's a tool to make scripts that calls the individual somatic mutation callers that we have incorporated in Docker. It's just there for people's convenience if needed. somaticseq_paralle.py is the core SomaticSeq algorithm.
We don't have pre-trained models at this moment. The trained models need to be trained such that, all the callers are run in the exactly same parameter as they are used for prediction. In the future we may provide training data (manuscript still under review).
We have updated and improved our algorithm since the paper. Strelka2 is good to run for the most part. Scalpel should only be run in WES or targeted sequencing: it is very slow. I'd also run VarDict, MuSE, and MuTect2.
Here is an example of a heavily sequenced sample set at different institutions and platforms: https://sites.google.com/view/seqc2/home/sequencing. You'll need to be able to download SRA data sets. The pre-print describing the data: https://doi.org/10.1101/625624

from somaticseq.

vymao commented on June 12, 2024

Ok, so:

makeSomaticScripts.py requires Docker to run? If instead of Docker, could we just run each of the callers on the synthetic BAMs themselves, and then use somaticseq_paralle.py to train the classifier?
According to methods 6.2.2 and 6.2.3, it seems like one does not necessarily need two sequencing replicates. Is this correct? For these methods it seems one could use a single normal BAM or a tumor/normal pair.

from somaticseq.

litaifang commented on June 12, 2024

Yes. You should best run those callers with your own workflows, to keep them updated and apply the best parameters for your own data sets.
It's best to have sequencing replicates. If not, there are other ways to create one. 6.2.2 is to randomly split one higher depth sequencing data into two replicates. 6.2.3 combines tumor-normal into a single bam file, and then randomly split it into two as two replicates. We haven't done extensive benchmarking to determine which way is better. Both work pretty well, most of the time.

from somaticseq.

vymao commented on June 12, 2024

Unless I misunderstood:

Couldn't I just run somaticseq_parallel.py on a single WGS tumor/normal sample listed at https://sites.google.com/view/seqc2/home/sequencing? If the truth set is already online, I wouldn't need to create synthetic BAMs, correct?
For running methods 6.2.2 and 6.2.3, is Docker required? Is it not possible to run this without Docker?
Also, if we are running in tumor-only mode, we do not need a training set correct? We can just run somaticseq_parallel.py without inputting a matched normal, and it should default to consensus mode, right?
On this resource (https://sites.google.com/view/seqc2/home/sequencing), what is the "FFPE on Illumina"? I did not see this referenced in the manuscript.

from somaticseq.

vymao commented on June 12, 2024

Just following up. Would you know if I am correct?

from somaticseq.

litaifang commented on June 12, 2024

somaticseq_parallel.py requires input from at least one somatic mutation callers. We do have those VCF file(ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/analysis/cancer_reference_samples_supporting_files/somaticMutationCalling/), but it's better you run the tool(s) yourself because it's hard to make sure the version and setting are identical when you download a VCF. Your own workflow may have different settings, giving different results, so a model trained on one may not be the best for another.
Docker is never required. I've included some workflow scripts that uses docker. Those are simply for conveniences. They are 3rd party workflows.
While in tumor-only mode, the options are different. You can still do training, but training in single-sample mode is not as well validated, so I don't know how good it can be. I don't know if it is really possible to distinguish somatic and germline variants in tumor samples only, so machine learning may overfit. It's an option though. If you have a large well-labeled data set, that you think machine learning may build a model from it, and use it in the future.
FFPE results (among others) are described in another preprint: https://doi.org/10.1101/626440

from somaticseq.

Some clarifications? about somaticseq HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent