Comments (8)
- For
somaticseq_parallel.py
to generate classifiers, you need training data with ground truth and the callers run. The section about creating synthetic BAM files is about creating that training data. - You don't use two identical BAM files. You use two sequencing replicates of the same "normal" sample, so the replicate-to-replicate variability will give you false positive calls, which mimic false positives in real tumor-normal sequencing data set.
- Coverage has some effect, generally speaking from our experiences, a couple of fold in coverage difference is OK, but too much difference may be problematic.
somaticseq_parallel.py
is just a command. You can run it in any workflow management system you like. The scripts that create synthetic BAMs just produce a bunch of bash scripts. They mainly serve as template. They can be submitted as SGE scripts, but they are not core SomaticSeq algorithms. If there need to be some differences between SGE and SLURM scripts, you'll need to make those modifications.
from somaticseq.
Thank you. A few follow up questions:
- With regards to creating the synthetic BAM files, I noticed that the way you create the classifiers is through
makeSomaticScripts . py paired ... /
. This is different thansomaticseq_parallel.py
; what is the difference? - Do you provide any pretrained models?
- There are more somatic callers in the documentation than listed as used in the paper. For example, I noticed that LoFreq, Scalpel, and Strelka2 are not mentioned in the paper. Should we be using these, and how does the performance compare if we do not? Which callers should we definitely be using?
- According to methods 6.2.2 and 6.2.3, it seems like one does not necessarily need two sequencing replicates. Is this correct? If not, could you point me to a sample that has two sequencing replicates? I'm having trouble finding some.
from somaticseq.
makeSomaticScripts.py
is not the core SomaticSeq algorithm. It's a tool to make scripts that calls the individual somatic mutation callers that we have incorporated in Docker. It's just there for people's convenience if needed.somaticseq_paralle.py
is the core SomaticSeq algorithm.- We don't have pre-trained models at this moment. The trained models need to be trained such that, all the callers are run in the exactly same parameter as they are used for prediction. In the future we may provide training data (manuscript still under review).
- We have updated and improved our algorithm since the paper. Strelka2 is good to run for the most part. Scalpel should only be run in WES or targeted sequencing: it is very slow. I'd also run VarDict, MuSE, and MuTect2.
- Here is an example of a heavily sequenced sample set at different institutions and platforms: https://sites.google.com/view/seqc2/home/sequencing. You'll need to be able to download SRA data sets. The pre-print describing the data: https://doi.org/10.1101/625624
from somaticseq.
Ok, so:
makeSomaticScripts.py
requires Docker to run? If instead of Docker, could we just run each of the callers on the synthetic BAMs themselves, and then usesomaticseq_paralle.py
to train the classifier?- According to methods 6.2.2 and 6.2.3, it seems like one does not necessarily need two sequencing replicates. Is this correct? For these methods it seems one could use a single normal BAM or a tumor/normal pair.
from somaticseq.
- Yes. You should best run those callers with your own workflows, to keep them updated and apply the best parameters for your own data sets.
- It's best to have sequencing replicates. If not, there are other ways to create one. 6.2.2 is to randomly split one higher depth sequencing data into two replicates. 6.2.3 combines tumor-normal into a single bam file, and then randomly split it into two as two replicates. We haven't done extensive benchmarking to determine which way is better. Both work pretty well, most of the time.
from somaticseq.
Unless I misunderstood:
- Couldn't I just run
somaticseq_parallel.py
on a single WGS tumor/normal sample listed at https://sites.google.com/view/seqc2/home/sequencing? If the truth set is already online, I wouldn't need to create synthetic BAMs, correct? - For running methods 6.2.2 and 6.2.3, is Docker required? Is it not possible to run this without Docker?
- Also, if we are running in tumor-only mode, we do not need a training set correct? We can just run
somaticseq_parallel.py
without inputting a matched normal, and it should default to consensus mode, right? - On this resource (https://sites.google.com/view/seqc2/home/sequencing), what is the "FFPE on Illumina"? I did not see this referenced in the manuscript.
from somaticseq.
Just following up. Would you know if I am correct?
from somaticseq.
somaticseq_parallel.py
requires input from at least one somatic mutation callers. We do have those VCF file(ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/analysis/cancer_reference_samples_supporting_files/somaticMutationCalling/), but it's better you run the tool(s) yourself because it's hard to make sure the version and setting are identical when you download a VCF. Your own workflow may have different settings, giving different results, so a model trained on one may not be the best for another.- Docker is never required. I've included some workflow scripts that uses docker. Those are simply for conveniences. They are 3rd party workflows.
- While in tumor-only mode, the options are different. You can still do training, but training in single-sample mode is not as well validated, so I don't know how good it can be. I don't know if it is really possible to distinguish somatic and germline variants in tumor samples only, so machine learning may overfit. It's an option though. If you have a large well-labeled data set, that you think machine learning may build a model from it, and use it in the future.
- FFPE results (among others) are described in another preprint: https://doi.org/10.1101/626440
from somaticseq.
Related Issues (20)
- Special setting for b37? HOT 14
- Question about simulating somatic mutations HOT 7
- Pretrained Classifier HOT 3
- Docker issue with latest version HOT 1
- SEQC2: Some high confidence SNVs and INDELs in VCF are outside of regions defined by High-Confidence_Regions_v1.2.bed HOT 2
- Somaticseq makeSomaticScripts.py running and output issues HOT 8
- Slow RNA variant calling HOT 8
- Question for the paper on establishing the reference call set HOT 3
- Where are the 10x Genomics single-cell copy number variation (CNV) analysis results? HOT 7
- Ground Truths required for training HOT 1
- somaticseq failing for same command it had previously successfully run HOT 11
- Applying internal filters to outputs before running SomaticSeq HOT 1
- Dockerized alignment workflow does not work with multiple input files HOT 5
- Error when running makeSomaticScripts with multiple threads HOT 3
- Output allele of the normal sample HOT 2
- UnboundLocalError: cannot access local variable 'normal_name' where it is not associated with a value HOT 10
- how to obtain all variants where the "FILTER" column is not labeled as "PASS" HOT 1
- Are multi-nucleotide and complex variants ignored? HOT 2
- Error when running FFPE training data from SEQ2C HOT 5
- AI consensus calling error on WGS samples HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from somaticseq.