Code Monkey home page Code Monkey logo

Comments (8)

litaifang avatar litaifang commented on June 12, 2024
  1. For somaticseq_parallel.py to generate classifiers, you need training data with ground truth and the callers run. The section about creating synthetic BAM files is about creating that training data.
  2. You don't use two identical BAM files. You use two sequencing replicates of the same "normal" sample, so the replicate-to-replicate variability will give you false positive calls, which mimic false positives in real tumor-normal sequencing data set.
  3. Coverage has some effect, generally speaking from our experiences, a couple of fold in coverage difference is OK, but too much difference may be problematic.
  4. somaticseq_parallel.py is just a command. You can run it in any workflow management system you like. The scripts that create synthetic BAMs just produce a bunch of bash scripts. They mainly serve as template. They can be submitted as SGE scripts, but they are not core SomaticSeq algorithms. If there need to be some differences between SGE and SLURM scripts, you'll need to make those modifications.

from somaticseq.

vymao avatar vymao commented on June 12, 2024

Thank you. A few follow up questions:

  1. With regards to creating the synthetic BAM files, I noticed that the way you create the classifiers is through makeSomaticScripts . py paired ... /. This is different than somaticseq_parallel.py; what is the difference?
  2. Do you provide any pretrained models?
  3. There are more somatic callers in the documentation than listed as used in the paper. For example, I noticed that LoFreq, Scalpel, and Strelka2 are not mentioned in the paper. Should we be using these, and how does the performance compare if we do not? Which callers should we definitely be using?
  4. According to methods 6.2.2 and 6.2.3, it seems like one does not necessarily need two sequencing replicates. Is this correct? If not, could you point me to a sample that has two sequencing replicates? I'm having trouble finding some.

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024
  1. makeSomaticScripts.py is not the core SomaticSeq algorithm. It's a tool to make scripts that calls the individual somatic mutation callers that we have incorporated in Docker. It's just there for people's convenience if needed. somaticseq_paralle.py is the core SomaticSeq algorithm.
  2. We don't have pre-trained models at this moment. The trained models need to be trained such that, all the callers are run in the exactly same parameter as they are used for prediction. In the future we may provide training data (manuscript still under review).
  3. We have updated and improved our algorithm since the paper. Strelka2 is good to run for the most part. Scalpel should only be run in WES or targeted sequencing: it is very slow. I'd also run VarDict, MuSE, and MuTect2.
  4. Here is an example of a heavily sequenced sample set at different institutions and platforms: https://sites.google.com/view/seqc2/home/sequencing. You'll need to be able to download SRA data sets. The pre-print describing the data: https://doi.org/10.1101/625624

from somaticseq.

vymao avatar vymao commented on June 12, 2024

Ok, so:

  1. makeSomaticScripts.py requires Docker to run? If instead of Docker, could we just run each of the callers on the synthetic BAMs themselves, and then use somaticseq_paralle.py to train the classifier?
  2. According to methods 6.2.2 and 6.2.3, it seems like one does not necessarily need two sequencing replicates. Is this correct? For these methods it seems one could use a single normal BAM or a tumor/normal pair.

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024
  1. Yes. You should best run those callers with your own workflows, to keep them updated and apply the best parameters for your own data sets.
  2. It's best to have sequencing replicates. If not, there are other ways to create one. 6.2.2 is to randomly split one higher depth sequencing data into two replicates. 6.2.3 combines tumor-normal into a single bam file, and then randomly split it into two as two replicates. We haven't done extensive benchmarking to determine which way is better. Both work pretty well, most of the time.

from somaticseq.

vymao avatar vymao commented on June 12, 2024

Unless I misunderstood:

  1. Couldn't I just run somaticseq_parallel.py on a single WGS tumor/normal sample listed at https://sites.google.com/view/seqc2/home/sequencing? If the truth set is already online, I wouldn't need to create synthetic BAMs, correct?
  2. For running methods 6.2.2 and 6.2.3, is Docker required? Is it not possible to run this without Docker?
  3. Also, if we are running in tumor-only mode, we do not need a training set correct? We can just run somaticseq_parallel.py without inputting a matched normal, and it should default to consensus mode, right?
  4. On this resource (https://sites.google.com/view/seqc2/home/sequencing), what is the "FFPE on Illumina"? I did not see this referenced in the manuscript.

from somaticseq.

vymao avatar vymao commented on June 12, 2024

Just following up. Would you know if I am correct?

from somaticseq.

litaifang avatar litaifang commented on June 12, 2024
  1. somaticseq_parallel.py requires input from at least one somatic mutation callers. We do have those VCF file(ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/analysis/cancer_reference_samples_supporting_files/somaticMutationCalling/), but it's better you run the tool(s) yourself because it's hard to make sure the version and setting are identical when you download a VCF. Your own workflow may have different settings, giving different results, so a model trained on one may not be the best for another.
  2. Docker is never required. I've included some workflow scripts that uses docker. Those are simply for conveniences. They are 3rd party workflows.
  3. While in tumor-only mode, the options are different. You can still do training, but training in single-sample mode is not as well validated, so I don't know how good it can be. I don't know if it is really possible to distinguish somatic and germline variants in tumor samples only, so machine learning may overfit. It's an option though. If you have a large well-labeled data set, that you think machine learning may build a model from it, and use it in the future.
  4. FFPE results (among others) are described in another preprint: https://doi.org/10.1101/626440

from somaticseq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.