Code Monkey home page Code Monkey logo

Comments (6)

umahsn avatar umahsn commented on August 25, 2024

Hi Tanveer,

Thank you for your interest in NanoCaller.

For HG001 (NA12878), we used aligned CCS reads from here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/PacBio_SequelII_CCS_11kb/HG001_GRCh38/) so we didn't do any alignment ourselves. For HG002-HG004, we obtained PacBio CCS FASTQ files from precisionFDA (now available at https://data.nist.gov/pdr/od/id/mds2-2336). We aligned these reads with minimap2 without any duplicate removal. Generally, for long-reads sequencing, you do not need to do any duplicate removal, regardless of PCR-free or not.

You can consult the supplementary file of precisionFDA challenge preprint to see what different teams did for alignment and any other pre-processing steps. Since we did not sequence these genomes, we do not if anything else was done before the FASTQ files were released publicly, so it would be better to consult with the sequencing team about that.

Just FYI, we used the following minimap2 command for CCS reads, with settings used in several aligned files provided by GIAB (example):
minimap2 -a -x map-pb -k 19 -O 5,56 -E 4,1 -B 5 -z 400,50 -r 2k --eqx --secondary=no

Let me know if there are any further questions.

from nanocaller.

tahashmi avatar tahashmi commented on August 25, 2024

Thank you so much Umair for your quick and detailed reply. It's clear now.

Would you like to mention a short flow of tools in ensemble mode (like mentioned here) for your precisionFDA challenge submission? I want to know if I can use PySaprk dataframes for SAM data to integrate such ensemble mode variant calling workflow. Thanks.

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

For our precisionFDA submission we had a somewhat elementary ensemble. We ran NanoCaller, Clair and Medaka separately on BAM files. After variant calling we took variant quality scores (QUAL field) of predictions made by each of the three variant callers and scaled them so that were all on a uniform scale. The reason for this is most of Clair's quality score were mostly in 500-1000, Medaka's score were mostly in 0-25, and NanoCaller's quality scores were mostly in 30-400 range.

We performed a weighted vote for each variant call, instead of majority voting. More precisely, we looked at each variant call in the union of three tools, issued the same call in final output with the following quality score:

QUAL=(NanoCaller QUAL score for the variant +Clair QUAL score for the variant +Medaka QUAL score for the variant)/3

If a tool did not make the variant call in union set we gave a quality score of 0 for that variant caller. So if there was a variant call made by NanoCaller with QUAL=240, but Medaka and Clair did not make this variant call, then the ensemble will issue the same variant call with QUAL=(214+0+0)/3=80.

So instead of removing calls with low number of votes, we just gave it a lower score. The reason for this is that there are several variant calls that are uniquely called by only one or two methods, so it would not seem prudent to discard such calls. This allows us maximum recall, but with quality score filtering we can make the callset more stringent to remove low vote supporting calls.

from nanocaller.

tahashmi avatar tahashmi commented on August 25, 2024

Thanks Umair for this detailed answer. Is it possible to contact you on your email for a possible collaboration? Thanks.

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

Hi Tanveer, yes that would be great. Could you give me your email so I can copy the rest of my team in the email as well?

from nanocaller.

tahashmi avatar tahashmi commented on August 25, 2024

Sure Umair. This is my email id: [email protected]

from nanocaller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.