Hi Umair, I hope you are doing well! I am working on cluster sca

Sure Umair. This is my email id: t.ahmad@tudelft.n

info required on duplicate reads about nanocaller HOT 6 CLOSED

tahashmi commented on August 25, 2024

info required on duplicate reads

from nanocaller.

Comments (6)

umahsn commented on August 25, 2024

Hi Tanveer,

Thank you for your interest in NanoCaller.

For HG001 (NA12878), we used aligned CCS reads from here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/PacBio_SequelII_CCS_11kb/HG001_GRCh38/) so we didn't do any alignment ourselves. For HG002-HG004, we obtained PacBio CCS FASTQ files from precisionFDA (now available at https://data.nist.gov/pdr/od/id/mds2-2336). We aligned these reads with minimap2 without any duplicate removal. Generally, for long-reads sequencing, you do not need to do any duplicate removal, regardless of PCR-free or not.

You can consult the supplementary file of precisionFDA challenge preprint to see what different teams did for alignment and any other pre-processing steps. Since we did not sequence these genomes, we do not if anything else was done before the FASTQ files were released publicly, so it would be better to consult with the sequencing team about that.

Just FYI, we used the following minimap2 command for CCS reads, with settings used in several aligned files provided by GIAB (example):
minimap2 -a -x map-pb -k 19 -O 5,56 -E 4,1 -B 5 -z 400,50 -r 2k --eqx --secondary=no

Let me know if there are any further questions.

from nanocaller.

tahashmi commented on August 25, 2024

Thank you so much Umair for your quick and detailed reply. It's clear now.

Would you like to mention a short flow of tools in ensemble mode (like mentioned here) for your precisionFDA challenge submission? I want to know if I can use PySaprk dataframes for SAM data to integrate such ensemble mode variant calling workflow. Thanks.

from nanocaller.

umahsn commented on August 25, 2024

For our precisionFDA submission we had a somewhat elementary ensemble. We ran NanoCaller, Clair and Medaka separately on BAM files. After variant calling we took variant quality scores (QUAL field) of predictions made by each of the three variant callers and scaled them so that were all on a uniform scale. The reason for this is most of Clair's quality score were mostly in 500-1000, Medaka's score were mostly in 0-25, and NanoCaller's quality scores were mostly in 30-400 range.

We performed a weighted vote for each variant call, instead of majority voting. More precisely, we looked at each variant call in the union of three tools, issued the same call in final output with the following quality score:

QUAL=(NanoCaller QUAL score for the variant +Clair QUAL score for the variant +Medaka QUAL score for the variant)/3

If a tool did not make the variant call in union set we gave a quality score of 0 for that variant caller. So if there was a variant call made by NanoCaller with QUAL=240, but Medaka and Clair did not make this variant call, then the ensemble will issue the same variant call with QUAL=(214+0+0)/3=80.

So instead of removing calls with low number of votes, we just gave it a lower score. The reason for this is that there are several variant calls that are uniquely called by only one or two methods, so it would not seem prudent to discard such calls. This allows us maximum recall, but with quality score filtering we can make the callset more stringent to remove low vote supporting calls.

from nanocaller.

tahashmi commented on August 25, 2024

Thanks Umair for this detailed answer. Is it possible to contact you on your email for a possible collaboration? Thanks.

from nanocaller.

umahsn commented on August 25, 2024

Hi Tanveer, yes that would be great. Could you give me your email so I can copy the rest of my team in the email as well?

from nanocaller.

tahashmi commented on August 25, 2024

Sure Umair. This is my email id: [email protected]

from nanocaller.

info required on duplicate reads about nanocaller HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent