I ran NanoCaller on ONT release 6 NA12878 data with the default parameters (NanoCaller

Low F1 score for NA12878 about nanocaller HOT 11 CLOSED

wglab commented on August 25, 2024

Low F1 score for NA12878

from nanocaller.

Comments (11)

umahsn commented on August 25, 2024

Hi, we have recently made some changes to run the code faster which has the potential of giving excessive SNP calls from indel calling subroutine, and we are working on fixing that issue. Can you please tell us if you used the file 'PREFIX.final.vcf.gz' for evaluation? If yes, could you instead run the evaluation using the files 'PREFIX.snps.vcf.gz' or 'PREFIX.snps.phased.vcf.gz' and tell us the results?

from nanocaller.

umahsn commented on August 25, 2024

If you could also tell us the commit number of NanoCaller code that you used, that would help us understand the situation better. The release v0.1.1 (https://github.com/WGLab/NanoCaller/releases/tag/v0.1.1) should not have the problem I have mentioned in the comment above.

from nanocaller.

illarionovaanastasia commented on August 25, 2024

Thank you for the suggestions! I have cloned the directory four weeks ago, so the commit number should be 72c5bc0. Yes, I have used 'PREFIX.final.vcf.gz' for evaluation. I have checked the 'PREFIX.snps.vcf.gz' and 'PREFIX.snps.phased.vcf.gz', the F1 score and specificity are 0.63 and 0.52 respectively.

from nanocaller.

umahsn commented on August 25, 2024

Thank you for brining this to our attention. Can you please tell us whether you evaluated SNP calls within high-confidence regions for HG001? If you could share the commands you used for minimap2, NanoCaller and RTG vcfeval that would help us understand the issue better. The default settings of NanoCaller should be good for running on Nanopore reads without any changes. I recently ran NanoCaller with default options (with the exception of minimum coverage threshold which was set to 4, and exclusion of centromere regions, described below) on this HG001 nanopore rel6 BAM file, and was able to get 0.9774 precision, 0.9434 recall and 0.9553 F1-score inside high-confidence regions. Without the high-confidence region, I got the following results: 0.7522 precision, 0.8973 recall, and 0.8184 F1-score.

Also, we have just added some options that can allow users to run NanoCaller by specifying BED files for regions they want to include or exclude from variant calling. For instance, '--exclude_bed' options allows users to specify their own BED file for regions they want to exclude, but also allows user to skip centromere and telomere regions by built-in options such as 'hg38'. This can massively help with false positive calls.

Just for the reference, this is command I used for example for chr1:
python NanoCaller.py -bam NA12878-minion-ul_GRCh38.bam -seq ont -model NanoCaller1 -vcf chr1 -chrom chr1 -ref GRCh38.fa -prefix HG001.chr1 -cpu 16 --exclude_bed hg38 -mincov 4

from nanocaller.

illarionovaanastasia commented on August 25, 2024

Sure! Precision, recall and F1 score are 0.49, 0.86, 0.62 respectively for the final.vcf.gz and 0.92, 0.81, 0.86 for the snps.vcf.gz inside HG001 high-confidence regions. The code I have used:

# Minimap2
minimap2 -t 20 -ax map-ont hg38.fa rel6_na12878_hq7.fastq.gz | samtools sort -@ 20 -o rel6_na12878.bam
# Nanocaller
python NanoCaller.py -bam rel6_na12878.bam -mode both -seq ont -model NanoCaller1 -vcf rel6 -ref hg38.fa -wgs_contigs_type all -prefix rel6 -cpu 40 > rel6_log
# vcfeval
rtg vcfeval -b HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c rel6.final.filtered.vcf.gz -t hg38_sdf -o vcfeval_rel6_hg38

For the previous evaluation I have not incuded chrY , chrM and unplaced scaffolds.
What is your expectation regarding the excessive number of FPs if there is a filter only for telomeres and centromeres? Also, the release v0.1.1 does not have --exclude_bed feature yet, would you recommend then to use the latest release?

from nanocaller.

umahsn commented on August 25, 2024

Just wanted to confirm one thing with you regarding evaluation in high-confidence region, in the above RTG vcfeval command, there is no BED file for evaluation specified with '-e' flag. I am assuming you filtered NanoCaller calls already before using RTG vcfeval. However, the benchmark dataset from GIAB that you have used contains variant calls outside the high confidence region, which can lead to a low recall. You could also filter the benchmark calls with the GIAB provided high-confidence region file before using RTG vcfeval. But in my experience, the evaluation is more accurate when you give RTG vcfeval the VCF files that have not been filtered for any high confidence region from both benchmark and variant caller, and give the evaluation region BED file to vcfeval instead.

Excluding centromere and telomere regions is mostly for speeding up the variant calling because variant calling inside those region is highly unreliable and we get way too many variants (roughly 150,000/460,000 SNP calls from NanoCaller in chr1 come from the 4mb centromere region if you do not exclude centromeres). In fact, high-confidence regions published by GIAB do not include centromere and telomere regions. So if you used high confidence region for evaluation, then the false positives must not have come from those regions. We have just released v0.2.0 that includes these new features and we would recommend this version.

Could you please point us to the source of the FASTQ file you have used, if it is publicly available? Also, would you be able to share the VCF files you obtained from NanoCaller with us, either here or via email? That would really help us figure out where the false positives are coming from.

from nanocaller.

umahsn commented on August 25, 2024

Another important thing is, I would also suggest you to use '-f QUAL' option with vcfeval because this will give precision/recall/f1 for various quality score thresholds on the variant calls. Or, you can filter NanoCaller variant calls using bcftools beforehand using

bcftools view rel6.final.filtered.vcf.gz -i 'QUAL> NUM' |bgziptabix rel6.final.filtered.quality_filtered.vcf.gz

and use this filtered VCF file in vcfeval, where NUM is a quality threshold (try 110). I would recommend earlier method of using vcfeval with -f flag. If you do use vcfeval with -f flag, the file vcfeval_rel6_hg38/snp_roc.tsv will contain performances with various quality score thresholds. Then you can use the following command:

grep 'f_measure' vcfeval_rel6_hg38/snp_roc.tsv; tail -n +8 vcfeval_rel6_hg38/snp_roc.tsv |awk -v max=0 '{if($8>max){want=$0; max=$8}}END{print want}'; tail -n +8 vcfeval_rel6_hg38/snp_roc.tsv |awk -v max=0 '{if($8>=max){want=$0; max=$8}}END{print want}'; tail -1 vcfeval_rel6_hg38/snp_roc.tsv

This will show you performances for two quality scores that give the highest F1-score, followed by the performance when you dont use any quality score filtering. First performance will be the quality score that gives the highest F1-score with higher precision score, and the second will be the quality score that gives the highest F1-score with higher recall. This is because various combos of precision/recall can give same highest F1-score. I believe this could be the reason for low performance.

Therefore, I would recommend you use the following RTG vcfeval command without filtering NanoCaller calls by high-confidence regions but supply the high confidence regions to vcfeval:

rtg vcfeval -b HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c rel6.snps.vcf.gz -t hg38_sdf -o vcfeval_rel6_hg38 -Z -f QUAL -e https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed

NanoCaller SNP calls are very accurate even without filtering by quality score, but the accuracy can be slightly increased if the calls are filtered by quality score; this table from supplementary materials of our bioarxiv paper shows what are the best scores we obtained when we tested NanoCaller on Nanopore datasets.

from nanocaller.

illarionovaanastasia commented on August 25, 2024

Ok, thank you very much for all your recommendations! The fastq file I used is a publicly available dataset from NA12878 release 6 ONT sequencing. I have filtered the reads for quality score 7 with NanoPlot. Please check the vcf files here.

from nanocaller.

umahsn commented on August 25, 2024

I ran RTG vcfeval on the two files you have provided using the GIAB high-confidence regions BED file for HG001 chr1-X with -e -f QUAL options in vcfeval command, and here is the performance I get for SNPs:

rel6_34.nanocaller.snps.vcf.gz

score	true_pos_baseline	false_pos	true_pos_call	false_neg	precision	sensitivity	f_measure
139.927	2875514	99623	2875644	167275.3	0.9665	0.945	0.9557
125.262	2884401	109336	2884531	158387.8	0.9635	0.9479	0.9557
30.103	2930678	241744	2930802	112111	0.9238	0.9632	0.9431

rel6_34.final.vcf.gz

score	true_pos_baseline	false_pos	true_pos_call	false_neg	precision	sensitivity	f_measure
146.239	2871930	111499	2872731	170858.9	0.9626	0.9438	0.9532
130.824	2881798	122340	2882605	160991	0.9593	0.9471	0.9532
2.23	2931811	275323	2932639	110978	0.9142	0.9635	0.9382

I have shared the RTG vcfeval results with you via Google Drive.

from nanocaller.

illarionovaanastasia commented on August 25, 2024

Thank you! How save is it to run the tool on a in-house genome with a read coverage 25-30X?

from nanocaller.

umahsn commented on August 25, 2024

I believe it should be ok to use NanoCaller for in-house genome. In our testing on ONT reads of HX1 genome sequenced by our lab, NanoCaller achieved just as high recall for SNPs (~94%) as other methods, when we compared against variant calls made by GATK on Illumina reads of HX1. All methods show different precision, which can be tricky to estimate correctly in the absence of a rigorous benchmark variant sets, but the high recall allows us to not miss out on any true SNPs. I believe that should hold true for your genome as well, especially if it uses latest flowcells and basecaller.

from nanocaller.

Low F1 score for NA12878 about nanocaller HOT 11 CLOSED

Comments (11)

rel6_34.nanocaller.snps.vcf.gz

rel6_34.final.vcf.gz

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent