Code Monkey home page Code Monkey logo

Comments (11)

umahsn avatar umahsn commented on August 25, 2024

Hi, we have recently made some changes to run the code faster which has the potential of giving excessive SNP calls from indel calling subroutine, and we are working on fixing that issue. Can you please tell us if you used the file 'PREFIX.final.vcf.gz' for evaluation? If yes, could you instead run the evaluation using the files 'PREFIX.snps.vcf.gz' or 'PREFIX.snps.phased.vcf.gz' and tell us the results?

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

If you could also tell us the commit number of NanoCaller code that you used, that would help us understand the situation better. The release v0.1.1 (https://github.com/WGLab/NanoCaller/releases/tag/v0.1.1) should not have the problem I have mentioned in the comment above.

from nanocaller.

illarionovaanastasia avatar illarionovaanastasia commented on August 25, 2024

Thank you for the suggestions! I have cloned the directory four weeks ago, so the commit number should be 72c5bc0. Yes, I have used 'PREFIX.final.vcf.gz' for evaluation. I have checked the 'PREFIX.snps.vcf.gz' and 'PREFIX.snps.phased.vcf.gz', the F1 score and specificity are 0.63 and 0.52 respectively.

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

Thank you for brining this to our attention. Can you please tell us whether you evaluated SNP calls within high-confidence regions for HG001? If you could share the commands you used for minimap2, NanoCaller and RTG vcfeval that would help us understand the issue better. The default settings of NanoCaller should be good for running on Nanopore reads without any changes. I recently ran NanoCaller with default options (with the exception of minimum coverage threshold which was set to 4, and exclusion of centromere regions, described below) on this HG001 nanopore rel6 BAM file, and was able to get 0.9774 precision, 0.9434 recall and 0.9553 F1-score inside high-confidence regions. Without the high-confidence region, I got the following results: 0.7522 precision, 0.8973 recall, and 0.8184 F1-score.

Also, we have just added some options that can allow users to run NanoCaller by specifying BED files for regions they want to include or exclude from variant calling. For instance, '--exclude_bed' options allows users to specify their own BED file for regions they want to exclude, but also allows user to skip centromere and telomere regions by built-in options such as 'hg38'. This can massively help with false positive calls.

Just for the reference, this is command I used for example for chr1:
python NanoCaller.py -bam NA12878-minion-ul_GRCh38.bam -seq ont -model NanoCaller1 -vcf chr1 -chrom chr1 -ref GRCh38.fa -prefix HG001.chr1 -cpu 16 --exclude_bed hg38 -mincov 4

from nanocaller.

illarionovaanastasia avatar illarionovaanastasia commented on August 25, 2024

Sure! Precision, recall and F1 score are 0.49, 0.86, 0.62 respectively for the final.vcf.gz and 0.92, 0.81, 0.86 for the snps.vcf.gz inside HG001 high-confidence regions. The code I have used:

# Minimap2
minimap2 -t 20 -ax map-ont hg38.fa rel6_na12878_hq7.fastq.gz | samtools sort -@ 20 -o rel6_na12878.bam
# Nanocaller
python NanoCaller.py -bam rel6_na12878.bam -mode both -seq ont -model NanoCaller1 -vcf rel6 -ref hg38.fa -wgs_contigs_type all -prefix rel6 -cpu 40 > rel6_log
# vcfeval
rtg vcfeval -b HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c rel6.final.filtered.vcf.gz -t hg38_sdf -o vcfeval_rel6_hg38

For the previous evaluation I have not incuded chrY , chrM and unplaced scaffolds.
What is your expectation regarding the excessive number of FPs if there is a filter only for telomeres and centromeres? Also, the release v0.1.1 does not have --exclude_bed feature yet, would you recommend then to use the latest release?

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

Just wanted to confirm one thing with you regarding evaluation in high-confidence region, in the above RTG vcfeval command, there is no BED file for evaluation specified with '-e' flag. I am assuming you filtered NanoCaller calls already before using RTG vcfeval. However, the benchmark dataset from GIAB that you have used contains variant calls outside the high confidence region, which can lead to a low recall. You could also filter the benchmark calls with the GIAB provided high-confidence region file before using RTG vcfeval. But in my experience, the evaluation is more accurate when you give RTG vcfeval the VCF files that have not been filtered for any high confidence region from both benchmark and variant caller, and give the evaluation region BED file to vcfeval instead.

Excluding centromere and telomere regions is mostly for speeding up the variant calling because variant calling inside those region is highly unreliable and we get way too many variants (roughly 150,000/460,000 SNP calls from NanoCaller in chr1 come from the 4mb centromere region if you do not exclude centromeres). In fact, high-confidence regions published by GIAB do not include centromere and telomere regions. So if you used high confidence region for evaluation, then the false positives must not have come from those regions. We have just released v0.2.0 that includes these new features and we would recommend this version.

Could you please point us to the source of the FASTQ file you have used, if it is publicly available? Also, would you be able to share the VCF files you obtained from NanoCaller with us, either here or via email? That would really help us figure out where the false positives are coming from.

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

Another important thing is, I would also suggest you to use '-f QUAL' option with vcfeval because this will give precision/recall/f1 for various quality score thresholds on the variant calls. Or, you can filter NanoCaller variant calls using bcftools beforehand using

bcftools view rel6.final.filtered.vcf.gz -i 'QUAL> NUM' |bgziptabix rel6.final.filtered.quality_filtered.vcf.gz

and use this filtered VCF file in vcfeval, where NUM is a quality threshold (try 110). I would recommend earlier method of using vcfeval with -f flag. If you do use vcfeval with -f flag, the file vcfeval_rel6_hg38/snp_roc.tsv will contain performances with various quality score thresholds. Then you can use the following command:

grep 'f_measure' vcfeval_rel6_hg38/snp_roc.tsv; tail -n +8 vcfeval_rel6_hg38/snp_roc.tsv |awk -v max=0 '{if($8>max){want=$0; max=$8}}END{print want}'; tail -n +8 vcfeval_rel6_hg38/snp_roc.tsv |awk -v max=0 '{if($8>=max){want=$0; max=$8}}END{print want}'; tail -1 vcfeval_rel6_hg38/snp_roc.tsv

This will show you performances for two quality scores that give the highest F1-score, followed by the performance when you dont use any quality score filtering. First performance will be the quality score that gives the highest F1-score with higher precision score, and the second will be the quality score that gives the highest F1-score with higher recall. This is because various combos of precision/recall can give same highest F1-score. I believe this could be the reason for low performance.

Therefore, I would recommend you use the following RTG vcfeval command without filtering NanoCaller calls by high-confidence regions but supply the high confidence regions to vcfeval:

rtg vcfeval -b HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c rel6.snps.vcf.gz -t hg38_sdf -o vcfeval_rel6_hg38 -Z -f QUAL -e https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed

NanoCaller SNP calls are very accurate even without filtering by quality score, but the accuracy can be slightly increased if the calls are filtered by quality score; this table from supplementary materials of our bioarxiv paper shows what are the best scores we obtained when we tested NanoCaller on Nanopore datasets.

image

from nanocaller.

illarionovaanastasia avatar illarionovaanastasia commented on August 25, 2024

Ok, thank you very much for all your recommendations! The fastq file I used is a publicly available dataset from NA12878 release 6 ONT sequencing. I have filtered the reads for quality score 7 with NanoPlot. Please check the vcf files here.

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

I ran RTG vcfeval on the two files you have provided using the GIAB high-confidence regions BED file for HG001 chr1-X with -e -f QUAL options in vcfeval command, and here is the performance I get for SNPs:

rel6_34.nanocaller.snps.vcf.gz

score true_pos_baseline false_pos true_pos_call false_neg precision sensitivity f_measure
139.927 2875514 99623 2875644 167275.3 0.9665 0.945 0.9557
125.262 2884401 109336 2884531 158387.8 0.9635 0.9479 0.9557
30.103 2930678 241744 2930802 112111 0.9238 0.9632 0.9431

rel6_34.final.vcf.gz

score true_pos_baseline false_pos true_pos_call false_neg precision sensitivity f_measure
146.239 2871930 111499 2872731 170858.9 0.9626 0.9438 0.9532
130.824 2881798 122340 2882605 160991 0.9593 0.9471 0.9532
2.23 2931811 275323 2932639 110978 0.9142 0.9635 0.9382

I have shared the RTG vcfeval results with you via Google Drive.

from nanocaller.

illarionovaanastasia avatar illarionovaanastasia commented on August 25, 2024

Thank you! How save is it to run the tool on a in-house genome with a read coverage 25-30X?

from nanocaller.

umahsn avatar umahsn commented on August 25, 2024

I believe it should be ok to use NanoCaller for in-house genome. In our testing on ONT reads of HX1 genome sequenced by our lab, NanoCaller achieved just as high recall for SNPs (~94%) as other methods, when we compared against variant calls made by GATK on Illumina reads of HX1. All methods show different precision, which can be tricky to estimate correctly in the absence of a rigorous benchmark variant sets, but the high recall allows us to not miss out on any true SNPs. I believe that should hold true for your genome as well, especially if it uses latest flowcells and basecaller.

from nanocaller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.