In human genetics, an important task is validating mutations that have been found using exome or whole genome sequencing with a PCR spanning the abnormal region. This task is not difficult, but surprisingly laborious when done manually.
This workflow automates the design of PCR and qPCR primers. As input, it takes either:
- an HGVS-formatted mutation, e.g.
NM_000546.6:c.215C>G
(for PCR) - a gene name and exon number, e.g.
NM_007294.4:14
(for qPCR)
The workflow will:
- validate HGVS syntax
- for mutations in coding regions try to span the exon
- constrain primers to desired parameters
- construct a second backup pair of primers that does not overlap the first pair
- avoid placing primers across known common variants
- validate all primers to not produce unspecific amplicons
conda install -y -n primer -c bioconda nextflow && conda activate primer
git clone https://github.com/phiweger/primer && cd primer
mkdir data
# ^ add data in there, see below
# And some more dependencies
conda install -y -c bioconda blast
Then set the absolute (!) path to the data in nextflow.config
. If you use other reference genomes/ annotations/ variant databases, you also have to adjust the filenames in config.json
.
primer4 -i order.csv -d /path/to/primer4/data -p /path/to/primer4/config.json
# DEPRECATED
cat input.csv
# name,method,variant
# ABCA4_v1,PCR,NM_000350.3:c.4234C>T
# ABCA4_v2,PCR,NM_000350.3:c.4773+3A>G
nextflow run workflow/main.nf --input input.csv --results designs
You need to get three things, which need to correspond to one another:
- reference genome
- SNPs
- annotation
Below, we'll use the files corresponding to the human genome hg19
(v13, NCBI, last access 2021-02-03).
# 1. Reference genome and index for random access:
# - GRCh37_latest_genomic.fna.gz
# - GRCh37_latest_genomic.fna.gz.fai
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.fna.gz
gunzip GRCh37_latest_genomic.fna.gz
samtools faidx GRCh37_latest_genomic.fna
# Blast index
makeblastdb -in GRCh37_latest_genomic.fna -dbtype nucl
# Adding sequences from FASTA; added 297 sequences in 39.731 seconds.
# 2. Variants and index:
# - GRCh37_latest_dbSNP_all.vcf.gz
# - GRCh37_latest_dbSNP_all.vcf.gz.tbi
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_dbSNP_all.vcf.gz
gunip GRCh37_latest_dbSNP_all.vcf.gz
bgzip -c GRCh37_latest_dbSNP_all.vcf > GRCh37_latest_dbSNP_all.vcf.gz
tabix -p vcf GRCh37_latest_dbSNP_all.vcf.gz
# https://genetools.org/SNPCheck/snpcheck.htm
DB=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf
gunzip ${DB}.gz
bgzip -c $DB > ${DB}.gz
tabix -p vcf ${DB}.gz
DB=ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf
tar -xzvf ${DB}.tar.gz
cat *.vcf > $DB
bgzip -c $DB > ${DB}.gz
tabix -p vcf ${DB}.gz
# 3. Annotation database:
# - hg19-p13_annotation.db
Get it from OSF (project ID 7csav); here is how we created it:
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz
gunzip GRCh37_latest_genomic.gff.gz
# switch to Python
import gffutils
# http://daler.github.io/gffutils/
# https://daler.github.io/gffutils/database-import.html
fp = 'GRCh37_latest_genomic.gff'
db = gffutils.create_db(
fp,
dbfn='hg19-p13_annotation.db',
force=True,
keep_order=True,
merge_strategy='merge',
sort_attribute_values=True)
# Takes a couple of hours on a regular laptop, then use like
db = gffutils.FeatureDB('hg19-p13_annotation.db', keep_order=True)
db['exon-NR_024540.1-3']
For future reference, below are documented some peculiarities:
gffutils.FeatureDB
will not accept symlinks; this means we cannot pass the database in anextflow
channel.- There are two files with parameters, one for
nextflow
(nextflow.config
) and one to guide the primer design (config.json
). There are more elegant solutions, but meh.