Pipeline for wireworm genome assembly, RADseq genotyping, & genome skimming

Analyses implemented in the following manuscript:

Andrews KR, Gerritsen A, Rashed A, Crowder DW, Rondon SI, van Herk WG, Vernon R, Wanner KW, Wilson CM, New DD, Fagnan MW, Hohenlohe PA, Hunter SS 2020. Wireworm (Coleoptera: Elateridae) genomic analysis reveals putative cryptic species, population structure, and adaptation to pest control. Communications Biology 3. doi: 10.1038/s42003-020-01169-9

Requirements

MaSuRCA v3.2.4
BUSCO v3.0.2
HTStream
STACKS v2.3e
BWA v0.7.17
GATK v4.1.1.0
VCFTools v0.1.15
BLAT
MAFFT v.7.429
Python v.2.7
ARC

Genome assembly

1. Assembly of Limonius californicus genome from Illumina, SLR, and PacBio reads

Install Masurca:

cd ~/MaSuRCA-hybrid-assembly
mkdir masurca
cd masurca
wget ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.4.tar.gz
tar xvf MaSuRCA-3.2.4.tar.gz
./install.sh

PacBio data was converted from fastq to fasta format and combined with Illumina SLR (synthetic long read) data and provided to MaSuRCA as a single file:

PACBIO=./pacbio+SLR_for_Masurca/combined.fa

Setup MaSuRCA assembly

cd ~/MaSuRCA-hybrid-assembly

mkdir 01-MaSuRCA-assembly
cd 01-MaSuRCA-assembly
cp ../masurca/MaSuRCA-3.2.4/sr_config_example.txt sr_config.txt

#Edit sr_config.txt to add Illumina and long read data. See MaSuRCA/sr_config.txt for final MaSuRCA config file.


../masurca/MaSuRCA-3.2.4/bin/masurca sr_config.txt

2. Assessment of genome completeness

Installation of Busco is beyond the scope of this document, but the exact version of Busco used can be obtained by doing the following:

git clone https://gitlab.com/ezlab/busco.git
cd busco
git checkout 3927d240f8b5bf8232066d51fce4ac7554bf29a6

The database used was downloaded with the following command:

wget http://busco.ezlab.org/datasets/insecta_odb9.tar.gz

Run busco

cd ~/MaSuRCA-hybrid-assembly
module load python
module load ncbi-blast
module load hmmer
module load R/3.5.0

mkdir 02-Busco

export AUGUSTUS_CONFIG_PATH=~/opt/src/augustus/augustus-3.3.1/config

run_BUSCO.py -i ../01-MaSuRCA-assembly/CA.mr.41.15.17.0.029/final.genome.scf.fasta \
  -c 1 -m geno -l ~/databases/busco/insecta_odb9 -o busco-test

generate_plot.py -wd ./

RADseq genotyping

Raw fastq.gz files are located in ./00-RawData

1. Trim adapters using HTStream

mkdir 01-hts_AdapterTrimmed

Example command:

hts_AdapterTrimmer -m 50 -1 ./00-RawData/CAGATCAT_S178_L008_R1_001.fastq.gz -2 ./00-RawData/CAGATCAT_S178_L008_R2_001.fastq.gz -fgp ./01-hts_AdapterTrimmed/CAGATCAT -L ./01-hts_AdapterTrimmed/CAGATCAT_log

2. Clean & demultiplex by barcode using STACKS

mkdir 02-process_radtags

Example command:

process_radtags -r -c -q -P -1 ./01-hts_AdapterTrimmed/CAGATCAT_R1.fastq.gz -2 ./01-hts_AdapterTrimmed/CAGATCAT_R2.fastq.gz -b ./stacks_files/Plate1_barcodes.txt -e sbfI -i gzfastq -o ./02-process_radtags --bestrad

3. Remove PCR duplicates using HTStream

mkdir 03-superdedup

Example command:

hts_SuperDeduper -fgp ./03-superdedup/ID021_dedup   -L ./03-superdedup/ID021.log  -1 ./02-process_radtags/ID021.1.fq.gz -2 ./02-process_radtags/ID021.2.fq.gz

4. Map to reference genome using BWA

mkdir 04-Mapped
python 04-SetupMapping.py
sh 04-mapping-commands.sh

Example command:

bwa mem -t 5 -R '@RG\tID:bwa\tSM:WA049_dedup\tPL:ILLUMINA' ./idx/idx ./03-superdedup/ID021_dedup_R1.fastq.gz ./03-superdedup/ID021_dedup_R2.fastq.gz  2>./04-Mapped/ID021_dedup.log | samtools view -bS - | samtools sort - > ./04-Mapped/ID021_dedup.bam

5. Genotype calling using GATK

mkdir 05-gatk
sh 05-gatk.sh

6. Filter SNPs using VCFTools

mkdir 06-VCFTools
cd 06-VCFTools

Remove indels

vcftools --vcf ../05-gatk/mapped.vcf --out mapped_snps --remove-indels --recode

Remove SNPs with depth <5

vcftools --vcf mapped_snps.recode.vcf --out mapped_snps_D5 --minDP 5 --recode

Remove SNPs with genotype quality <15

vcftools --vcf mapped_snps_D5.recode.vcf --out mapped_snps_D5_GQ15 --minGQ 15 --recode

Remove SNPs with >20% missing data

vcftools --vcf mapped_snps_D5_GQ15.recode.vcf --out mapped_snps_D5_GQ15_mis80 --max-missing 0.8 --recode

Remove individuals with >20% missing SNPs

vcftools --vcf mapped_snps_D5_GQ15_mis80.recode.vcf --out mapped_snps_D5_GQ15_mis80_misind80 --remove individuals_to_remove_01 --recode

Remove SNPs with max mean depth >1.5 standard deviations above mean depth of all SNPs

vcftools --vcf mapped_snps_D5_GQ15_mis80_misind80.recode.vcf --out mapped_snps_D5_GQ15_mis80_misind80_maxD --max-meanDP 19.36 --recode

Remove SNPs with mean depth <8

vcftools --vcf mapped_snps_D5_GQ15_mis80_misind80_maxD.recode.vcf --out mapped_snps_D5_GQ15_mis80_misind80_maxD_minD --min-meanDP 8 --recode

Remove singletons

vcftools --vcf mapped_snps_D5_GQ15_mis80_misind80_maxD_minD.recode.vcf --out mapped_snps_D5_GQ15_mis80_misind80_maxD_minD --singletons

awk '{ print $1, $2}'  OFS='\t' mapped_snps_D5_GQ15_mis80_misind80_maxD_minD.singletons | awk '{ if (NR!=1) {print} }' OFS='\t' > mapped_snps_D5_GQ15_mis80_misind80_maxD_minD_singletonlist

vcftools --vcf mapped_snps_D5_GQ15_mis80_misind80_maxD_minD.recode.vcf --out mapped_snps_D5_GQ15_mis80_misind80_maxD_minD_xsing --exclude-positions mapped_snps_D5_GQ15_mis80_misind80_maxD_minD_singletonlist --recode

Remove monomorphic SNPs (MAF<0.00001)

vcftools --vcf mapped_snps_D5_GQ15_mis80_misind80_maxD_minD_xsing.recode.vcf --out mapped_snps_D5_GQ15_mis80_misind80_maxD_minD_xsing_xmono --maf 0.00001 --recode

Thin SNPs <1000bp apart

vcftools --vcf sub01_snps_D5_GQ15_mis80_misind80_maxD_xsing_xmono.recode.vcf --out sub01_snps_D5_GQ15_mis80_misind80_maxD_xsing_xmono_thin1000 --thin 1000 --recode

Genome skimming

Scripts and configuration files for this step are included in this GitHub repository in the 16s_and_CO1_assembly folder.

0. Read cleaning

HTStream was used to clean the raw shotgun reads prior to de novo assembly. Reads were cleaned to remove PCR duplicates, screen for PhiX, trim Illumina adapters, trim reads containing N's, screen against a collection of known adapter sequences, and finally to quality trim the reads. The 01-clean.py script was used to set up the cleaning pipeline. SPAdes assembler was used to do a preliminary assembly of the cleaned reads. The contigs resulting from this assembly were not included in this publication.

Raw shotgun sequence data (fastq.gz files) were stored in ./00-RawData and cleaned reads are written to ./01-Cleaned/.

python 01-clean.py
sh 01-cleaning_commands.sh
sh 01-assembly_commands.sh

1. Iterative mapping & de novo assembly with ARC

A set of 16S and COI reference sequences for Limonius californicus (GenBank KT852377.1) and other wireworm species (manually identified from SPAdes assembly) were collected and stored in combined.fa and used to initiate read recruitment and assembly using ARC with configuration supplied by the included ARC_config.txt.

mkdir 02-16sCO1-ARC-assemblies
cd 02-16sCO1-ARC-assemblies
python ../02-Setup-ARC.py
ARC > log.out

2. Extract and orient ARC contigs that blat with a good score to the original targets, create initial alignment with mafft. This process was set up using the included 03-setup_CO1_extract-ARC.py and 03-setup_16s_extract-ARC.py scripts, and relies on the blat_and_extract.py script.

COI:

mkdir 03-CO1-contigs-ARC
python 03-setup_CO1_extract-ARC.py
sh 03-extract_CO1-ARC.sh
cat ./03-CO1-contigs-ARC/*.fasta > ./03-CO1-contigs-ARC/all_combined.fasta
mafft --auto --thread 50 ./03-CO1-contigs-ARC/all_combined.fasta > ./03-CO1-contigs-ARC/all_combined_aligned.fasta

16S:

mkdir 03-16s-contigs-ARC
python 03-setup_16S_extract-ARC.py
sh 03-extract_16S-ARC.sh
cat ./03-16s-contigs-ARC/*.fasta > ./03-16s-contigs-ARC/all_combined.fasta
mafft --auto --thread 50 ./03-16s-contigs-ARC/all_combined.fasta > ./03-16s-contigs-ARC/all_combined_aligned.fasta

kimandrews / wireworm_popgen Goto Github PK

wireworm_popgen's Introduction

Pipeline for wireworm genome assembly, RADseq genotyping, & genome skimming

Requirements

Genome assembly

1. Assembly of Limonius californicus genome from Illumina, SLR, and PacBio reads

Install Masurca:

PacBio data was converted from fastq to fasta format and combined with Illumina SLR (synthetic long read) data and provided to MaSuRCA as a single file:

Setup MaSuRCA assembly

2. Assessment of genome completeness

Run busco

RADseq genotyping

1. Trim adapters using HTStream

2. Clean & demultiplex by barcode using STACKS

3. Remove PCR duplicates using HTStream

4. Map to reference genome using BWA

5. Genotype calling using GATK

6. Filter SNPs using VCFTools

Genome skimming

0. Read cleaning

1. Iterative mapping & de novo assembly with ARC

2. Extract and orient ARC contigs that blat with a good score to the original targets, create initial alignment with mafft. This process was set up using the included 03-setup_CO1_extract-ARC.py and 03-setup_16s_extract-ARC.py scripts, and relies on the blat_and_extract.py script.

Further analysis of the resulting contigs was carried out using other software as described in the manuscript.

wireworm_popgen's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org