Code Monkey home page Code Monkey logo

id16s's Introduction

ID16S – 16S amplicon identification

DOI

Table of contents

Introduction

The ID16S pipeline reconstructs the composition of bacterial species from a multifasta file of 16S amplicon sequences. Inferences are derived from BLAST homology searches against the NCBI 16S Microbial database. Normalization based on rRNA copy variation is performed by utilizing a database produced from all completely assembled bacterial genome annotations within RefSeq. For a good overview of rRNA copy variation, please see this excellent review by Lavrinienko et al.

The ID16S pipeline was tested on Nanopore 1D reads obtained with the 16S Barcoding Kit (SQK-RAB204). Identification accuracy parallels that of the Nanopore sequencing reads. This pipeline should work on all 16S datasets but full length 16S amplicons are preferable.

Note that for large datasets, BLAST homology searches will take a while to complete, even with the megablast algorithm. People interested in faster tools should look at Kraken2. The later is based on kmers and is much faster than BLAST approaches but produces a lower recall with Nanopore reads due to their lower acccuracy. For an excellent comparison of the recall rate from nanopore reads with BLAST and Kraken2, see this paper by Pearman et al.

Dependencies

Getting started

Installing ID16S

The ID16S pipeline can be downloaded with Git:

git clone --recursive https://github.com/PombertLab/ID16S.git

To run the ID16S pipeline, its environment variables must be setup:

ID16S/setup_ID16S.pl \
  -d /desired/database/path \
  -c /desired/configuration/file

source $CONFIG_FILE

Downloading databases

A total of 3 NCBI datasets are required for the ID16S pipeline:

  1. The NCBI 16S Microbial database - 16S_ribosomal_RNA.tar.gz
  2. The NCBI Taxonomy database - taxdb.tar.gz
  3. The NCBI Taxonomy dumps - taxdump.tar.gz

The ID16S pipeline comes with a precompiled 16S rRNA database (updated March 2022), though an updated version can be created.

The required databases can be downloaded as follows:

ID16S/download_ID16S_dbs.pl -d

If an updated 16S rRNA database is desired, the addition of the -c (--create) flag can be used:

ID16S/download_ID16S_dbs.pl -d -c

Running ID16S via run_ID16S.pl

To run ID16S, provide run_ID16S.pl with the desired FASTA/Q files:

ID16S/run_ID16S.pl \
  -fa /path/to/FASTA/files
  -fq /path/to/FASTQ/files

Options for run_ID16S.pl are:

GENERAL OPTIONS
-fa (--fasta)		FASTA files to run
-fq (--fastq)		FASTQ files to convert then run
-hd (--headcrop)	Remove the first X nucleotides from 5' end of FASTQ sequences
			## Useful for Nanopore data
-m (--min_length)	Minimum read length to keep from FASTQ files [Default: 1000]

ADVANCED OPTIONS
# ID16S SETTINGS
-o (--outdir)		Output directory [Default = \$ID16S_HOME]
-d (--db)		Path to 16IDS_DB download [Default = \$ID16S_DB]

# BLAST OPTIONS
-k (--tasks)		megablast, dc-megablast, blastn [default = megablast]
-t (--threads)		CPUs to use [default = 10]
-cu (--culling)		Culling limit [default = 10]
-h (--hits)		Number of hits to return [Default = 1]
-pe (--p_evalue)	Preliminary e-value cutoff for BLAST results [Default = 1e-05]

# OUTPUT OPTIONS
-r (--ranks)		Output files by taxonomic ranks [Default: species genus family order class]
-fe (--f_evalue)	Final e-value cutoff for BLAST results [Default = 1e-75]
-co (--concat)		Concatenate all results into a single file [Default: off]

The non-normalized results can be found in the NonNormalized directory, and will look similar to:

==> EXAMPLE/NonNormalized/sample_1.fasta.megablast.class <==
class	TaxID	Number	Percent (total = 848)
Mollicutes	31969	848	100.00%

==> EXAMPLE/NonNormalized/sample_1.fasta.megablast.family <==
family	TaxID	Number	Percent (total = 848)
Mycoplasmataceae	2092	848	100.00%

==> EXAMPLE/NonNormalized/sample_1.fasta.megablast.genus <==
genus	TaxID	Number	Percent (total = 848)
Mycoplasmopsis	2767358	838	98.82%
Mycoplasma	2093	10	1.18%

==> EXAMPLE/NonNormalized/sample_1.fasta.megablast.order <==
order	TaxID	Number	Percent (total = 848)
Mycoplasmatales	2085	848	100.00%

==> EXAMPLE/NonNormalized/sample_1.fasta.megablast.species <==
species	TaxID	Number	Percent (total = 848)
Mycoplasmopsis fermentans	2115	677	79.83%
Mycoplasmopsis arginini	2094	134	15.80%
Mycoplasmopsis caviae	55603	22	2.59%
Mycoplasma canadense	29554	4	0.47%
Mycoplasmopsis hyopharyngis	29558	2	0.24%
Mycoplasma gateae	35769	2	0.24%
Mycoplasma neophronis	872983	1	0.12%
Mycoplasmopsis adleri	51362	1	0.12%
Mycoplasma auris	51363	1	0.12%

The normalized results can be found in the Normalized directory, and will look similar to:

==> EXAMPLE/Normalized/sample_1_fasta_megablast_class_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Mollicutes	31969	class	100.00	100.00	0.00

==> EXAMPLE/Normalized/sample_1_fasta_megablast_family_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Mycoplasmataceae	2092	family	100.00	100.00	0.00

==> EXAMPLE/Normalized/sample_1_fasta_megablast_genus_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Mycoplasmopsis	2767358	genus	98.82	98.82	0.00
Mycoplasma	2093	genus	1.18	1.18	0.00

==> EXAMPLE/Normalized/sample_1_fasta_megablast_order_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Mycoplasmatales	2085	order	100.00	100.00	0.00

==> EXAMPLE/Normalized/sample_1_fasta_megablast_species_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Mycoplasmopsis fermentans	2115	species	79.83	88.79	8.96
Mycoplasmopsis arginini	2094	genus	15.80	8.79	-7.01
Mycoplasmopsis caviae	55603	genus	2.59	1.44	-1.15
Mycoplasma canadense	29554	genus	0.47	0.26	-0.21
Mycoplasma gateae	35769	genus	0.24	0.13	-0.11
Mycoplasmopsis hyopharyngis	29558	genus	0.24	0.13	-0.11
Mycoplasma neophronis	872983	genus	0.12	0.07	-0.05
Mycoplasmopsis mucosicanis	458208	genus	0.12	0.07	-0.05
Mycoplasma auris	51363	genus	0.12	0.07	-0.05

Running ID16S step-wise

Key steps The ID16S pipeline consists of a few simple steps:

  1. Convert FASTQ files to FASTA format with fastq2fasta.pl
  2. Perform homology searches against the Microbial 16S database with megablast.pl
  3. Summarize the taxonomic composition of the datasets with taxid_dist.pl

Example

We can use the FASTQ files located in the Example/ folder to test the installation of the pipeline. To convert the FASTQ files to FASTA format with fastq2fasta.pl, simply type:

fastq2fasta.pl \
   -f Example/*.fastq.gz \
   -h 50 \
   -m 1000 \
   -o FASTA \
   -v

Options for fastq2fasta.pl are:

-f (--fastq)		FASTQ files to convert
-o (--outdir)		Output directory [Default: ./]
-h (--headcrop)		Remove the first X nucleotides from 5' ## Useful for nanopore data
-m (--min_length)	Minimum length (in nt) of reads to keep [Default: 1000]
-v (--verbose)		Adds verbosity

To perform BLAST homology searches against the NCBI 16S ribosomal RNA database, we can use megablast.pl. This script will generate BLAST outputs with the following format: -outfmt '6 qseqid sseqid pident length bitscore evalue staxids sskingdoms sscinames sblastnames'. This format is required for taxid_dist.pl.

To perform BLAST searches with megablast.pl using 10 threads (-t 10), type:

megablast.pl \
   -k megablast \
   -q FASTA/*.fasta \
   -d NCBI_16S/16S_ribosomal_RNA \
   -e 1e-05 \
   -c 10 \
   -t 10 \
   -o MEGABLAST \
   -v

Options for megablast.pl are:

-k (--task)	megablast, dc-megablast, blastn [default = megablast]
-q (--query)	fasta file(s) to be queried
-d (--db)	NCBI nucleotide database to query [default = 16S_ribosomal_RNA]
-e (--evalue)	1e-05, 1e-10 or other [default = 1e-05]
-c (--culling)	culling limit [default = 10]
-t (--threads)	CPUs to use [default = 10]
-o (--outdir)	Output directory [Default: ./]
-x (--taxids)	Restrict search to taxids from file ## one taxid per line
-n (--ntaxids)	Exclude from search taxids from file ## one taxid per line
-v (--verbose)	Adds verbosity

To reconstruct the composition of the datasets from the BLAST homology searches, we can use taxid_dist.pl. This script will generate output files per requested taxonomic rank, up to the superkingdom rank.

To use taxid_dist.pl, type:

taxid_dist.pl \
   -n TaxDump/nodes.dmp \
   -a TaxDump/names.dmp \
   -b MEGABLAST/*.megablast \
   -e 1e-75 \
   -h 1 \
   -o SUMMARY \
   -r species genus family order class phylum \
   -v

Options for taxid_dist.pl are:

-n (--nodes)	NCBI nodes.dmp file 
-a (--names)	NCBI names.dmp
-b (--blast)	NCBI blast output file(s) in outfmt 6 format
-e (--evalue)	evalue cutoff [Default: 1e-75]
-h (--hits)	Number of BLAST hits to keep; top N hits [Default: 1]
-o (--outdir)	Output directory [Default: ./]
-r (--ranks)	Output files by taxonomic ranks [Default: species genus family order class]
		# Possible taxonomic rank options are:
		# subspecies strain species genus family order class phylum superkingdom 'no rank'
-v (--verbose)	Adds verbosity

The output of taxid_dist.pl should look like:

head -n 5 SUMMARY/*

==> SUMMARY/sample_1.fasta.megablast.class <==
class	TaxID	Number	Percent (total = 848)
Mollicutes	31969	848	100.00%

==> SUMMARY/sample_1.fasta.megablast.family <==
family	TaxID	Number	Percent (total = 848)
Mycoplasmataceae	2092	848	100.00%

==> SUMMARY/sample_1.fasta.megablast.genus <==
genus	TaxID	Number	Percent (total = 848)
Mycoplasmopsis	2767358	837	98.70%
Mycoplasma	2093	11	1.30%

==> SUMMARY/sample_1.fasta.megablast.order <==
order	TaxID	Number	Percent (total = 848)
Mycoplasmatales	2085	848	100.00%

==> SUMMARY/sample_1.fasta.megablast.phylum <==
phylum	TaxID	Number	Percent (total = 848)
Tenericutes	544448	848	100.00%

==> SUMMARY/sample_1.fasta.megablast.species <==
species	TaxID	Number	Percent (total = 848)
Mycoplasma fermentans	2115	677	79.83%
Mycoplasma arginini	2094	134	15.80%
Mycoplasma caviae	55603	22	2.59%
Mycoplasma canadense	29554	5	0.59%

==> SUMMARY/sample_2.fasta.megablast.class <==
class	TaxID	Number	Percent (total = 809)
Bacilli	91061	809	100.00%

==> SUMMARY/sample_2.fasta.megablast.family <==
family	TaxID	Number	Percent (total = 809)
Staphylococcaceae	90964	799	98.76%
Bacillaceae	186817	7	0.87%
Listeriaceae	186820	1	0.12%
Enterococcaceae	81852	1	0.12%

==> SUMMARY/sample_2.fasta.megablast.genus <==
genus	TaxID	Number	Percent (total = 809)
Staphylococcus	1279	796	98.39%
Bacillus	1386	3	0.37%
Oceanobacillus	182709	1	0.12%
Halobacillus	45667	1	0.12%

==> SUMMARY/sample_2.fasta.megablast.order <==
order	TaxID	Number	Percent (total = 809)
Bacillales	1385	807	99.75%
Lactobacillales	186826	2	0.25%

==> SUMMARY/sample_2.fasta.megablast.phylum <==
phylum	TaxID	Number	Percent (total = 809)
Firmicutes	1239	809	100.00%

==> SUMMARY/sample_2.fasta.megablast.species <==
species	TaxID	Number	Percent (total = 809)
Staphylococcus kloosii	29384	696	86.03%
Staphylococcus casei	201828	24	2.97%
Staphylococcus hominis	1290	20	2.47%
Staphylococcus succinus	61015	15	1.85%

To normalize the composition by number of rRNA copies, we can use get_organism_statistics.pl. This script will produce a file containing the rRNA-copy-normalized composition using the corresponding database for the taxonomic rank of the provided file.

The output will look similar to:

==> EXAMPLE2/Normalized/sample_2_fasta_megablast_class_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Bacilli	91061	class	100.00	100.00	0.00

==> EXAMPLE2/Normalized/sample_2_fasta_megablast_family_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Staphylococcaceae	90964	family	98.76	98.86	0.10
Bacillaceae	186817	family	0.87	0.61	-0.26
Aerococcaceae	186827	family	0.12	0.22	0.10
Listeriaceae	186820	family	0.12	0.17	0.05
Enterococcaceae	81852	family	0.12	0.14	0.02

==> EXAMPLE2/Normalized/sample_2_fasta_megablast_genus_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Staphylococcus	1279	genus	98.39	98.51	0.12
Nosocomiicoccus	489909	genus	0.12	0.22	0.10
Salinicoccus	45669	genus	0.12	0.22	0.10
Bacillus	1386	genus	0.25	0.17	-0.08
Listeria	1637	genus	0.12	0.17	0.05
Halobacillus	45667	genus	0.12	0.12	0.00
Mammaliicoccus	2803850	genus	0.12	0.12	0.00
Alkalihalophilus	2893060	genus	0.12	0.11	-0.01
Vagococcus	2737	genus	0.12	0.11	-0.01

==> EXAMPLE2/Normalized/sample_2_fasta_megablast_order_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Bacillales	1385	order	99.75	99.38	-0.37
Lactobacillales	186826	order	0.25	0.62	0.37

==> EXAMPLE2/Normalized/sample_2_fasta_megablast_species_Normalized_Microbiome_Composition.tsv <==
###Organism Name	TaxID	Taxo Level	Non-normalized % of sample	Normalized % of sample	Delta
Staphylococcus kloosii	29384	genus	85.91	80.61	-5.30
Staphylococcus hominis	1290	species	2.60	8.52	5.92
Staphylococcus casei	201828	genus	2.97	2.78	-0.19
Staphylococcus succinus	61015	genus	1.98	1.86	-0.12
Staphylococcus croceilyticus	319942	genus	0.74	0.70	-0.04
Staphylococcus ureilyticus	94138	genus	0.62	0.58	-0.04
Staphylococcus gallinarum	1293	genus	0.37	0.35	-0.02
Staphylococcus equorum	246432	genus	0.37	0.35	-0.02
Staphylococcus carnosus	1281	species	0.25	0.32	0.07

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. PMID: 2231712 DOI: 10.1016/S0022-2836(05)80360-2.

Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019 Nov 28;20(1):257. PMID: 31779668 PMCID: PMC6883579 DOI: 10.1186/s13059-019-1891-0.

Pearman WS, Freed NE, Silander OK Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads BMC Bioinformatics. 2020 May 29;21(1):220. PMID: 32471343 PMCID: PMC7257156 DOI: 10.1186/s12859-020-3528-4.

Lavrinienko A, Jernfors T, Koskimäki JJ, Pirttilä AM, Watts PC. Does Intraspecific Variation in rDNA Copy Number Affect Analysis of Microbial Communities? Trends Miccrobiol. 2021 Jan;29(1):19-27. PMID: 32593503 DOI: 10.1016/j.tim.2020.05.019.

id16s's People

Contributors

alexanderthomasjulian avatar pombert-jf avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.