tseemann / ekidna Goto Github PK

View Code? Open in Web Editor NEW

25.0 6.0 3.0 4.2 MB

Assembly based core genome SNP alignments for bacteria

License: GNU General Public License v3.0

Perl 100.00%

ekidna's Introduction

⚠️ This software is still in early development

ekidna

Assembly based core genome SNP alignments

Introduction

In an ideal world, to determine a core genome amongst a set of genomes, we would perform a "multiple whole genome alignment" and extract the conserved sites (mono- and poly- morphic). Software like Mauve can do this, but it does not scale to more than 10s of genomes, due to the exponential computational need.

Instead, we usually choose a reference genome and align isolate genomes sequentially to the reference. The "genomes" could be already assembled genomes (contigs in FASTA) or raw sequencing data (reads in FASTQ). Tools like ParSNP and Roary can achive this using assemblies. Many SNP calling pipelines will combine SNPs into a core genome alignment. My SNP pipeline Snippy will accept both assemblies and reads, but internally shreds the assemblies into fake reads rather than use the contigs natively.

One of Heng Li's past experiments was fermikit which did "rough" de novo assemblies and then aligned them to the reference, and called SNPs. One of the advantages of this is improved calling of indels.

Another experiment

Here we present ekidna which will accept either reads or contigs. Reads will be assembled in a fast and conservative method into contigs. A reference will be chosen from the contig sets, and the remainder contigs will be directly aligned to the reference using minimap2 and variants called using paftools. VCF files will then be combined into a core genome alignment suitable for building phylogenies for population analysis.

Synposis

% ekidna -t -o outdir *.fna *_R1.fastq.gz
<snip>

% figtree outdir/ekidna.nwk
<admire the SNP resolution population structure>

Installation

Conda

Install Conda or Miniconda:

conda install -c bioconda ekidna  # COMING ONE DAY

Homebrew

Install HomeBrew (Mac OS X) or LinuxBrew (Linux).

brew install brewsci/bio/ekidna  # COMING ONE DAY

Source

This will install the latest version direct from Github. You'll need to add the Ekidna bin directory to your $PATH, and also ensure all the dependencies are installed.

cd $HOME
git clone https://github.com/tseemann/ekidna.git
$HOME/ekidna/bin/ekidna --help

Interface

USAGE
  ekidna [options] -o <outdir> <SAMPLE1 SAMPLE2 SAMPLE3 ...>
SAMPLES
  Contigs    contigs.{fna,gff,gbk}[.gz] (assembled genomes)
  Reads      R1.{fq,fastq}[.gz] (only want R1)
OPTIONS
  -h         Print this help
  -v         Print version and exit
  -q         No output while running, only errors
  -k         Keep intermediate files
  -o OUTDIR  Output folder [mandatory]
  -p PREFIX  Prefix for output files [ekidna]
  -j CPUS    Number of CPU threads to use [1]
  -m MINLEN  Minimum alignment size to consider [500]
  -a ASMCMD  Assember command [skesa ...]
  -t         Also build tree

Input files

Assembled genomes or contigs

FASTA, Genbank, EMBL, GFF ; optionally compressed with gzip, bzip2, zip

Read sequences

FASTQ ; optionally compressed with gzip
These will be assembled rapidly and roughly into contigs
Only one FASTQ file is accepted ; suggest _R1 if you have paired reads

Usage

% cd test

% ls
NC_018594.fna.gz  NC_021004.fna.gz  NC_021006.fna.gz  NC_021028.fna.gz
NC_021003.fna.gz  NC_021005.fna.gz  NC_021026.fna.gz

% ekidna -o outdir *.fna.gz
<snip>

% ls outdir
ekidna.aln  ekidna.fna  ekidna.full.aln  ekidna.log  ekidna.vcf

% bcftools stats outdir/ekidna.vcf | grep ^SN
SN      0       number of samples:      6
SN      0       number of records:      34157
SN      0       number of no-ALTs:      0
SN      0       number of SNPs: 34157
SN      0       number of MNPs: 0
SN      0       number of indels:       0
SN      0       number of others:       0
SN      0       number of multiallelic sites:   275
SN      0       number of multiallelic SNP sites:       275

% ekidna -t -o outdir_with_tree *.fna.gz
<snip>

% ls outdir_with_tree
ekidna.aln  ekidna.fna  ekidna.full.aln  ekidna.log  ekidna.nwk  ekidna.vcf

% nw_indent outdir_with_tree/ekidna.nwk
(
  1:0.0053451412,
  (
    2:0.0000468773,
    3:0.0000132048
  )100:0.0047928014,
  (
    (
      (
        4:0.0000018784,
        6:0.0000041298
      )100:0.0000329960,
      7:0.0000872871
    )100:0.0018254115,
    5:0.0014513321
  )100:0.0036613244
);

Output files

File	Contents	Format
`.log`	log file of all the message output of the pipeline commands	ASCII text
`.vcf`	multisample VCF file of SNPs found	VCF
`.fna`	reference genome chosen from largest of input genomes	FASTA
`.aln`	FASTA alignment of core genome SNPs	FASTA (aligned)
`.full.aln`	FASTA alignment of genomes relative to the `.fna` reference	FASTA (aligned)
`.nwk`	Tree built from `.full.aln` using `iqtree` GTR+G4 model	Newick

Dependencies

perl >= 5.18
minimap2 + paftools.js >= 2.13
samtools >= 1.9
bcftools >= 1.9
any2fasta >= 0.4
seqtk >= 1.2
snp-sites >= 2.0
bedtools >= 2.0

Etymology

The name Ekidna is named for the native Australian "spiny ant-eater" called an echidna. The are coated with coarse hair and spines, and like the platypus, are egg-laying mammals (monotreme). In other words, weird but cool.

License

Ekidna is free software, released under the GPL 3.0.

Issues

Please submit suggestions and bug reports to the Issue Tracker

Author

Torsten Seemann

ekidna's People

Contributors

Stargazers

Watchers

Forkers

vikash84 biologicalscientist xiangyang1984

ekidna's Issues

Fix multisample VCF for unaligned positions

Currently the ekidna.vcf is made by merging all the variant VCF files.
But these are ignorant of unaligned regions.
Need to somehow put . in the positions for samples with no consensus?

no SNP found

For the record, I ran Ekidna on a bunch of already-assembled bacterial genomes, and the output was:

[..]
[ekidna] Deriving core SNP alignment: ekidna.aln
[ekidna] Running: (snp-sites -c -o ekidna.aln ekidna.full.aln) 2>&1 | tee -a '/..]/ekidna.log'
Warning: No SNPs were detected so there is nothing to output.
[..]

With the following output files:

$ ll output/
total 13643341
drwxr-sr-x 2 rchikhi cifs- 120 Jul  3 10:15 .
drwxr-sr-x 3 rchikhi cifs- 45 Jul  2 12:21 ..
-rw-r--r-- 1 rchikhi cifs- 3179826 Jul  3 03:09 ekidna.fna
-rw-r--r-- 1 rchikhi cifs- 12375917205 Jul  3 07:52 ekidna.full.aln
-rw-r--r-- 1 rchikhi cifs- 853 Jul  3 03:09 ekidna.genome
-rw-r--r-- 1 rchikhi cifs- 8058765 Jul  3 08:20 ekidna.log

Suggestion - allow a reference to be set

Hi - really liking the tool! One suggested improvement would be to have an option to control which of the sequences gets set as your reference rather than defaulting to the largest sequence. If there is a reason this is a bad idea please let me know. I had a look at the code and from what I could tell the determination of the reference is using the snippet below. The $biggest_size variable looked to be only used to decide the $ref_index so it should be possible to select a reference sequence as the $ref_idx if provided. I wasn't sure if any of the other tools required the reference to be the largest sequence though.

  if ($size > $biggest_size) {
    $ref_idx = $N;
    $biggest_size = $size;
  }

The main reason for this suggestion is for when there is a reference sequence that has better annotations/ phenotypic data so understanding what has changed in relation to that sequence is useful.

Output names are just sequential integers

Need to map sample names to input files somehow.

Could use filename, but if it is just SAMPLE/contigs.fa will need folder name too

maybe accept -i sample_list.tab which has <filename> \t <label> and use labels (like ARIBA).

.vcf file only retained if -k option selected.

Not clear if this is intentional but if running ekidna without the -k option the .vcf file is removed from the output. In the readme it seems to indicate that this output should be kept.

readme example
% ls outdir
ekidna.aln ekidna.fna ekidna.full.aln ekidna.log ekidna.vcf

my output
$ ls no-k
test.aln test.fna test.full.aln test.genome test.log

also produces a .genome file which appears to be just an accession code and genome size.

Paftools -> Bcftools vcf.gz conversion fails

Hi!

Long time fan of your work here!

I have about 16 mitochondrial genome that I needed to find variants with, so I have been trying to use ekidna for it :)

However, I have been stuck in the loop for the conversion of the sample.paf -> sample.vcf.gz either in the ekidna pipeline or checking it manually.

While I'm attaching the log here, I'm also sharing the command that I used (going through the log and ekidna)

paftools call -l 500 -L 500 -s 3 -f ../15.fa 3.paf 2> /dev/null | bcftools view --types snps,mnps

The /dev/null redirection output:

349081 reference bases covered by exactly one contig
5 substitutions; ts/tv = 0.000
0 1bp deletions
1 1bp insertions
0 2bp deletions
0 2bp insertions
0 [3,50) deletions
0 [3,50) insertions
0 >=50 deletions
0 >=50 insertions

hits me up with this error:
Failed to open -: unknown file type

Following are the versions of the various tools in case that plays a role:

bcftools: 1.8
seqtk: 1.3-r106
skesa: v.2.2
iqtree: 1.6.7 (multi-core version)
bedtools: v2.27.1
v8: 3.16.4
snp-sites: 2.4.1

ekidna, any2fasta were obtained from your repos on March 11th.

I'll have to check the commit of minimap2 though.

Put gaps in the uncovered regions

I think paftools bedcov can do this.
Otherwise bedtools ?

Use CIGAR lines to find micro deletions relative to the reference

We currently use the unaligned regions and set them to gaps in the aligement (no coverage).

But the aligned regions could have small INS and DEL in them. Those DEL should be masked off in the respective sample.

Add "-i FOFN" option

some extra dependencies..

.. are not mentioned in the readme:

skesa
iqtree

also a paftools binary is required but it seems the standard name is paftools.js

Cheers!
Rayan

Make tree optional -t

Also add iqtree to docs

Paftools and k8 dependency names causing issues

The dependencies paftools and k8 can have the names paftools.js and k8-linux or k8-Darwin (macOSX) when downloaded. Unless these files are renamed to paftools and k8 then ekidna can't find and use them. Would it be possible to have the paftools.js and k8 variants recognised or mention the issue in the check dependencies error message?

STOCKHOLM 1.0

id1 AGTCGCGTCTTCTAT
id2 TGTCGCGTTTCTATA
id3 GTCGCGTCTTCTATG

Add TRAVIS tests

Support FASTQ files

Rapid skesa assembly.