Code Monkey home page Code Monkey logo

ekidna's Introduction

Echidna cartoon Build Status License: GPL v3 Don't judge me

โš ๏ธ This software is still in early development

ekidna

Assembly based core genome SNP alignments

Introduction

In an ideal world, to determine a core genome amongst a set of genomes, we would perform a "multiple whole genome alignment" and extract the conserved sites (mono- and poly- morphic). Software like Mauve can do this, but it does not scale to more than 10s of genomes, due to the exponential computational need.

Instead, we usually choose a reference genome and align isolate genomes sequentially to the reference. The "genomes" could be already assembled genomes (contigs in FASTA) or raw sequencing data (reads in FASTQ). Tools like ParSNP and Roary can achive this using assemblies. Many SNP calling pipelines will combine SNPs into a core genome alignment. My SNP pipeline Snippy will accept both assemblies and reads, but internally shreds the assemblies into fake reads rather than use the contigs natively.

One of Heng Li's past experiments was fermikit which did "rough" de novo assemblies and then aligned them to the reference, and called SNPs. One of the advantages of this is improved calling of indels.

Another experiment

Here we present ekidna which will accept either reads or contigs. Reads will be assembled in a fast and conservative method into contigs. A reference will be chosen from the contig sets, and the remainder contigs will be directly aligned to the reference using minimap2 and variants called using paftools. VCF files will then be combined into a core genome alignment suitable for building phylogenies for population analysis.

Synposis

% ekidna -t -o outdir *.fna *_R1.fastq.gz
<snip>

% figtree outdir/ekidna.nwk
<admire the SNP resolution population structure>

Installation

Conda

Install Conda or Miniconda:

conda install -c bioconda ekidna  # COMING ONE DAY

Homebrew

Install HomeBrew (Mac OS X) or LinuxBrew (Linux).

brew install brewsci/bio/ekidna  # COMING ONE DAY

Source

This will install the latest version direct from Github. You'll need to add the Ekidna bin directory to your $PATH, and also ensure all the dependencies are installed.

cd $HOME
git clone https://github.com/tseemann/ekidna.git
$HOME/ekidna/bin/ekidna --help

Interface

USAGE
  ekidna [options] -o <outdir> <SAMPLE1 SAMPLE2 SAMPLE3 ...>
SAMPLES
  Contigs    contigs.{fna,gff,gbk}[.gz] (assembled genomes)
  Reads      R1.{fq,fastq}[.gz] (only want R1)
OPTIONS
  -h         Print this help
  -v         Print version and exit
  -q         No output while running, only errors
  -k         Keep intermediate files
  -o OUTDIR  Output folder [mandatory]
  -p PREFIX  Prefix for output files [ekidna]
  -j CPUS    Number of CPU threads to use [1]
  -m MINLEN  Minimum alignment size to consider [500]
  -a ASMCMD  Assember command [skesa ...]
  -t         Also build tree

Input files

Assembled genomes or contigs

  • FASTA, Genbank, EMBL, GFF ; optionally compressed with gzip, bzip2, zip

Read sequences

  • FASTQ ; optionally compressed with gzip
  • These will be assembled rapidly and roughly into contigs
  • Only one FASTQ file is accepted ; suggest _R1 if you have paired reads

Usage

% cd test

% ls
NC_018594.fna.gz  NC_021004.fna.gz  NC_021006.fna.gz  NC_021028.fna.gz
NC_021003.fna.gz  NC_021005.fna.gz  NC_021026.fna.gz

% ekidna -o outdir *.fna.gz
<snip>

% ls outdir
ekidna.aln  ekidna.fna  ekidna.full.aln  ekidna.log  ekidna.vcf

% bcftools stats outdir/ekidna.vcf | grep ^SN
SN      0       number of samples:      6
SN      0       number of records:      34157
SN      0       number of no-ALTs:      0
SN      0       number of SNPs: 34157
SN      0       number of MNPs: 0
SN      0       number of indels:       0
SN      0       number of others:       0
SN      0       number of multiallelic sites:   275
SN      0       number of multiallelic SNP sites:       275

% ekidna -t -o outdir_with_tree *.fna.gz
<snip>

% ls outdir_with_tree
ekidna.aln  ekidna.fna  ekidna.full.aln  ekidna.log  ekidna.nwk  ekidna.vcf

% nw_indent outdir_with_tree/ekidna.nwk
(
  1:0.0053451412,
  (
    2:0.0000468773,
    3:0.0000132048
  )100:0.0047928014,
  (
    (
      (
        4:0.0000018784,
        6:0.0000041298
      )100:0.0000329960,
      7:0.0000872871
    )100:0.0018254115,
    5:0.0014513321
  )100:0.0036613244
);

Output files

File Contents Format
.log log file of all the message output of the pipeline commands ASCII text
.vcf multisample VCF file of SNPs found VCF
.fna reference genome chosen from largest of input genomes FASTA
.aln FASTA alignment of core genome SNPs FASTA (aligned)
.full.aln FASTA alignment of genomes relative to the .fna reference FASTA (aligned)
.nwk Tree built from .full.aln using iqtree GTR+G4 model Newick

Dependencies

  • perl >= 5.18
  • minimap2 + paftools.js >= 2.13
  • samtools >= 1.9
  • bcftools >= 1.9
  • any2fasta >= 0.4
  • seqtk >= 1.2
  • snp-sites >= 2.0
  • bedtools >= 2.0

Etymology

The name Ekidna is named for the native Australian "spiny ant-eater" called an echidna. The are coated with coarse hair and spines, and like the platypus, are egg-laying mammals (monotreme). In other words, weird but cool.

License

Ekidna is free software, released under the GPL 3.0.

Issues

Please submit suggestions and bug reports to the Issue Tracker

Author

Torsten Seemann

ekidna's People

Contributors

tseemann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ekidna's Issues

Fix multisample VCF for unaligned positions

Currently the ekidna.vcf is made by merging all the variant VCF files.
But these are ignorant of unaligned regions.
Need to somehow put . in the positions for samples with no consensus?

no SNP found

For the record, I ran Ekidna on a bunch of already-assembled bacterial genomes, and the output was:

[..]
[ekidna] Deriving core SNP alignment: ekidna.aln
[ekidna] Running: (snp-sites -c -o ekidna.aln ekidna.full.aln) 2>&1 | tee -a '/..]/ekidna.log'
Warning: No SNPs were detected so there is nothing to output.
[..]

With the following output files:

$ ll output/
total 13643341
drwxr-sr-x 2 rchikhi cifs- 120 Jul  3 10:15 .
drwxr-sr-x 3 rchikhi cifs- 45 Jul  2 12:21 ..
-rw-r--r-- 1 rchikhi cifs- 3179826 Jul  3 03:09 ekidna.fna
-rw-r--r-- 1 rchikhi cifs- 12375917205 Jul  3 07:52 ekidna.full.aln
-rw-r--r-- 1 rchikhi cifs- 853 Jul  3 03:09 ekidna.genome
-rw-r--r-- 1 rchikhi cifs- 8058765 Jul  3 08:20 ekidna.log

Suggestion - allow a reference to be set

Hi - really liking the tool! One suggested improvement would be to have an option to control which of the sequences gets set as your reference rather than defaulting to the largest sequence. If there is a reason this is a bad idea please let me know. I had a look at the code and from what I could tell the determination of the reference is using the snippet below. The $biggest_size variable looked to be only used to decide the $ref_index so it should be possible to select a reference sequence as the $ref_idx if provided. I wasn't sure if any of the other tools required the reference to be the largest sequence though.

  if ($size > $biggest_size) {
    $ref_idx = $N;
    $biggest_size = $size;
  }

The main reason for this suggestion is for when there is a reference sequence that has better annotations/ phenotypic data so understanding what has changed in relation to that sequence is useful.

Output names are just sequential integers

Need to map sample names to input files somehow.

Could use filename, but if it is just SAMPLE/contigs.fa will need folder name too

maybe accept -i sample_list.tab which has <filename> \t <label> and use labels (like ARIBA).

.vcf file only retained if -k option selected.

Not clear if this is intentional but if running ekidna without the -k option the .vcf file is removed from the output. In the readme it seems to indicate that this output should be kept.

readme example
% ls outdir
ekidna.aln ekidna.fna ekidna.full.aln ekidna.log ekidna.vcf

my output
$ ls no-k
test.aln test.fna test.full.aln test.genome test.log

also produces a .genome file which appears to be just an accession code and genome size.

Paftools -> Bcftools vcf.gz conversion fails

Hi!

Long time fan of your work here!

I have about 16 mitochondrial genome that I needed to find variants with, so I have been trying to use ekidna for it :)

However, I have been stuck in the loop for the conversion of the sample.paf -> sample.vcf.gz either in the ekidna pipeline or checking it manually.

While I'm attaching the log here, I'm also sharing the command that I used (going through the log and ekidna)

paftools call -l 500 -L 500 -s 3 -f ../15.fa 3.paf 2> /dev/null | bcftools view --types snps,mnps

The /dev/null redirection output:

349081 reference bases covered by exactly one contig
5 substitutions; ts/tv = 0.000
0 1bp deletions
1 1bp insertions
0 2bp deletions
0 2bp insertions
0 [3,50) deletions
0 [3,50) insertions
0 >=50 deletions
0 >=50 insertions

hits me up with this error:
Failed to open -: unknown file type

Following are the versions of the various tools in case that plays a role:

bcftools: 1.8
seqtk: 1.3-r106
skesa: v.2.2
iqtree: 1.6.7 (multi-core version)
bedtools: v2.27.1
v8: 3.16.4
snp-sites: 2.4.1

ekidna, any2fasta were obtained from your repos on March 11th.

I'll have to check the commit of minimap2 though.

some extra dependencies..

.. are not mentioned in the readme:

  • skesa
  • iqtree

also a paftools binary is required but it seems the standard name is paftools.js

Cheers!
Rayan

Paftools and k8 dependency names causing issues

The dependencies paftools and k8 can have the names paftools.js and k8-linux or k8-Darwin (macOSX) when downloaded. Unless these files are renamed to paftools and k8 then ekidna can't find and use them. Would it be possible to have the paftools.js and k8 variants recognised or mention the issue in the check dependencies error message?

faster infile processing

I specified 32 CPUs with -j 32 but it looks like any2fasta calls were done sequentially. Would you consider parallelizing that step or making an option to skip it?

For 1445 FNA files without lowercase or ambiguous sites, it took 30 min to save new FA files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.