Code Monkey home page Code Monkey logo

coati's Introduction

DOI github-actions codecov

COATi - Codon-Aware Multiple Sequence Alignments

A statistical pairwise and multiple (in development) sequence aligner that is robust to artifacts, incorporates codon models, and supports complex indels.

Table of Contents

Installation

Download

Source code for the most recent beta versions is available at https://github.com/CartwrightLab/coati/archive/master.tar.gz

Dependencies

Compiling

tar -xvzf coati*.tar.gz
meson setup builddir --buildtype=release
meson compile -C builddir

Global Install (requires root access)

tar -xvzf coati*.tar.gz
meson setup builddir --buildtype=release
meson compile -C builddir
meson install -C builddir

alignpair

Pairwise alignment of nucleotide sequences via composition of finite-state transducers (FSTs) or using Gotoh's algorithm. Substitution models available:

  • tri-mg94: Muse and Gaut codon model using finite-state transducers.
  • tri-ecm: empirical codon model using FSTs.
  • dna: dna model resulting of marginalizing Muse and Gaut's model using FSTs.
  • mar-mg94: marginal Muse and Gaut using Gotoh's algorithm.
  • mar-ecm: marginal empirical codon model using Gotoh's algorithm.

Indel model allows gaps of any length at any position. Insertion always precede deletions when contiguous to eliminate equivalent alignments.

coati alignpair - pairwise alignment of nucleotide sequences

Usage: coati alignpair [OPTIONS] input

Positionals:
  input TEXT REQUIRED              Input file (FASTA/PHYLIP/JSON accepted)

Options:
  -h,--help                        Print this help message and exit
  -s,--score                       Score input alignment and exit
  -o,--output TEXT                 Alignment output file


Model parameters:
  -m,--model TEXT Excludes: --sub  Substitution model (dna tri-mg tri-ecm mar-mg mar-ecm)
  -t,--time FLOAT:POSITIVE         Evolutionary time/branch length
  -g,--gap-open FLOAT:POSITIVE     Gap opening score
  -e,--gap-extend FLOAT:POSITIVE   Gap extension score
  -k,--gap-len UINT                Gap unit length


Advanced options:
  --sub TEXT Excludes: --model     File with branch lengths and codon subst matrix
  -r,--ref TEXT Excludes: --rev-ref
                                   Name of reference sequence (default: 1st seq)
  -v,--rev-ref Excludes: --ref     Use 2nd seq as reference (default: 1st seq)
  -w,--omega FLOAT:POSITIVE        Nonsynonymous-synonymous bias
  -p,--pi FLOAT x 4                Nucleotide frequencies (A C G T)
  -x,--sigma FLOAT x 6             GTR sigma parameters (AC AG AT CG CT GT)
  -b,--base-error FLOAT:POSITIVE   Base calling error rate

Sample runs

#Align file example-001.fasta with marginal model (default) and output in JSON
format (default)
coati alignpair sampledata/example-001.fasta

# Align file example-002.fasta with m-ecm model and output in fasta format
coati alignpair sampledata/example-002.fasta -m m-ecm -o example-002.fasta

# Align file example-003.fasta with ecm model, PHY output, and save alignment weight to w.out
coati alignpair sampledata/example-003.fasta -m ecm -l w.out

# Align file example-003.fasta with m-coati model, specifying a branch length of 0.0133 and underlying nucleotide frequencies A:0.3, C:0.2, C:0.2, T:0.3
coati alignpair sampledata/example-003.fasta -t 0.0133 -p 0.3 0.2 0.2 0.3

sample

Align a pair of sequences and get alignments from Gotoh's dynamic programming matrices with sampling (not necessarily best alignment). Output is always in json format (see input output syntax) and provides a weight and log weight of each pairwise alignment sampled.

coati sample - align two sequences and sample alignments

Usage: coati sample [OPTIONS] input

Positionals:
  input TEXT REQUIRED              Input file (FASTA/PHYLIP/JSON accepted)

Options:
  -h,--help                        Print this help message and exit
  -o,--output TEXT                 Alignment output file
  -n,--sample-size UINT            Sample size
  -s,--seed TEXT ...               Space separated list of seed(s) used for sampling


Model parameters:
  -t,--time FLOAT:POSITIVE         Evolutionary time/branch length
  -m,--model TEXT Excludes: --sub  Substitution model (coati ecm dna m-coati m-ecm)
  -g,--gap-open FLOAT:POSITIVE     Gap opening score
  -e,--gap-extend FLOAT:POSITIVE   Gap extension score
  -k,--gap-len UINT                Gap unit length


Advanced options:
  --sub TEXT Excludes: --model     File with branch lengths and codon subst matrix
  -w,--omega FLOAT:POSITIVE        Nonsynonymous-synonymous bias
  -p,--pi FLOAT x 4                Nucleotide frequencies (A C G T)
  -x,--sigma FLOAT x 6             GTR sigma parameters (AC AG AT CG CT GT)

Sample runs

# Sample 100 alignments from example-003.fasta and save to `ex3_100.json`.
coati sample sampledata/example-003.fasta -n 100 -o ex3_100.json

# Sample 50 alignments from example-003.fasta specifying seed
coati sample sampledata/example-003.fasta -n 50 -s random42

genseed

Generate a random seed. Usage:

coati genseed

format

Convert between formats, extract and/or reorder sequences. Accepted format are FASTA, PHYLIP, and JSON formats. Output can be piped to other commands.

Additionaly, coati format can adjust sequences aligned with coati alignpair or coati msa to be used with downstream analyses and maintain our model assumption that the reference is always in frame.

coati format - convert between formats, extract and/or reoder sequences

Usage: coati format [OPTIONS] input

Positionals:
  input TEXT REQUIRED              Input file (FASTA/PHYLIP/JSON accepted)

Options:
  -h,--help                        Print this help message and exit
  -o,--output TEXT                 Alignment output file
  -p,--preserve-phase              Preserve phase
  -c,--padding TEXT Needs: --preserve-phase
                                   Padding char to format preserve phase
  -s,--cut-seqs TEXT ... Excludes: --cut-pos
                                   Name of sequences to extract
  -x,--cut-pos UINT ... Excludes: --cut-seqs
                                   Position of sequences to extract (1 based)

Sample runs

# Extract first two sequences and align them with alignpair
coati format sampledata/example-msa-001.fasta -x 1 2 | coati alignpair json:-

# Extract sequences C B D (in that order) and convert to PHYLIP format
coati format sampledata/example-msa-002.fasta -s C B D -o ex-msa-2.phy

# Align then preserve phase using '?' character
coati alignpair sampledata/example-003.fasta | coati format json:- -p -c '?'

msa

Multiple sequence alignment (still in development).

coati msa - multiple sequence alignment of nucleotide sequences

Usage: coati msa [OPTIONS] input tree reference

Positionals:
  input TEXT REQUIRED              Input file (FASTA/PHYLIP/JSON accepted)
  tree TEXT:FILE REQUIRED          Newick phylogenetic tree
  reference TEXT REQUIRED          Name of reference sequence

Options:
  -h,--help                        Print this help message and exit
  -o,--output TEXT                 Alignment output file


Model parameters:
  -m,--model TEXT                  Substitution model (mar-mg94 mar-ecm)
  -g,--gap-open FLOAT:POSITIVE     Gap opening score
  -e,--gap-extend FLOAT:POSITIVE   Gap extension score
  -k,--gap-len UINT                Gap unit length


Advanced options:
  -w,--omega FLOAT:POSITIVE        Nonsynonymous-synonymous bias
  -p,--pi FLOAT x 4                Nucleotide frequencies (A C G T)
  -x,--sigma FLOAT x 6             GTR sigma parameters (AC AG AT CG CT GT)

Input output syntax

Syntax for input and output files is [format:]filename.extension where format is optional (indicated by []). Format must be one of "fa", "fasta" (FASTA format), "phy" (PHYLIP format), or "json" (JSON format). If no format is specified, then extension must also be one of the above. Alternatively, input can be of the form format:- or format: and COATi will read from the command line (std::cin); This allows the different COATi commands to work in a pipeline.

JSON output

The JSON input and output (except for sample) JSON format is as follows:

{
  "data": {
    "names": ["A","B","C","D","E"],
    "seqs":["TCA--TCG","TCA-GTCG","T-A--TCG","TCAC-TCG","TCA--TC-"]
  }
}

The output from sample is always in the following JSON format:

[
  {
    "aln": {
      "1": "CTCTGGATAGTG",
      "2": "CT----ATAGTG"
    },
    "weight": 0.994785,
    "log_weight": -0.00522821
  },
  {
    "aln": {
      "1": "CTCTGGATAGTG",
      "2": "CT----ATAGTG"
    },
    "weight": 0.994785,
    "log_weight": -0.00522821
  },
  {
    "aln": {
      "1": "CTCTGGATAGTG",
      "2": "CT----ATAGTG"
    },
    "weight": 0.994785,
    "log_weight": -0.00522821
  }
]

PHYLIP

PHYLIP format limits names to 10 characters and uses interleaved format:

3 110
A         CCTACTCACTAGGAACAATTCTAAGGTTAT--------------------
B         CCTACTCACGACGAACAAGTTTAAGGTTATATGACACTTCTCCGATTGTC
VeryLongNaCCTACTCACTAGGAACAATTCTAAGGTTAT--------------------

-----GTTCCACTTCACCAAGTGTCGCATGTCGTCCACAGGTGTTCATTG----GCTCAA
GCAAGG-------------------------CGTCTACAGGTGTTCATTGTTACG----A
-----GTTCCACTTCACCAAGTGTCGCATGTCGTCCACAGGTGTTCATTGTTACG----A

coati's People

Contributors

jgarciamesa avatar reedacartwright avatar

Stargazers

王宇轩626 avatar  avatar jinqiu wang avatar Hong Zhang avatar Gaorui Gong avatar Qin Lin avatar ONT_HiFi_HiC avatar  avatar  avatar Dirk-Jan avatar HERMAN avatar  avatar M. Elise Lauterbur avatar Emmarie Alexander avatar Siyuan Feng avatar Albert Vill avatar peterdfields avatar Cailean Carter avatar

Watchers

 avatar James Cloos avatar  avatar  avatar

coati's Issues

JSON Format

I think we need more consistency between the JSON formats produced by alignpair and sample.

I think this should be our core alignment format:

{
  "aln": {
    "1": "CTCTGGATAGTG",
    "2": "CT----ATAGTG"
  },
  "score": 0.994785,
  "log_score": -0.00522821
}

With sample outputting an array of of these.

I chose to use score instead of weight here under the assumption that alignpair will get used more often than sample. But we should use the same name in both outputs even if they are slightly different.

Docker Image

Hey there 👋

I took the liberty of making a docker image of this tool since I needed one. The image is on DockerHub but the code for a dockerfile if anyone wants it is below. It's pretty straightforward but it insures that the final image only contains the compiled binary.

FROM debian:stable as build

RUN apt-get update

RUN apt-get install -y git gcc meson ninja-build wget

WORKDIR /usr/local/bin/

RUN wget https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/boost_1_82_0.tar.gz

RUN tar -xvzf boost_1_82_0.tar.gz 

ARG BOOST_ROOT=/usr/local/bin/boost_1_82_0/

WORKDIR /usr/local/bin/coati

RUN git clone https://github.com/CartwrightLab/coati.git .

RUN meson setup builddir --buildtype=release

RUN meson compile -C builddir

ARG DESTDIR=/usr/local/bin/coati_build/

RUN meson install -C builddir

FROM debian:stable as dist

COPY --from=build /usr/local/bin/coati_build/usr/local/bin/coati /usr/local/bin/coati

ENV PATH="${PATH}:/usr/local/bin/"

Error when generating src/version.h

Both meson compile and meson test generate the following message:

[1/3] Generating src/version.h with a custom command
fatal: No names found, cannot describe anything.

Is this a git-describe message?

msa example?

Hi, can you post an example for using the msa? I’m getting hung up on how to specify the reference sequence. I have the sequence in a file called ref.fa and I’m getting a ERROR: Node ref.fa not found. with this command: coati msa -m mar-ecm -o global_tps global_tps.fa global_tps.tre ref.fa

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.