Code Monkey home page Code Monkey logo

ancseq's Introduction

ancseq

Version 1.2.1

Table of contents

What is ancseq?

Ancestral sequence reconstruction is a technique to reconstruct ancestral states from a multiple sequence alignment. ancseq is a wrapper tool to reconstruct ancestral sequences using IQ-TREE. See more detail workflow in ancseq here.

Citation

  • Under preparation...

Installation

Dependencies

Softwares

Python (>=3.5) libraries

Installation using conda

You can install ancseq with the dependencies using anaconda.

git clone https://github.com/YuSugihara/ancseq.git
cd ancseq
conda env create -f ancseq.yml
conda activate ancseq
pip install .

Usage

$ ancseq -h
usage: ancseq -s <ALIGNED_FASTA> -m <MODE> -o <OUT_DIR> [-t <INT>]

ancseq version 1.2.1

options:
  -h, --help         show this help message and exit
  -s , --seq         Sequence alignment in FASTA format.
  -m , --mode        Sequence type. [DNA/AA/CODON]
  -o , --out         Output directory. The given name must not exist.
  -t , --threads     Number of threads. [8]
  -b , --bootstrap   Replicate for bootstrap. [1000]
  --max-report       Maximum number of ambiguous sites to report at the same position. [5]
  --min-prob         Minimum probability of being reported as an ambiguous site. [0.05]
  --min-gap-prob     Minimum probability of replacing the ancestral state with a gap. [0.5]
  --fast             Use -fast option in IQ-TREE [FLASE]
  --model            Specify substitution model for IQ-TREE. IQ-TREE searches the best substitution
                     model using ModelFinder in default [MFP]
  --outgroup         Specify outgroup for IQ-TREE. [None]
  --stop-codon-prob  Stop calculation of codon probabilities in DNA mode [FLASE]
  --asr-only         Skip building tree and reconstruct ancestral states only [FLASE]
  -v, --version      show program's version number and exit

We recommend to specify the outgroup to avoid misinterpretation of the ancestral states of nodes. See more detail here.

Example 1 : Running ancseq for nucleotide sequence alignment

ancseq -s test_nuc.fasta \
       -m DNA \
       -o out_dir

-s : Nucleotide sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

Example 2 : Running ancseq for amino acid sequence alignment

ancseq -s test_nuc.fasta \
       -m AA \
       -o out_dir

-s : Amino acid sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

Example 3 : Running ancseq for codon sequence alignment

!!!WARNING!!! IQ-TREE implements codon substitution models. However, it might take too long to build phylogenetic tree, depending on the alignment you input. In that case, we would recommend running ancseq in DNA mode. anceseq can calculate the probabilities of each codon in DNA mode.

ancseq -s test_codon.fasta \
       -m CODON \
       -o out_dir

-s : Codon sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

Example 4 : Running ancseq specifing outgroup

You can reconstruct the ancestral states without specifying the outgroup. However, the ancestral states of the node may be misinterpreted when you visualize the tree. Therefore, we recommend to specify the outgroup to avoid misinterpretation of ancestral states of the node. IQ-TREE converts the rooted tree to the unrooted tree in defalt.

ancseq -s test_nuc.fasta \
       -m DNA \
       --outgroup seq_id \
       -o out_dir

-s : Nucleotide sequence alignment in fasta format.

-m : Sequence type.

--outgroup : Sequence ID of outgroup.

-o : Name of the output directory. The given name should not exist.

Example 5 : Running ancseq with --fast option

ancseq -s test_nuc.fasta \
       -m DNA \
       -o out_dir \
       --fast

-s : Nucleotide sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

--fast : Use -fast option in IQ-TREE.

Outputs

Inside of OUT_DIR is like below.

├── 00_tree
│  ├── 00_iqtree.err
│  ├── 00_iqtree.out
│  ├── test_nuc.fasta
│  ├── test_nuc.fasta.bionj
│  ├── test_nuc.fasta.ckp.gz
│  ├── test_nuc.fasta.contree
│  ├── test_nuc.fasta.iqtree
│  ├── test_nuc.fasta.log
│  ├── test_nuc.fasta.mldist
│  ├── test_nuc.fasta.model.gz
│  ├── test_nuc.fasta.splits.nex
│  └── test_nuc.fasta.treefile
├── 10_asr
│  ├── 10_iqtree.err
│  ├── 10_iqtree.out
│  ├── test_nuc.fasta
│  ├── test_nuc.fasta.ckp.gz
│  ├── test_nuc.fasta.iqtree
│  ├── test_nuc.fasta.log
│  ├── test_nuc.fasta.state.gz
│  └── test_nuc.fasta.treefile
├── 20_indels
│  ├── 20_iqtree.err
│  ├── 20_iqtree.out
│  ├── test_nuc.fasta.binary
│  ├── test_nuc.fasta.binary.ckp.gz
│  ├── test_nuc.fasta.binary.iqtree
│  ├── test_nuc.fasta.binary.log
│  ├── test_nuc.fasta.binary.state.gz
│  └── test_nuc.fasta.binary.treefile
└── 30_result
   ├── ancestral_state_result.treefile
   ├── ancestral_state_result.fasta
   ├── ancestral_state_result_with_gap.fasta
   ├── ancestral_state_result.sort.tsv
   ├── ancestral_state_result.tsv.gz
   └── ancestral_state_result.codon_prob.tsv.gz
  • The phylogenetic tree reconstructed by IQ-TREE can be found in 00_tree.
  • The results of the ancestral sequence reconstruction can be found in 30_result.
    • ancestral_state_result.treefile: Phylogenetic tree with the node labels.
    • ancestral_state_result.fasta: FASTA file of the ancestral sequences without gaps.
    • ancestral_state_result_with_gap.fasta: FASTA file of the ancestral sequences with gaps.
    • ancestral_state_result.sort.tsv : Probabilities of the ancestral states.

Workflow in ancseq

IQ-TREE command 1

iqtree -s ${INPUT_FASTA} \
       -st ${SEQ_TYPE} \
       -T ${NUM_THREADS} \
       -B ${NUM_BOOTSTRAP} \
       -m MFP \
       1> /OUT_DIR/00_tree/00_iqtree.out \
       2> /OUT_DIR/00_tree/00_iqtree.err

-s : Sequence alignment in fasta format.

-st : Sequence type.

-T : Number of threads.

-B : Replicates for ultrafast bootstrap.

-m MFP : Extended model selection followed by tree inference.

If you specify the --fast option in ancseq, the IQ-TREE command 1 will change as follows.

iqtree -s ${INPUT_FASTA} \
       -st ${SEQ_TYPE} \
       -T ${NUM_THREADS} \
       --alrt ${NUM_BOOTSTRAP} \
       -m MFP \
       --fast \
       1> /OUT_DIR/00_tree/00_iqtree.out \
       2> /OUT_DIR/00_tree/00_iqtree.err

-s : Sequence alignment in fasta format.

-st : Sequence type.

-T : Number of threads.

--alrt : Replicates for SH approximate likelihood ratio test.

-m MFP : Extended model selection followed by tree inference.

--fast : Fast search to resemble FastTree.

IQ-TREE command 2

iqtree -asr \
       -s ${INPUT_FASTA} \
       -te /OUT_DIR/00_tree/${INPUT_FASTA}.treefile \
       -st ${SEQ_TYPE} \
       -T ${NUM_THREADS} \
       -m ${MODEL} \
       -o ${OUTGROUP} \
       -keep_empty_seq \
       1> /OUT_DIR/10_asr/10_iqtree.out \
       2> /OUT_DIR/10_asr/10_iqtree.err

-asr : Ancestral state reconstruction by empirical Bayes.

-s : Sequence alignment in fasta format.

-te : Tree file.

-st : Sequence type.

-T : Number of threads.

-m : Model name.

-o : Sequence ID of the outgroup if you specify the outgroup with --outgroup.

-keep_empty_seq : Keep empty sequences in the alignment.

IQ-TREE command 3

iqtree -asr \
       -s ${INPUT_FASTA}.binary \
       -te /OUT_DIR/00_tree/${INPUT_FASTA}.treefile \
       -st BIN \
       -T ${NUM_THREADS} \
       -blfix \
       -m JC2 \
       -o ${OUTGROUP} \
       -keep_empty_seq \
       1> /OUT_DIR/20_indels/20_iqtree.out \
       2> /OUT_DIR/20_indels/20_iqtree.err

-asr : Ancestral state reconstruction by empirical Bayes.

-s : Sequence alignment in fasta format.

-te : Tree file.

-st BIN : Binary sequence type.

-T : Number of threads.

-blfix : Fix branch lengths of tree passed via -t or -te.

-m JC2 : Jukes-Cantor type binary model.

-o : Sequence ID of the outgroup if you specify the outgroup with --outgroup.

-keep_empty_seq : Keep empty sequences in the alignment.

References

  1. Aadland K, Pugh C, Kolaczkowski B. 2019. High-Throughput Reconstruction of Ancestral Protein Sequence, Structure, and Molecular Function. In: Sikosek T ed. Computational Methods in Protein Evolution. Methods in Molecular Biology. New York, NY: Springer, 135–170. DOI: 10.1007/978-1-4939-8736-8_8.

  2. VanAntwerp J, Finneran P, Dolgikh B, Woldring D. 2022. Ancestral Sequence Reconstruction and Alternate Amino Acid States Guide Protein Library Design for Directed Evolution. In: Traxlmayr MW ed. Yeast Surface Display. Methods in Molecular Biology. New York, NY: Springer US, 75–86. DOI: 10.1007/978-1-0716-2285-8_4.

How does anceseq calculate codon probabilities in DNA mode?

IQ-TREE implements codon substitution models. However, it might take too long to build phylogenetic tree, depending on the alignment you input. Therefore, we implemented a function to calculate the probabilities of each codon by simply multiplying the probabilities of each nucleotide. For example, a probability of methionine at the $j$ th position, $P_{j, M}$, can be calculated by multiplying a probability of A at the 1st nucleotide of the $j$ th codon, $p_{j_{1}, A}$, that of T at the 2nd nucleotide, $p_{j_{2}, T}$, and that of G at the 3rd nucleotide, $p_{j_{3}, G}$ like below.

$P_{j, M} = p_{j_{1}, A} * p_{j_{2}, T} * p_{j_{3}, G}$

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.