Code Monkey home page Code Monkey logo

asa's Introduction

Adaptive Sequence Alignment (ASA)

Software

The analysis flow logic is implemented in aligner-asa program. The program requires Bowtie2 and Samtools utilities as well as RDP II database. The software is designed to run under Linux operating system.

MBARC26_public

The pre-processed MBARC-26 dataset is available in the NCBI SRA archive with the accession number SRX1836716. It can be downloaded using the SRA toolkit

ASA tool installation

docker pull elolab/asa

Once download completes, installation can be checked:

docker run -it --rm elolab/asa /opt/asa/aligner-asa.py --help

ASA tool example run

The ASA container can be run with following command. In this example the computer has /data -folder, which is visible also to the container. The output is written to /data/out/ folder, which is required to exist.


docker run -it --rm -v /data:/data elolab/asa /opt/asa/aligner-asa.py \
-i /data/R1.fastq.gz,R2.fastq.gz \
-t /data/pro_rdp_11_5_T_I_L1200_taxonomy.txt \
--db_fasta /data/pro_rdp_11_5_T_I_L1200_seqs.fa \
--top_refs 2 \
--rounds 2 \
--best_match_only \
--candlist_bt2_score_min 0.99 \
--top_refs_file top_refs-example.txt \
--aligner_threads 6 \
--embed_in_n \
-o /data/out/consensus.txt

Output

First, the tool aligns the reads to the RDP database in order to discover the best aligning references. The result can be found in top_refs-example.txt

Next, the tool chooses a number of the best matching references (specified by --top_refs) and starts iteratively building each sequence. The maximum amount of iterations can be set by the --rounds parameter

Creating the RDP reference database

First, the RDP database was split into subsets on family level in order to decrease computational load.

Next, one tree per family was generated using clustalw and fasttree. Finally, the minimum distances between genera were calculated using the phylotool.py script. It takes a directory of trees and creates a pickled object (min_family_dists.pickle) containing the cross-genera cutoff values.

Consensus-reference mapping

The consensus-reference tree is generated using the /opt/asa/create_con_ref_tree.R script. Modify the paths in this file to point to the correct locations

/opt/asa/phylotool.py processes the consensus-reference tree and reports the species mapping. Furthermore, a note is made if the mapping is within the cross-genera cutoff. If the cutoff is exceeded, then the mapping is not too confident, since it is possibly mapped to the wrong genera.

Appendix

Creating a new reference database files

ASA can use any fasta file as reference file. However, if there is taxonomy associated with fasta sequences, there are two files: 1) fasta file and 2) related taxonomy file.

Fasta file example:

>0
AGAGTTTGATCATGGCTCAGGACGA
>1
TATTTCGACGGGTTCCG
>2
TAAAGAGCTCG

Corresponding taxonomy file example:

0[TAB]r__Root;d__Bacteria;p__Actinobacteria;c__Actinobacteria;b__Acidimicrobidae;o__Acidimicrobiales;u__Acidimicrobineae;f__Acidimicrobiaceae;g__Acidimicrobium;s__Acidimicrobium_ferrooxidans_(T)_ICP
1[TAB]r__Root;d__Bacteria;p__Actinobacteria;c__Actinobacteria;b__Acidimicrobidae;o__Acidimicrobiales;u__Acidimicrobineae;f__Acidimicrobiaceae;g__Ferrimicrobium;s__Ferrimicrobium_acidiphilum_(T)_T23
2[TAB]r__Root;d__Bacteria;p__Actinobacteria;c__Actinobacteria;b__Acidimicrobidae;o__Acidimicrobiales;u__Acidimicrobineae;f__Acidimicrobiaceae;g__Ferrithrix;s__Ferrithrix_thermotolerans_(T)_Y005

Where [TAB] is the tabulator character.

Example, RDP database filtering

Download and unzip Bacterial 16S sequences (as Genbank and FASTA) from RDP. The files are: current_Bacteria_unaligned.fa.gz and current_Bacteria_unaligned.gb.gz. Note. RDP II project website seems to be available currently.

cd ..

wget \
https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.gb.gz \
--no-check-certificate

wget \
https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.fa.gz \
--no-check-certificate

Next, convert and filter the data to the format required by ASA:

gunzip current_Bacteria_unaligned.gb.gz current_Bacteria_unaligned.fa.gz

./filter_database.py \
--output_gb rdp_11_5_T_I_L1200.gb \
-f rdp_11_5_T_I_L1200.fasta \
--rdp_gb current_Bacteria_unaligned.gb \
--rdp_fasta current_Bacteria_unaligned.fa \
-t rdp_11_5_T_I_L1200_taxonomy.txt \
-s rdp_11_5_T_I_L1200_seqs.fa \
--filter_min_seqlen 1200 \
--filter_db_typestrain \
--filter_db_isolates

SILVA database

The repository includes a script (import_silva.py) to facilitate the loading of the SILVA (v. 138.1) database (SILVA_138.1_SSURef_NR99_tax_silva.fasta) and the creation of silva_138.1.fasta. This allows for the utilization of the SILVA database for conducting ASA and manually identifying sample taxonomy from phylogenetic trees.

asa's People

Contributors

sampie avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.