Code Monkey home page Code Monkey logo

pss-fs's Introduction

FastScreen license dockerhub compihub

FastScreen is a compi pipeline to identify datasets that likely show evidence for positive selection and thus should be the subject of detailed, time-consuming analyses1. A Docker image is available for this pipeline in this Docker Hub repository.

FastScreen repositories

Using the FastScreen image in Linux

In order to use the FastScreen image, create first a directory with name compi_fss_working_dir/input in your local file system. compi_fss_working_dir is the name of the working directory of the pipeline where the output results and intermediate files will be created. The input FASTA files to be analized must be placed in the compi_fss_working_dir/input directory.

Note that FastScreen requires FASTA files to have at least 4 sequences, otherwise the pipeline will not start its execution and a list with the files having less than 4 sequences is created.

Test data

The sample data is available here. Download the FASTA files and put them inside the directory compi_fss_working_dir/input in your local file system. Please, note that the folder input must remain with that name as the pipeline will look for the FASTA files there.

Then, you should adapt and run the following commands:

WORKING_DIR=/path/to/compi_fss_working_dir

docker run -v ${WORKING_DIR}:/working_dir --rm pegi3s/pss-fs --logs /working_dir/logs

In these commands, you should replace:

  • /path/to/compi_fss_working_dir to the actual path in your file system.

Extra

To re-run the pipeline in the same working directory, run the following command first in order to clean it:

docker run -v ${WORKING_DIR}:/working_dir --entrypoint clean_working_dir pegi3s/pss-fs /working_dir/

Or, alternatively, delete every folder manually:

sudo rm -rf ${WORKING_DIR}/ali ${WORKING_DIR}/renamed_seqs ${WORKING_DIR}/logs ${WORKING_DIR}/tree ${WORKING_DIR}/FUBAR_files ${WORKING_DIR}/FUBAR_results ${WORKING_DIR}/short_list ${WORKING_DIR}/to_be_reevaluated_by_codeML ${WORKING_DIR}/codeML_random_list ${WORKING_DIR}/codeML_results ${WORKING_DIR}/tree.codeML ${WORKING_DIR}/codeML_short_list ${WORKING_DIR}/negative_list ${WORKING_DIR}/files_requiring_attention ${WORKING_DIR}/FUBAR_short_list ${WORKING_DIR}/renamed_seqs_mappings

For Developers

Pipeline implementation

The pipeline.xml analyzes each FASTA file in the input_dir directory in parallell (using binded foreachs) and produces the results at the specified working_dir. For each input FASTA file, ClustalOmega and FastTree are executed in first place in order to look for evidence for positive selection with FUBAR. If evidence for positive selection is found, then the name of the file is added to the short_list file. If it is not found, then the file is analized using CodeML. The tasks related with the execution of CodeML can be skipped by passing the parameter skip_code_ml.

Please, note that there is a limit around 90 000 for the product of the number of sequences times the number of ungapped codons that CodeML can handle1. When this limit is exceeded a random sample is taken from the initial dataset (in the codeml-check-limit task). In these cases, as many as possible sequences minus one are used.

The main output is the short_list file, which contains the names of the FASTA files where evidence for positive selection.

Appart from the short_list file, six other output files are produced:

  1. FUBAR_short_list: contains the names of the files where evidence for positive selection has been found by FUBAR.
  2. to_be_reevaluated_by_codeML: contains the names of the files that where re-evaluated by CodeML.
  3. codeML_random_list: contains the names of the files from which a random sequence sample was taken because they were too large to be analysed by CodeML.
  4. codeML_short_list: contains the names of the files where PSS were detected by CodeML model M2a.
  5. negative_list: contains the names of the files where no evidence for positive selection was found by either FUBAR or CodeML.
  6. files_requiring_attention: contains the names of the files that could not be processed without error (usually because they have in frame stop codons that were introduced during the nucleotide alignment step).

Building the Docker image

To build the Docker image, compi-dk is required. Once you have it installed, simply run compi-dk build from the project directory to build the Docker image. The image will be created with the name specified in the compi.project file (i.e. pegi3s/pss-fs:latest). This file also specifies the version of compi that goes into the Docker image.

References

  • H. López-Fernández; C. P. Vieira; P. Ferreira; P. Gouveia; F. Fdez-Riverola; M. Reboiro-Jato; J. Vieira (2021) On the identification of clinically relevant bacterial amino acid changes at the whole genome level using Auto-PSS-Genome. Interdisciplinary Sciences: Computational Life Sciences. Volume 13, pp. 334–343. DOI
  • H. López-Fernández; P. Duque; N. Vázquez; F. Fdez-Riverola; M. Reboiro-Jato; C.P. Vieira; J. Vieira (2019) Inferring Positive Selection in Large Viral Datasets. 13th International Conference on Practical Applications of Computational Biology & Bioinformatics: PACBB 2019. Ávila, Spain. 26 - June DOI

pss-fs's People

Contributors

hlfernandez avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

ar-rohman

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.