Code Monkey home page Code Monkey logo

pseudogenepipeline's Introduction

Pseudogene pipeline

Scripts and wrapper file for runing the Shiu Lab's pseduogene pipeline.

Overview

Associated publications

  • Zou C, Lehti-Shiu, MD, Thibaud-Nissen F, Prakash Th, Buell CR, and Shiu SH (2009) Evolutionary and Expression Signatures of Pseudogenes in Arabidopsis thaliana and Rice. Plant Physiol. 151:3-15. pubmed
  • Campbell M, Law MY, Holt C, Stein J, Moghe G, Hufnagel Du, Lei J, Achawanantakun R, Jiao D, Lawrence C, Ware D, Shiu SH, Childs K, Sun Y, Jiang N, Yandell M (2014) MAKER-P: a tool-kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol164(2):513-24 pubmed
  • Lloyd JP, Tsai ZTY, Sowers RPu, Panchy NL, Shiu SH (2018) A model-based approach for identifying functional intergenic transcribed regions and non-coding RNAs. Mol. Biol. Evol. 35(6):1422-1436 pubmed

Requirements

Useage

Cut-to-the-chase

The pipline is run using:

python _scripts/PseudogenePipeline.py [parameter_file]

To invoke the help function, run:

python _scripts/PseudogenePipeline.py

About the parameter file

This text file specifies how the pipeline should run and an example can found in in the _example_files folder.

Test run

Two test datasets are provided for you to gauge whether there is any issues.

  1. _test25.tgz: This compressed file contains a test dataset of 25 proteins that takes ~1 minute to run. To use:
tar xvzf _test25.tgz
cd _test25

Then make sure the test_parameter_file in the folder is modified to specifiy:

  • The location and name of tfasty program.
  • The location of RepeatMasker.
  • The location of PseudogenePipeline's _scripts folder.

Below we assume that you are in the _test25 folder is in the cloned PseudogenePipeline folder. Run the pipeline:

python ../_scripts/PseudogenePipeline.py test_parameter_file 

The _expected_results folder contains what you should be seeing.

  1. _test27206.tgz: This is a test case that is more realistic with a larger Arabidopsis thaliana dataset that takes ~20-30 min to run.

Ouput

The output of the pipeline is seperated into the following subfolder:

  • _intermediate: Intermediate files used to generate final results.
  • These may be removed following a successful run, however if you want a list of pseudogenes generated prior to high confidence filtering and/or RepatMasker filtering they will be here
  • _logs: log files generated by the run
  • _results: Output files for the final list of pseudogenes following high confidence and and RepeatMasker filtering
    1. hiConf.RMfilt/hiConf.RMfilt.cdnm - position information for pseudogenes
    2. fa.hiConf.RMfilt/fa.hiConf.RMfilt.cdnm - sequence information for pseudogenes
    3. hiConf.RMfilt.cdnm.gff - gff file with pseudogene annotations
    4. pseudogene_evidence_cod - information on the numbers of stop codons, stop condons near gaps, frameshits, framshifts near gaps. Because the gaps can simply be alignments between coding sequence and introns, any predicted stop and frameshift near them are of lower confidence.
  • NOTE: cdnm versions of the output use simplified pseudogene names.

Versio info

v.2.0.0

  • Converted to Python 3.
  • Some changes so the codes are PEP8-compliant.
  • Added license.

v1.0.0

pseudogenepipeline's People

Contributors

shinhanshiu avatar panchyni avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.