Code Monkey home page Code Monkey logo

homologene's Introduction

A small pipeline for parsing Homologene

Parsing the Homologene build and building the XML document tree in python takes too much memory. Here, we built a small pipeline to:

  1. Simplify the Homologene XML document tree corresponding only to species of interest
  2. Extract from the simplified Homologene XML one-to-one ortholog mappings between two species of interest.

Usage

Packages and conda environment:

Run conda env create -f env.yml to install necessary packages.

Snakemake

We have included a Snakemake pipeline to extract 1-1 ortholog mappings between a specified pairs of species. These pipelines are configured by configuration files. Run snakemake all <configfile> to produce ortholog mappings with the provided config files in configs/.

The provided config files are:

  • sc-sp.yml : For S. Cerevisiae vs. S. Pombe ortholog lists (using locus tags)
  • human-mouse.yml: For human vs. mouse ortholog lists (using RefSeq IDs)

Scripts

simplify_homologene.py extracts and outputs XML elements of the Homologene build corresponding only to the provided species Taxa IDs.

The following elements are removed from the XML:

  • From HG-Entry elements:
    • HG-Entry_commentaries
    • HG-Entry_cr-date
    • HG-Entry_up-date
  • From HG-Gene elements:
    • HG-Gene_domains
    • HG-Gene_location

Command line parameters:

  • --input: Path to the Homologene XML
  • --output: Destination for output file
  • --tax_ids: Space separated list of taxonomy IDs for species of interest.

extract_homologs.py extracts and saves to disk (as a TSV) 1-1 ortholog mapping from a Homologene XML document. 1-1 homolog mappings are generated by selecting pairs with reciprocol best bitscores. (We recommend only performing this on simplified Homologene XML files since building XML document trees in memory for all species in Homologene takes an enormous amount of memory.)

Command line parameters:

  • --input: Path to the Homologene XML
  • --output: Destination for output file
  • --tax_ids: Space separated list of two [taxonomy IDs] for species of interest
  • --use_refseq_id: An optional flag. When this flag is used, RefSeq IDs will be used, otherwise locus tags will be used. (Note: if locus tags are not available for all genes w.r.t the species of interest then this flag must be used).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.