Code Monkey home page Code Monkey logo

ensembl-genomio's Introduction

ensembl-genomio

Pipelines to turn basic genomic data into Ensembl cores and back

This is a mulitlanguage (Perl, Python) repo providing eHive pipelines and various scripts (see below) to prepare genomic data and load it as Ensembl core database or to dump such core databases as file bundles.

Bundles themselves consist of genomic data in various formats (e.g. fasta, gff3, json) and should follow the corresponding specification.

Installation and configuration

This repo

Prerequisites

Pipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.

Get repo and install

Clone:

git clone [email protected]:Ensembl/ensembl-genomio.git 

Install the python part (of the pipelines) and test it:

pip install ./ensembl-genomio

# test
python -c 'import ensembl.brc4.runnable.read_json'

Update your perl envs (if you need to)

export PERL5LIB=$(pwd)/ensembl-genomio/lib/perl:$PERL5LIB
export PATH=$(pwd)/ensembl-genomio/scripts:$PATH

Optional installation

If you need to install "editable" python package use '-e' option

pip install -e ./ensembl-genomio

To install additional dependencies (e.g. [doc] or [dev]) provide [<tag>] string. I.e.

pip install -e ./ensembl-genomio[dev]

For the list of tags see [project.optional-dependencies] in pyproject.toml.

Additional steps to use automated genertaion of the documentation (part of it)

Install python part with the [doc] or [dev] tag. Change into repo dir Run doc build script.

git clone [email protected]:Ensembl/ensembl-genomio.git 
pip install -e ./ensembl-genomio[doc]

cd ./ensembl-genomio

# build docs
./scripts/setup/docs/build_sphinx_docs.sh

Pipelines

Initialising and running eHive-based pipelines

Pipelines are derived from Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf, or from Bio::EnsEMBL::Hive::PipeConfig::EnsemblGeneric_conf, of from Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf (see documentation).

And the same perl class prefix used for every pipeline: Bio::EnsEMBL::EGPipeline::PipeConfig:: .

N.B. Don't forget to specify -reg_file option for the beekeeper.pl -url $url -reg_file $REG_FILE -loop command.

init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf
    $($CMD details script) \
    -hive_force_init 1\
    -queue_name $SPECIFIC_QUEUE_NAME \
    -registry $REG_FILE \
    -pipeline_tag "_${PIPELINE_RUN_TAG}" \
    -ensembl_root_dir ${ENSEMBL_ROOT_DIR} \
    -dbsrv_url $($CMD details url) \
    -proddb_url "$($PROD_SERVER details url)""$PROD_DBNAME" \
    -taxonomy_url "$($PROD_SERVER details url)""$TAXONOMY_DBNAME" \
    -release ${RELEASE_VERSION} \
    -data_dir ${INPUT_DIR}/manifests_dir/ \
    -pipeline_dir $OUT_DIR/loader_run \
    ${OTHER_OPTIONS} \
    2> $OUT_DIR/init.stderr \
    1> $OUT_DIR/init.stdout

SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -sync

LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -reg_file $REG_FILE -loop

$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout

List of the pipelines

Pipeline name Description Document Comment Module
BRC4_genome_loader creates an Ensembl core database from a set of flat files or adds ad-hoc (ie organellas) sequences to the existing core BRC4_genome_loader Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf
BRC4_genome_dumper
BRC4_genome_prepare
BRC4_addition_prepare
BRC4_genome_compare
LoadGFF3
LoadGFF3Batch

Scripts

Various docs

See docs

TODO

Tests, tests, tests...

Acknowledgements

Some of this code and documentation is inherited from the EnsemblGenomes and other Ensembl projects. We appreciate the effort and time spent by developers of the EnsemblGenomes and Ensembl projects.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.