Code Monkey home page Code Monkey logo

helixer's Introduction

Python CI

Helixer

Gene calling with Deep Neural Networks.

Disclaimer

This software is undergoing active testing and development.

Goal

Setup and train models for ab initio prediction of gene structure. That is, to perform "gene calling" and identify which base pairs in a genome belong to the UTR/CDS/Intron of genes. Train one model for a wide variety of genomes.

Web tool

For light usage (inference on one to a few genomes), you may want to check out the Helixer web tool: https://plabipd.de/helixer_main.html and skip the installation overhead.

Install

GPU requirements

For realistically sized datasets, a GPU will be necessary for acceptable performance.

The example below and all provided models should run on an nvidia GPU with 11GB Memory (e.g. GTX 1080 Ti)

The diver for the GPU must also be installed. During development we have used

  • nvidia-driver-495
  • nvidia-driver-510
  • nvidia-driver-525

and many in between.

via Docker / Singularity (recommended)

See https://github.com/gglyptodon/helixer-docker

Additionally, please see notes on usage, which will differ slightly from the example below.

Manual installation

Please see full installation instructions

contributors & team members

Please additionally see dev installation instructions

Example

Training and Evaluation

If the provided models don't work for your needs, information on training and evaluating the models can be found in the docs folder, as well as notes on experimental ways to fine-tune the network for target species including a hack to include RNAseq data in the input.

Inference (gene calling)

If you want to use Helixer to annotate a genome with a provided model, start here.

Using trained models

Acquire models

The best models for all lineages are best downloaded by running.

fetch_helixer_models.py

If desired, the --lineage can be specified, or --all released models can be fetched.

The available lineages are land_plant, vertebrate, invertebrate, and fungi.

Info on the downloaded models (and any new releases) can be found here: https://uni-duesseldorf.sciebo.de/s/lQTB7HYISW71Wi0

Note: to use a non-default model, set --model-filepath <path/to/model.h5>', to override the lineage default for Helixer.py.

Run on target genome
# download an example chromosome
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-47/fasta/arabidopsis_lyrata/dna/Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz
gunzip Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz
# run all Helixer components from fa to gff3
Helixer.py --lineage land_plant --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa  \
  --species Arabidopsis_lyrata --gff-output-path Arabidopsis_lyrata_chromosome8_helixer.gff3

The above runs three main steps: conversion of sequence to numerical matrices in preparation (fasta2h5.py), prediction of base-wise probabilities with the Deep Learning based model (helixer/prediction/HybridModel.py), post-processing into primary gene models (helixer_post_bin). See respective help functions for additional usage information, if necessary.

Run on target genomes, 3-step method
# example broken into individual steps
fasta2h5.py --species Arabidopsis_lyrata --h5-output-path Arabidopsis_lyrata.h5 --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa
# the exact location ($HOME/.local/share/) of the model comes from appdirs
# the model was downloaded when fetch_helixer_models.py was called above
# this example code is for _linux_ and will need to be modified for other OSs
HybridModel.py --load-model-path $HOME/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 \
     --test-data Arabidopsis_lyrata.h5 --overlap --val-test-batch-size 32 -v
helixer_post_bin Arabidopsis_lyrata.h5 predictions.h5 100 0.1 0.8 60 Arabidopsis_lyrata_chromosome8_helixer.gff3

Output: The main output of the above commands is the gff3 file (Arabidopsis_lyrata_chromosome8_helixer.gff3) which contains the predicted genic structure (where the exons, introns, and coding regions are for every predicted gene in the genome). You can find more about the format here. You can readily derive other files, such as a fasta file of the proteome or the transcriptome, using a standard parser, for instance gffread.

What Parameters Matter?

Most parameters from Helixer.py have been set to a reasonable default; but nevertheless there are a couple where the best setting is genome dependent.

--lineage or --model-filepath

It is of course critical to choose a model appropriate for your phylogenetic range / trained on species that generalize well to your target species. When in doubt selection via --lineage is recommended, as this will use the best available model for that lineage.

--subsequence-length and overlapping parameters

From v0.3.1 onwards these parameters are set to reasonable defaults when --lineage is used, but --subsequence-length will still need to be specified when using --model-filepath, while the overlapping parameters can be derived automatically.

Subsequence length controls how much of the genome the Neural Network can see at once, and should ideally be comfortably longer than the typical gene.

For genomes with large genes (i.e. there are frequently > 20kbp genomic loci), --subsequence-length should be increased This is particularly common for vertebrates and invertebrates but can also happen in plants. For efficiency, the overlap parameters should increase as well. It might then be necessary to decrease --batch-size if the GPU runs out of memory.

However, these should definitely not be higher than the N50, or even the N90 of the genome. Nor so high a reasonable batch size cannot be used.

General recommendations
  • fungi, leave as is (--subsequence-length 21384 --overlap-offset 10692 --overlap-core-length 16038)
  • plants, leave as is, or try up to --subsequence-length 106920 --overlap-offset 53460 --overlap-core-length 80190
  • invertebrates, set to --subsequence-length 213840 --overlap-offset 106920 --overlap-core-length 160380
  • vertebrates, set to --subsequence-length 213840 --overlap-offset 106920 --overlap-core-length 160380
--peak-threshold affects the precision <-> recall balance

In particular, increasing the peak threshold from the default of 0.8 has been reported to increase the precision of predictions, with very minimal reduction in e.g. BUSCO scores. Values such as 0.9, 0.95 and 0.975 are very reasonable to try.

Citation

Full Applicable Tool

Felix Holst, Anthony Bolger, Christopher Günther, Janina Maß, Sebastian Triesch, Felicitas Kindel, Niklas Kiel, Nima Saadat, Oliver Ebenhöh, Björn Usadel, Rainer Schwacke, Marie Bolger, Andreas P.M. Weber, Alisandra K. Denton. Helixer—de novo Prediction of Primary Eukaryotic Gene Models Combining Deep Learning and a Hidden Markov Model. bioRxiv 2023.02.06.527280; doi: https://doi.org/10.1101/2023.02.06.527280

Original Development and Description of Deep Neural Network for basewise predictions

Felix Stiehler, Marvin Steinborn, Stephan Scholz, Daniela Dey, Andreas P M Weber, Alisandra K Denton. Helixer: Cross-species gene annotation of large eukaryotic genomes using deep learning. Bioinformatics, btaa1044, https://doi.org/10.1093/bioinformatics/btaa1044

helixer's People

Contributors

alisandra avatar bjoernusadel avatar chrisguenther91 avatar felicitas215 avatar frogsicle avatar gglyptodon avatar soi avatar tonybolger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

helixer's Issues

Somtimes predicted categories don't sum to one (overlap + ensembled)

For instance papio_anubis, chunk 479

@soi explained this well in the original (GenuFF repo) issue:
"I believe this is due to the way overlapping is done. Could this chunk be the last one in a chromosome? I vaguely remember that such things happened, but that it also had no effect on the argmax outcome (hopefully).

Basically there is nothing to overlap with at the end and the initial array to which the sequences get copied to for the final averaging is initialized with zeros, so the average shrinks but not the argmax."

Install things

Update instructions or install to allow easy usage of GeenuFF (e.g. get import2geenuff.py to path or also getting GeenuFF repo)

Compatible GeenuFF version as dependency

Test (and probably bugfix) phase encoding where we don't start with phase 0

i.e.

  • when there was a partial model in the gff and it started with phase != 0
  • during super chunking / splitting into write_by
    • (currently test_super_chunking4write does catch this, but potentially should be done a bit more formally)

(and sorry if I'm missing something, if it's already there and I missed it)

why does coverage still have -1s?

two parts to this,

  1. know why this is happening

  2. if unavoidable (e.g. data is really missing), then make sure it's handled as ambiguous data and not lack of coverage throughout; while if fixable fix

Possibility to Export Unannotated sequences

Completely unannotated sequences (GeenuFF coordinates) have been filtered out to avoid the ambiguity between e.g. a contig on which annotation was never performed and a contig that in fact had no genes; but this needs to be parameterized / only occur for train & eval. Could also perhaps be marked and not filtered (h5s even already have an 'is_annotated' field, but I guess I didn't finish).

This will be important to actually predict on an unannotated genome...

Don't export non-coding genes to h5

preferably parameterized as we have thus far excluded them bc they aren't consistently included... but would be interesting / maybe even biologically logical to be able to include if we ever had the data.

btw, we haven't been doing this yet because GeenuFF has not been importing them, so this mod needs to be done in concert with weberlab-hhu/GeenuFF#392

Extend validation of data

Do genomes include...

  • splicing during start/stop codons
  • alternative splice sites
  • any dinucleotide bias at transcription start site
  • ...

Data read efficiency

Where we read from many different h5 datasets (not just X, y, but also transitions, coverage, spliced coverage, coverage score, someday phase...) we can bottle neck (particularly for smaller networks) on the data read in.

Check if the H5 files can be restructured to make for less random reads / keep data from different datasets but same index in a way it's easy to get all relevant data by index. Then pro/con and decide if it's worth the effort and repercussions to change...

Test using zarr as data format

One last performance test that should not be terribly hard to implement. As hdf5 is not threadsafe and there seems to be a way to prefetch samples with our keras sequences and use_multiprocessing=True (although there are warnings)

Cleanup HelixerModel code

There is so much stuff in there, that we do not need. E.g. canary genomes, gene length, profiling

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.