Code Monkey home page Code Monkey logo

aparent's Introduction

APARENT Logo

APARENT - APA Regression Net

This repository contains the code for training and running APARENT, a deep neural network that can predict human 3' UTR Alternative Polyadenylation (APA), annotate genetic variants based on the impact of APA regulation, and engineer new polyadenylation signals according to target isoform abundances or cleavage profiles.

APARENT was described in Bogard et al, Cell 2019.

The model was trained on >3.5 million randomized 3' UTR poly-A signals expressed on mini gene reporters in HEK293.

Forward-engineering of new poly-A signals is done using the included SeqProp (Stochastic Sequence Backpropagation) software, which implements a gradient-based input optimization algorithm and uses APARENT as the predictor.

Further below on this page are links to IPython Notebooks containing all of the analyses performed in the paper. There is also a link to the repository containing all of the processed data used by the notebooks.

Contact jlinder2 (at) cs.washington.edu for any questions about the model or data.

Web Prediction Tool

We have hosted a publicly accessible web application where users can predict APA isoform abundance and variant effects with APARENT and visualize the results.

The web prediction tool is located at https://apa.cs.washington.edu.

Installation

APARENT can be installed by cloning or forking the github repository:

git clone https://github.com/johli/aparent.git
cd aparent
python setup.py install

APARENT requires the following packages to be installed

  • Python >= 3.6
  • Tensorflow >= 1.13.1
  • Keras >= 2.2.4
  • Scipy >= 1.2.1
  • Numpy >= 1.16.2
  • Isolearn >= 0.2.0 (github)
  • [Optional] Pandas >= 0.24.2
  • [Optional] Matplotlib >= 3.1.1
  • [Optional] SeqProp >= 0.1 (github)

Usage

APARENT is built as a Keras Model, and as such can be easily executed using simple Keras function calls. See the example usage notebooks below for a tutorial on how to use the model for APA- and Variant Effect prediction.

This simple example illustrates how to predict the isoform abundance and cleavage profile of an input APA event:

import keras
from keras.models import Sequential, Model, load_model
from aparent.predictor import *

#Load APADB-tuned APARENT model and input encoder
apadb_model = load_model('../saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5')
apadb_encoder = get_apadb_encoder()

#Example APA sites (gene = PSMC6)

#Proximal and Distal PAS Sequences
seq_prox = 'AGATAGTGGTATAAGAAAGCATTTCTTATGACTTATTTTGTATCATTTGTTTTCCTCATCTAAAAAGTTGAATAAAATCTGTTTGATTCAGTTCTCCTACATATATATTCTTGTCTTTTCTGAGTATATTTACTGTGGTCCTTTAGGTTCTTTAGCAAGTAAACTATTTGATAACCCAGATGGATTGTGGATTTTTGAATATTAT'
seq_dist = 'TGGATTGTGGATTTTTGAATATTATTTTAAAATAGTACACATACTTAATGTTCATAAGATCATCTTCTTAAATAAAACATGGATGTGTGGGTATGTCTGTACTCCTCCTTTCAGAAAGTGTTTACATATTCTTCATCTACTGTGATTAAGCTCATTGTTGGTTAATTGAAAATATACATGCACATCCATAACTTTTTAAAGAGTA'

#Site Distance
site_distance = 180

#Proximal and Distal cut intervals within each sequence defining the isoforms
prox_cut_start, prox_cut_end = 80, 105
dist_cut_start, dist_cut_end = 80, 105

#Predict with APADB-tuned APARENT model
iso_pred, cut_prox, cut_dist = apadb_model.predict(x=apadb_encoder([seq_prox], [seq_dist], [prox_cut_start], [prox_cut_end], [dist_cut_start], [dist_cut_end], [site_distance]))

print("Predicted proximal vs. distal isoform % (APADB) = " + str(iso_pred[0, 0]))

APARENT Example Usage Notebooks

These two notebooks illustrate how to use the APARENT Keras models to predict APA given a proximal and distal site, and to predict APA Variant effects, respectively. These are the two model versions we recommend using:

saved_models/aparent_large_lessdropout_all_libs_no_sampleweights.h5

The base version of APARENT. Given an input sequence, predicts the (non-normalized) isoform abundance and cleavage distribution. It is non-normalized in the sense that predictions are not scaled w.r.t. a particular distal site, but rather the average distal bias of the training MPRA data. The main use of this model is to predict the effect of variants, by calculating the odds ratio between variant and wildtype isoform predictions.

saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5

A siamese APARENT network model, expecting both proximal and distal sequences as input. APARENT scores each site independently. The scores are weighted and combined with the log site distance, where the combination weights have been fitted on the Pooled-Tissue APADB data.

Notebook 1: APA Isoform & Cleavage Prediction
Notebook 2: APA Variant Effect Prediction
Notebook 3: PolyA Peak Detection

Note: This model version is not the one evaluated in the paper; this version has been trained on all MPRA libraries (no libraries have been held out) in order to make the best APA predictor possible.

Legacy Model & Code Availability

The Legacy Model is the version evaluated in the paper, which we provide here for reproducibility. The model architecture itself has not changed since the Legacy version, but the newest version has been trained on all MPRA libraries. The Legacy models (base version and APADB-fitted version) are located at saved_models/legacy_models/.

The Legacy model was originally built and trained using Theano. Theano has since stopped being developed, so we have lifted the original model into Keras. The original Theano training code can be found in the below repository:

Legacy Code Repository

Data Availability

The raw sequencing data for the 3' UTR MPRA libraries are found at GEO accession GSE113849.

The Legacy Data is the version of the processed data analyzed in the paper, which we provide here for reproducibility. The newest version of the data has been re-processed with the following additional improvements:

  1. Exact cleavage positions have been mapped for the Alien1 Random MPRA Sublibrary.
  2. A 20 nt random barcode upstream of the USE in the Alien1 Sublibrary has been included in the sequence.

Processed Data Repository
Processed Data Repository (legacy)

Note: The "Processed Data Repository" also includes the Legacy data, but the data has been re-formatted such that it is easier to work with in Keras.

Analysis

The following collection of IPython Notebooks contains all of the analyses performed in the paper. To aid reproducibility, we have used the Legacy APARENT model and Legacy Data in all of the notebooks.

Random MPRA Linear Model Notebooks

Log Odds Ratio Analysis of hexamers in the Random MPRA libraries and Linear Logistic Hexamer Regression.

Notebook 1a: Isoform Log Odds Ratio Analysis (Alien1 Library)
Notebook 1b: Isoform Log Odds Ratio Analysis (Alien2 Library)
Notebook 2: Cleavage Log Odds Ratio Analysis (Alien1 Library)
Notebook 3a: Hexamer Logistic Regression (Combined Library)
Notebook 3b: Hexamer Logistic Regression (TOMM5 Library only)
Notebook 3c: Hexamer Logistic Regression (Alien1 Library only)
Notebook 3d: Hexamer Logistic Regression (Alien2 Library only)

Random MPRA Neural Network Notebooks

Evaluation of APARENT on the Random MPRA libraries, and Convolutional Layer 1 & 2 visualizations.

Notebook 1: MPRA Prediction Evaluation
Notebook 2a: Conv Layer 1 and 2 Analysis (Alien1 Library)
Notebook 2b: Conv Layer 1 and 2 Analysis (Alien2 Library)
Notebook 3: CSE Hexamer Filter (Conv Layer 1)
Notebook 4: Cleavage Motifs (Conv Layer 1)

SeqProp APA Engineering Notebooks

Engineering of PAS sequences according to target isoform and cleavage objectives (and DeepDream).

Notebook 1: Target Isoform Sequence Optimization
Notebook 2: Target Cleavage Sequence Optimization
Notebook 3: Dense Layer Sequence Visualization (DeepDream-Style)

Designed MPRA Analysis Notebooks

Analysis of the Designed MPRA library, including Forward-engineering, Native PAS prediction, and Variant analysis.

Notebook 0a: Basic MPRA Library Statistics
Notebook 0b: MPRA LoFi vs. HiFi Replicates

Notebook 1a: SeqProp Target Isoforms (Summary)
Notebook 1b: SeqProp Target Isoforms (Detailed)

Notebook 2a: SeqProp Target Cut (Summary)
Notebook 2b: SeqProp Target Cut (Detailed)

Notebook 3: Human Wildtype APA Prediction

Notebook 4a: Human Variant Analysis (Summary)
Notebook 4b: Disease-Implicated Variants/UTRs (Detailed)
Notebook 4c: Cleavage-Altering Variants (Detailed)

Notebook 5a: Complex Functional Variants (Summary)
Notebook 5b: Complex Functional Variants (Canonical CSE)
Notebook 5c: Complex Functional Variants (Cryptic CSE)
Notebook 5d: Complex Functional Variants (CFIm25)
Notebook 5e: Complex Functional Variants (CstF)
Notebook 5f: Complex Functional Variants (Folding)

Notebook Bonus: TGTA Motif Saturation Mutagenesis

Native APA Analysis Notebooks

Analysis of native human APA (APADB and Leslie APA Atlas), including cell-type specific APA prediction evaluation.

Data sources: (APADB | Leslie)

Notebook 0: Basic Data Statistics
Notebook 1: Differential Usage Analysis
Notebook 2: Cleavage Site Prediction
Notebook 3: APA Isoform Prediction
Notebook 4: APA Isoform Prediction (Cross-Validation)

aparent's People

Contributors

jlinder2 avatar johli avatar

Stargazers

Marc Horlacher avatar Shuxian Zou avatar  avatar  avatar Yunfan Luo avatar Lin avatar ZwormZ avatar Ross Altman avatar  avatar Svetlana Lebedeva avatar Junwei Yang avatar mnarayan avatar David Schlesinger avatar Andy He avatar Sam Bryce-Smith avatar Mervin Fansler avatar  avatar NeuroForLunch avatar  avatar Rory Kirchner avatar  avatar  avatar  avatar Sebastian M. Castillo Hair avatar Thiago Britto Borges avatar  avatar  avatar Shang Xie (谢上) avatar Xin Xiong avatar Jay Hesselberth avatar carushi avatar Masaru Koido avatar Ran Zhou avatar eiobnodagdlj avatar

Watchers

 avatar eiobnodagdlj avatar  avatar

aparent's Issues

How to APARENT to study pPAS and dPAS

Hello!
APARENT is a very good software! I hope to use APARENT to predict the sequences near pPAS and dPAS in my study, but I encountered the following question:

  1. Should I use APARENT or APARENT2?
  2. For all the genes I want to study, I have identified a pPAS and a dPAS. Should I use Notebook 1: APA Isoform & Cleavage Prediction? If needed, how to set site_distance, prox_cut_start set, prox_cut_end(and dist_)? What do Non-normalized proximal sum-cut logit, Non-normalized distal sum-cut logit and Predicted proximal vs. distal isoform % (APADB) mean? As you said in #1, I have used 100 nt upstream of the poly-A site (proximal and distal) +205nt as the sequence.
  3. Should I use Notebook 2: APA Variant Effect Prediction? If so, how do I get the seq? Do the parameters need to be adjusted?
  4. Can I use APARENT to study others about APA? I have the site of genes' pPAS and dPAS now.

Best,
Yang

Potential PolyA Detector scoring issue

Hi there,

I cloned the APARENT repo to run the polyA detector (score_polya_peaks) locally. I used the "Notebook 3: PolyA Peak Detection" Jupyter notebook as a guide to get started. However, running the code using the example data I got different results with the peak_iso_scores function, specifically the last score. The notebook suggested I should detect 3 peaks with the following logit scores:

Peak PAS scores (log odds) = [2.413, 0.965, 1.688]

The results I get are:

Peak PAS scores (log odds) = [2.413, 0.965, -10]
with the last score being -10 vs. 1.688 in the Jupyter notebook.

Investigating further, using different sequences, I found I always get -10 as my last peak score. E.g. If I use a sequence with one peak, the score is -10. If I use a sequence with 6 peaks, again the last score is -10. This true for a sequence that contained previously a high scoring peak.

I then checked this against web tool (https://apa.cs.washington.edu/detect) which gives the same results, always giving a score of -10 for the last detected peak. This can be reproduced using the example sequence in the Jupyter notebook. So this issue is not due to my particular system setup.

I don't believe this is the intended behaviour. Let me know if you want me to post my system details.

How to ensure the number of test set samples?

Thanks for your wonderful work!

I would like to use the original training and test sets, keeping the same sequences as in the original paper. I ran the file https://nbviewer.org/github/johli/aparent/blob/master/analysis/evaluate_aparent_random_mpra_legacy.ipynb and found that every library had the same test number. It seems different from the split in Figure 1. C in the paper.

Thanks for your time!
Best regards!

Original paper:
image
evaluate_aparent_random_mpra_legacy.ipynb:
image

I can not rerun the notebook for polyA peak detection

Hello Dear Developer,

I am trying to run your notbook for predicting polyA sites: https://nbviewer.jupyter.org/github/johli/aparent/blob/master/examples/aparent_example_pas_detection.ipynb.  ButI got the following error.   Could you help with this error?

Thanks,
Haibo

peak_ixs, polya_profile = find_polya_peaks(
... aparent_model,
... aparent_encoder,
... seq,
... sequence_stride=5,
... conv_smoothing=True,
... peak_min_height=0.01,
... peak_min_distance=50,
... peak_prominence=(0.01, None)
... )
Traceback (most recent call last):
File "", line 9, in
File "/home/hl84w/work/mccb/bin/miniconda2/envs/aprent/lib/python3.6/site-packages/aparent/predictor/aparent_predictor.py", line 131, in find_polya_peaks
_, cut_pred = aparent_model.predict(x=aparent_encoder([seq_slice]))
File "/home/hl84w/work/mccb/bin/miniconda2/envs/aprent/lib/python3.6/site-packages/keras/engine/training.py", line 1149, in predict
x, _, _ = self._standardize_user_data(x)
File "/home/hl84w/work/mccb/bin/miniconda2/envs/aprent/lib/python3.6/site-packages/keras/engine/training.py", line 751, in _standardize_user_data
exception_prefix='input')
File "/home/hl84w/work/mccb/bin/miniconda2/envs/aprent/lib/python3.6/site-packages/keras/engine/training_utils.py", line 102, in standardize_input_data
str(len(data)) + ' arrays: ' + str(data)[:200] + '...')
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 9 array(s), but instead got the following list of 3 arrays: [array([[[[1.],
[0.],
[0.],
[0.]],

    [[0.],
     [0.],
     [1.],
     [0.]],

    [[1.],
     [0.],
     [0.],
     [0.]],

    [[0.]...

How to get PAS sequence?

Hi, I'd like to annotate transcripts with differential APA.
What exactly do I have to provide as input PAS sequence?

The input to APARENT is an 205bp long sequence.
Which part of the transcript sequence do I need to select therefore?

Assuming the CSE is max. 50bp upstream the Poly-A cut, i.e. the UTR end:
It should be OK to use 130bp upstream to 75bp downstream the UTR end, shouldn't it?

Why the data folder have no data files?

Hi, I am researching on your paper recently, but when I want to reproduct your codes, I have found that no data is in the project. I have download the dataset according to the website you offer in the readme, but I don't konw how to use it.

Kipoi APARENT

Hi, I am currently running Kipoi APARENT.

The provided example VCF file contains 15,077 variants, but the output only has 416 predictions for delta_logit_distal_prop and delta_logit_proximal_prop. I was expecting 15,077 predictions. Do you know why I only get 15,077 predictions?
Kipoi APARENT

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.