Code Monkey home page Code Monkey logo

kleat's Introduction

KLEAT

Documentation Status

Cleavage site prediction via de novo assembly.

Note: this is a reimplementation from scratch of the following paper (PMID: 25592595),

  • Birol I, Raymond A, Chiu R, Nip KM, Jackman SD, Kreitzman M, et al. KLEAT: cleavage site analysis of transcriptomes. Pac Symp Biocomput. World Scientific. 2015;:347–58.

KLEAT works by

  1. Collect polyA evidences (suffix, bridge, link, and blank) and their supporting clvs per contig
  2. Aggregate polyA evidence per clv identified by (seqname, strand, clv)
  3. Find the closest annotated clv per predicted clv and calculate the distance in between.

Suffix was named tail in the original paper, but tail collides with tail of a bridge contig, which could be confused, so it is renamed to suffix in the code base.

Blank indicates the contig without any indication of polyA, but could still support predicting cleavage site since if a transcript is expressed, then a cleavage site likely exists nearby.

Install

KLEAT (>3.0.0) supports only Python3 (>=py34). A few key packages include pysam, pandas, scikit-learn.

First, it's recommended to create a virtual environment, using either conda

conda create -p venv-kleat python=3
source activate venv-kleat/

or pip + virtualenv

pip install virtualenv # skip this step if virtualenv is available already
virtualenv venv-kleat
. venv-kleat/bin/activate

Then install kleat with pip

pip install git+https://github.com/zyxue/kleat.git#egg=kleat

Features

  • Suffix (previously known as tail), bridge, link, blank
  • Search PAS hexamer on the contig
  • Hardclip regions are considered, too, and well tested.
  • Process all chromosomes in parallel, and parallized other steps as much as possible (e.g. aggregate polyA evidence per clv)

Usage

  • Inputs to --contig-to-genome and --reads-to-contigs should both be sorted and indexed with samtools.

Notes on result interpretation

ctg_hex_pos, the distance between contig PAS hexamer and clv is not interpretable in terms of reference genome because there might be insertion and deletion, but ctg_hex_dist, the distance between contig PAS hexamer and clv is interpretation.

We decided not to remove this column, and leave as a sanity check/indication of indels. Indels can also be inferred from the difference between ctg_hex_dist and ref_hex_dist if they exist.

ctg_hex_pos may become especially useful when the ref_hex is not found, as it can be used as a rough estimate of the location of the hexamer, helpful for locating the PAS hexamer on the contig in a genome browser (e.g. IGV), quickly.

Development

virtualenv venv
. venv/bin/activate
pip install -r requirements_dev.txt
python setup.py develop
kleat --help

usage: kleat [-h] -c CONTIGS_TO_GENOME -r READS_TO_CONTIGS -f REFERENCE_GENOME
             -a KARBOR_CLV_ANNOTATION [-o OUTPUT] [-m OUTPUT_FORMAT]
             [-p NUM_CPUS] [--keep-pre-aggregation-tmp-file]
             [--bridge-skip-check-size BRIDGE_SKIP_CHECK_SIZE]
             [--cluster-first-then-aggregate]
             [--cluster-cutoff CLUSTER_CUTOFF]

KLEAT: cleavage site detection via de novo assembly

optional arguments:
  -h, --help            show this help message and exit
  -c CONTIGS_TO_GENOME, --contigs-to-genome CONTIGS_TO_GENOME
                        input contig-to-genome alignment BAM file
  -r READS_TO_CONTIGS, --reads-to-contigs READS_TO_CONTIGS
                        input read-to-contig alignment BAM file
  -f REFERENCE_GENOME, --reference-genome REFERENCE_GENOME
                        reference genome FASTA file, if provided, KLEAT will
                        search polyadenylation signal (PAS) hexamer in both
                        contig and reference genome, which is useful for
                        checking mutations that may affect PAS hexmaer. Note
                        this fasta file needs to be consistent with the one
                        used for generating the read-to-contig BAM alignments
  -a KARBOR_CLV_ANNOTATION, --karbor-clv-annotation KARBOR_CLV_ANNOTATION
                        the annotated clv pickle formatted for karbor with
                        (seqname, strand, clv, gene_ids, gene_names) columns
                        this file is processed from GTF annotation file
  -o OUTPUT, --output OUTPUT
                        output tsv file, if not specified, it will use prefix
                        output, and the extension depends on the value of
                        --output-format. e.g. output.csv, output.pickle, etc.
  -m OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
                        also support tsv, pickle (python)
  -p NUM_CPUS, --num-cpus NUM_CPUS
                        parallize the step of aggregating polya evidence for
                        each (seqname, strand, clv)
  --keep-pre-aggregation-tmp-file
                        specify this if you would like to keep the tmp file
                        before aggregating polyA evidence per cleavage site,
                        mostly for debugging purpose
  --bridge-skip-check-size BRIDGE_SKIP_CHECK_SIZE
                        the size beyond which the clv is predicted be on the
                        next matched region. Otherwise, clv is predicted to be
                        at the edge of the prevision match. The argument is
                        added because inconsistent break points are observed
                        between read (softclip as in r2c alignment) and contig
                        (boundry between BAM_CMATCH and BAM_CREF_SKIP)
  --cluster-first-then-aggregate
                        the default approach is aggregate_polya_evidence ->
                        filter -> cluster, if this argument is specified, then
                        the order becomes cluster -> aggregate_polya_evidence
                        -> filter. The idea is that by clustering first,
                        neighbouring clvs would result in stronger polyA
                        evidence signal, but preliminary results show that it
                        does not make a big difference. Also, note that if the
                        data is noisy, cluster before filter would further
                        decrease the clv resolution since single-linkage
                        culstering would combine those noisy clvs in to the
                        clvs with real signal
  --cluster-cutoff CLUSTER_CUTOFF
                        the cutoff for single-linkage clustering

To uninstall

python setup.py develop --uninstall

Debug instruction

For a particular contig, you could insert pdb such as below

@@ -32,6 +32,11 @@ def collect_polya_evidence(c2g_bam, r2c_bam, ref_fa, csvwriter, bridge_skip_chec
         if contig.is_unmapped:
             continue

+        if contig.query_name == "<your contig name>" and contig.reference_name == "chrX":
+            pass
+        else:
+            continue
+
         ascs = []           # already supported clvs
         rec = process_suffix(
             contig, r2c_bam, ref_fa, csvwriter)

Zero-based index

Every index is 0-based, including ascii visualization such as

Symbols:
--: ref_skip
//: hardclip at right
\\: hardclip at left
__: deletion
┬ : insertion
 └: softclip at left
 ┘: softclip at right

Abbreviation:
 cc: ctg_clv, clv in contig coordinate
 rc: ref_clv, clv in reference coordinate

icb: init_clv_beg, initialized beginning index in contig coordinate (for - strand clv)
irb: init_ref_beg, initialized beginning index in reference coordinate (for - strand clv)

ice: init_clv_end, initialized end index in contig coordinate (for + strand clv)
ire: init_ref_end, initialized end index in reference coordinate (for + strand clv)

 TTT
   └AT
 89012 <- one coord (0-based)
   1   <- ten coord

which is different from the display on IGV that is 1-based (although its underlying system is still 0-based).

Some key concepts in the code:

  • ctg_clv: clv in contig coordinate including clipped regions and indels
  • gnm_clv: or ref_clv. clv in genome coordinate
  • gnm_offset: ctg_clv converted genome coordinate with proper handling of skips,

clips, indels, so that gnm_offset is addable to the genome coordinate directly.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

kleat's People

Contributors

zyxue avatar

Watchers

 avatar  avatar

kleat's Issues

Infer ref_clv from chimeric contigs

Given a chimeric contig, when a bridge read is aligned to the contig, the contig is in one piece; but when the contig is aligned to the genome, it becomes two pieces.

Problem: When looping through the contig to the first-piece, how to infer the genome coord of a clv that's actually aligned the second piece?

Sum up num_suffix_reads is buggy

Currently benchmarking, cluster first, and realized that summing up num_suffix_reads is buggy as one suffix read, based on the current definition, can support multiple neighbouring cleavage sites before clustering.

Potential solution:

  1. more rigorous defintion of suffix read, maybe similar to bridge read.
  2. take the max of num_suffix_reads instead of summing them up.

Think through hardclip from softclip

hardclip is distinct from softclip as it's related to chimeric contigs,

see if it's indeed necessary to consider it separately from softclip (likely).

Parallize the initial looping through contig

main difficulty

  File "stringsource", line 2, in pysam.libcalignedsegment.AlignedSegment.__reduce_cython__
TypeError: self._delegate cannot be converted to a Python object for pickling

ctg_hex_pos and ctg_hex_dist is buggy when the contig has indels

This is because the coordinate information is currently lost when the sequence is extracted from contig.

Currently, this search function is used to search for hexamer in the contig, it only takes into account the extracted sequence with cigar information missing.

def search(strand, clv, seq, window=50):

Maybe it's easier to define the searching window (e.g. 50bp) wst. to the reference, the actual sequence length wst. contig could be a few bp more or less than 50bp.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.