Code Monkey home page Code Monkey logo

trackcluster's Introduction

TrackCluster

PyPI

Trackcluster is an isoform calling and quantification pipeline for long RNA/cDNA reads.

Table of Contents

Hint: the ongoing development can be found in the "dev" branch.

Overview

A pipeline for reference-based isoform identification and quantification using long reads. This pipeline was designed to use only long and nosisy reads to make a valid transcriptome. An indicator for the intact 5' could be very helpful to the pipeline, i.e, the splicing leader in the mRNA of nematodes.

The major input/output for this pipeline is "bigg"--"bigGenePred" format.

Requirements

  1. developed on python 3.9, tested on python 3.6 and above (or 2.7.10+), should work with most of the py3 versions
  2. samtools V2.0+ , bedtools V2.24+ and minimap2 V2.24+ in your $PATH
# install the external bins with conda
conda install -c bioconda samtools
conda install -c bioconda bedtools
conda install -c bioconda minimap2

Installation

# use pip from pypi
pip install trackcluster
# or pip from source code for the latest version
git clone https://github.com/Runsheng/trackcluster.git
pip install ./trackcluster

Recommendations

  1. UCSC Kent source tree (for generating binary track), used only in bigg2b.py

Scripts

All scripts can be run directly from shell after pip installation.

  • trackrun.py: the main script for trackcluser run
  • bam2bigg.py: convert the mapped read from the bam file, to bigg track format
  • gff2bigg.py: convert the isoform annotation in gff3 to bigg format
  • bigg2b.py: convert the bigg track into binary format for better loading in IGV/UCSC
  • biggmutant.py: change the value of one column in a bigglist

Walkthrough

# test if all dependencies are installed
trackrun.py test --install

# prepare the reference annotation bed file from gff file
# tested on Ensembl, WormBase and Arapost gff
gff2bigg.py -i ensemblxxxx.gff3 -o ref.bed 
# WormBase full gff contains too many information, need to extract the lines from WormBase only
cat c_elegans.PRJNA13758.WS266.annotations.gff3 |grep WormBase > ws266.gff
gff2bigg.py -i ws266.gff -o ref.bed
# the ref.bed can be sorted to speed up the analysis
bedtools sort -i ref.bed > refs.bed # refs.bed contains the sorted, know transcripts from gff annotation

# generate the read track from minimap2 bam file
bam2bigg.py -b group1.bam -o group1.bed
bam2bigg.py -b group2.bam -o group2.bed

# merge the bed file and sort
cat group1.bed group2.bed > read.bed
bedtools sort -i read.bed > reads.bed

# Examples for running commands:
trackrun.py clusterj -s reads.bed -r refs.bed -t 40 # run in junction mode, will generate the isoform.bed
trackrun.py count -s reads.bed -r refs.bed -i isoform.bed # generate the csv file for isoform expression
# alternative for cluster
trackrun.py cluster -s reads.bed -r refs.bed -t 40 # run in exon/intron intersection mode, slower, will generate the isoform.bed

# the post analysis could include the classification of novel isoforms
trackrun.py desc --isoform isoform.bed --reference ref.bed # generate the description for each novel isoform
# this part can be run directly on reads, to count the frequency of splicing events in reads, like intron_retention
trackrun.py addgene -r ref.bed -s reads.bed # will generate reads_gene.bed
trackrun.py desc --isoform reads_gene.bed --reference ref.bed # will generated reads_desc.txt and reads_class12.txt 

Citation

Please kindly cite our paper for using trackcluster in your work.

Li, R., Ren, X., Ding, Q., Bi, Y., Xie, D. and Zhao, Z., 2020. Direct full-length RNA sequencing reveals unexpected transcriptome complexity during Caenorhabditis elegans development. Genome research, 30(2), pp.287-298.

trackcluster's People

Contributors

runsheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

trackcluster's Issues

“from trackcluster import convert” wrong

Hi. I downloaded and unzipped the "trackcluster-master.zip", then "python setup.py install" was run. However, I got an error stated as below when I ran bam2bigg.py:
"ImportError: No module named trackcluster"
My comand is "python ${command_dir}/script/bam2bigg.py -b ${bam_dir}/${i}_minimap2_sorted.bam -o ${i}.bed".
Then I used "trackcluster-dev.zip" and installed "trackcluster" package, but another error "ImportError: cannot import name convert" arised. I am sure that both the package "trackcluster" and "convert" can be imported successfully in python.
Could you please help me with this problem? Thank you in advance!

How to realise to use full-length reads in the first round of clustering?

You provide two parameters, ratio and ratio short when computing the similarity score. But I do not find that your programs can automatically identify full-length reads and use those reads to finish the first round of clustering.
Therefore, I want to ask how to just use full-length reads to cluster them in advance?

error occured when using gff2bigg

Hi Runsheng,

We encountered an error when converting the gff file. It seems the gff file should be sorted and also include rows to document gene information. otherwise, it will be failed. Here is the code when the error happens.

            if gff_type == "gene":
                for attribute in attributes.split(";"):
                    if indicator in attribute: # the key "sequence_name" only for wormbase gff
                        kk = attribute.split("=")[1]
                        gene_d[kk] = []
                        gene_d[kk].append(line)

            else:
                for attribute in attributes.split(";"):
                    if "ID" in attribute:
                        parent = attribute.split("=")[1]
                        if kk in parent:
                            gene_d[kk].append(line)
                            break
                    if "Parent" in attribute:
                        parent = attribute.split("=")[1]
                        if kk in parent:
                            gene_d[kk].append(line)
                            break

Leo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.