trackcluster's Introduction

TrackCluster

Trackcluster is an isoform calling and quantification pipeline for long RNA/cDNA reads.

Overview
Requirements
Scripts
Walkthrough

Hint: the ongoing development can be found in the "dev" branch.

Overview

A pipeline for reference-based isoform identification and quantification using long reads. This pipeline was designed to use only long and nosisy reads to make a valid transcriptome. An indicator for the intact 5' could be very helpful to the pipeline, i.e, the splicing leader in the mRNA of nematodes.

The major input/output for this pipeline is "bigg"--"bigGenePred" format.

Requirements

developed on python 3.9, tested on python 3.6 and above (or 2.7.10+), should work with most of the py3 versions
samtools V2.0+ , bedtools V2.24+ and minimap2 V2.24+ in your $PATH

# install the external bins with conda
conda install -c bioconda samtools
conda install -c bioconda bedtools
conda install -c bioconda minimap2

Installation

# use pip from pypi
pip install trackcluster
# or pip from source code for the latest version
git clone https://github.com/Runsheng/trackcluster.git
pip install ./trackcluster

Recommendations

UCSC Kent source tree (for generating binary track), used only in bigg2b.py

Scripts

All scripts can be run directly from shell after pip installation.

trackrun.py: the main script for trackcluser run
bam2bigg.py: convert the mapped read from the bam file, to bigg track format
gff2bigg.py: convert the isoform annotation in gff3 to bigg format
bigg2b.py: convert the bigg track into binary format for better loading in IGV/UCSC
biggmutant.py: change the value of one column in a bigglist

Walkthrough

# test if all dependencies are installed
trackrun.py test --install

# prepare the reference annotation bed file from gff file
# tested on Ensembl, WormBase and Arapost gff
gff2bigg.py -i ensemblxxxx.gff3 -o ref.bed 
# WormBase full gff contains too many information, need to extract the lines from WormBase only
cat c_elegans.PRJNA13758.WS266.annotations.gff3 |grep WormBase > ws266.gff
gff2bigg.py -i ws266.gff -o ref.bed
# the ref.bed can be sorted to speed up the analysis
bedtools sort -i ref.bed > refs.bed # refs.bed contains the sorted, know transcripts from gff annotation

# generate the read track from minimap2 bam file
bam2bigg.py -b group1.bam -o group1.bed
bam2bigg.py -b group2.bam -o group2.bed

# merge the bed file and sort
cat group1.bed group2.bed > read.bed
bedtools sort -i read.bed > reads.bed

# Examples for running commands:
trackrun.py clusterj -s reads.bed -r refs.bed -t 40 # run in junction mode, will generate the isoform.bed
trackrun.py count -s reads.bed -r refs.bed -i isoform.bed # generate the csv file for isoform expression
# alternative for cluster
trackrun.py cluster -s reads.bed -r refs.bed -t 40 # run in exon/intron intersection mode， slower, will generate the isoform.bed

# the post analysis could include the classification of novel isoforms
trackrun.py desc --isoform isoform.bed --reference ref.bed # generate the description for each novel isoform
# this part can be run directly on reads, to count the frequency of splicing events in reads, like intron_retention
trackrun.py addgene -r ref.bed -s reads.bed # will generate reads_gene.bed
trackrun.py desc --isoform reads_gene.bed --reference ref.bed # will generated reads_desc.txt and reads_class12.txt

Citation

Please kindly cite our paper for using trackcluster in your work.

Li, R., Ren, X., Ding, Q., Bi, Y., Xie, D. and Zhao, Z., 2020. Direct full-length RNA sequencing reveals unexpected transcriptome complexity during Caenorhabditis elegans development. Genome research, 30(2), pp.287-298.

trackcluster's People

Contributors

Stargazers

Watchers

trackcluster's Issues

“from trackcluster import convert” wrong

Hi. I downloaded and unzipped the "trackcluster-master.zip", then "python setup.py install" was run. However, I got an error stated as below when I ran bam2bigg.py:
"ImportError: No module named trackcluster"
My comand is "python ${command_dir}/script/bam2bigg.py -b ${bam_dir}/${i}_minimap2_sorted.bam -o ${i}.bed".
Then I used "trackcluster-dev.zip" and installed "trackcluster" package, but another error "ImportError: cannot import name convert" arised. I am sure that both the package "trackcluster" and "convert" can be imported successfully in python.
Could you please help me with this problem? Thank you in advance!

How to realise to use full-length reads in the first round of clustering?

You provide two parameters, ratio and ratio short when computing the similarity score. But I do not find that your programs can automatically identify full-length reads and use those reads to finish the first round of clustering.
Therefore, I want to ask how to just use full-length reads to cluster them in advance?

error occured when using gff2bigg

Hi Runsheng,

We encountered an error when converting the gff file. It seems the gff file should be sorted and also include rows to document gene information. otherwise, it will be failed. Here is the code when the error happens.

            if gff_type == "gene":
                for attribute in attributes.split(";"):
                    if indicator in attribute: # the key "sequence_name" only for wormbase gff
                        kk = attribute.split("=")[1]
                        gene_d[kk] = []
                        gene_d[kk].append(line)

            else:
                for attribute in attributes.split(";"):
                    if "ID" in attribute:
                        parent = attribute.split("=")[1]
                        if kk in parent:
                            gene_d[kk].append(line)
                            break
                    if "Parent" in attribute:
                        parent = attribute.split("=")[1]
                        if kk in parent:
                            gene_d[kk].append(line)
                            break

Leo

runsheng / trackcluster Goto Github PK