Code Monkey home page Code Monkey logo

riboraptor's Introduction

riboraptor : a pipeline for analysing ribosome profiling data

image

image

image

image

image

Python package to analyse ribosome profiling data. Most of the functionality has been ported to ribotricer

Installation

Setting up conda

  1. Install conda, the best way to install it is with the Miniconda package.The Python 3 version is recommended.
  2. Set up channels, It is important to add them in this order.
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Installing dependencies

We will create a searate environment inside conda for running riboraptor. The environment name is also riboraptor. If you already have a conda environment named riboraptor, you can delete it by running:

source deactivate riboraptor && conda env remove -n riboraptor

We will now install all the dependencies:

conda create --name riboraptor python=3 gcc matplotlib numpy pandas pybedtools \
pyBigWig pyfaidx pysam scipy seaborn statsmodels six click click-help-colors htseq biopython bx-python \
h5py joblib trackhub pytest snakemake sra-tools star fastqc trim-galore ucsc-bedgraphtobigwig ucsc-bedsort \
ucsc-bigwigmerge bamtools pysradb && source activate riboraptor

We also have the following two dependencies for processing and downloading SRA datasets:

  1. aspera connect : For allowing '.fasp' downloads from SRA

Linux download link: https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz

  1. SRAdb : For fetching all experiments of a SRA project with the associated metadata

Since there is currently a bug in bioconductor-sradb, we will install it from github.

git clone https://github.com/seandavi/SRAdb
cd SRAdb

Run R, and install SRAdb within R use devtools. Please make sure your riboraptor environment is already activated. (source activate riboraptor):

library(devtools)
devtools::install(".")

And finally, we need two metadata files for processing SRA records:

mkdir riboraptor-data && cd riboraptor-data
wget -c http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz && gunzip GEOmetadb.sqlite.gz
wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz

Installing riboraptor

source activate riboraptor
git clone https://github.com/saketkc/riboraptor.git
cd riboraptor
python setup.py install --single-version-externally-managed --record=record.txt

We will assume you have the following directory structure for the rest of our analysis:

| some_root_directory
| ├── riboraptor
| │   ├── snakemake
| │   └── setup.py
| ├── riboraptor-data
| │   ├── GEOmetadb.sqlite
| │   └── SRAmetadb.sqlite

Using riboraptor

Usage mode 1: use riboraptor as a Snakemake based workflow

See example workflow

Usage mode 2: use riboraptor as a standalone toolkit

See: https://riboraptor.readthedocs.io/en/latest/

Usage mode 3: ribopod - database

In progress: http://ribopod.usc.edu/

Features

See: https://riboraptor.readthedocs.io/en/latest/cmd-manual.html

riboraptor's People

Contributors

saketkc avatar wenzhenl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

riboraptor's Issues

snakemake progress out of sync

  • riboraptor version: 0.3.0
    snakemake reports >100 percent progress on a re-run. Upstream bug possibly being caused because we use dynamic outputs for certain rules.

[Tue Sep 11 17:13:45 2018]
Finished job 596.
1083 of 1119 steps (97%) done
Submitted job 340 with external jobid 'Submitted batch job 1493698'.
[Tue Sep 11 17:17:07 2018]
Finished job 340.
1084 of 1119 steps (97%) done
Dynamically updating jobs
[Tue Sep 11 17:32:01 2018]
Finished job 51.
1085 of 36 steps (3014%) done

[Tue Sep 11 17:32:01 2018]
rule aggregate_multiqc_plots:
    input: multiqc_report/multiqc_plots/png/mqc_featureCounts_assignment_plot_1.png, multiqc_report/multiqc_plots/png/mqc_featureCounts_assignment_plot_1_pc.png, multiqc_report/multiqc_plots/png/mqc_star_alignment_plot_1.png, multiqc_report/multiqc_plots/png/mqc_star_alignment_plot_1_pc.png, multiqc_report/multiqc_plots/png/mqc_cutadapt_plot_Counts.png, multiqc_report/multiqc_plots/png/mqc_cutadapt_plot_Obs_Exp.png, multiqc_report/multiqc_plots/png/mqc_fastqc_sequence_counts_plot_1.png, multiqc_report/multiqc_plots/png/mqc_fastqc_sequence_counts_plot_1_pc.png, multiqc_report/multiqc_plots/png/mqc_fastqc_per_base_sequence_quality_plot_1.png, multiqc_report/multiqc_plots/png/mqc_fastqc_per_sequence_quality_scores_plot_1.png, multiqc_report/multiqc_plots/png/mqc_fastqc_per_sequence_gc_content_plot_Percentages.png, multiqc_report/multiqc_plots/png/mqc_fastqc_per_sequence_gc_content_plot_Counts.png, multiqc_report/multiqc_plots/png/mqc_fastqc_per_base_n_content_plot_1.png, multiqc_report/multiqc_plots/png/mqc_fastqc_sequence_duplication_levels_plot_1.png, multiqc_report/multiqc_plots/png/mqc_fastqc_overrepresented_sequencesi_plot_1.png, multiqc_report/multiqc_plots/png/mqc_fastqc_adapter_content_plot_1.png, featureCounts/fcounts.tsv
    output: multiqc_report/aggregated_report.html
    jobid: 1522
    benchmark: benchmarks/aggregate_multiqc_plots/multiqc.txt

Submitted job 1522 with external jobid 'Submitted batch job 1493969'.
[Tue Sep 11 17:34:52 2018]
Finished job 1522.
1086 of 36 steps (3017%) done

is riboraptor download-sra working?

In the readme, it seems we should put GEOmetadb and SRAmetadb in riboraptor-data, but in the code, it seems the default for them is the current path '.', and when I tried to use this cmd, it doesn't seem to work. @saketkc can you make it clear what we should put those two files?

infer-protocol fails on transcriptome-based bams

I think there's no need to create index for bam file in infer_protocol anyway since we are not really jump around.

(riboraptor) hpc3378@[/staging/as/wenzhenl/re-ribo-analysis/SRP010679_human/mapped/srr_tx_bams]$ riboraptor infer-protocol --bam SRR403883.bam --refseq /home/cmb-panasas2/wenzhenl/github/riboraptor/tests/data/hg38_v24_refseq.bed12
[E::hts_idx_push] Chromosome blocks not continuous
Traceback (most recent call last):
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/bin/riboraptor", line 11, in <module>
    load_entry_point('riboraptor==0.2.2', 'console_scripts', 'riboraptor')()
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/riboraptor/cli.py", line 459, in infer_protocol_cmd
    bam, refseq, n_reads)
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/riboraptor/infer_protocol.py", line 36, in infer_protocol
    _create_bam_index(bam)
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/riboraptor/count.py", line 49, in _create_bam_index
    pysam.index(bam)
  File "/home/cmb-panasas2/wenzhenl/miniconda3/envs/riboraptor/lib/python3.6/site-packages/pysam/utils.py", line 75, in __call__
    stderr))
pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=, stderr=samtools index: failed to create index for "SRR403883.bam"\n'

adaptor trimming

It seems --illumina will force to use the illumina universal adaptor which is fixed. For project SRP031501, this option doesn't change the adaptor, because the auto-detected one is the same as the universal. However, it seems in the STAR log, the number of input reads is just about 20k which is much less than the input which should be several million. I am not sure if this is caused by the merge preprocessing step.

Fix strand-aware counting and p-site correction based on reads' length

As we know there's problem with the current counting by using the bigwigs, maybe the problem is even worse because we didn't really adjust for p-site shifting. The way we are using is to shift at the end based on the highest peak around start codon, but it seems the correct way to do this is to adjust for each read based on its length. So may be we should calculate the coverage directly from the bam files with the length info available and completely drop the use of bigwig file?

Some typo errors in html manual

  • riboraptor version:
  • Python version:
  • Operating System:

Description

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Erroneous trimming

Lately I have come across cases where trim_galore fails to detect the right adapter sequence. This leads to skewed fragment length distribution downstream besides affecting the mapping statistics (since we use a stringent maximum of 2 mismatches cutoff).

Making a note here to make sure we have a strategy to handle this somehow. Probably using --illumina will work for all cases.

Example:

Using default mode

trim_galore -o preprocessed -q 5 sratofastq/SRR648667.fastq.gz  

Results in: only 1.6% reads with adapters:

==========================
Input filename: sratofastq/SRR648667.fastq.gz
Trimming mode: single-end
Trim Galore version: 0.5.0
Cutadapt version: 1.17
Quality Phred score cutoff: 5
Quality encoding type selected: ASCII+33
Adapter sequence: 'TGGAATTCTCGG' (Illumina small RNA adapter; auto-detected)
         AGATCGGAAGAGCACACGTCT
                   TCAGGGAGACTAC
                 CTGTAGGCACCATCAAT
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length before a sequence gets removed: 18 bp
Output file will be GZIP compressed


This is cutadapt 1.17 with Python 3.5.5
Command line parameters: -f fastq -e 0.1 -q 5 -O 1 -a TGGAATTCTCGG sratofastq/SRR648667.fastq.gz
Running on 1 core
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 2095.57 s (42 us/read; 1.44 M reads/minute).

=== Summary ===

Total reads processed:              50,184,481
Reads with adapters:                   780,055 (1.6%)
Reads written (passing filters):    50,184,481 (100.0%)

Total basepairs processed: 1,806,641,316 bp
Quality-trimmed:              30,426,701 bp (1.7%)
Total written (filtered):  1,775,227,008 bp (98.3%)

=== Adapter 1 ===

Sequence: TGGAATTCTCGG; Type: regular 3'; Length: 12; Trimmed: 780055 times.

No. of allowed errors:
0-9 bp: 0; 10-12 bp: 1

Bases preceding removed adapters:
  A: 57.9%
  C: 15.1%
  G: 14.1%
  T: 10.9%
  none/other: 2.0%

Forcing --illumina

(Though Ribo-seq library preparation is often done using a small RNA library prep kit)

(21)> trim_galore --illumina -o preprocessed_illumina -q 5 sratofastq/SRR648667.fastq.gz           

Results in 95% reads with adapters (with a warning for including an extra base A)

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
1.17
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Writing report to 'preprocessed_illumina/SRR648667.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
==========================
Input filename: sratofastq/SRR648667.fastq.gz
Trimming mode: single-end
Trim Galore version: 0.5.0
Cutadapt version: 1.17
Quality Phred score cutoff: 5
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length before a sequence gets removed: 20 bp
Output file(s) will be GZIP compressed

Writing final adapter and quality trimmed output to SRR648667_trimmed.fq.gz


  >>> Now performing quality (cutoff 5) and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file sratofastq/SRR648667.fastq.gz <<< 




10000000 sequences processed


20000000 sequences processed

30000000 sequences processed
40000000 sequences processed
50000000 sequences processed
This is cutadapt 1.17 with Python 3.5.5
Command line parameters: -f fastq -e 0.1 -q 5 -O 1 -a AGATCGGAAGAGC sratofastq/SRR648667.fastq.gz
Running on 1 core
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 1953.05 s (39 us/read; 1.54 M reads/minute).

=== Summary ===

Total reads processed:              50,184,481
Reads with adapters:                47,884,855 (95.4%)
Reads written (passing filters):    50,184,481 (100.0%)

Total basepairs processed: 1,806,641,316 bp
Quality-trimmed:              30,426,701 bp (1.7%)
Total written (filtered):  1,728,026,898 bp (95.6%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 47884855 times.

No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 98.2%
  C: 0.6%
  G: 0.3%
  T: 0.9%
  none/other: 0.0%
WARNING:
    The adapter is preceded by "A" extremely often.
    The provided adapter sequence may be incomplete.
    To fix the problem, add "A" to the beginning of the adapter sequence.

Periodicity for each coverage

So instead of implementing the full function of detecting translating ORFs, one simple version of it can be calculate the periodicity and p-val for the coverage profile (assuming that we calculate the coverage correctly after fixing the strand and p-site issue). Since we are currently having coverage profile for 5'UTR, CDS and 3'UTR, so this naive version would let us predict translating 5'UTR and 3'UTR first.

riboraptor version: 0.2.2-0.4.0 no configs file found

  • riboraptor version: 0.2.2-0.4.0
  • Python version: 3.6.7
  • Operating System: CENTOS7
  • gcc (GCC) 8.3.0

Description

bash submitall.sh SRP000637
I did not find SRP000637.py in the configs directory, I created a config file:
but is still need FRAGMENT_LENGTHS, ORIENTATIONS and STRANDS infromation, I and not sure it is OK:

FRAGMENT_LENGTHS=29
ORIENTATIONS='+'
STRANDS='+'

RAWDATA_DIR = '/home/wangk/proj/SHZS20191006025/data'
OUT_DIR = '/home/wangk/proj/SHZS20191006025/data/result'
GENOME_FASTA = '/usr/local/.db/genome/ucsc/mouse/mm10.fa'
CHROM_SIZES = '/usr/local/.db/genome/ucsc/mouse/mm10.sizessizes'
STAR_INDEX = '/usr/local/.db/genome/ucsc/mouse/star'
GTF = '/usr/local/.db/genome/ucsc/mouse/mm10.gtf'

INTRON_BED = '/usr/local/.db/riboraptor/annotation/mm10/intron.bed'
CDS_BED = '/usr/local/.db/riboraptor/annotation/mm10/cds.bed'
UTR5_BED = '/usr/local/.db/riboraptor/annotation/mm10/utr5.bed'
UTR3_BED = '/usr/local/.db/riboraptor/annotation/mm10/utr3.bed'

Besides, submitall.sh need a sbatch/slurm installed on the server to run cluster, but my job only run on a single node (We have only one server), is it possible to run jobs without slurm ?

What I Did

I commented out #--cluster in submitall.sh:

#!/bin/bash
snakemake --snakefile Snakefile
--config config_path=configs/$1.py
--js $PWD/jobscript.sh
--printshellcmds
--cluster-config $PWD/cluster.yaml
--jobname "{rulename}.{jobid}.$1"
--keep-going
--stats $PWD/stats/$1.riboraptor.stats
--rerun-incomplete
-j 200
#--cluster 'sbatch --partition={cluster.partition} --ntasks={cluster.cores} --mem={cluster.mem} --time={cluster.time} -o {cluster
.logout} -e {cluster.logerror}'

the result:

Provided cores: 200
Rules claiming more threads will be scaled down.
Unlimited resources: mem_mb
Job counts:
count jobs
1 aggregate_multiqc_plots
1 all
1 featurecounts
1 multiqc
1 qc_report_summarized
5

rule featurecounts:
output: featureCounts/fcounts.tsv
jobid: 1
benchmark: benchmarks/featurecounts/featurecounts.txt
threads: 16

rule qc_report_summarized:
output: reports/riboraptor_report_summarized.html
jobid: 4
benchmark: benchmarks/qc_report_summarized/benchmark.txt

/usr/local/.prog/anaconda/envs/riboseq/bin/python /home/wangk/proj/SHZS20191006025/snakemake/wrappers/.snakemake.3z0qubw9.qc_report_s
ummarizer_wrapper.py
/usr/local/.prog/anaconda/envs/riboseq/bin/python /home/wangk/proj/SHZS20191006025/snakemake/wrappers/.snakemake.vjafordk.featurecoun
ts_wrapper.py
Traceback (most recent call last):
File "/home/wangk/proj/SHZS20191006025/snakemake/wrappers/.snakemake.3z0qubw9.qc_report_summarizer_wrapper.py", line 48, in <module

snakemake.input["metagene"],

File "/usr/local/.prog/anaconda/envs/riboseq/lib/python3.6/site-packages/snakemake/io.py", line 817, in getitem
return getattr(self, key)
AttributeError: 'InputFiles' object has no attribute 'metagene'
Error in job qc_report_summarized while creating output file reports/riboraptor_report_summarized.html.
RuleException:
CalledProcessError in line 535 of /home/wangk/proj/SHZS20191006025/snakemake/Snakefile:
Command '/usr/local/.prog/anaconda/envs/riboseq/bin/python /home/wangk/proj/SHZS20191006025/snakemake/wrappers/.snakemake.3z0qubw9.qc
_report_summarizer_wrapper.py' returned non-zero exit status 1.
File "/home/wangk/proj/SHZS20191006025/snakemake/Snakefile", line 535, in __rule_qc_report_summarized
File "/usr/local/.prog/anaconda/envs/riboseq/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Job failed, going on with independent jobs.
Traceback (most recent call last):
File "/home/wangk/proj/SHZS20191006025/snakemake/wrappers/.snakemake.vjafordk.featurecounts_wrapper.py", line 12, in
for hdf, bam in zip(snakemake.input["hdfs"], snakemake.input["bams"]):
File "/usr/local/.prog/anaconda/envs/riboseq/lib/python3.6/site-packages/snakemake/io.py", line 817, in getitem
return getattr(self, key)
AttributeError: 'InputFiles' object has no attribute 'hdfs'
Error in job featurecounts while creating output file featureCounts/fcounts.tsv.
RuleException:
CalledProcessError in line 455 of /home/wangk/proj/SHZS20191006025/snakemake/Snakefile:
Command '/usr/local/.prog/anaconda/envs/riboseq/bin/python /home/wangk/proj/SHZS20191006025/snakemake/wrappers/.snakemake.vjafordk.fe
aturecounts_wrapper.py' returned non-zero exit status 1.
File "/home/wangk/proj/SHZS20191006025/snakemake/Snakefile", line 455, in __rule_featurecounts
File "/usr/local/.prog/anaconda/envs/riboseq/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message

is ignore_tx_version needed?

I just noticed that ignore_tx_version was added back to gene_coverage and export_metagene_coverage, I don't really think this is necessary. Because all the gene names are from bed files, so that the gene names can be always consistent as long as the same bed file is used throughout different functions. The addition of ignore_tx_version is not necessary and can cause potential problem.

Remove TODO.md and use issue instead?

Since now the master branch is protected, for the trivial change, such as change of TODO, it's an overkill to use pull request, maybe we should just remove TODO.md and use issue for new features planned?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.