Code Monkey home page Code Monkey logo

riboflow's Introduction

DOI

RiboFlow

RiboFlow

RiboFlow is a Nextflow based pipeline for processing ribosome profiling data. As output, it generates ribo files that can be analyzed using RiboR or RiboPy. RiboFlow belongs to a software ecosystem desgined to work with ribosome profiling data.

Overview

Contents

Installation

Requirements

First, follow the instructions in Nextflow website and install Nextflow.

Docker Option

Install Docker. Here is a tutorial for Ubuntu.

All remaining dependencies come in the Docker image hakanozadam/riboflow. This image is automatically pulled by RiboFlow when run with Docker (see test runs below).

Conda Option

This option has been tested on Linux systems only.

Install Conda.

All other dependencies can be installed using the environment file, environment.yaml, in this repository.

git clone https://github.com/ribosomeprofiling/riboflow.git
conda env create -f riboflow/environment.yaml

The above command will create a conda environment called ribo and install dependencies in it. To start using RiboFlow, you need to activate the ribo environment.

conda activate ribo

Test Run

For fresh installations, before running RiboFlow on actual data, it is recommended to do a test run.

Run Using Docker

# Clone this repository in a new folder and change your working directory to the RiboFlow folder.
mkdir rf_test_run && cd rf_test_run
git clone https://github.com/ribosomeprofiling/riboflow.git
cd riboflow

# Obtain a copy of the sample data in the working directory.
git clone https://github.com/ribosomeprofiling/rf_sample_data.git
nextflow RiboFlow.groovy -params-file project.yaml -profile docker_local

Note that we provided the argument -profile docker_local to Nextflow to indicate that RiboFlow will be run via Docker containers. In other words, the steps of RiboFlow will be executed inside Docker containers by Nextflow. Hence, no locally installed software (other than Java and Nextflow) is needed by RiboFlow.

Run Using Conda Environment

In Conda option, the steps of RiboFlow are run locally. So, we need to install the dependencies first. This can easily be done via conda. The default profile directs RiboFlow to run locally, so we can simply skip the -profile argument. Also note that the conda environment has to be activated before running RiboFlow.

Before running the commands below, make sure that you have created the conda environment, called ribo, using the instructions above.

# List the environments to make sure that ribo environment exists
conda env list

# Activate the ribo environment
conda activate ribo

# Get RiboFlow repository
mkdir rf_test_run && cd rf_test_run
git clone https://github.com/ribosomeprofiling/riboflow.git
cd riboflow

# Obtain a copy of the sample data in the working directory.
git clone https://github.com/ribosomeprofiling/rf_sample_data.git

# Finally run RiboFlow
nextflow RiboFlow.groovy -params-file project.yaml

Output

Pipeline run may take several minutes. When finished, the resulting files are in the ./output folder.

Mapping statistics are compiled in a csv file called stats.csv

ls output/stats/stats.csv

Ribosome occupancy data is in a single ribo file called all.ribo.

ls output/ribo/all.ribo

You can use RiboR or RiboPy to work with ribo files.

RiboFlow on Your Data

For running RiboFlow on actual data, files must be organized and a parameters file must be prepared. You can examine the sample run above to see an example.

  1. Organize your data. The following files are required for RiboFlow

    • Ribosome profiling sequencing data: in gzipped fastq files
    • Transcriptome Reference: Bowtie2 index files
    • Filter Reference: Bowtie2 index files (typically for rRNA sequences)
    • Annotation: A bed file defining CDS, UTR5 and UTR3 regions.
    • Transcript Lengths: A two column tsv file containing transcript lengths
  2. Prepare a custom project.yaml file. You can use the sample file project.yaml, provided in this repository, as template.

  3. In project.yaml, provide RiboFlow parameters such as clip_arguments, alignment arguments etc. You can simply modify the arguments in the sample file project.yaml in this repository.

  4. You can adjust the hardware and computing environment settings in Nextflow configuration file(s). For Docker option, see configs/docker_local.config. If you are not using Docker, see configs/local.config.

  5. RNA-Seq data is optional for RiboFlow. So, if you do NOT have RNA-Seq data, in the project file, set

do_rnaseq: false

If you have RNA-Seq data to be paired with ribosome profiling data, see the Advanced Features below.

  1. Metadata is optional for RiboFlow. If you do NOT have metadata, in the project file, set

do_metadata: false

If you have metadata, see Advanced Features below.

  1. Run RiboFlow using the new parameters file project.yaml.

Using Docker:

nextflow RiboFlow.groovy -params-file project.yaml -profile docker_local

Without Docker:

nextflow RiboFlow.groovy -params-file project.yaml

Working with Unique Molecular Identifiers

Unique Molecular Identifiers (UMIs) can be ligated to either side of the molecules and they allow labeling molecules uniqely. This way UMIs can be used to deduplicate mapped reads for more accurate quantification.

If there are UMIs in your ribosome profiling data, Riboflow can trim them and deduplicate reads based on UMIs.

RiboFlow extracts UMIs and stores them in the Fastq headers and uses the UMIs in deduplication (instead of position based read collapsing). For this purpose RiboFlow uses umi_tools.

Project File

Here we explain the related parts of the project file to be able to use UMIs feature of Riboflow.

Also, we provide a working example of project file in this repository: project_umi.yaml.

The following parameter must be set:

dedup_method: "umi_tools"

Also, users must set the following two parameters: umi_tools_extract_arguments and umi_tools_dedup_arguments.

For example:

umi_tools_extract_arguments: "-p \"^(?P<umi_1>.{12})(?P<discard_1>.{4}).+$\" --extract-method=regex"
umi_tools_dedup_arguments:   "--read-length"

The above example takes the first 12 nucleotides from the 5' end, discards the 4 nucleotides downstream and writes the 12 nt UMI sequence to the header. The second parameter tells umi_tools to use read lengths IN ADDITION to UMI sequencing in collapsing reads. Note that these two arguments are directly provided to umi_tools. So users are encouraged to familirize themselves with umi_tools.

Test Run with UMIs

We provide a mini dataset, with two samples, to try Riboflow with sequencing reads having UMIs. In this sample dataset, the first 12 nucleotides on the 5' end of the reads are UMIs. Four nucleotides following the UMIs need to be discarded. On the 3' end of the reads, there are adapters having the sequence AAAAAAAAAACAAAAAAAAAA. The parameters of this sample run are provided in the file project_umi.yaml. Below are the steps to process this data.

# List the environments to make sure that ribo environment exists
conda env list

# Activate the ribo environment
conda activate ribo

# Get RiboFlow repository
mkdir rf_test_run && cd rf_test_run
git clone https://github.com/ribosomeprofiling/riboflow.git
cd riboflow

# Obtain a copy of the sample data in the working directory.
git clone https://github.com/ribosomeprofiling/rf_sample_data.git

# Finally run RiboFlow
nextflow RiboFlow.groovy -params-file project_umi.yaml

# At the end of the run
# checkk the ribo file
ribopy info output_umi/ribo/all.ribo

UMI support for RNA-Seq

In the current version, UMIs are supported for ribosome profiling data only. So RNA-Seq libraries can either be used without deduplication or the reads can be collapsed based on position.

A Note on References

RiboFlow is designed to work with transcriptomic references. RiboFlow does NOT work with genomic references. The users need to provide a transcriptome reference and annotation to run this software. There is a curated set of RiboFlow references, that users can download and use, in this GitHub repository

Advanced Features

RNA-Seq Data

If you have RNA-Seq data that you want to pair with ribosome profiling experiments, provide the paths of the RNA-Seq (gzipped) fastq files in the configuration file in input -> metadata. See the file project.yaml in this repository for an example. Note that the names in defining RNA-Seq files must match the names in definig ribosome profiling data. Also turn set the do_rnaseq flag to true, in the project file:

do_rnaseq: true

Transcript abundance data will be stored in the output ribo file.

Metadata

If you have metadata files for the ribosome profiling experiments, provide the paths of the metadata files (in yaml format) in the configuration file in input -> metadata. See the file project.yaml in this repository for an example. Note that the names in defining metadata files must match the names in definig ribosome profiling data. Also turn set the metadata flag to true, in the project file:

do_metadata: true

Metadata will be stored in the output ribo file.

Citing

RiboFlow, RiboR and RiboPy: an ecosystem for analyzing ribosome profiling data at read length resolution, H. Ozadam, M. Geng, C. Cenik Bioinformatics 36 (9), 2929-2931

@article{ozadam2020riboflow,
  title={RiboFlow, RiboR and RiboPy: an ecosystem for analyzing ribosome profiling data at read length resolution},
  author={Ozadam, Hakan and Geng, Michael and Cenik, Can},
  journal={Bioinformatics},
  volume={36},
  number={9},
  pages={2929--2931},
  year={2020},
  publisher={Oxford University Press}
}

riboflow's People

Contributors

cancenik avatar hakanozadam avatar lucacozzuto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

riboflow's Issues

.ribo file not created when do_metadata = false and deduplicate = false

Running RiboFlow v0.0.1 in conda environment.

I encountered an issue where a ribo file is not created without metadata, only when combined with deduplicate = false.

If do_rnaseq = true, failure to create the ribo file leads to the error I recently posted:
ribosomeprofiling/ribopy#15

See attached for the following configs to reproduce the error with the test data.
project_no_metadata_nodedup_norna.yaml.zip

Given the support for deduplication in v0.0.1, I will see if this error still occurs with v0.0.0.

Docker Settings: Memory Requirements

Mention the following on README or FAQ:

RiboFlow won't run in 2GB memory. So we recommend providing at least. We recommend providing at least 8GB of memory.

Ribo file not created when deduplication parameter is set to none or false

Hello,

When I changed the dedup_method parameter in the project_umi.yaml file to "none" or the deduplicate parameter in the project.yaml file to "false", Riboflow runs without error, but creates no ribo folder, all.ribo file, or experiments folder in the output directory. I ran an older version of Riboflow and the missing files are there.

Thank you!

Mapping quality cut-off for RNA-Seq data

There is no way to specify a mapping quality cutoff for RNA-Seq data. However, such a cut-off can be specified for Ribosome Profiling Data.

Adding an argument like follows
rnaseq_mapping_quality_cutoff: 2
and adding "-q" argument in the filtering step will add this feature.

Mouse annotation

Hello,

I would like to run the pipeline on some mouse Ribo-seq data, and I was wondering if you might have a mouse annotation file, such as this one for human: appris_human_24_01_2019_actual_regions.bed ?
Alternatively, could you point me to how I could generate such a file?

Thanks a lot!

Best,
Ivan

Omitting dedup from the pipeline

Hello,

Would it be possible to omit the dedup step from the pipeline? I would like to see how my final result would look like without this filtering step.

Thanks a lot!

RiboFlow Compatibility with Nextflow DSL2

When I was first installing RiboFlow, I was getting error messages when using the latest version of NextFlow, which implements DLS 2 and has deprecated DSL 1. I had to switch to NextFlow version 20.10.0 in order to run the program.

Samtools idxstats

Check and remove "-@" in samtools idx stats. It is likely to cause an error.

RiboFlow process getting stuck on macOS

When running RiboFlow using Docker + Nextflow, the process seems to be getting stuck at the stage "creating the ribo file GSM1606107.ribo...". The docker statistics are below:

image

Notably, there seems to be almost no CPU usage.

This is being run with the latest versions of Docker and Nextflow as of December 4th, 2021. The OS is macOS Monterey version 12.0.1.

Error executing process > 'put_rnaseq_into_ribo'

Hello,

I tried running the pipeline on RiboSeq+RNAseq data, and it ran until almost the end when it failed:

Error executing process > 'put_rnaseq_into_ribo (1)'

Caused by:
  Process `put_rnaseq_into_ribo (1)` terminated with an error exit status (1)

Command executed:

  ribopy rnaseq set -n KO -a KO.merged.pre_dedup.bed -f bed --force KO.ribo

Command exit status:
  1

Command output:
  (empty)

Command error:
  + ribopy rnaseq set -n KO -a KO.merged.pre_dedup.bed -f bed --force KO.ribo
  Traceback (most recent call last):
    File "/miniconda3/bin/ribopy", line 10, in <module>
      sys.exit(cli())
    File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 764, in __call__
      return self.main(*args, **kwargs)
    File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 717, in main
      rv = self.invoke(ctx)
    File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
      return _process_result(sub_ctx.command.invoke(su[9d/b03848] Submitted process > put_rnaseq_into_ribo (2)
/click/core.py", line 1137, in invoke
      return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 956, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/miniconda3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
      return callback(*args, **kwargs)
    File "/miniconda3/lib/python3.6/site-packages/ribopy/cli/rnaseq.py", line 54, in set
      force         = force)
    File "/miniconda3/lib/python3.6/site-packages/ribopy/core/verify.py", line 104, in cli_func_wrapper
      return func(*args, **kwargs)
    File "/miniconda3/lib/python3.6/site-packages/ribopy/rnaseq.py", line 229, in set_rnaseq_wrapper
      with h5py.File(ribo_file, "r+") as ribo_handle:
    File "/miniconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 394, in __init__
      swmr=swmr)
    File "/miniconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 172, in make_fid
      fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
    File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
    File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
    File "h5py/h5f.pyx", line 85, in h5py.h5f.open
  OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

Work dir:
  /nfs/users2/enovoa/imilenkovic/software/riboflow/riboflow/work/45/f654e3edc0b937986d14d450917fcf

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Pipeline RiboFlow completed!
Started at  2021-03-24T00:07:38.048+01:00
Finished at 2021-03-24T01:25:25.332+01:00
Time elapsed: 1h 17m 47s
Execution status: failed
WARN: Killing pending tasks (1)

Do you maybe have an idea what went wrong? Thank you!

Running Riboflow on a transcriptome without CDS, 3' or 5' UTR defined?

My purpose for using Riboseq is to help define what the CDS, 3' or 5' UTR are for the RNA-Seq and Ribo-Seq data I have. I like the ribo data structure developed and the implementation of riboflow, but the criteria to require the UTR and CDS may be a deal-breaker for me.

I have to ask, is there a way to deactivate that criteria?

Errors running example data on umi_devel branch

Hi @hakanozadam ,

I tried doing a fresh install of Riboflow umi_devel on Mozart but ran into errors running the example data. I am running via conda.

Looks like one issue is FASTQC files are not found for the process transcriptome_aligned_individual_fastqc:

nextflow RiboFlow.groovy -params-file project.yaml
N E X T F L O W  ~  version 19.04.1
Launching `RiboFlow.groovy` [marvelous_easley] - revision: 1eb5b20f2c
[warm up] executor > local
[skipping] Stored process > clip (4)
[skipping] Stored process > clip (3)
[skipping] Stored process > clip (2)
[skipping] Stored process > clip (1)
[skipping] Stored process > extract_umi_via_umi_tools (1)
[skipping] Stored process > extract_umi_via_umi_tools (3)
[skipping] Stored process > extract_umi_via_umi_tools (2)
[skipping] Stored process > extract_umi_via_umi_tools (4)
[skipping] Stored process > filter (1)
[skipping] Stored process > filter (2)
[skipping] Stored process > filter (3)
[skipping] Stored process > filter (4)
[skipping] Stored process > transcriptome_alignment (2)
[skipping] Stored process > transcriptome_alignment (1)
[skipping] Stored process > transcriptome_alignment (3)
[skipping] Stored process > transcriptome_alignment (4)
[skipping] Stored process > merge_transcriptome_alignment (1)
[skipping] Stored process > merge_transcriptome_alignment (2)
executor >  local (14)
executor >  local (15)
[47/2cad5f] process > clipped_fastqc                            [100%] 4 of 4 ✔
executor >  local (17)
[47/2cad5f] process > clipped_fastqc                            [100%] 4 of 4 ✔
[c5/fbec5c] process > raw_fastqc                                [100%] 4 of 4 ✔
executor >  local (18)
[47/2cad5f] process > clipped_fastqc                            [100%] 4 of 4 ✔
[c5/fbec5c] process > raw_fastqc                                [100%] 4 of 4 ✔
executor >  local (19)
[47/2cad5f] process > clipped_fastqc                            [100%] 4 of 4 ✔
[c5/fbec5c] process > raw_fastqc                                [100%] 4 of 4 ✔
[2f/590158] process > transcriptome_unaligned_individual_fastqc [ 50%] 2 of 4
executor >  local (21)
[47/2cad5f] process > clipped_fastqc                            [100%] 4 of 4 ✔
[c5/fbec5c] process > raw_fastqc                                [100%] 4 of 4 ✔
[53/15cc5f] process > transcriptome_unaligned_individual_fastqc [ 75%] 3 of 4
[47/c9d300] process > write_fastq_correspondence                [100%] 1 of 1 ✔
[7b/40c88b] process > transcriptome_aligned_individual_fastqc   [ 50%] 4 of 8, failed: 4
[92/0631c4] NOTE: Missing output file(s) `GSM1606108.1.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (1)` -- Execution is retried (1)
[a4/308e18] NOTE: Missing output file(s) `GSM1606107.1.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (2)` -- Execution is retried (1)
executor >  local (21)
[47/2cad5f] process > clipped_fastqc                            [100%] 4 of 4 ✔
[c5/fbec5c] process > raw_fastqc                                [100%] 4 of 4 ✔
[cd/3bddd9] process > transcriptome_unaligned_individual_fastqc [100%] 4 of 4, failed: 1
[47/c9d300] process > write_fastq_correspondence                [100%] 1 of 1 ✔
[7b/40c88b] process > transcriptome_aligned_individual_fastqc   [100%] 8 of 8, failed: 8
[92/0631c4] NOTE: Missing output file(s) `GSM1606108.1.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (1)` -- Execution is retried (1)
[a4/308e18] NOTE: Missing output file(s) `GSM1606107.1.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (2)` -- Execution is retried (1)
[78/401fec] NOTE: Missing output file(s) `GSM1606108.2.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (3)` -- Execution executor >  local (21)
[47/2cad5f] process > clipped_fastqc                            [100%] 4 of 4 ✔
[c5/fbec5c] process > raw_fastqc                                [100%] 4 of 4 ✔
[cd/3bddd9] process > transcriptome_unaligned_individual_fastqc [100%] 4 of 4, failed: 1
[47/c9d300] process > write_fastq_correspondence                [100%] 1 of 1 ✔
[7b/40c88b] process > transcriptome_aligned_individual_fastqc   [100%] 8 of 8, failed: 8
[92/0631c4] NOTE: Missing output file(s) `GSM1606108.1.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (1)` -- Execution is retried (1)
[a4/308e18] NOTE: Missing output file(s) `GSM1606107.1.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (2)` -- Execution is retried (1)
[78/401fec] NOTE: Missing output file(s) `GSM1606108.2.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (3)` -- Execution is retried (1)
[e5/a6ed71] NOTE: Missing output file(s) `GSM1606107.2.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (4)` -- Execution is retried (1)
WARN: Killing pending tasks (4)
ERROR ~ Error executing process > 'transcriptome_aligned_individual_fastqc (1)'

Caused by:
  Missing output file(s) `GSM1606108.1.transcriptome.aligned_fastqc.html` expected by process `transcriptome_aligned_individual_fastqc (1)`

Command executed:

  if [ ! -f GSM1606108.1.transcriptome.aligned.fastq.gz ]; then
     ln -s GSM1606108.1.aligned.transcriptome_alignment.fastq.gz GSM1606108.1.transcriptome.aligned.fastq.gz
  fi
  fastqc GSM1606108.1.transcriptome.aligned.fastq.gz --outdir=$PWD -t 1

Command exit status:
  0

Command output:
  Analysis complete for GSM1606108.1.transcriptome.aligned.fastq.gz

Command error:
  + '[' '!' -f GSM1606108.1.transcriptome.aligned.fastq.gz ']'
  + ln -s GSM1606108.1.aligned.transcriptome_alignment.fastq.gz GSM1606108.1.transcriptome.aligned.fastq.gz
  + fastqc GSM1606108.1.transcriptome.aligned.fastq.gz --outdir=/home/ihoskins/riboflow_umi/riboflow/work/29/b062e85fb941e26442063b70a3d1b8 -t 1
  Started analysis of GSM1606108.1.transcriptome.aligned.fastq.gz
  Failed to process file GSM1606108.1.transcriptome.aligned.fastq.gz
  java.lang.ArrayIndexOutOfBoundsException: -1
  p: wheat uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.calculateDistribution(SequenceLengthDistribution.java:101)
  	at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.raisesError(SequenceLengthDistribution.java:190)
  - Checat uk.ac.babraham.FastQC.Report.HTMLReportArchive.startDocument(HTMLReportArchive.java:336)
  	at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:84)
  	at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:178)
  	at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
  	at java.lang.Thread.run(Thread.java:750)

Work dir:
  /home/ihoskins/riboflow_umi/riboflow/work/29/b062e85fb941e26442063b70a3d1b8

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`

 -- Check '.nextflow.log' file for details

Then, if I set do_fastqc = False, I run into an error during deduplication:

nextflow RiboFlow.groovy -params-file project.yaml
N E X T F L O W  ~  version 19.04.1
Launching `RiboFlow.groovy` [insane_watson] - revision: 1eb5b20f2c
[warm up] executor > local
[skipping] Stored process > clip (4)
[skipping] Stored process > clip (3)
[skipping] Stored process > clip (1)
[skipping] Stored process > clip (2)
[skipping] Stored process > extract_umi_via_umi_tools (1)
[skipping] Stored process > extract_umi_via_umi_tools (2)
[skipping] Stored process > extract_umi_via_umi_tools (3)
[skipping] Stored process > extract_umi_via_umi_tools (4)
[skipping] Stored process > filter (2)
[skipping] Stored process > filter (1)
[skipping] Stored process > filter (3)
[skipping] Stored process > filter (4)
[skipping] Stored process > transcriptome_alignment (2)
[skipping] Stored process > transcriptome_alignment (1)
[skipping] Stored process > transcriptome_alignment (3)
[skipping] Stored process > transcriptome_alignment (4)
[skipping] Stored process > merge_transcriptome_alignment (1)
[skipping] Stored process > merge_transcriptome_alignment (2)
executor >  local (19)
[01/51ddee] process > write_fastq_correspondence  [100%] 1 of 1 ✔
[f8/7325fa] process > quality_filter              [100%] 4 of 4 ✔
[b1/af98e0] process > bam_to_bed                  [100%] 4 of 4 ✔
[8d/2ca045] process > merge_bam_post_qpass        [100%] 2 of 2 ✔
[9a/070bfd] process > add_sample_index_col_to_bed [100%] 4 of 4 ✔
[46/1950df] process > deduplicate_umi_tools       [ 50%] 2 of 4, failed: 2
executor >  local (19)
[01/51ddee] process > write_fastq_correspondence  [100%] 1 of 1 ✔
[f8/7325fa] process > quality_filter              [100%] 4 of 4 ✔
[b1/af98e0] process > bam_to_bed                  [100%] 4 of 4 ✔
[8d/2ca045] process > merge_bam_post_qpass        [100%] 2 of 2 ✔
[9a/070bfd] process > add_sample_index_col_to_bed [100%] 4 of 4 ✔
[44/491ce9] process > deduplicate_umi_tools       [100%] 4 of 4, failed: 4
[da/f6a586] NOTE: Process `deduplicate_umi_tools (2)` terminated with an error exit status (1) -- Execution is retried (1)
executor >  local (19)
[01/51ddee] process > write_fastq_correspondence  [100%] 1 of 1 ✔
[f8/7325fa] process > quality_filter              [100%] 4 of 4 ✔
[b1/af98e0] process > bam_to_bed                  [100%] 4 of 4 ✔
[8d/2ca045] process > merge_bam_post_qpass        [100%] 2 of 2 ✔
[9a/070bfd] process > add_sample_index_col_to_bed [100%] 4 of 4 ✔
[44/491ce9] process > deduplicate_umi_tools       [100%] 4 of 4, failed: 4
[da/f6a586] NOTE: Process `deduplicate_umi_tools (2)` terminated with an error exit status (1) -- Execution is retried (1)
[57/252c60] NOTE: Process `deduplicate_umi_tools (1)` terminated with an error exit status (1) -- Execution is retried (1)
WARN: Killing pending tasks (1)
ERROR ~ Error executing process > 'deduplicate_umi_tools (1)'

Caused by:
  Process `deduplicate_umi_tools (1)` terminated with an error exit status (1)

Command executed:

  umi_tools dedup --read-length               -I GSM1606107.merged.pre_dedup.bam --output-stats=GSM1606107.dedup.stats -S GSM1606107.dedup.bam -L GSM1606107.dedup.log
  
  bamToBed -i GSM1606107.dedup.bam > GSM1606107.dedup.bed

Command exit status:
  1

Command output:
  (empty)

Command error:
  + umi_tools dedup --read-length -I GSM1606107.merged.pre_dedup.bam --output-stats=GSM1606107.dedup.stats -S GSM1606107.dedup.bam -L GSM1606107.dedup.log
  Traceback (most recent call last):
    File "/home/ihoskins/miniconda3/envs/ribo_umi/bin/umi_tools", line 11, in <module>
      sys.exit(main())
    File "/home/ihoskins/miniconda3/envs/ribo_umi/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 66, in main
      module.main(sys.argv)
    File "/home/ihoskins/miniconda3/envs/ribo_umi/lib/python3.6/site-packages/umi_tools/dedup.py", line 312, in main
      barcode_getter=bundle_iterator.barcode_getter)
    File "/home/ihoskins/miniconda3/envs/ribo_umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 187, in __init__
      self.fill()
    File "/home/ihoskins/miniconda3/envs/ribo_umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 220, in fill
      self.refill_random()
    File "/home/ihoskins/miniconda3/envs/ribo_umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 192, in refill_random
      list(self.umis.keys()), self.random_fill_size, p=self.prob)
    File "mtrand.pyx", line 908, in numpy.random.mtrand.RandomState.choice
  ValueError: 'a' cannot be empty unless no samples are taken

Work dir:
  /home/ihoskins/riboflow_umi/riboflow/work/46/1950dfa4a153305c3aca44b9d5a765

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`

 -- Check '.nextflow.log' file for details

Docker image needs an update

Docker image is running on the previous version of ribopy.
We need to make a Docker image having the current version of the ribopy.

Optional Adapter Trimming

Make adapter trimming optional with a flag such as

trim_adapter: True/False

Some sequencing data might already be trimmed. In the current version, one way to get around this problem is providing only quality threshold to cutadapt parameters.

Typo in rna-seq arguments

there is a typo in the RNA-Seq argument of bowtie2 reference:

bt2_argumments

Fix this in the next issue and make it back compatible.

transcriptome question

I want to use RiboFlow for a species other than human/mouse, so I will need to provide my own transcriptome. Does the transcriptome need to only have 1 isoform per gene or can RiboFlow handle multiple isoforms per gene?

Thanks!

Suggestions for project.yaml file

Here are a few thoughts about the project.yaml file to make it more intuitive:

  1. The default range of read lengths in riboflow parameter file could be 16-40 instead of 28-32 as most of the human data is commonly on that range.

  2. Provide a few sentences about the clipping arguments for adapter filtering along the lines of:
    "You might want to alter the adapter sequence for your data."

  3. Fastq Files:
    "To process your own experiment files, specify their names and provide their locations."

Simpler Output

For an average user, the output only needs to have

  1. "all.ribo"
  2. stats folder containing ONLY the "stats.csv" file
  3. fastqc results if do_fastqc flag is set

Every other file can go to the "intermediates" folder.

Bug Error When Attempting to Run Test Data

Riboflow looks like potentially a powerful tool so I tried downloading and seeing if I could run the tool. It looks like there is a file missing, but I'm unsure what? I have attached the output log.txt of the Next flow run.

Documentation Needed on libtbb

In Ubuntu 20.04 (and maybe in other distributions) we see this error when running bowtie2.

error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory

Following the solution here, we figured out that installing libtbb-dev solves this issue. On Ubuntu 20.04, running the following does the installation:

sudo apt-get install libtbb-dev

No need for a transcript lengths file

Transcript lengths can be inferred from annotation or transcriptome reference. Hence a separate transcript length file is redundant. We can add this functionality as a separate command to rftools and incorporate this in the next version of RiboFlow.

Formatting of FASTQ paths in config is critical

Not really an issue, just a quirk of setting up the config yaml file properly.

The following formatting does not work and leads to strange mappings in work/tmp/correspondence.txt and consequently an error similar to this (if do_check_file_existence : true):

ERROR ~ assert this_file.exists()
       |         |
       |         false
       /scratch/users/ihoskins/TEC/s

Incorrect formatting:

fastq:
AAVS1_KO_rep1: /scratch/users/ihoskins/TEC/TEC_1_R1.fq.gz

Correct formatting:

fastq:
AAVS1_KO_rep1:
- /scratch/users/ihoskins/TEC/TEC_1_R1.fq.gz

Conversely, the newline and hyphen is not required for specifying the reference files or metadata yaml files.

Paired END data set

Hi, do this riboflow pipeline suitable for paired end data set.

Cheers,
Ranj

.ribo file does not contain counts when deduplicate = false

General File Information:
             info                 
   format version                1
        reference        appris-v1
  min read length               28
  max read length               32
        left span               35
       right span               10
 transcript count            19822
     has.metadata             TRUE
  metagene radius               50
        has.alias            FALSE

Dataset Information:
  experiment total.reads    coverage     rna.seq    metadata
  GSM1606107           0        TRUE        TRUE        TRUE
  GSM1606108           0        TRUE        TRUE        TRUE

I am wondering if this is somehow related to #28

Test config file is attached
project_nodedup.yaml.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.