Code Monkey home page Code Monkey logo

rail's Introduction

Rail-RNA logo

is software for RNA-seq analysis. It was used to generate outputs for the recount2 project and is deprecated as of January 2021. Please use monorail instead to generate RNA-seq outputs that can be compared with recount3, the successor to recount2.

Build Status

the website.

the latest stable release. Read the

especially the

Ask questions in the repo's

Join the chat at https://gitter.im/nellore/rail .

Get interested

Rail-RNA's distinguishing features are

  • Scalability. Built on MapReduce, the software scales to analyze hundreds of RNA-seq samples at the same time.
  • Reduced redundancy. The software identifies and eliminates redundant alignment work, making the end-to-end analysis time per sample decrease for fixed computer cluster size as the number of samples increases.
  • Integrative analysis. The software borrows strength across replicates to achieve more accurate splice junction detection, especially in genomic regions with low coverage.
  • Mode agnosticism. The software integrates its own parallel abstraction layer that allows it to be run in various distributed computing environments, including the Amazon Web Services (AWS) Elastic MapReduce (EMR) service, or any distributed environment supported by IPython, including clusters using batch schedulers like PBS or SGE, Message Passing Interface (MPI), or any cluster with a shared filesystem and mutual SSH access. Alternately, Rail-RNA can be run on a single multi-core computer, without the aid of a batch system or MapReduce implementation.
  • Inexpensive cloud implementation. An EMR run on > ~100 samples costs ~ $1/sample with spot instances.
  • Secure analysis of dbGaP-protected data on EMR. See this guide for information on setup.

Outputs currently include

  • Alignment BAMs with only primary alignments by default (for more, use --bowtie2-args "-k <N>", where <N> is the maximum number of alignments to report per read)
  • Genome coverage bigWigs
  • TopHat-like indel and splice junction BEDs

and will likely expand in future versions.

Read our paper for more details. Methods explained there correspond to Rail-RNA 0.1.9.

Get set up

Start with a recent (>= 2009) OS X or Linux box. For a no-fuss install, enter

(INSTALLER=/var/tmp/$(cat /dev/urandom | env LC_CTYPE=C tr -cd 'a-f0-9' | head -c 32);
curl http://verve.webfactional.com/rail -o $INSTALLER; python2 $INSTALLER -m || true;
rm -f $INSTALLER)

at a Bash prompt. For a more customizable install, download install_rail-rna-0.2.4b, change to the directory containing it, and make the installer executable with

chmod +x install_rail-rna-0.2.4b

Now run

sudo ./install_rail-rna-0.2.4b

to install for all users or

./install_rail-rna-0.2.4b

to install for just you. Refer to these detailed installation instructions from the docs for more information. If the executable doesn't work, you may need Python. You'll also need Bowtie 1 and 2 indexes of the appropriate genome assembly if you will be running Rail-RNA in either its single-computer (local) or IPython Parallel (parallel) modes. The easiest way to get these is by downloading an Illumina iGenome. If running Rail-RNA on EMR (elastic mode) and aligning to hg19, the assembly can be specified at the command line with the -a parameter.

Get started

Rail-RNA takes as input a Myrna-style manifest file, which describes a set of input FASTQs that may be on the local filesystem in local and parallel modes; or on the web or Amazon Simple Storage Service (S3) in local, parallel, and elastic modes. Each line takes one of the following two forms.

  1. (for a set of unpaired input reads) <FASTQ URL>(tab)<optional MD5>(tab)<sample label>
  2. (for a set of paired input reads) <FASTQ URL 1>(tab)<optional MD5 1>(tab)<FASTQ URL 2>(tab)<optional MD5 2>(tab)<sample label>

Find some RNA-seq data, create a manifest file, run

rail-rna

and follow the instructions; or check the docs for help getting started.

To use Rail-RNA in elastic mode, you'll need an account with AWS. For an introduction to cloud computing with AWS, refer to this excellent tutorial by the Griffith Lab at Wash U.

Disclaimer

Renting AWS resources costs money, regardless of whether your run ultimately succeeds or fails. In some cases, Rail-RNA or its documentation may be partially to blame for a failed run. While we are happy to review bug reports, we do not accept responsibility for financial damage caused by these errors. Rail-RNA is provided "as is" with no warranty.

Licenses

MIT except for the directory src/hadoop/relevant-elephant, which contains Apache-licensed code adapted from Twitter's Elephant Bird project.

Contributors

This product was developed primarily at

Hopkins logo

rail's People

Contributors

andrewejaffe avatar benlangmead avatar christopherwilks avatar gitter-badger avatar jeenalee avatar lcolladotor avatar mortonjt avatar nellore avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rail's Issues

missing isofrags.tar.gz in parallel mode (v0.2.3b)

Not sure if this is a bug or specific for my use case.

When running rail in parallel mode using ipcluster with Slurm I get a RuntimeError that isofrags.tar.gz does not exist. If I restart from that point everything finishes cleanly.

If I run rail in parallel on a single node with ipcluster (i.e. local instead of slurm) everything runs cleanly.

I am guessing it has something to do with using slurm. Probably not your problem, only bring it up because there is a mention of this in a commit log on the parallel branch. Please let me know if you have a known fix or a suggestion what might be going on.

Thanks
Justin

error when specifying -r parameter...

I recently tried a local rail-RNA run on one of our compute nodes here. I specified "-r 1000000" for the sort-memory-cap and got the following error:

Loading...
Detected Bowtie 1 v1.1.1, Bowtie 2 v2.2.5, and SAMTools v1.2.1.
Launching Dooplicity runner with PyPy...
ningal.cluster.lifesci.dundee.ac.uk > tail -f rail-rna.e1288159
Detected Bowtie 1 v1.1.1, Bowtie 2 v2.2.5, and SAMTools v1.2.1.
Launching Dooplicity runner with PyPy...
usage: emr_simulator.py [-h] [-m MEMCAP] [-p NUM_PROCESSES] [-t MAX_ATTEMPTS]
[-s SEPARATOR] [-k] [--keep-last-output]
[--gzip-outputs] [--gzip-level GZIP_LEVEL] [--ipy]
[--ipcontroller-json IPCONTROLLER_JSON]
[--ipy-profile IPY_PROFILE] [--scratch SCRATCH]
[--common COMMON] [--sort SORT] [-b BRANDING]
[-j JSON_CONFIG] [-f] [-l LOG]
emr_simulator.py: error: argument -m/--memcap: invalid int value: '1000000000.0'

It looks like the command line input is being read as a float when an int is expected. Removing the -r parameter removes the issue.

Calculate AUC for each sample

Hi,

To be consistent with recount, it'll be great if Rail-RNA includes in counts.tsv.gz the AUC for each sample.

Best,
Leo

partition pipeline so high-utilization steps can use different instance type from low-utilization steps

If there is a series of Rail-RNA steps (alignment-related ones, probably) that achieve good CPU utilization and load balance, then high-cpu instances are cost effective & well suited. If a series of steps is not like this (poor load balance, mostly I/O), then high-cpu instances probably a waste of money, as most CPUs are idle.

If we partition the pipeline into stretches that either do or don't have these properties, then we could launch separate clusters (with different instance types) for those stretches. This could reduce costs relative to a pipeline that runs end-to-end on a big high-cpu cluster.

Downside: we pay the cost of bootstrapping multiple times for a given dataset. But we might also be able to simplify the bootstrapping for any given cluster, since a given cluster is running only a portion of the overall pipeline. E.g. if there's no alignment involved then you don't have to download and install the index. If samtools isn't involve, you don't have to install samtools.

Thanks to elasticity, there's no reason this would have to come at the expense of overall throughput.

make_it_rail requires [email protected]'s password

I am trying to run make_it_rail.sh per the manual (http://docs.rail.bio/installation/) but it needs [email protected]'s password:

$ sh make_it_rail.sh
  adding: __main__.py (deflated 89%)
  adding: cloudformation/ (stored 0%)
  adding: cloudformation/dbgap.template (deflated 80%)
  (snip)
  adding: rna/utils/tempdel.py (deflated 55%)
  adding: version.py (deflated 20%)
Installer created at /Users/langmead/git/rail/releases/install_rail-rna-devel .
Copying to webfactional... .
[email protected]'s password:

Spurious error about existing bucket

Scenario: I have a bucket blah and my EMR job is set to output to a subdirectory of blah. When I try to run Rail starting from json:

cat hg19.json | python src/dooplicity/emr_runner.py --region us-east-1

I get an error:

RuntimeError: Bucket `blah` already exists on S3. Change affected output directories in job flow and try again. The more distinctive the name chosen, the less likely it's already a bucket on S3. It may be easier to create a bucket first using the web interface and use its name + a subdirectory as the output directory.

I don't know why the bucket already existing should pose a problem here.

Normalize mean consistent with `recount`

Hi,

To be consistent with recount, it'll be great if Rail-RNA uses the AUC as provided in #47 for normalizing the samples before calculating the mean bigwig. In recount we are normalizing to a 40 million 100 bp SE reads library using the AUC.

Best,
Leo

Optionally skip "Copying Rail-RNA and bootstraps to S3..."

Sometimes I am running the tool in elastic mode just to get a json file (with --json) and not to actually start a job. In that case, I'd like to suppress the uploading of the software to S3, which can be slow over a slow connection.

use ncbi/ngs library to slurp in SRA reads

This Python module seems to be a reasonably easy-to-use and portable way to slurp a stream of reads from the SRA: https://github.com/ncbi/ngs/tree/master/ngs-python.

It depends on a library that they will dynamically download and install in ~/.ncbi upon first use of the LibManager module:

https://github.com/ncbi/ngs/blob/master/ngs-python/ngs/LibManager.py

There's many-second delay when this happens. Overall, I don't see anything terribly wrong with that way of doing things.

It's written by US govt employees and as such isn't copyrighted. I'm pretty sure, but not 100% positive, that makes it OK to package in an open source project.

Final note: I thought that DNANexus mirrored the SRA, but it seems they only mirror the SRA metadata. So this library is only for accessing reads at NCBI, and there is no mirror that I know of.

add counters for debugging

"ideas for counters we'd like to have (a) how many reads/readlets align, (b) number of introns called, (c) number of introns & nucleotides that go into intron index"

Python subprocess OSError

This might be a problem with just my machine, but I just did a fresh install of Ubuntu onto a new SSD so it's possible others are experiencing this as well.

I tried to install Rail-RNA using the downloaded installer as well as the command line script and run into this error with both. It seems to occur right after all the dependencies are downloaded and unpacked:

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/var/tmp/e5618b82c6dd032102ed3aa1d2c2e502/main.py", line 90, in
File "/var/tmp/e5618b82c6dd032102ed3aa1d2c2e502/rna/driver/rna_installer.py", line 416, in install
File "/usr/lib/python2.7/subprocess.py", line 212, in check_output
process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "/usr/lib/python2.7/subprocess.py", line 390, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1025, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

I've tried to do some Googling on this error and it looks like a potential Python 2 problem. I've tried reinstalling the most recent Python 2.7 which hasn't helped. Any ideas on what I might be doing wrong?

Thank you!

David

Supress intermediary bw option

In the coverage_bigwigs hundreds of bw chromosome files are generated, provide an option to report only final bw file.

add precision/recall tests to line.py

Add specification of expected precision and recall on dataset as command-line parameters of line.py and return failure of they don't agree with test output; this way, changes to the underlying alignment algorithms must be "approved" by editing .travis.yml.

Bowtie Indices relative paths not supported

While trying to run the Drosophila example from the Tutorial, I've noticed that providing the relative paths to the bowtie indices throws a runtime error. I'm calling the following command from an empty directory railtests, and the Drosophila_melanogaster directory containing the bowtie indices is one level above railtests:

[root@ip-10-5-163-113 railtests]# rail-rna go local -x ../Drosophila_melanogaster/UCSC/dm3/Sequence/BowtieIndex/genome ../Drosophila_melanogaster/UCSC/dm3/Sequence/Bowtie2Index/genome -m https://raw.githubusercontent.com/nellore/rail/master/ex/dm3_example.manifest

The runtime error which can be found in /rail-rna_logs/precoverage/dp.reduce.log/0.0.log:

Traceback (most recent call last):
  File "app_main.py", line 75, in run_toplevel
  File "/home/ec2-user/raildotbio/rail-rna/rna/steps/coverage_pre.py", line 130, in <module>
    os.path.expandvars(args.bowtie_idx)
  File "/home/ec2-user/raildotbio/rail-rna/rna/utils/bowtie_index.py", line 27, in __init__
    raise RuntimeError('No Bowtie index files with prefix "%s"' % idx_prefix)
RuntimeError: No Bowtie index files with prefix "../Drosophila_melanogaster/UCSC/dm3/Sequence/BowtieIndex/genome"

While the issue can be fixed simply by providing the full paths to the bowtie indices, I feel it would be best to support relative paths in the future, or to note this requirement in the documentation for now.

Allow --tag specification

Since people may do multiple large, expensive runs with Rail, like we do, it's nice to be able to tag the EMR cluster so that the various projects can be distinguished on the bill.

Local mode, manifest file (paired-end samples) throws errors

Hi,

I believe that my manifest file is correct. That is, I have:

<local path pair1><tab><0><tab><local path pair2><tab><0><tab><sample name>

But when I run the rail prep local step, I get errors like this:

Rail-RNA v0.1.9c
Loading...
Traceback (most recent call last):
  File "/jhpce/shared/community/compiler/gcc/4.4.7/python/2.7.9/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/jhpce/shared/community/compiler/gcc/4.4.7/python/2.7.9/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/bst/student/lcollado/raildotbio/rail-rna/__main__.py", line 725, in <module>
    scratch=args.scratch
  File "/home/bst/student/lcollado/raildotbio/rail-rna/rna/driver/rna_config.py", line 4610, in __init__
    raise_runtime_error(base)
  File "/home/bst/student/lcollado/raildotbio/rail-rna/rna/driver/rna_config.py", line 1189, in raise_runtime_error
    ) if len(bases.errors) > 1 else bases.errors[0]
RuntimeError: 1) The following line from the manifest file /dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/rail/rail-manifest.txt has an invalid sample label:
/dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/simulated_reads/sample_01_1.fasta.gz   0   /dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/simulated_reads/sample_01_2.fasta.gz   0   sample1G1R1
A valid sample label takes the following form:
<Group ID>-<BioRep ID>-<TechRep ID>
2) The following line from the manifest file /dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/rail/rail-manifest.txt has an invalid sample label:
/dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/simulated_reads/sample_02_1.fasta.gz   0   /dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/simulated_reads/sample_02_2.fasta.gz   0   sample2G1R1
A valid sample label takes the following form:
<Group ID>-<BioRep ID>-<TechRep ID>

From the error I can see that rail is thinking that I'm on single-end mode instead of paired-end. Also, I don't think that my manifest file is different from https://github.com/nellore/rail/blob/master/eval/GEUVADIS_28_local.manifest so I'm not sure what's wrong.

Details

manifest file which gets created by this script
script for local rail currently failing at rail prep local step.

I'm using v0.1.9c which is the latest non-devel version. Should I switch to the devel one?

ideas for major feature additions in 0.3.0

  • Resolve multimapping reads in a way that makes the most sense for downstream analyses. (How? Push coverage distribution toward more uniformity like MMR?)
  • Add option to run job flow with Spark on EMR
  • Permit automated sample clustering for collective analysis and report clusters?
  • Add more normalization methods
  • Play even nicer with derfinder; compute region matrix?
  • Investigate --bind-to-socket in MPI to constrain processor usage on cluster
  • Code robustness to losing and gaining ipengines in emr_simulator.py

--no-dependencies install option errors out

python rail/releases/install_rail-rna-devel --no-dependencies
â Rail-RNA vdevel Installer
Rail-RNA can be installed for all users or just the current user.
* Install for all users? [y/n]: n
Traceback (most recent call last):.
File "/cm/shared/apps/python/2.7.9/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/cm/shared/apps/python/2.7.9/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "rail/releases/install_rail-rna-devel/main.py", line 90, in
File "/home-1/[email protected]/rail/releases/install_rail-rna-devel/rna/driver/rna_installer.py", line 381, in install
File "/cm/shared/apps/python/2.7.9/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "rail/releases/install_rail-rna-devel/dooplicity/tools.py", line 63, in cd
OSError: [Errno 2] No such file or directory: '/tmp/tmpxPjmQ4/samtools-1.2'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.