Code Monkey home page Code Monkey logo

eggo's People

Contributors

fnothaft avatar laserson avatar ryan-williams avatar tomwhite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eggo's Issues

Luigi assumes `hadoop` is on the `PATH`

Luigi assumes that hadoop is on the PATH where it is run. However, in the case of the cluster provisioned by the Spark EC2 scripts, the hadoop binary is located in the /root/ephemeral-hdfs/bin directory, which is not on the path. It may be complex to specify to Luigi where to get the Hadoop binary, so we may need to add some kind of config on our side that adds it to the path.

Publish eggo spec

Publish a document specifying the design and scope of the eggo project.

Add add'l config option for HDFS-specific location?

The eggo environment always assumes the availability of HDFS. Currently, DownloadDatasetHadoopTask sets as a prerequisite a PrepareHadoopDownloadTask whose hdfs_path gets whatever is in ToastConfig().dfs_tmp_data_url(). However, this could be an s3n: URL. Currently, this works because the resulting HdfsTarget object with the S3 URL still uses the hadoop CLI command to check whether the path exists, and hadoop CLI can process s3n URLs. But this feels potentially brittle.

Add file system abstraction

In supporting other target file systems (#42), things become more complicated to support the multiple types of URIs and basically writing lots of switch statements. It'd be great to have a Hadoop style file system interface for which we can supply multiple implementations that do the right thing. Then we'll have a global config that decides whether we're targeting S3, HDFS, or local.

Deal properly with merging and partitioning data.

For example, the 1000 Genomes VCF data is organized into individual files by chromosome. But this obviously causes the files to differ substantially in size. My ideal would be to merge into a single file, or to split the genome into equal sized bins based on locus and sort the data into it.

Setup config files where relevant

There are too many hard-coded values. Need to set up a config file. Probably most practical to use the python configparser lib and INI files.

Support local testing

It should be possible to develop and test Eggo locally without running on a cluster.

Support arbitrary Hadoop filesystem for output

It would be useful to be able to run Eggo to write outputs to HDFS or the local filesystem (for testing). Currently, only S3 is supported.

This might be a bit more involved than simply changing URLs, since Luigi has different types of target for different filesystems, e.g. S3FlagTarget vs. HdfsTarget vs. LocalTarget. Also, the use of aws cp should be replaced by hadoop fs -put.

Add config shortcuts to config.py

There are plenty of instances of, e.g., eggo_config.get('worker_env', 'hadoop_home') which is very verbose. Since these values should not change after loading config.py, it'd be great to provide shortcuts to the most common ones.

Use Hadoop to run source dataset downloads in parallel

DownloadDatasetParallelTask in dag.py handles the parallelization of the source downloads itself. It would be preferable to let Hadoop do this for a couple of reasons: 1) it removes the SSH dependency (this will allow Eggo to run on a non-fabric provisioned cluster), and 2) fault tolerance is handled by Hadoop rather than Eggo code.

The simplest way is probably to write a streaming script that uses NLineInputFormat so each mapper can download a single source file.

VCF2ADAMTask failing because of error with Hadoop-BAM's VCFInputFormat

Here is the example stacktrace.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-138-100-70.ec2.internal): java.lang.IllegalArgumentException: unknown VCF format, cannot create RecordReader: hdfs://ec2-54-157-172-152.compute-1.amazonaws.com:9000/tmp/tmp_eggo_2015-05-09T00-51-50_AODY.vcf/6a2df5dfbf260f9a0ac9c5270fe27c97.https___github.com_bigdatagenomics_eggo_raw_master_test_resources_chr22.small.vcf
    at org.seqdoop.hadoop_bam.VCFInputFormat.createRecordReader(VCFInputFormat.java:178)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

I'm guessing we're running into the silenced IOException here:
https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/VCFInputFormat.java#L145

@tomwhite any idea what's going on here? If you want to reproduce, you should be able to just run my latest branch: laserson/EGGO-62

eggo provision
eggo deploy_config
eggo setup_master
eggo setup_slaves
eggo toast:config=$EGGO_HOME/test/registry/test-genotypes.json

Pull in a small sample of 1000G low coverage WGS data

I was going to pull down 50 low coverage WGS files, for testing with avocado. I was going to pull down:

HG00126
HG00173
HG00258
HG00369
HG00583
HG00641
HG00664
HG01070
HG01079
HG01346
HG01495
HG01629
NA10836
NA11829
NA12234
NA12375
NA17978
NA18489
NA18546
NA18549
NA18633
NA18702
NA18773
NA18789
NA18989
NA19070
NA19107
NA19138
NA19153
NA19173
NA19175
NA19180
NA19198
NA19240
NA19564
NA19568
NA19729
NA19913
NA20288
NA20364
NA20761
NA20772
NA20864
NA20886
NA21297
NA21344
NA21369
NA21371
NA21494
NA21636

These samples give us a population distribution of:

   3 ASW
   4 CEU
   5 CHB
   2 CHD
   2 CHS
   2 CLM
   2 FIN
   2 GBR
   2 GIH
   1 IBS
   4 JPT
   6 MKK
   1 MXL
   3 PUR
   2 TSI
   9 YRI

hadoop-streaming dnloader is breaking because the mapper can't get the AWS credentials

We are using hadoop streaming to manage the parallel dnloads of the data sets to S3. In the process, we are instantiating a Luigi S3Client object which needs the AWS credentials. As written, we try to get these values from the standard environment variables. However, these environment variables are not available to the streaming mapper.

One solution is to set -cmdenv vars in the invocation of the streaming job. However, it appears that Luigi does not support setting these options in HadoopJobRunner. There are probably many workarounds, but the best would be to submit a Luigi patch that lets you set environment vars.

Thoughts, @tomwhite? Is there a better way to get the AWS credentials to the mapper?

Eliminate use of as many env variables as possible

(Not sure if this is what I want yet.) But seems redundant to have another file of env variables that must get sourced prior to running stuff on the workers. In principle, the eggo config should have all the necessary information, so we should be able to set the minimal env vars necessary for each command in situ.

Consider using Cloudera Director for cluster provisioning

The Spark EC2 scripts have variable quality depending on instance types and options that are set. Also, they mainly just set up Spark, whereas we may want some of the other tools in the Hadoop stack (e.g., @tomwhite's partitioning tool that uses Crunch/Hadoop 2.x). Cloudera Director may make it more reliable to set up the whole stack, and also may more easily support using alternate clouds as well.

Tagging datasets with reference genomes

I pulled down the 1000G genotypes for an analysis today, and lost an hour on the typical fun times that come from mixing a GRCh37 dataset with a hg19 dataset (specifically, the 1 vs. chr1 naming mismatch). Perhaps we should "tag" input datasets with the specific reference they're aligned against, where relevant? Even further, would it be make sense to ensure that all datasets in eggo are aligned against a specific reference genome? I wouldn't want to be as restrictive as saying everything must be lifted on to hg19, but I think it might make sense to ensure that everything is aligned against a CRCh genome build instead of a UCSC build.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.