The eggo from bigdatagenomics

Implement directory structure from spec

Run eggo on a Cloudera Director provisioned cluster

Follow on to #44

Rename `conf` to `director_conf`

"conf" is too overloaded.

Clean up TODOs embedded in the code and convert to issues

Factor out code for dealing with temporary directories

Install `mechanize` on the machines that run Luigi

This ensures that Luigi can get back the logs from the Hadoop tasks.

Luigi assumes `hadoop` is on the `PATH`

Luigi assumes that hadoop is on the PATH where it is run. However, in the case of the cluster provisioned by the Spark EC2 scripts, the hadoop binary is located in the /root/ephemeral-hdfs/bin directory, which is not on the path. It may be complex to specify to Luigi where to get the Hadoop binary, so we may need to add some kind of config on our side that adds it to the path.

Replace use of luigi with ketrew

http://i1.kym-cdn.com/photos/images/newsfeed/000/243/561/afc.gif

Eliminate dependency on recursive input directories

There are some problems with spark-ec2 in specifying that input directories should be explored recursively for input files. Instead, we should just eliminate the dependence on this feature.

Publish eggo spec

Publish a document specifying the design and scope of the eggo project.

Add a single command to start a cluster

Combine provision, setup_master, and setup_slaves.

Set config var for running on different execution contexts

e.g., spark-ec2 scripts, director cluster, or local machine

Add a command to delete a dataset

This would delete the raw and derived files.

Add add'l config option for HDFS-specific location?

The eggo environment always assumes the availability of HDFS. Currently, DownloadDatasetHadoopTask sets as a prerequisite a PrepareHadoopDownloadTask whose hdfs_path gets whatever is in ToastConfig().dfs_tmp_data_url(). However, this could be an s3n: URL. Currently, this works because the resulting HdfsTarget object with the S3 URL still uses the hadoop CLI command to check whether the path exists, and hadoop CLI can process s3n URLs. But this feels potentially brittle.

Eliminate use of EGGO_HOME env variable

EGGO_HOME should instead simply use the root of the package hierarchy that is installed when eggo is installed.

install_eggo and install_adam should not fail if the directories already exist

They break bc they attempt a git clone when the repo may already be cloned. Should just emit a warning, ideally.

Add file system abstraction

In supporting other target file systems (#42), things become more complicated to support the multiple types of URIs and basically writing lots of switch statements. It'd be great to have a Hadoop style file system interface for which we can supply multiple implementations that do the right thing. Then we'll have a global config that decides whether we're targeting S3, HDFS, or local.

Configure the Luigi central scheduler on the provisioned cluster

Add jenkins job to test compatibility between different eggo.cfg templates

Once we bona fide support multiple execution contexts, we'll probably include a few pre-configured config files. We should make a jenkins job that tests that all the files contain the same set of configurable options, to ensure they don't start diverging.

Generate partitioned data

This is called locuspart in the spec.

Anyone got a problem with this name?

"Hardcode" the maven version into "install_maven"

This should be an implementation detail. The user shouldn't really need to specify this in the eggo config.

Deal properly with merging and partitioning data.

For example, the 1000 Genomes VCF data is organized into individual files by chromosome. But this obviously causes the files to differ substantially in size. My ideal would be to merge into a single file, or to split the genome into equal sized bins based on locus and sort the data into it.

Use virtualenv to isolate installed python deps from main system python

This doesn't matter too much for provisioned clusters that will anyway be blown away, but for local use/testing, it'd be great not to pollute the default system python with the packages/versions that eggo installs.

Setup config files where relevant

There are too many hard-coded values. Need to set up a config file. Probably most practical to use the python configparser lib and INI files.

Document how to run on an arbitrary cluster

It should be possible to point Eggo to run on any existing cluster. This is likely just an extension of local testing mode in terms of mechanics.

Ensure the configuration files are packaged up in the build tarballs

Fix S3 bucket name

Support local testing

It should be possible to develop and test Eggo locally without running on a cluster.

Check that license stuff is in order

Refactor eggo.cfg to be more logical

Support arbitrary Hadoop filesystem for output

It would be useful to be able to run Eggo to write outputs to HDFS or the local filesystem (for testing). Currently, only S3 is supported.

This might be a bit more involved than simply changing URLs, since Luigi has different types of target for different filesystems, e.g. S3FlagTarget vs. HdfsTarget vs. LocalTarget. Also, the use of aws cp should be replaced by hadoop fs -put.

Reorganize fabric cli to use namespaces and support different execution contexts

We need to define a minimal API for us to implement a new execution context (e.g., spark-ec2 scripts, director, local, etc.)

_dnload_to_local_upload_to_dfs apparently failing bc `hadoop fs -mkdir -p` is an invalid option

I assumed that -mkdir -p existed for the Hadoop CLI, but apparently not. However, -put will create directories if necessary. But this is still not good enough because we are depending on the dir existing prior to the copy. To fix, we'll have a separate -mkdir command that gets run and fails if the dir already exists, but that's ok.

Add config shortcuts to config.py

There are plenty of instances of, e.g., eggo_config.get('worker_env', 'hadoop_home') which is very verbose. Since these values should not change after loading config.py, it'd be great to provide shortcuts to the most common ones.

Use Hadoop to run source dataset downloads in parallel

DownloadDatasetParallelTask in dag.py handles the parallelization of the source downloads itself. It would be preferable to let Hadoop do this for a couple of reasons: 1) it removes the SSH dependency (this will allow Eggo to run on a non-fabric provisioned cluster), and 2) fault tolerance is handled by Hadoop rather than Eggo code.

The simplest way is probably to write a streaming script that uses NLineInputFormat so each mapper can download a single source file.

VCF2ADAMTask failing because of error with Hadoop-BAM's VCFInputFormat

Here is the example stacktrace.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-138-100-70.ec2.internal): java.lang.IllegalArgumentException: unknown VCF format, cannot create RecordReader: hdfs://ec2-54-157-172-152.compute-1.amazonaws.com:9000/tmp/tmp_eggo_2015-05-09T00-51-50_AODY.vcf/6a2df5dfbf260f9a0ac9c5270fe27c97.https___github.com_bigdatagenomics_eggo_raw_master_test_resources_chr22.small.vcf
    at org.seqdoop.hadoop_bam.VCFInputFormat.createRecordReader(VCFInputFormat.java:178)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

I'm guessing we're running into the silenced IOException here:
https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/VCFInputFormat.java#L145

@tomwhite any idea what's going on here? If you want to reproduce, you should be able to just run my latest branch: laserson/EGGO-62

eggo provision
eggo deploy_config
eggo setup_master
eggo setup_slaves
eggo toast:config=$EGGO_HOME/test/registry/test-genotypes.json

S3 deadlock when generating flattened data

The deadlock is the same as the one described here: https://groups.google.com/forum/#!topic/presto-users/Kouge3Jx3Ks

This is occurring since the version of Hadoop/Spark on the EC2 cluster has an old version of jets3t (0.6) - the issue is fixed in later versions. One workaround is to set the following in core-site.xml:

<property>
  <name>fs.s3n.impl.disable.cache</name>
  <value>true</value>
</property>

Refactor ADAM Luigi tasks to use inheritance/mixins for code reuse

See if Luigi can correctly order/handle dags that spit out interdependent nodes in the requires fn. This will make the tasks more reusable.

Pull in a small sample of 1000G low coverage WGS data

I was going to pull down 50 low coverage WGS files, for testing with avocado. I was going to pull down:

HG00126
HG00173
HG00258
HG00369
HG00583
HG00641
HG00664
HG01070
HG01079
HG01346
HG01495
HG01629
NA10836
NA11829
NA12234
NA12375
NA17978
NA18489
NA18546
NA18549
NA18633
NA18702
NA18773
NA18789
NA18989
NA19070
NA19107
NA19138
NA19153
NA19173
NA19175
NA19180
NA19198
NA19240
NA19564
NA19568
NA19729
NA19913
NA20288
NA20364
NA20761
NA20772
NA20864
NA20886
NA21297
NA21344
NA21369
NA21371
NA21494
NA21636

These samples give us a population distribution of:

   3 ASW
   4 CEU
   5 CHB
   2 CHD
   2 CHS
   2 CLM
   2 FIN
   2 GBR
   2 GIH
   1 IBS
   4 JPT
   6 MKK
   1 MXL
   3 PUR
   2 TSI
   9 YRI

hadoop-streaming dnloader is breaking because the mapper can't get the AWS credentials

We are using hadoop streaming to manage the parallel dnloads of the data sets to S3. In the process, we are instantiating a Luigi S3Client object which needs the AWS credentials. As written, we try to get these values from the standard environment variables. However, these environment variables are not available to the streaming mapper.

One solution is to set -cmdenv vars in the invocation of the streaming job. However, it appears that Luigi does not support setting these options in HadoopJobRunner. There are probably many workarounds, but the best would be to submit a Luigi patch that lets you set environment vars.

Thoughts, @tomwhite? Is there a better way to get the AWS credentials to the mapper?

Eliminate use of as many env variables as possible

(Not sure if this is what I want yet.) But seems redundant to have another file of env variables that must get sourced prior to running stuff on the workers. In principle, the eggo config should have all the necessary information, so we should be able to set the minimal env vars necessary for each command in situ.

Pull in high coverage NA12878 from 1000G

Factor out code for extraction compression type

Default to 'basic' edition if no editions are specified

Consider using Cloudera Director for cluster provisioning

The Spark EC2 scripts have variable quality depending on instance types and options that are set. Also, they mainly just set up Spark, whereas we may want some of the other tools in the Hadoop stack (e.g., @tomwhite's partitioning tool that uses Crunch/Hadoop 2.x). Cloudera Director may make it more reliable to set up the whole stack, and also may more easily support using alternate clouds as well.

Set up CI on the small test data sets for local mode

Make sure there aren't any changes in the tools we depend on that will pull the rug out from under us.

Consider co-opting stuff from Data Protocols

http://dataprotocols.org/

Standardize on "config" instead of "conf" to avoid confusion

In some places in the code, we use "conf" instead of "config". I propose we standardize on "config" whenever possible. Exceptions will be the .conf file extension for the director HOCON files.

Tagging datasets with reference genomes

I pulled down the 1000G genotypes for an analysis today, and lost an hour on the typical fun times that come from mixing a GRCh37 dataset with a hg19 dataset (specifically, the 1 vs. chr1 naming mismatch). Perhaps we should "tag" input datasets with the specific reference they're aligned against, where relevant? Even further, would it be make sense to ensure that all datasets in eggo are aligned against a specific reference genome? I wouldn't want to be as restrictive as saying everything must be lifted on to hg19, but I think it might make sense to ensure that everything is aligned against a CRCh genome build instead of a UCSC build.

bigdatagenomics / eggo Goto Github PK

eggo's People

Contributors

Stargazers

Watchers

Forkers

eggo's Issues

Recommend Projects

Recommend Topics

Recommend Org