bigdatagenomics / eggo Goto Github PK
View Code? Open in Web Editor NEWReady-to-go Parquet-formatted public 'omics datasets
License: Apache License 2.0
Ready-to-go Parquet-formatted public 'omics datasets
License: Apache License 2.0
We are using hadoop streaming to manage the parallel dnloads of the data sets to S3. In the process, we are instantiating a Luigi S3Client
object which needs the AWS credentials. As written, we try to get these values from the standard environment variables. However, these environment variables are not available to the streaming mapper.
One solution is to set -cmdenv
vars in the invocation of the streaming job. However, it appears that Luigi does not support setting these options in HadoopJobRunner
. There are probably many workarounds, but the best would be to submit a Luigi patch that lets you set environment vars.
Thoughts, @tomwhite? Is there a better way to get the AWS credentials to the mapper?
This should be an implementation detail. The user shouldn't really need to specify this in the eggo config.
"conf" is too overloaded.
We need to define a minimal API for us to implement a new execution context (e.g., spark-ec2 scripts, director, local, etc.)
This ensures that Luigi can get back the logs from the Hadoop tasks.
It would be useful to be able to run Eggo to write outputs to HDFS or the local filesystem (for testing). Currently, only S3 is supported.
This might be a bit more involved than simply changing URLs, since Luigi has different types of target for different filesystems, e.g. S3FlagTarget vs. HdfsTarget vs. LocalTarget. Also, the use of aws cp
should be replaced by hadoop fs -put
.
The eggo environment always assumes the availability of HDFS. Currently, DownloadDatasetHadoopTask
sets as a prerequisite a PrepareHadoopDownloadTask
whose hdfs_path
gets whatever is in ToastConfig().dfs_tmp_data_url()
. However, this could be an s3n:
URL. Currently, this works because the resulting HdfsTarget
object with the S3 URL still uses the hadoop CLI command to check whether the path exists, and hadoop CLI can process s3n
URLs. But this feels potentially brittle.
Luigi assumes that hadoop
is on the PATH
where it is run. However, in the case of the cluster provisioned by the Spark EC2 scripts, the hadoop
binary is located in the /root/ephemeral-hdfs/bin
directory, which is not on the path. It may be complex to specify to Luigi where to get the Hadoop binary, so we may need to add some kind of config on our side that adds it to the path.
This doesn't matter too much for provisioned clusters that will anyway be blown away, but for local use/testing, it'd be great not to pollute the default system python with the packages/versions that eggo installs.
Combine provision, setup_master, and setup_slaves.
(Not sure if this is what I want yet.) But seems redundant to have another file of env variables that must get sourced prior to running stuff on the workers. In principle, the eggo config should have all the necessary information, so we should be able to set the minimal env vars necessary for each command in situ.
Follow on to #44
I assumed that -mkdir -p
existed for the Hadoop CLI, but apparently not. However, -put
will create directories if necessary. But this is still not good enough because we are depending on the dir existing prior to the copy. To fix, we'll have a separate -mkdir
command that gets run and fails if the dir already exists, but that's ok.
It should be possible to develop and test Eggo locally without running on a cluster.
EGGO_HOME
should instead simply use the root of the package hierarchy that is installed when eggo is installed.
For example, the 1000 Genomes VCF data is organized into individual files by chromosome. But this obviously causes the files to differ substantially in size. My ideal would be to merge into a single file, or to split the genome into equal sized bins based on locus and sort the data into it.
It should be possible to point Eggo to run on any existing cluster. This is likely just an extension of local testing mode in terms of mechanics.
I pulled down the 1000G genotypes for an analysis today, and lost an hour on the typical fun times that come from mixing a GRCh37 dataset with a hg19 dataset (specifically, the 1 vs. chr1 naming mismatch). Perhaps we should "tag" input datasets with the specific reference they're aligned against, where relevant? Even further, would it be make sense to ensure that all datasets in eggo are aligned against a specific reference genome? I wouldn't want to be as restrictive as saying everything must be lifted on to hg19, but I think it might make sense to ensure that everything is aligned against a CRCh genome build instead of a UCSC build.
Publish a document specifying the design and scope of the eggo project.
In supporting other target file systems (#42), things become more complicated to support the multiple types of URIs and basically writing lots of switch statements. It'd be great to have a Hadoop style file system interface for which we can supply multiple implementations that do the right thing. Then we'll have a global config that decides whether we're targeting S3, HDFS, or local.
They break bc they attempt a git clone
when the repo may already be cloned. Should just emit a warning, ideally.
There are plenty of instances of, e.g., eggo_config.get('worker_env', 'hadoop_home')
which is very verbose. Since these values should not change after loading config.py, it'd be great to provide shortcuts to the most common ones.
There are some problems with spark-ec2 in specifying that input directories should be explored recursively for input files. Instead, we should just eliminate the dependence on this feature.
e.g., spark-ec2 scripts, director cluster, or local machine
DownloadDatasetParallelTask in dag.py handles the parallelization of the source downloads itself. It would be preferable to let Hadoop do this for a couple of reasons: 1) it removes the SSH dependency (this will allow Eggo to run on a non-fabric provisioned cluster), and 2) fault tolerance is handled by Hadoop rather than Eggo code.
The simplest way is probably to write a streaming script that uses NLineInputFormat so each mapper can download a single source file.
This is called locuspart
in the spec.
The Spark EC2 scripts have variable quality depending on instance types and options that are set. Also, they mainly just set up Spark, whereas we may want some of the other tools in the Hadoop stack (e.g., @tomwhite's partitioning tool that uses Crunch/Hadoop 2.x). Cloudera Director may make it more reliable to set up the whole stack, and also may more easily support using alternate clouds as well.
The deadlock is the same as the one described here: https://groups.google.com/forum/#!topic/presto-users/Kouge3Jx3Ks
This is occurring since the version of Hadoop/Spark on the EC2 cluster has an old version of jets3t (0.6) - the issue is fixed in later versions. One workaround is to set the following in core-site.xml:
<property>
<name>fs.s3n.impl.disable.cache</name>
<value>true</value>
</property>
In some places in the code, we use "conf" instead of "config". I propose we standardize on "config" whenever possible. Exceptions will be the .conf
file extension for the director HOCON files.
This would delete the raw and derived files.
Here is the example stacktrace.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-138-100-70.ec2.internal): java.lang.IllegalArgumentException: unknown VCF format, cannot create RecordReader: hdfs://ec2-54-157-172-152.compute-1.amazonaws.com:9000/tmp/tmp_eggo_2015-05-09T00-51-50_AODY.vcf/6a2df5dfbf260f9a0ac9c5270fe27c97.https___github.com_bigdatagenomics_eggo_raw_master_test_resources_chr22.small.vcf
at org.seqdoop.hadoop_bam.VCFInputFormat.createRecordReader(VCFInputFormat.java:178)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I'm guessing we're running into the silenced IOException
here:
https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/VCFInputFormat.java#L145
@tomwhite any idea what's going on here? If you want to reproduce, you should be able to just run my latest branch: laserson/EGGO-62
eggo provision
eggo deploy_config
eggo setup_master
eggo setup_slaves
eggo toast:config=$EGGO_HOME/test/registry/test-genotypes.json
There are too many hard-coded values. Need to set up a config file. Probably most practical to use the python configparser lib and INI files.
See if Luigi can correctly order/handle dags that spit out interdependent nodes in the requires
fn. This will make the tasks more reusable.
Once we bona fide support multiple execution contexts, we'll probably include a few pre-configured config files. We should make a jenkins job that tests that all the files contain the same set of configurable options, to ensure they don't start diverging.
Make sure there aren't any changes in the tools we depend on that will pull the rug out from under us.
I was going to pull down 50 low coverage WGS files, for testing with avocado. I was going to pull down:
HG00126
HG00173
HG00258
HG00369
HG00583
HG00641
HG00664
HG01070
HG01079
HG01346
HG01495
HG01629
NA10836
NA11829
NA12234
NA12375
NA17978
NA18489
NA18546
NA18549
NA18633
NA18702
NA18773
NA18789
NA18989
NA19070
NA19107
NA19138
NA19153
NA19173
NA19175
NA19180
NA19198
NA19240
NA19564
NA19568
NA19729
NA19913
NA20288
NA20364
NA20761
NA20772
NA20864
NA20886
NA21297
NA21344
NA21369
NA21371
NA21494
NA21636
These samples give us a population distribution of:
3 ASW
4 CEU
5 CHB
2 CHD
2 CHS
2 CLM
2 FIN
2 GBR
2 GIH
1 IBS
4 JPT
6 MKK
1 MXL
3 PUR
2 TSI
9 YRI
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.