Comments (1)
Update:
We were able to resolve this issues using a combination of three strategies.
-
We enabled autoscaling in our cluster.
-
We changed the spark partition defaults on our cluster to split data into 8,00 partitions. We had read that this number could be changed to 3x the number of vCPUs on our cluster. Because we are using autoscaling, the number of vCPUs used is not predetermined. Because of this we started with 1x the maximum number of secondary workers in our cluster. Our maximum is set to 1000 n1-highmem-8 machines. These nodes contain 8 vCPUs each, so 8 x 1,000 = 8,000. After speaking with Google, we verified that we could have used 3x the maximum number of vCPUs to increase parallelism. With a maximum of 10 workers and 1,000 secondary workers, all n1-highmem-8 nodes, we could have increased our partition to 24,240.
A sample cluster declaration using autoscaling and default shuffle partitions and parallelism of 8000 is below.
- The hail team had informed us that "You might try adding
block_size=2048
to your King invocation. That will reduce the memory requirements on the workers to ~1/4 of the default which should give ample room for the analysis." Because of this, we changed the block size in king toblock_size=2048
. After looking through the king source code, we were able to determine the default block size is 4096.
hailctl dataproc start cluster \
--vep GRCh38 \
--autoscaling-policy=MVP_autoscaling_policy \
--requester-pays-allow-annotation-db \
--packages gnomad \
--requester-pays-allow-buckets gnomad-public-requester-pays \
--secondary-worker-type=non-preemptible \
--master-machine-type=n1-highmem-8 \
--worker-machine-type=n1-highmem-8 \
--worker-boot-disk-size=1000 \
--preemptible-worker-boot-disk-size=1000 \
--properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true,spark:spark.sql.shuffle.partitions=8000,spark:spark.default.parallelism=8000
from hail.
Related Issues (20)
- [hailtop] The Batch client should warn users when they are using deprecated APIs
- [batch] Complete the transition from GSA key files to the Batch metadata server
- [qob] Add metadata server support for JVMJobs and update QoB to use it
- Invalid maximum heap size: -Xmx0m
- [docs] Query-on-Batch desperately needs its own tutorial
- [query/vds] Actually use `ref_block_max_length` in `to_dense_mt`
- [batch] Properly expose and document "job-private instances"
- [batch] Batch charges for private instance creation that fails with exhausted resource errors.
- [query] global field name clash in GroupedTable
- VEP is being incorrectly initialised in australia-southeast1 region HOT 4
- [query] add string find() function
- [batch] Azure storage requirements beyond tempdisk for standing worker result in NotImplementedError HOT 1
- [query] filter intervals causing a failed partitioner assertion
- from_pandas is super low for a pd.DataFrame with shape 35000*67 HOT 1
- RuntimeException: IR is.hail.expr.ir.StreamFlatMap of type stream<struct{oldContext: str, nRows: int64, nCols: int64}> is not realizable
- [hailctl] QoB job specs should always use the git revision and never jar_url
- Machine Memory Calculations in Hail Batch HOT 1
- [query] Failures to communicate with the spark/local backend result in cryptic error message
- MakeNDArray OOM on stream data
- VDS reference data needs to have ploidy information
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hail.