Code Monkey home page Code Monkey logo

Comments (1)

jsarro13 avatar jsarro13 commented on June 2, 2024 1

Update:

We were able to resolve this issues using a combination of three strategies.

  1. We enabled autoscaling in our cluster.

  2. We changed the spark partition defaults on our cluster to split data into 8,00 partitions. We had read that this number could be changed to 3x the number of vCPUs on our cluster. Because we are using autoscaling, the number of vCPUs used is not predetermined. Because of this we started with 1x the maximum number of secondary workers in our cluster. Our maximum is set to 1000 n1-highmem-8 machines. These nodes contain 8 vCPUs each, so 8 x 1,000 = 8,000. After speaking with Google, we verified that we could have used 3x the maximum number of vCPUs to increase parallelism. With a maximum of 10 workers and 1,000 secondary workers, all n1-highmem-8 nodes, we could have increased our partition to 24,240.

A sample cluster declaration using autoscaling and default shuffle partitions and parallelism of 8000 is below.

  1. The hail team had informed us that "You might try adding block_size=2048 to your King invocation. That will reduce the memory requirements on the workers to ~1/4 of the default which should give ample room for the analysis." Because of this, we changed the block size in king to block_size=2048. After looking through the king source code, we were able to determine the default block size is 4096.
hailctl dataproc start cluster \
    --vep GRCh38 \
    --autoscaling-policy=MVP_autoscaling_policy \
    --requester-pays-allow-annotation-db \
    --packages gnomad \
    --requester-pays-allow-buckets gnomad-public-requester-pays \
    --secondary-worker-type=non-preemptible \
    --master-machine-type=n1-highmem-8 \
    --worker-machine-type=n1-highmem-8 \
    --worker-boot-disk-size=1000 \
    --preemptible-worker-boot-disk-size=1000 \
    --properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true,spark:spark.sql.shuffle.partitions=8000,spark:spark.default.parallelism=8000

from hail.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.