Our team is currently trying to run kinship analysis with <a href="https://hail.is/doc

Spark executor heartbeat timeout during hl.king() about hail HOT 1 CLOSED

jsarro13 commented on June 2, 2024

Spark executor heartbeat timeout during hl.king()

from hail.

Comments (1)

jsarro13 commented on June 2, 2024 1

Update:

We were able to resolve this issues using a combination of three strategies.

We enabled autoscaling in our cluster.
We changed the spark partition defaults on our cluster to split data into 8,00 partitions. We had read that this number could be changed to 3x the number of vCPUs on our cluster. Because we are using autoscaling, the number of vCPUs used is not predetermined. Because of this we started with 1x the maximum number of secondary workers in our cluster. Our maximum is set to 1000 n1-highmem-8 machines. These nodes contain 8 vCPUs each, so 8 x 1,000 = 8,000. After speaking with Google, we verified that we could have used 3x the maximum number of vCPUs to increase parallelism. With a maximum of 10 workers and 1,000 secondary workers, all n1-highmem-8 nodes, we could have increased our partition to 24,240.

A sample cluster declaration using autoscaling and default shuffle partitions and parallelism of 8000 is below.

The hail team had informed us that "You might try adding block_size=2048 to your King invocation. That will reduce the memory requirements on the workers to ~1/4 of the default which should give ample room for the analysis." Because of this, we changed the block size in king to block_size=2048. After looking through the king source code, we were able to determine the default block size is 4096.

hailctl dataproc start cluster \
    --vep GRCh38 \
    --autoscaling-policy=MVP_autoscaling_policy \
    --requester-pays-allow-annotation-db \
    --packages gnomad \
    --requester-pays-allow-buckets gnomad-public-requester-pays \
    --secondary-worker-type=non-preemptible \
    --master-machine-type=n1-highmem-8 \
    --worker-machine-type=n1-highmem-8 \
    --worker-boot-disk-size=1000 \
    --preemptible-worker-boot-disk-size=1000 \
    --properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true,spark:spark.sql.shuffle.partitions=8000,spark:spark.default.parallelism=8000

from hail.

Recommend Projects