I've been doing some application profiling in the background while working on other th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Is it possible straightforward to get p4est</code

Write a better LoadBalancer about ibamr HOT 11 OPEN

drwells commented on June 12, 2024

Write a better LoadBalancer

from ibamr.

Comments (11)

boyceg commented on June 12, 2024

Would it be easier/better to adopt a simpler grid generation algorithm --- e.g. a tree-structure with fixed-sized patches?

from ibamr.

drwells commented on June 12, 2024

That may be more efficacious, but I think just redoing the load balancer will be easier. If we use a different grid generation algorithm we'll have to redo a load balancer and also the gridding algorithm instead of just the load balancer. Furthermore, being able to chop boxes in more arbitrary ways might be advantageous for IB load balancing.

from ibamr.

drwells commented on June 12, 2024

This has come up a couple of times on slack in different places so I'll summarize what I know here.

I've been running the nozzle benchmark on my machine to try and address some IFED scalability issues - at this point we have more parallelization problems with SAMRAI then IFED, which is good since, fundamentally, we can fix a lot of these by just writing a better LoadBalancer.

A few highlights from callgrind:

some basic stuff which is very easy to patch (#1562, #1563, #1564) and marginally improves performance (each is a few percent of the overall solver) (the 'recompute overlapping boxes on demand' patch actually makes my benchmark run twice as fast)
SAMRAI's communication uses extra buffers on both ends - we can get another marginal improvement in communication cost by writing directly into PatchData from MPI's buffer.
we spend roughly half our time when doing a MatMult communicating ghost data: that's good motivation to get rid of some more intermediate copies in SAMRAI and reduce the total number of ghost cells by writing a better load balancer. Reducing copies is a marginal improvement compared to writing a really good load balancer. Maybe we can collaborate with the AMREX people on this? I haven't found a ton of load-balancing algorithms for patch-based FDM in the literature. I should look more.
Nodal IFED is now fast enough that we spend a lot more time communicating Cartesian-grid data than actually doing IFED :) The global FE scatter isn't that bad (and there isn't much more we can do here - see drwells/fiddle@d24a478) but doing a global redistribution of Cartesian grid data is about 10x as expensive as doing IFED stuff. Scattering the accumulated forces back is also quite slow (maybe 3x the cost of doing the actual spreading). This implies that we need to get rid of SecondaryHierarchy and have a single global partitioning of the Cartesian-grid data which load balances both the total number of cells/processor and number of IB points/processor.

Update 1: some relevant discussion on AMR in the Parthenon paper (AFAICT they use trees, not patches)

https://arxiv.org/pdf/2202.12309.pdf: "PARTHENON provides infrastructure for AMR with both cell- and face-centered data. The size of these arrays must be the same on all MeshBlocks, and moreover the overall domain must contain an integer number of MeshBlocks in each dimension. However, the number and size of individual MeshBlocks tiling the computational domain is arbitrary."

Update 2: I spent a little bit playing around with a possible new load-balancing algorithm for patches. It does a better job than the default LoadBalancer because it alternates between chopping and balancing. The algorithm is something like:

Calculate a weight for each patch (e.g., number of cells)
use a bin packing algorithm (e.g., Karmarkar-Karp) to balance patches between bins
pair the biggest and smallest bin quartiles together. The bigger bin donates something (either a patch or a subset of a patch) to the smaller bin to move it close to the optimal workload.
Repeat 6-7 times.

My rough Python implementation of this algorithm converges nicely: the difference between the biggest and smallest bins roughly halves at each step and we create a constant (n_procs / 4) number of new patches at most. It's just a simulacrum (I approximate bad chops by adding some randomness) but I get far better results than SAMRAI's load balancer (60 total patches instead of 300, maximum difference of 100 instead of 3000)

Update 3: We may have found a capable undergraduate student to work on this this summer.

Another thing we should examine, if we end up using integer programming, is penalizing using more ghost regions. Formulations of bin packing as an integer program typically involve minimizing some quantity r. We could penalize the existence of ghost regions by just adding up the number of ghost regions present with certain bin packing operations. It might not help a lot but it could keep adjacent patches preferentially together (and then we can just merge them in a postprocessing step).

from ibamr.

knepley commented on June 12, 2024

Can the load balance problem be phrased as just graph partitioning with a balance constraint (as PArMetis supports)? (edited by @drwells to remove extra email text)

from ibamr.

drwells commented on June 12, 2024

@knepley We can probably interpret this load-balancing problem as a weighted graph-balancing problem. A fundamental difference between this and a more normal situation (e.g., FEM) is that we can chop patches. For example, on a single processor SAMRAI generates

for exactly the same grid as the one in my first picture. Just partitioning this isn't going to work since there are only 16 patches: we need a way to chop them up in a nice way too.

from ibamr.

boyceg commented on June 12, 2024

In some sense, we don't have to use SAMR as long as the levels are properly nested. Would it make sense to use p4est?

from ibamr.

knepley commented on June 12, 2024

Also, by checking the refinement criteria you input, you can force p4est to deliver blocks if you want, and the 2:1 balance is also optional.

from ibamr.

boyceg commented on June 12, 2024

Is it ~~possible~~ straightforward to get p4est to generate what SAMR folks call "properly nested" grids?

from ibamr.

drwells commented on June 12, 2024

Would it make sense to use p4est?

Maybe! That's essentially what Parthenon is doing (and a lot of other libraries). Notably, AMREX does not (they still use SAMR).

Is it straightforward to get p4est to generate what SAMR folks call "properly nested" grids?

I believe this is the default - e.g., everything in deal.II is properly nested.

from ibamr.

boyceg commented on June 12, 2024

Seems pretty tempting to try to use p4est here...

from ibamr.

knepley commented on June 12, 2024

I have been using it for a while now. All the plasma stuff I am doing with Mark Adams uses p4est. You can convert seamlessly to a Plex (and pretty seamlessly back). We have some built-in stuff to define Vec objects that define the refinement/coarsening (see VecTagger and DMAdaptLabel/DMAdaptMetric).

from ibamr.

Write a better LoadBalancer about ibamr HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent