Code Monkey home page Code Monkey logo

Comments (46)

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024 1

Will do! I'll set it up later on, for now I'll keep the number of elements smaller!

I'm just so pumped I got the original queue (of which I've given up on) actually scales, and its all thanks to your suggestion Engin! Thanks man!

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Quick idea that I wanna mention before I forget: In these benchmarks there should be some parameter to control how intense is the computation on the data. i.e. do tasks enqueue/dequeue something and be done with it quickly, or do they do a lot of computation before enqueue/dequeueing the next item. It'd be nice to control this easily..

You can devise something artificial: do N number of transcendentals before next operation etc..

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

@e-kayrakli I didn't see your response, sorry about that.

That'd be an interesting approach, as having no computations shows the raw performance of each operation, while having with shows it's usage under a more practical (even if artificial) workload. For example, like in the current model, you'd see a lot more dequeue operations spinning if you had it take then discard without even making use of what it obtained, compared to if it was actually utilized.

Those 4 above benchmarks would definitely be able to support such configuration...

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Another thing would be to add some jitter to the amount of computation/queue elements produced-consumed per task. To see how the queue behavior changes under irregular applications(eg graph processing)

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Another benchmark that I think we can use is n-queens:

https://en.wikipedia.org/wiki/Eight_queens_puzzle

I think having a distributed queue with balanced load can form a good backbone for solving this problem with good scalability. I think there must be an nqueens implementation in the master repo. Even if not it is not too difficult to implement.

A sketch of the algorithm can be like this:

Assume a type type Board = N*uint(8); where b[i]=0 denotes row i is empty, any b[i] = n where 1 <= n <= N denotes a queen placed in column n in the row i.

You create a queue of Boards, initially enqueue it with n boards where b[0]=j for 1<=j<=N than tasks will roughly do:

while true {
  var myBoard = boardQueue.dequeue();
  const firstEmptyRow = findFirstEmptyRow(myBoard);
  if firstEmptyRow == N+1 {  //no empty row
    return;
  }
  for j in 1..N {
    if canPlaceQueen(myBoard, firstEmptyRow, j) then
      const newBoard = myBoard;
      newBoard[firstEmptyRow] = j;
      boardQueue.enqueue(newBoard);
  }
}

Assuming that the queue was strictly fifo, in the end it will contain all the solutions. And master thread can just print out the number of elements in the queue.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Also, for the raw benchmark:

I think the way it is is useful. But we should have a variation of that where enqueues and dequeues are intertwined. We can use config const ratio: real to denote enqueue/dequeue ratio. So for every enqueue task dequeues ratio elements, or vice versa, or use ratio to denote some upper limit and then pick some random number to add some variation to the load.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

@e-kayrakli

For the NQueens problem, does it have to be strictly FIFO? Would it be a huge deal if it wasn't? In the case of finding solutions, would it be a proper Proof-Of-Correctness test to ensure that it does indeed work? I mean, of course if elements get dropped randomly then the non-determinism would cause an issue, or a segmentation fault cropping up would be an obvious problem, but would this be enough? Perhaps I can also test the Memory module and ensure it does not leak memory all over the place by the end of it as well, perhaps examine the number of communications between all locales as well.

One thing I'm curious about as well is where we would introduce locality. Do we have a 1:1 mapping of Locales to Boards (at the beginning I mean), so there is no race condition where if all of the boards are taken and new boards haven't been added, it doesn't see the queue as being empty and give up early? What if we have less boards than locales, what do happens then? That kind of got me for the Breadth-First Search: If we enqueue the root node, do we wait until we add it's neighbors? What if there are less neighbors than there are locales, then when do the other locales know not to give up? Etc.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Not necessarily. We can also have different variations. i.e. with strict FIFO as I described, with loose fifo but using atomic counters to accumulate total number of solutions etc.

Testing all the benchmarks with memory tracking module seems like a good idea. Note that there might be some leaks because of Chapel's implementation and not yours.

I think for these benchmarks you can enqueue the starting data before you start your timing. And you can do it however you wish, ie add everything in single locale, let the work stealing handle the rest, or distribute them manually initially.

I guess I was a bit too optimistic to think that single queue approach would work. Normally, you can solve this issue by using two data structures: one that is everyone consuming in this step, another everyone is producing this step and will be consuming next. And do a barrier between the steps. I think this is the least synchronized way of solving that issue.

This brings up the issue of neither of the benchmarks requiring strict FIFO-ness. But, if we were to have different queues with different FIFO-ness guarantees, I think we can test them all with these benchmarks regardless.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Another thing that I was thinking about (I vaguely remember discussing this, but I may be wrong):

We should have a naively wrapped queue based on the List module, and compare against that. The reasoning would be; such an implementation is the status quo, if a user wants to have queue logic but doesn't want to invest too much in optimizing it. So, we should be able to say our queue gets X-folds better performance over the only alternative you have in the language.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Playing around more with CCLock benchmark: Tests should be able to run with minimal configurations. In this specific example, I had to build Chapel with GMP support. In the future, let's make tests as self-contained as possible.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

We should have a naively wrapped queue based on the List module, and compare against that.

Hm... in terms of the actual List module, would we just have 1 locale working with the list, or multiple? I'm assuming 1 since otherwise it would require all of them migrating to one solitary list, especially if as you said you construct a new queue for each step.

Normally, you can solve this issue by using two data structures: one that is everyone consuming in this step, another everyone is producing this step and will be consuming next.

Speaking of steps, why is it done in terms of steps? Why not just have the same queue and add it to the tail? Wouldn't this allow more parallelism/concurrency in that other steps can be processed in tandem with current ones? I'm assuming that you create a new tuple based on the old one, modify it, and add it. This is entirely pure, so it shouldn't be an issue if it uses the same queue, right? Only issue is when it needs to stop, but that could be handled in some way I'm sure. If we have to setup and tear down our multilocale queue, then we end up incurring massive communication costs in terms of construction and destruction. Unless the number of boards per step were massive in number enough, it would significantly impact benchmark results.

Playing around more with CCLock benchmark

First, do you get similar results that I do: cclock gets a lot more performance out of it than sync does, but it varies per run. That's for very short critical sections too, of which the overhead of the abstraction would be massive compared to the logic being protected.

I had to build Chapel with GMP support.

Apologies, I had no idea that it required it, normally I just make all, which makes sense. Although I'm not really sure how to tell which modules require external features and which do not.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Speaking of CCLock, I was testing it on swan (Haswell), and it seems that sync must make some interesting NUMA-aware optimizations. Do you know if it does anything like that? Its interesting in that on my machine, it exceeds sync (presumably qthread's syncvar_t implementation, which has so many optimizations its a wonder in and of itself that it beats it) by 1.5 - 2x for extremely small critical sections. However, on Haswell, it actually loses with sync winning by ~1.1x. Overall, its a very simple benchmark which doesn't even begin to show the true power of it, its still interesting.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

would we just have 1 locale working with the list, or multiple?

Multiple. Like a real-world distributed application. Although, the queue data will be stored on one locale.

Speaking of steps, why is it done in terms of steps? Why not just have the same queue and add it to the tail? Wouldn't this allow more parallelism/concurrency in that other steps can be processed in tandem with current ones? I'm assuming that you create a new tuple based on the old one, modify it, and add it. This is entirely pure, so it shouldn't be an issue if it uses the same queue, right? Only issue is when it needs to stop, but that could be handled in some way I'm sure. If we have to setup and tear down our multilocale queue, then we end up incurring massive communication costs in terms of construction and destruction. Unless the number of boards per step were massive in number enough, it would significantly impact benchmark results.

No tear-down, build-up necessary. You'll just haw two queues and swap them after each turn. The point is of course you can somehow make them stop, but we need to keep the overhead of that operation to a minimal so that we can see the queue performance without too much of a noise. Besides, I don't want you to spend too much time on figuring how to do that out.

First, do you get similar results that I do: cclock gets a lot more performance out of it than sync does, but it varies per run. That's for very short critical sections too, of which the overhead of the abstraction would be massive compared to the logic being protected.

Haven't done any --fast testing. I was just trying to understand issues we have discussed better.

Speaking of CCLock, I was testing it on swan (Haswell), and it seems that sync must make some interesting NUMA-aware optimizations. Do you know if it does anything like that? Its interesting in that on my machine, it exceeds sync (presumably qthread's syncvar_t implementation, which has so many optimizations its a wonder in and of itself that it beats it) by 1.5 - 2x for extremely small critical sections. However, on Haswell, it actually loses with sync winning by ~1.1x. Overall, its a very simple benchmark which doesn't even begin to show the true power of it, its still interesting.

Don't really understand what is beating what. But that reminds me; I was wondering if we should leave the option to choose what synchronization to use to the end user. What do you think?

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Don't really understand what is beating what. But that reminds me; I was wondering if we should leave the option to choose what synchronization to use to the end user. What do you think?

This does sound reasonable. Considering that performance seems to differ based on architecture, why not, currently CC-Synch and SyncVar are the two types I can think of right now. I would say "and a lock-free one" but then I'd have to deal with hazard pointers and/or MCAS, and I never realized how difficult that would actually be. Can't believe I thought I'd be able to 'finish' everything the first month, bleh!

Quick idea that I wanna mention before I forget: In these benchmarks there should be some parameter to control how intense is the computation on the data. i.e. do tasks enqueue/dequeue something and be done with it quickly, or do they do a lot of computation before enqueue/dequeueing the next item. It'd be nice to control this easily..

Revisiting this idea... It definitely seems to be the most viable thing to try out as well, aside from implementing NQueens (of which I'm still not satisfied with either Enqueue or Dequeue as is). There's still some communication going on that I need to pinpoint and eliminate in both (strictly FIFO that was recently resurrected, and MPMC Work-Stealing). The question here is... how does this reflect in the benchmark? Same applies to...

Another thing would be to add some jitter to the amount of computation/queue elements produced-consumed per task. To see how the queue behavior changes under irregular applications(eg graph processing)

Again, how would this effect the results? If all queues had jitter at the same location, then perhaps it would actually work out, but I'm wondering how something like this (along with computation intensity) can be implemented.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

image

Incoming Benchmark - Finally got Enqueue to work and scale. Of course its to be expected, but a lot more work than expected.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Again, how would this effect the results?

I guess my point with jitter is to try to have same amount of data go in and out of the queue, yet randomize the times when enqueue/dequeue occurs in order to reduce contention. Higher the jitter less the contention.

But maybe I am too optimistic to think that we can really control that...

how something like this (along with computation intensity) can be implemented.

For intensity: you can have a config const compPerDequeue = 10. Which would mean per an element you dequeue, you'd perform 10 computations, where a computation can be calculating the sin of the value or something.

For jitter: you can have config const compJitter = 5. Which would mean instead of 10 operations per dequeue you are doing N operations, s.t. 5<=N<=15 using a random number generator.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Your Jitter and Computation idea to reduce contention is... genius! I think the strictly FIFO queue actually is scaling! Although I can't be sure... to run the benchmark, I need to repeatedly interactive input "Yes" each time, and when I try to do echo "y" | ./main.exe it doesn't show me the output and feels like its not doing anything. Only way to tell is by tomorrow morning.

From what I have seen, the baseline was about 500K Op/Sec at 1 locale, but ~700k Op/Sec at 32 locales. Even for 1 locale, the contention of one variable is too large to ignore. However, this is the same thing i observed before, which was dispelled when I had them all migrate to the owning locale and use CCLock for synchronization on the counters... sigh.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

I need to repeatedly interactive input "Yes"

I am not sure what this means, but if I understand correctly this may be solved with not using the chapel launchers but writing your own pbs script.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Unfortunately my connection got reset while it was running overnight, so had to rerun again. It seems that it does work, it just took a really long time to get things done. Some good news however: The FIFO queue scales! I'm kind of curious what would happen if I increased the number of computations and jitter, but it definitely is contention. Might have to add a FIFO test in #10

image

My original idea actually worked!

The changes I made were efforts to reduce issues with extra communications outside of the head/tail of the queue (by passing around pointers to each local queue rather than taking an additional hit from a dereference of the instance field) and makes use of network atomics! BooYah!

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Unfortunately my connection got reset while it was running overnight, so had to rerun again.

Google "GNU Screen" or "tmux"(I cannot guarantee the latter is installed on the system you are using)

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

I've made a very curious observation... I see the double-edged sword to CC-Synch algorithm. While reducing contention and providing great performance under high contention, it actually performs abysmally under irregularities introduced by the jitter and computations per operation! It seems that due to the short computations taking too long for the combiner thread to perform multiple operations at once, we lose any and all scalability. Under high contention, multiple operations can get done very rapidly, which also explains extra performance from the FIFO queue (a lot of remote tasks being spawned per locale, the CCQueue is in it's true element.)

I'll have to redouble my enthusiasm for letting the user specify which lock to use.

With some jitter and computation

image

Without Jitter and Computation

image

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

How much of this performance difference is due to the actual computational load that you are introducing?

Your comparison axis (same synch w/o computation) may be unfair. A fairer comparison would be sync var vs cclock under computational load/jitter..

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

How much of this performance difference is due to the actual computational load that you are introducing?

Everything. Literally same parameters except jitter and computation.

Your comparison axis (same synch w/o computation) may be unfair. A fairer comparison would be sync var vs cclock under computational load/jitter..

I'll agree to this. I'll try to see about doing sometime either later today or tomorrow, benchmark takes a while and I'll need to revise some logic here and there. I'm currently running the FIFO without computations and jitter to test.

Also wouldn't comparison with and without workload be fair? I thought it helped represent how it works under an actual workload.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Pardon me if I don't have it plotted, but its exactly what I saw before! The same thing that disheartened me in before when I decided it would never work! The below is for FIFO Queue (note: globally FIFO) without jitter and computation!

NumLocales Op/Sec
1 1.84709e+06
2 1.95306e+05
4 2.24346e+05
8 2.72278e+05
16 3.11618e+05
32 2.8594e+05
44 3.09985e+05

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Also wouldn't comparison with and without workload be fair? I thought it helped represent how it works under an actual workload.

What I was trying to get at is: since you are introducing new computation and measuring its time too, saying that ccqueue performance dropped from 1M to 100K op/s under computational load is not fair to CCQueue. Because the overall time you are measuring includes some computation on top of the synchronization.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Its interesting. I thought it would never scale due to communication... but communication seems to be involved in literally everything (I.E: lack of privatization), so I suppose it makes sense a bit for as to why it would scale with network atomics! Note this is without migration! YAHOO!

Also you're right, more testing is needed before I make assumptions on CCQueue.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Hm... trying to rationalize this behavior for both queues... might need to sit on it for a while... but the most interesting property for the globally strictly FIFO queue... is that dequeues will not effect the performance for enqueues. Dequeue uses the global tail (of which needs to be in it's own cache-line), and locally it only contends for the tailLock. Very interesting... For MPMC queue, its the same way (except for work stealing which needs to add bulk)

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Okay, so after adding Dequeue to benchmark, I see that the FIFO queue won't scale, not if I migrate to another locale, or if I have to use multiple network atomics... after seeing enqueue scale, I refuse to just abandon it, I just can't! I get results like such...

Enqueue:  5.12851e+05 , Dequeue:  7.26564e+05
Launching with # 2
Enqueue:  58193.8 , Dequeue:  14894.5
Launching with # 4
Enqueue:  1.08921e+05 , Dequeue:  30462.1
Launching with # 8
Enqueue:  2.44922e+05 , Dequeue:  26018.3

The core of the problem lies in that dequeue must not have tail exceed head, and if another task or locale has advanced head, it cannot do so again without checking itself, so we end up with 3 network calls (1 with extremely high contention) per iteration of the loop, only one can proceed. There has to be a way, do you have any ideas Engin? I was thinking of doing a fetchAdd on head and rewinding if it turns out later it had exceeded the tail, but this leads to race conditions that violate everything the queue has going for it... I'm thinking of going back to having both head and tail held within 1 word, which will increase overall contention at a global scope. This means that enqueue performance will suffer (as its no longer a simple fetchAdd but a read followed by a CompareAndSwap, which will introduce a lot of contention, and may result in it not scaling anymore...)

I'm a bit lost here, because again, FIFO scalability... unheard of! I can't let something this novel and interesting just slip by!

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

I'm thinking of performing some compiler __primitive black magic here... encode head in lower 32 bits, tail in upper 32 bits, then use __primitive("cast", ...) to cast to atomic uint(32) and let enqueues only require a fetchAdd for this case... but the issue is that... I don't know if the compiler allows atomics to overwrite the space afterwards, as for all I know it could cause undefined behavior on some systems.

Edit:

I need to know if an atomic CompareAndSwap and FetchAdd respect contention and claim for ownership on that cache line. If I fetchAdd the lower 32 bits, then any contending CompareAndSwap on the entire 64 bits must fail, and vice versa. If this is true, perhaps it can scale!

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

I'd hate to give up, but for now I'll have to sigh wait until after evaluations to test it. The performance is so bad, it takes literally 30 minutes for what should take about 5 - 10 seconds. Not even possible to show in benchmark... I've tried everything (but what I mentioned above, but too much effort for now).

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Okay, I got it! Finally, dequeue doesn't sink down to abysmal levels. It will not scale because it has to migrate to the original locale and I introduced a protecting sync var, but there will be a lot less contention! Meaning that enqueue scales, but dequeue does not, but it is strictly FIFO! I'll consider it a successful Proof-Of-Concept and move on!

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Not to say that I am able to follow everything here but,

I need to know if an atomic CompareAndSwap and FetchAdd respect contention and claim for ownership on that cache line. If I fetchAdd the lower 32 bits, then any contending CompareAndSwap on the entire 64 bits must fail, and vice versa. If this is true, perhaps it can scale!

This sounds extremely dangerous to me!

But good to know that you are happy with the performance. Such detailed optimization ideas have to wait until the very end of the project, I think.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Fair enough. Speaking of which, I think I should make another issue with potential ideas so I can revisit them later.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Sounds good

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

@e-kayrakli

I think I found a decent way to prevent each locale from stopping before it has fully finished processing and to know when to stop. It also eliminates communication for all but the locale that finishes and is concurrent safe so that if multiple solutions are found it stops early.

After looking up the problem, I see that we need to stop at the first solution, as the number of steps required increase in terms of magnitudes extremely quickly! So without further ado, here's a simple solution based on yours!

    var foundDom = { 1 .. N };
    var foundDmap = foundDom dmapped Cyclic(startIdx=foundDom.low);
    var foundDist : [foundDmap] bool;

    coforall loc in Locales {
      on loc {
        // We spin on our local index, no communication cost needed, and its not
        // a class instance field so no cost on access.
        while !foundDist[ourIdx] {
          var (exists, myBoard) = boardQueue.dequeue();
          // Spin: We haven't found solution yet...
          if !exists {
            continue;
          }

          const firstEmptyRow = findFirstEmptyRow(myBoard);
          if firstEmptyRow == N+1 {  //no empty row
            // We are done: Alert everyone that it is done (including ourselves...)
            coforall idx in foundDist.domain {
              foundDist[idx] = true;
            }
            break;
          }
          forall j in 1..N {
            if canPlaceQueen(myBoard, firstEmptyRow, j) {
              const newBoard = myBoard;
              newBoard[firstEmptyRow] = j;
              boardQueue.enqueue(newBoard);
            }
          } 
      }
    }
  }

So really, looking at it, this ensures parallel computations across not only tasks (checking all potential columns) but locales (they all pull a board from the global queue, and produce multiple boards). Is this logic correct? Unfortunately, this can't be tested for MPMC queue until work stealing is implemented, but that can be done this weekend! Can't wait!!! I predict for work stealing, at first you'll see a ton as the queue is empty, but very quickly it will balance itself out. Since one board produces multiple boards worth of work, likely by stealing X% from all locales, every locale will be fed very quickly. If ever one is short, the multiple of boards added on others can be stolen. I predict very good results!

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

After looking up the problem, I see that we need to stop at the first solution

Do we, really? I thought you stop when you find the number of possible solutions to the problem...

Assuming that we are only interested in one solution:

  • What guarantee is there for this spinning logic to work with existence of hardware/software cache? Granted dequeue might have some fences keeping memory consistent, but there is nothing at this level of code that guarantees that. Am I wrong?

  • You can think of the same problem within a locale as well. Consider the code slightly different (which I think is a much better way of creating parallelism):

coforall l in Locales do on l {
  coforall taskId in 0..#here.maxTaskPar {

  }
}

Multiple tasks in the same locale would be spinning on the same non-atomic flag. Can you guarantee that they will read it from the main memory and not from their caches by looking at this code? Or similarly, the guy who writes true to that flag; can you guarantee that that write will be written to the main memory and not stay in its cache?

  • Assuming no change in the rest of the code:
            coforall idx in foundDist.domain {
              foundDist[idx] = true;
            }

Couple of things here in terms of Chapel idioms and parallelism in general. In such simple loops you should do

            coforall found in foundDist {
              found = true;
            }

to avoid indexed array accesses. This is true for any kind of loop, but especially important for performance if you have multi-dimensional arrays.

Assuming that foundDist is a sufficiently large array you are better of doing:

foundDist = true;

and count on Chapel's operator promotion. This kind of coforalls can blow up erally badly if the array is large.

Now you did it this way, probably because you knew that foundDist is not going to be that big, because N-queens problem get's incredibly large when you hit double digits for N. But then, why are you parallelizing a loop that will have 15-16 iterations at most. Such small loops are better sequential. Two reasons: (1) Chapel still doesn't keep tasks alive or bulk create tasks, so creating 15 tasks here will be huge overhead compared to setting 15 bools, (2) even if Chapel was doing a better job doing that, you still need to pay for some task management/loop scheduling, instead of setting 15 bools. (You can replace my rough estimate of 15-16 with 50-100, my argument still holds. Setting this many booleans should come at a no cost at all compared to creating that many tasks)

There is a slightly relevant comment from Elliot in one of the issues I have in the main Chapel repo with some pointers if you are interested: chapel-lang/chapel#6391 (comment)

EDIT: If you are unfamiliar with the Stream benchmark, when I was saying "small vector sizes" in that issue, I am talking about (more or less) adding two vectors of size 2**16, which I define as "small".

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

On a second read, I understand that your motivation was probably to parallelize the communication. But I am still very doubtful that that coforall is better than for.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Do we, really? I thought you stop when you find the number of possible solutions to the problem...

Oh? My fault, I've only ever done this once and that a long time ago. I suppose it makes sense that we would need to find all solutions. In this case, I'm assuming the number of potential solutions can be known and we can use integers instead of booleans.

What guarantee is there for this spinning logic to work with existence of hardware/software cache? Granted dequeue might have some fences keeping memory consistent, but there is nothing at this level of code that guarantees that. Am I wrong?

Truth be told, when I made the solution, I literally modified it and did it as pseudocode, but you're right, we should use atomics here (especially since an integer can be used). In this case however, I'd think we can keep it relaxed because the loop iteration just needs to ensure it doesn't read a cached value, correct?

Couple of things here in terms of Chapel idioms and parallelism in general.

Now that I changed it to use atomic int, would this work potentially?

for found in foundDist do found.add(1);

This should eliminate the need to spawn more tasks than needed, right? I assume that due to ref intents, it ensures it is written by reference.

This would change the loop iteration to be...

while foundDist[ourIdx].read() < nSolutions { ... }

Which should be fair, correct? Presumably, no other locales can ever get the same solution twice (???). If so, then this definitely gets complicated... but I wonder if its possible to determine the number of duplicates... Yeah, didn't realize it was this complicated, my bad. Perhaps I can have the task which found the solution add to a set, and then if the solution is unique, have it increment the count of all foundDist? I mean, this way communication is still only done sparsely, when a solution is found.

There is a slightly relevant comment from Elliot in one of the issues I have in the main Chapel repo with some pointers if you are interested: chapel-lang/chapel#6391 (comment)

Oh wow... definitely going to save this bit for later, always good to know more about how to efficiently manage tasks, especially when creating distributed parallel data structures. I suppose I should look through more issues tagged with certain labels like "Performance".

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

As well I see why you did

coforall l in Locales do on l {
  coforall taskId in 0..#here.maxTaskPar {

  }
}

I didn't realize that it was modifying the current board, I thought it was creating a new one each time and doing one square at a time. I see now, yeah this would ensure all processors per node are being used.

Nevermind that, forgot that it was doing an enqueue for the new copy of the board on each loop:

            if canPlaceQueen(myBoard, firstEmptyRow, j) {
              const newBoard = myBoard;
              newBoard[firstEmptyRow] = j;
              boardQueue.enqueue(newBoard);
            }

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Okay, I managed to implement it. Its inefficient, and I won't lie that I kind of cheated, but hey why not! It finds the first solution to the problem, and I verified it works for that much at least, just need to modify it a bit (tomorrow) to find all solutions and I'll be good to go for the benchmark.

Added bonus: It is very inefficient and so should do well to simulate computationally expensive workloads!

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Boom! Done! I didn't realize before what it meant by 'Total vs Unique' solutions, but now I see that they aren't exactly duplicate boards, just isomorphic. So I find all solutions, and BAM, done! Time to test it on multiple data structures.

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

@e-kayrakli

OH MY GOD! ENGIN! I THINK I FOUND IT! I think I can make FIFO scale... after the ahem race condition is resolved, I found that I can reduce the bottleneck on communication over the head of the queue to the point that it does increase in throughput! Mind you, the performance itself is abysmal, but point is that it scales...

image

I know, performance is bad... but point is it scales, even with massive communication costs. Currently the algorithm goes like such...

// Reduces communication cost penalty to the global synchronization.
var _tail = tail.read();
// Assumption: sync variables let you do a read + acquire in one communication
var _head = head$; // Full -> Empty
// Proceed based on what we read for tail at the time we queried it.
if _head < _tail then head$ = _head + 1; // Empty -> Full
// Writeback if empty...
else head$ = _head; // Empty -> Full

As can be seen, even with communication this actually scales. That's for each concurrent task on a single locale, each of them making these communication calls and contesting with each other. But what if they... worked together?

This borrows from the local CC-Synch algorithm... what if we have the combiner thread attempt to acquire the global lock, and then do N active requests (being the next index)? This way, if we have M requests on a single locale, then communication is reduced to M/N, significantly speeding up performance. Note: It is still FIFO order, just in a more biased one (biased towards the locale that owns the global lock).

The rationale here is that dequeue scales even with an excessively large amount of communication. The scalability shown is a pittance of what it can be, and I believe it is worth investigating... the only issue is of priority. Do you think I should explore this now or defer to later?

Opinion Appreciated.

Edit 2:

Also, to demonstrate why this would yield better results... look at it this way. It peaks at 32 locales, 24 processors per node, potential 768 tasks contending for the same global lock, and it still scales! Imagine reducing that contention down to 32, with the added benefit of grabbing the counters locally. Looking at 'FIFO - Enqueue', we know that the model of communication-per-operation can and will scale.

Edit 3:

Perhaps, if this did scale, I could change Enqueue to use a similar algorithm!

Edit 4:

Realized as well, this will allow it to scale even under large amounts of contention! CC-Synch is proven to beat normal mutual exclusion under high contention (although I'll thoroughly test later), and with communication costs being as they are, its likely there will be enough concurrent tasks on a single node to keep it busy enough to shine! Meaning, if I did this for Enqueue, it won't just need nComputations > 0 && nJitter > 0 to shine, but in all cases!

Edit 5:

I left enqueue to using a single network atomic operation, but dequeue now uses CC-Synch algorithm. The performance does increase under high contention...

Edit 6:

Nevermind...

image

It also gets worse when I add computation and jitter... its unfortunate. I've tried almost everything.

Edit 7:

Okay, I'm done for now... nothing else I can do, I've tried everything for this... this is the best I can do in terms of scalability... consider it checked off for now (ironically just like the shape the graph makes)

image

I manually inserted a horizontal line which indicates the baseline. It shows that it does exceed the baseline of 1 locale, but barely, and majority of the time (<32 locales using it in parallel) is unusable.

Edit 8:

Losing track of these edits... too many to keep up with. I'm thinking of attempting that memory pooling you mentioned and using it, especially for CC-Synch algorithm (since in the paper it also uses it). Not sure about where it is in terms of priority...

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

@e-kayrakli

Another idea! I was thinking that maybe I should make my microbenchmark (that tests raw operations per second) should be more dynamic! Instead of a static workload, it will keep the amount of time between each down (currently it varies widely, from 5 minutes to an hour, cutting into time for development). I remember when I worked on Go for my concurrent map, I used their benchmark framework, and it was smooth as butter. It would rerun the benchmark over and over, increasing the number of operations or decreasing it depending on the amount of time it took to complete them and estimate based on that.

The code I can transcribe to Chapel can be seen here? What do you think? Also perhaps if its good enough, it could be another official addition to Chapel we can plan!

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

Losing track of these edits

Yes. Please repeat anything of importance.

Instead of a static workload, it will keep the amount of time between each down

I don't understand what is new here. Can't you do that with your script? i.e. submit multiple PBS jobs with varying sizes (or whatever config variables)?

from distributed-data-structures.

LouisJenkinsCS avatar LouisJenkinsCS commented on June 2, 2024

Summary: Spent majority of time trying to refine FIFO queue, cannot get to scale in the least, even under varying conditions. Above shows varying results of multiple different experiments...

I don't understand what is new here. Can't you do that with your script? i.e. submit multiple PBS jobs with varying sizes (or whatever config variables)?

The difference is that it decides, at runtime, how much should be attempted. It saves the overhead of having to manually decide the number of elements. In essence, what the benchmark framework did was to start small, then scale its way up (or down) depending on estimated time, until a perfect level was reached. Can this be done statically via the script? Yes. Would this save time? Undoubtedly. Its basically a way to automate without relying on a script, and reusable for any other benchmarks/problems.

from distributed-data-structures.

e-kayrakli avatar e-kayrakli commented on June 2, 2024

The difference is that it decides, at runtime, how much should be attempted. It saves the overhead of having to manually decide the number of elements. In essence, what the benchmark framework did was to start small, then scale its way up (or down) depending on estimated time, until a perfect level was reached. Can this be done statically via the script? Yes. Would this save time? Undoubtedly. Its basically a way to automate without relying on a script, and reusable for any other benchmarks/problems.

I think this would be a nice addition. I don't think this is how you should spend your time.

from distributed-data-structures.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.