louisjenkinscs / distributed-data-structures Goto Github PK

[GSoC] Distributed Data Structures - Collections Framework for Chapel language

Home Page: https://summerofcode.withgoogle.com/archive/2017/projects/6530769430249472/

License: BSD 3-Clause "New" or "Revised" License

Chapel 98.87% Makefile 1.13%

chapel-language partitioned-global-address-space distributed-computing data-structures distributed-algorithms google-summer-of-code

distributed-data-structures's People

Contributors

Stargazers

Watchers

Forkers

e-kayrakli ktp-forked-repos

distributed-data-structures's Issues

Interface design

An issue to keep track of API discussion

Standard must-haves

proc Queue(type eltType)
- an optional size argument?
- some param/const flags?
- an optional "targetLocales"?
proc enqueue(elt :eltType)
- possibly a vararg implementation
proc dequeue(): (bool, eltType)
proc clear()
proc length
proc size (equivalent to length)
proc isEmtpy

Some brainstorming:

Utilities:

iter these()
- Most likely the whole 4 iterators: sequential, leader, follower,
  standalone
proc copy() -- or clone, something for deep-copy

Handling asynchrony:

proc sync()
proc asyncEnqueue(elt :eltType)
proc asyncDequeue(elt :eltType): (bool, eltType)
- Maybe you can allow both synchronous and asynchronous operations to
  the queue? Then you'd need something like these. If the queue
  doesn't allow asynchrony due to some configuration parameters
  (likely params), you can throw a compiler warning in these functions
  and fallback to regular enqueue/dequeue

Bulk operations:

proc enqueueBulk(elt: [] eltType)
proc dequeueBulk(numElems): (bool, [] eltType)
proc concat(otherQueue)
- Obviously the same thing as enqueueBulk, you can pick one or have
  both signatures available
proc +=(q: Queue, elts)
proc +=(q: Queue, otherQueue)
- Such operators are supported in sparse and associative domains

Bounded queue:

proc capacity
proc requestCapacity(newCap)
proc isFull

The other threads got kind of cluttered, so I wanted to make a new one with, in particular, a checklist. I'm also getting side-tracked a lot so this is to refocus myself. It actually isn't that much! Data Structures can be easily made in about 30 min, Micro Benchmark is pretty easy (although NQueens I'll need to ensure its correct and properly demonstrates performance, etc), and Work Depletion is easy as testing during the micro benchmark for dequeue. Boom! Its smooth sailing... so long as more compiler issues don't arise (which seem to crop up about 4 - 5 times a week now)

Proof-Of-Concept(s)

Currently, I have two of them, the original (strictly FIFO) and new work stealing one (MPMC). The strictly FIFO one I know already works just fine although it doesn't scale due to communications, but may as well be included. The MPMC one is lacking work stealing currently (and I know you said not to worry about it, but I definitely need to do it before I can call it a proper Proof-Of-Concept. While FIFO doesn't have to scale, MPMC does.

Satisfactory

Data Structures

I'm wondering if the 'distributed vs non-distributed' versions can just compare to baseline at 1 locale, since no communication is performed then.

Distributed Data Structures
- FIFO
  - Potential race condition
  - Enqueue scales, but Dequeue does not
    - Planned revision to allow both to scale under all conditions (including contention)
- MPMC
  - Work stealing is in a workable Proof-Of-Concept state but not viable for usage
    - Requires some low-level C logic for manageable performance and semantics
  - The local queue logic is embedded into the data structure, and uses a sync two-locked queue
    - Need to create separate paths: One for Sync, One for CC-Synch
Local Data Structures
- Sync Two-Locked Queue
- CCQueue
- Sync List (Non-Distributed)

Benchmarks

This has been discussed in issues #1 and #2 but in general I want to create the actual checklist here so I can keep everything in one place.

Micro benchmark (Raw Enqueue + Dequeue)
- Enqueue
- Dequeue
- Computational Intensity
  - Needs to be checked/confirmed
- Artificial Irregularities
  - Needs to be checked/confirmed
NQueens

Proof-Of-Correctness

Work Depletion
- Implemented in the main method of DistributedQueue.chpl
FIFO Test
- Since the original DistributedFIFOQueue scales, I need to ensure it is actually FIFO!

Repository State

Stated in issue: #6 I need to organize it more.

Script

Besides a basic makefile, it was mentioend here: #2

Run All or Subset of Micro-Benchmarks
- Differs from a simple makefile that builds all chpl files together and specifies --main-module
Properly sets up environment variables
Logs SHA hash and Chapel version
- Likely just chpl --version concatenated with git rev-parse HEAD
Log Performance Metrics
- How do I do this? What am I logging that the benchmark isn't? I think I already am
Plotting
- Currently I'm doing this.
Versatility
- Adding new benchmarks
  - Planned to just change --main-module to run the benchmark and compile everything together. This may not be possible though.

Q&A

@e-kayrakli

In terms of the local keyword, would the checks it add cause any significant decline in overall throughput/scalability? As in, does it perform a check for when you enter the block, or on each memory access? It feels as if there is some invisible bottleneck that I cannot find in my queue... I've confirmed there isn't any excess communication (output CommDiagnostics confirms that, everything is near 0 except for a small amount for initialization, <50). There must be something. I thought local is supposed to enable some optimizations?

Repo organization suggestions

Let's keep track of houskeeping tasks here:

Use Makefiles for benchmarks
Add .gitignores to eliminate binaries, result files etc.

Ideas

Segmented Flat Array - Custom Domain

Use the segmented idea from FlatObjectPool to allow concurrent operations during resizing.
Can possibly work with distribution as well as layouts.

FCHLock - Localization

Localize data when we get combiner status and perform it on that, then write back later.

MPMC List - Per Task Blocks

Have here.maxTaskPar blocks of data.

Global Queue using GDT

Idea: Enqueue 'promises' like normal but this time does an exchange operation.

// Inside 'add' function...
var node = new node();
var prev = tail.exchange(node);
if prev == nil then head.write(node);
else prev.next = node;

Of course, using GDT for global objects!

Active Request Bias

Have separate active request lists so that a locale serves only its own tasks. Possible to define some fairness or starvation-free principles by just giving combiner thread access to another locale that has requests.

FIFO Extension

Use the above algorithm for obtaining the head index. This eliminates contention and ensures good system throughput (aside for communication cost, which is necessary anyway). Compared to the number of contending Compare-And-Swap operations being O(numLocales), its 0, with drawback of having communication (which is unavoidable).

Benchmark Framework Tool

Idea: Convert this from Go to Chapel so that benchmarks and tests can be ran with more ease.

Queue

Global wait free queue + FCHLock

Idea: Wait-Freedom guarantees that we have a bounded number of remote operations, which ensures forward progress and scalability (eventually). By using global atomics, we can make use of a bounded global queue that is proven to scale very well. Since the queue is only SPMC (without MCAS), we can use the Flat Combining strategy to ensure scalability even for a single a producer. Consumers use the normal wait-free dequeue solution.

Overall this will ensure global wait freedom! The queue itself is wait free, and because the operations to be combined is wait free, the wait for a task to have its operation be completed is bounded assuming that scheduling inside FCHLock is adequate. Overall, this can guarantee scalability for bounded globally FIFO queues.

Round Robin + Work Stealing

Idea: Reduce amount of work stealing by performing a round robin enqueue like FIFO does.

Rationale: Queue has more consumers than producers, means it won't need to steal work too often. Might increase overall throughput of queue.

Drawback: Enqueue does scale but is magnitudes slower than local enqueue. Good only if enqueue is not a huge priority.

Optimization

FIFO

Allow Enqueues and Dequeues to skip indexes...

Idea: Have enqueue and dequeue solely rely on fetchAdd, then have them update another set of global arrays using fetchAdd for enqueue or fetchSub for dequeue. If during an enqueue, a fetchAdd yields a negative, skip it. If during a dequeue, a fetchSub yields a negative, skip it.

Rationale: Allows enqueues to be unhindered with only one additional wait-free atomic operation. Allows dequeue to rely on two wait-free atomic operations.

Drawback: Having a large amount of dequeues means you have an equal amount more of enqueues having to retry, but is fair and should scale.

Abstractions

QueueSet

Idea: Have a set of queues that are linked together that allow work stealing. This is the MPMC queue, but in a much reusable manner that allows multiple queues, even multiple local queues, all linked together.

RCU With Descriptors

Idea: Have descriptors into a locale's address space which adds one more step of indirection.

Rationale: Allows atomic updates of variables, especially with RCU if its for record types. A compare-and-swap is just one on descriptors. A read and write are also just on descriptors. RCU can do a read on the descriptor, a copy of what is read and modification into a new descriptor, and a replacement to point to the new one (with reference counting for the old one).

Track Chapel Bugs

This is a meta-issue to keep a TODO list of Chapel implementation bugs we found/reported:

Compiling a module that uses BigInteger, but never gets used itself: #3 (comment)

Issue now reported: chapel-lang/chapel#6536

Missing error message where returning an inner function that uses outer-local variables: #3 (comment)

Issue: chapel-lang/chapel#6538

Returning by-ref values from overridden methods and not capturing them at call site: #9 (comment)

Issue: chapel-lang/chapel#6542

Class instances passed with explicit ref intent causes incorrect types during code generation.

Issue: chapel-lang/chapel#6642 -- only adresses the problem partially
SHA: 205021e
Notes: Run make FCHQueue

--cache-remote fails assertion with FIFO queue

SHA: 51dd86e
Notes: Run with make Benchmark EXTRAFLAGS='--cache-remote -sisFIFO=true' then bin/main.exe -nl 2 --nElements=1000
Output:

ATP Stack walkback for Rank 1 starting:
  [empty]@0xffffffffffffffff
  [email protected]:2299
  [email protected]:814
  [email protected]:122
  [email protected]:126
  chpl_gen_comm_get$$CFE_id_569285a2_9b0c5384@chpl-comm-compiler-macros.h:54
  [email protected]:2836
  [email protected]:2297
  [email protected]:1434
  [email protected]:1011
  [email protected]:988
ATP Stack walkback for Rank 1 done
Process died with signal 11: 'Segmentation fault'
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

Returning arrays results in an internal/syntax error

TIO

The whole document is important as you'll be developing correctness tests that can be used in Chapel's testing infrastructure. Sub-header "Futures" some way down in the document is particularly useful for reporting bugs: you create an issue and ideally create a PR adding a "future" to the test system to track down the issue in nightly testing)

First Class Functions and Type System

One thing I'm going to concede here is that cclock can't be improved at this point in time. The object class is the only way to do this, and any other way produces annoying and frustrating compiler errors (internal) or syntax errors which lead to paradoxes (like Chicken-And-The-Egg, except with types) where if you don't know the type at initialization time, or if the type can change, it just won't work. In this way, only class instances can work, which require wrapper objects. I spent far too long on attempting to improve it, I'll try later.

Minimizing Communications

@e-kayrakli (Mentioning so you are aware of a new thread)

Lack of privatization has a dire impact on not only performance, but development outright. I thought you said you distributions are automatically privatized, but I suppose I misheard you, in that the field itself is not privatized and incurs a communication cost, which was responsible in causing performance to slow to a crawl ( to the point I thought I honestly had a deadlock, until I added a println showing it was working, just very slowly). It turns out as well that config variables inside of a module cannot be accessed without a communication cost either. To fix this I have to expose the internal per-locale queue because accessing each time just is not viable. Privatization is the only way to make this viable and convenient. In fact, I'm not even sure I eliminated everything.

Addition of Priority Queue

One day, I had a need for a priority queue implementation, I thought that there would be a module like collections, and here it is, but again here is no Priority queue implementation here, so I was hoping to have a distributed priority queue implementation which may seems handy many times in the computer science, like in the implementation of Djikstra's algorithm or prim's or kruskal's algorithm there is a need of a priority queue, so was hoping of it.

Efficient benchmarking strategy

Opened this issue just to capture some ideas as we discuss them.

My random "in an ideal world" thoughts on this are as follows:

Whatever script that you develop is most likely going to be used by you only for the rest of the project, although you may reuse parts of it later. Therefore, you need to find a sweet spot in terms of time that you are going to spend on this script vs the time it will save you. You can start small and add features as you want.
A script to run all (at least) microbenchmarks is a must. I think we should set a data size (or workload size or whatever term makes sense) and stick with it so as to avoid comparing apples and oranges.
It'd be nice if this script can log performance metrics, along with commit sha of your queue and Chapel. So that we can revert back if we notice a performance degradation. This implies that you should use the master and not the release. When we see a performance degradation, we can do an adhoc test with release to see if the performance issue is due to what you changed or some regression on master.
Plots help understand the results but definitely not a must. If you want to do that at most you can do very basic pyplots, which should do the job and shouldn't take a lot of time.
Writing your script a bit in a flexible fashion can help when/if we add new benchmarks.
In general, when do stuff like this I add the support (if necessary) to be able to run it locally and use it for correctness testing before performance testing. It helps save time by pruning segfaults/races etc. before running on an actual machine.

Global Atomic Operations on Multi-word Data

@e-kayrakli
@mppf

This is of importance to both the project and to Chapel as a whole (or so I believe).
I believe that this early prototype of the Global Descriptor Table (GDT), formerly
referred to as GlobalAtomicObject, demonstrates the potential power of global atomics.
There are a few key things here which were misconceptions that I've had and been told
since the beginning...

Communication Is NOT Bad

Communication, even if it is per-operation, is not bad, but it's not good either.
The true bottleneck is contention. I've tested time and time again, and each time
an algorithm containing some contention-causing operation which may cause an unbounded number
of failed operation resulting in an unbounded number of communications, has failed horribly
in terms of performance. For example, operations like this is bad...

while true {
  var _x = x.read();
  var _y = y.read();
  if _x < _y && _x.compareExchangeStrong(_x, _x + 1) {
    break;
  }
}

Will cause an unbounded number of communications. Reading x and y is relatively
expensive, and since the CAS operation not only causes remote contention, but the fact
that both x and y need to be read again, causes performance to drop.

Now, algorithms which are guaranteed to succeed are guaranteed to scale very well;
that is, only wait-free algorithms can see scalability. Code that uses
fetchAdd and exchange work wonderfully, which gives me hope that a global data
structure with very loose guarantees are possible and useful.

Global Descriptor Table

Next, is the new and novel idea of the GDT, the Global Descriptor Table. A 64-bit word
is used over a 128-bit wide pointer allowing us to take advantage of the benefits of
network atomics. In essence, we encode the locale number in the upper 32 bits and
the actual index in the lower 32 bits. Currently, an array is used, but in reality (perhaps
with runtime support if not already available) its possible that a descriptor can be directly
used like a normal pointer. Currently there is a large amount of overhead in needing
to keep a bitmap of usable memory and it cannot be resized without needing to synchronize
all accesses to it (as Chapel domains and arrays cannot be resized in a lock-free way).

Currently, it has been tested with simple exchange operations on class instances
remotely across all locales, versus needing to acquire a sync variable to do the same.
As stated above, a compareExchangeStrong kills performance, but that has nothing to do
with the GDT but with network atomic contention, so its kept simple. It works and it
scales. The below graph shows time to complete the same number of operations
(100,000 in this case). It shows the overall runtime of the average of 4 trials
(discarding the first warm-up), and that while sync will increase in an a near
exponential growth, GDT remains linear.

Now, to get a better look at it, here are the same results in Operations per Second.

Implications

I believe this is some very valuable information, which is why I include Michael as well.
Not only is the implementation very immature (there are so many optimizations that
can be made, that there's no saying how much more this can scale), it also surpasses
the only way to perform atomics on global class instances. As well, this opens some
very unconventional means of concurrency, the first being the DSMLock (WIP) that is
based on the DSM-Synch (Distributed Shared Memory Combined Synchronization)
in the publication that had CC-Synch (Cache-Coherent Combined Synchronization). As I can confirm
that this scales, I can possibly even make DSMLock scale in such a way that global
lock-based algorithms can scale (not just wait-free). Extremely exciting!

Edit:

If this is extended on for the runtime, I can imagine having the entire global address space chunked up into a 4GB (2^32) zones, with 2^8 zones on 2^24 locales. With 256 of 4GB zones, that's 1TB address space per locale, with 16M+ locales.

Benchmarks

Must create benchmarks for the following scenarios:

Single-Producer Single-Consumer,
Single-Producer Multiple-Consumer,
Multiple-Producer Single-Consumer,
Multiple-Producer Multiple-Consumer

louisjenkinscs / distributed-data-structures Goto Github PK

distributed-data-structures's People

Contributors

Stargazers

Watchers

Forkers

distributed-data-structures's Issues

Standard must-haves

Some brainstorming:

Proof-Of-Concept(s)

Data Structures

Benchmarks

Proof-Of-Correctness

Repository State

Script

Ideas

Segmented Flat Array - Custom Domain

FCHLock - Localization

MPMC List - Per Task Blocks

Global Queue using GDT

Active Request Bias

FIFO Extension

Benchmark Framework Tool

Queue

Global wait free queue + FCHLock

Round Robin + Work Stealing

Optimization

FIFO

Allow Enqueues and Dequeues to skip indexes...

Abstractions

QueueSet

RCU With Descriptors

Communication Is NOT Bad

Global Descriptor Table

Implications

Recommend Projects

Recommend Topics

Recommend Org