Code Monkey home page Code Monkey logo

commbench's Introduction

CommBench

CommBench is a portable benchmarking tool for HPC networks involving multi-GPU nodes. The tool integrates MPI, NCCL, and IPC capabilities, provides an API for users to compose desired communication patterns, takes measurements, and offers ports for benchmarking on Nvidia, AMD, and Intel GPUs.

For questions and support, please send an email to [email protected]

API

CommBench is a runtime tool for implementing custom microbenchmarks. It offers a C++ API for programming a desired pattern using a composition of point-to-point communications. When programming a microbenchmark, CommBench's API functions must be hit by all processes as if the program is sequential.

Communicator

The benchmarking pattern is registered into a persistent communicator. The data type must be provided at compile time with the template parameter T. Communication library for the implementation must be specified at this stage because the communiator builds specific data structures accordingly. Current options are: MPI, NCCL, and IPC.

template <typename T>
CommBench::Comm<T> Comm(CommBench::Library);

Pattern Composition

CommBench relies on point-to-point communications. The API offers a single function add for registering point-to-point communications that can be used as the building block for the desired pattern.

For quick tests, add_lazy allocates communication buffers buffers internally.

void CommBench::Comm<T>::add_lazy(size_t count, int sendid, int recvid);

The rigorous function requires the pointers to the send and recieve buffers as well as the offset to the data. For reliability, the pointers must point to the head of the buffer, as returned by virtual memory allocation. The rest of the arguments are the number of elements (their type is templatized) and the MPI ranks of the sender and reciever processes in the global communicator (i.e., MPI_COMM_WORLD).

void CommBench::Comm<T>::add(T *sendbuf, size_t sendoffset, T *recvbuf, size_t recvoffset, size_t count, int sendid, int recvid);

For seeing the benchmarking pattern as a sparse communication matrix, one can call the report() function.

void CommBench::Comm<T>::report();

Synchronization

Synchronization across the GPUs is made by start() and wait() functions. The former launches the registered communications all at once using nonblocking API of the chosen library. The latter blocks the program until the communication buffers are safe to be reused.

Among all GPUs, only those who are involved in the communications are effected by the wait() call. Others move on executing the program. For example, in a point-to-point communication, the sending process returns from wait() when the send buffer is safe to be reused. Likewise, the recieving process returns from wait() when the recieve buffer is safe to be reused. The GPUs that are not involved return from both start() and wait() functions immediately. In sparse communications, each GPU can be sender and receiver, and the wait() function blocks a process until all associated communications are completed.

void CommBench::Comm<T>::start();
void CommBench::Comm<T>::wait();

The communication time can be measured with minimal overhead using the synchronization functions as below.

MPI_Barrier(MPI_COMM_WORLD);
double time = MPI_Wtime();
comm.start();
comm.wait();
time = MPI_Wtime() - time;
MPI_Allreduce(MPI_IN_PLACE, &time, 1, MPI_DOUBLE, MPI_MAX, comm_mpi);

Measurement

It is tedious to take accurate measurements, mainly because it has to be repeated several times to find the peak performance. We provide a measurement functions that executes the communications multiple times and reports the statistics.

void CommBench::Comm<T>::measure(int warmup, int numiter);

For "warming up", communications are executed warmup times. Then the measurement is taken over numiter times, where the latency in each round is recorded for calculating the statistics.

Multi-Step Benchmarks

For benchmarking multiple steps of communication patterns where each step depends on the previous, CommBench provides the following function:

void CommBench::measure(std::vector<Comm<T>> sequence, int warmup, int numiter, size_t count);

In this case, the sequence of communications are given in a vector, e.g., sequence = {comm_1, comm_2, comm_3}. CommBench internally figures out the data dependencies across steps and executes them asynchronously while preserving the dependencies across point-to-point functions.

Striping

As an example, the above shows striping of point-to-point communications across nodes. The asynchronous execution of this pattern finds opportunites to overlap communications within and across nodes using all GPUs, and utilizes the overall hierarchical network (intra-node, extra-node) efficiently towards measuring the peak bandwidth across nodes. See examples/striping for an implementation with CommBench. The measurement will report the end-to-end latency ($t$) and throughput ($d/t$), where $d$ is the data movement across nodes and calculated based on count and the size of data type T.

For questions and support, please send an email to [email protected]

Collaborators: Simon Garcia de Gonzalo (Sandia), Yu Li (Illinois), Elliot Slaughter (SLAC), Chris Zimmer (ORNL), Bin Ren (W&M), Tekin Bicer (ANL), Wen-mei Hwu (Nvidia), Bill Gropp (Illinois), Alex Aiken (Stanford).

commbench's People

Contributors

merthidayetoglu avatar yuli821 avatar markstock avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.