Code Monkey home page Code Monkey logo

comscribe's Introduction

ComScribe

ComScribe is a tool that identifies communication among all GPU-GPU and CPU-GPU pairs in a single-node multi-GPU system.

Installation

You can directly execute install.sh script

./install.sh

OR

You can install it manually.

You will need the following programs:

  • Python: ComScribe is a Python script. It uses several packages listed in requirements.txt, which you can install via the command:
pip3 install -r requirements.txt
  • nvprof: ComScribe parses the outputs of NVIDIA's profiler nvprof, which is a light-weight command-line profiler available since CUDA 5.

  • NCCL[Optional]. ComScribe modifies NCCL library to profile collective communication primitives. If your application does not use any collective operations, you don't have to perform this step.

cd nccl && make -j src.build

No further installation is required.

Usage

P2P Communication Profiling

To obtain the communication matrices of your application (app):

python3 comscribe.py -g <num_gpus> -s log|linear -i <cmd_to_run>

-g lets our tool know how many GPUs will be used, however note that if the application to be run requires such a parameter too, it must be explicitly specified (see -i below).

-s can be log for log scale or linear for linear scale for the output figures.

-i takes the input command as a string such as: -i './app --foo 20 --bar 5'

The communication matrix for a communication type is only generated if it is detected, e.g. if there are no Unified Memory transfers then there will not be any output regarding Unified Memory transfers. For the types of communication detected, the generated figures are saved as PDF files in the directory of the script.

Collective Communication Profiling

python3 comscribe.py -n -c <collective_type> -g <num_gpus> -s log|linear -i <cmd_to_run>

-n enables the profiling of collectives.

-c represents the collective to be profiled (if not specified, all collectives will be profiled by default). Options: broadcast, reduce, allgather, allreduce, reducescatter

Benchmarks

We have used our tool in an NVIDIA V100 DGX2 system with up to 16 GPUs using CUDA v10.0.130 for the following benchmarks:

  • NVIDIA Monte Carlo Simluation of 2D Ising-GPU | GitHub

  • NVIDIA Multi-GPU Jacobi Solver | GitHub

  • Comm|Scope | Paper | GitHub

    • Full-Duplex | GitHub
    • Full-Duplex with Unified Memory | GitHub
    • Half-Duplex with peer access | GitHub
    • Half-Duplex without peer access | GitHub
    • Zero-copy Memory (both Read and Write benchmarks) | GitHub

    Note: In order to run a Comm|Scope benchmark with fixed iterations e.g. 100, in the source code of benchmark, replace it's registration with:

    benchmark::RegisterBenchmark(...)->SMALL_ARGS()->Iterations(100);
    
  • MGBench | Github

  • Eidetic 3D LSTM | Paper | GitHub

  • Transformer | Paper | GitHub

Example: Comm|Scope Zero-copy Memory Read Half-Duplex Micro-benchmark

python3 comscribe.py -g 4 -i './scope --benchmark_filter="Comm_ZeroCopy_GPUToGPU_Read.*18.*" -n 0' -s log

Gives the bar-chart for Zero-copy memory transfers:

Example: Comm|Scope Unified Memory Full Duplex Micro-benchmark

python3 comscribe.py -g 4 -i './scope --benchmark_filter="Comm_Demand_Duplex_GPUGPU.*18.*"' -s linear

Gives two matrices, bytes transferred (left) and number of transfers made (right):

Example: MGBench Full Duplex Micro-benchmark

python3 comscribe.py -g 4 -i './fullduplex' -s linear

Gives two matrices, bytes transferred (left) and number of transfers made (right):

Example: NVIDIA Monte Carlo Simluation of 2D Ising-GPU

python3 comscribe.py -g 4 -i './cuIsing -y 32768 -x 65536 -n 128 -p 16 -d 4 -t 1.5' -s log

Gives two matrices, bytes transferred (left) and number of transfers made (right):

Publication:

  • Akhtar, P., Tezcan, E., Qararyah, F.M., Unat, D. (2021). ComScribe: Identifying Intra-node GPU Communication. In: Wolf, F., Gao, W. (eds) Benchmarking, Measuring, and Optimizing. Bench 2020. Lecture Notes in Computer Science(), vol 12614. Springer, Cham. https://doi.org/10.1007/978-3-030-71058-3_10

  • Soytürk, M.A., Akhtar, P., Tezcan, E., Unat, D. (2022). Monitoring Collective Communication Among GPUs. In: , et al. Euro-Par 2021: Parallel Processing Workshops. Euro-Par 2021. Lecture Notes in Computer Science, vol 13098. Springer, Cham. https://doi.org/10.1007/978-3-031-06156-1_4

comscribe's People

Contributors

erhant avatar palwisha-18 avatar mabdullahsoyturk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.