Code Monkey home page Code Monkey logo

gpupowertest's Introduction

/**
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/*
GPUPowerTest is derived from CUDA samples from which the above notice is taken
and is the context of the EULA.
*/
$ ./gpt -h
Usage: build/gpt [-u <power load up duration> (in float seconds) default 1.0]
        [-d <power load down duration> (in float seconds) default 1.0]
        [-t <load test duration> (in int seconds) default 60]
        [-c <spin N CPU cores per GPU> (in int) default 0]
        [-i <GPU number> (in int) default 0]
        [-L <reduce to minimum power on down phase> (boolean) default is medium power]
        [-D <random power dropouts: 0 - 1, 1 - 4 sec, every 6 sec> (boolean) default is not]


# Multi-GPU tests are implemented with OpenMPI; MPI RANK coorasponds to GPU ID.

# High Level GPUPowerTest (gpt) SOP

1. Make sure OpenMPI >= v4.1.3 is installed and healthy
1.2 In particular the library libmpi.so.40 must be found (default location /usr/local/lib)
1.3 verify with "mpirun --allow-run-as-root -np 8 -H localhost:8 date"


2. Make sure CUDA >= 11.0, driver and (if applicable) nvida-fabricmanager are installed
2.1 In particular the library libcublasLt.so.11 must be found (default location /usr/local/cuda/targets/x86_64-linux/lib)
2.2 On multi-GPU NVLink systems
2.2.1 systemctl status nvidia-fabricmanager # will check the status
2.2.2 systemctl start nvidia-fabricmanager  # will start the Fabric Manager
2.3 ensure GPUs are visible with nvidia-smi


3. Ensure the needed libraries are found (your paths may vary, example based on defaults)
3.1 export LD_LIBRARY_PATH=/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib
3.2 check with "ldd ./gpt"

4. Run "./gpt -h" to check options and execute under mpirun
4.1 mpirun -np N [ -H 127.0.0.1:M,127.0.0.1:P ] ./gpt ...  where N is the number of GPU to engage in total
                                                           where M + P = N;  the N GPU can be spread over multiple MPI enabled similar servers,
                                                           some number of GPU each, as long as the total equals N
4.2 example: "mpirun --allow-run-as-root -np 8 -H localhost:8 ./gpt -t 14400 -d 0.1 -u 35 -D"

# e.g.

$ mpirun --allow-run-as-root -np 8 -H localhost:8  ./gpt -t 3000  -u .001 -d .001 -D

$ mpirun --allow-run-as-root -np 8 -H localhost:8  ./gpt -t 3000  -u .001 -d .001 -L

$ mpirun --allow-run-as-root -np 8 -H localhost:8  ./gpt -t 300  -u 10 -d 1 



# For a machine with CMT disabled use these mpi options 
$ mpirun -np 8 -x LD_LIBRARY_PATH --map-by ppr:2:numa:pe=11:overload-allowed ./gpt -u 5 -d .5 -t 600

For NVlink load...

Also included is a NVlink load generator, nvlbm.cu. Currently needs to be compiled seperately, e.g.
nvcc -o nvlbm nvlbm.cu -lpthread or run the bld_nvlbm.sh script. 
The nvlbm binary can be run standalone to generate NVlink traffic,
or simultaneously with gpt to generate additional load and power consumption.
By default, nvlbm will generate traffic between all installed GPUs. The only option is the
"-t" flag to specify the run duration in seconds, e.g.

nvlbm -t 300

to run for 5 minutes.

With DCGM installed, NVlink traffic, along with power, can be monitored easily:

dcgmi dmon -e 100,155,1002,1011,1012

The above will sample every second. See dcgmi dmon --list for a complete list of
events that can be captured. Use the -d (delay) flag to change the query interval.
The value is in milliseconds, default 1000ms (1 second). 
See dcgmi dmon --help for more.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.