Code Monkey home page Code Monkey logo

ngc_multinode_perf's Introduction

Performance Tests for Multinode NGC.Ready Certification

Prerequisites

  • numactl
  • jq
  • iperf3 version >= 3.5
  • sysstat
  • perftest version >= 4.5-0.11 (if running RDMA tests)
  • nvidia-utils (if running RDMA test with CUDA)
  • mlnx-tools (or at least the common_irq_affinity.sh and set_irq_affinity_cpulist.sh scripts from it). This package is available here, and is also installed as a part of OFED.

Access requirements

In order to run this, you will need passwordless root access to all the involved servers and DPUs. This can be achieved in several ways:

Firstly, generate a passwordless SSH key (e.g. using ssh-keygen) and copy it to all the entities involved (e.g. using ssh-copy-id).

  • If running as root: no further action is required.
  • If running as a non-root user:
    • Make sure that your non-root user name is the same on the server, the client, and both the DPUs.
    • Make sure that your non-root user is able to use sudo (is in administrative group, such as sudo or wheel, or is mentioned directly in the sudoers file).
    • Make sure that your non-root user is able to use sudo either without a password at all (for example, this configuration in the sudoers file will do, assuming your user is a member of a group named sudo: %sudo ALL=(ALL:ALL) NOPASSWD: ALL) or, for a more granular approach, the following line can be used, allowing the non-root user to run only the needed binaries without a password:
      %sudo ALL=(ALL) ALL, NOPASSWD: /usr/bin/bash,/usr/sbin/ip,/opt/mellanox/iproute2/sbin/ip,/usr/bin/mlxprivhost,/usr/bin/mst,/usr/bin/systemctl,/usr/sbin/ethtool,/usr/sbin/set_irq_affinity_cpulist.sh,/usr/bin/tee,/usr/bin/numactl,/usr/bin/awk,/usr/bin/taskset,/usr/bin/setpci,/usr/bin/rm -f /tmp/*
      

RDMA test

Will automatically detect device local NUMA node and run write/read/send bidirectional tests. Pass criterion is 90% of the port link speed.

Usage:

./ngc_rdma_test.sh <client hostname/ip> <client ib device>[,<client ib device2>] \
    <server hostname/ip> <server ib device>[,<server ib device2>] [--use_cuda] \
    [--qp=<num of QPs, default: total 4>] \
    [--all_connection_types | --conn=<list of connection types>] \
    [--tests=<list of ib perftests>] \
    [--bw_message_size_list=<list of message sizes>] \
    [--lat_message_size_list=<list of message sizes>] \
    [--server_cuda=<cuda_device>] \
    [--client_cuda=<cuda_device>] \
    [--unidir] \ 
    [--ipsec <list of DPU clients> <list of PFs associated to list of DPU clients> \
    <list of DPU servers> <list of PFs associated to list of DPU servers>]
  • If running with CUDA:
    • The nvidia-peermem driver should be loaded.
    • Perftest should be built with CUDA support.

RDMA Wrapper

Automatically detect HCAs on the host(s) and run ngc_rdma_test.sh for each device.

Usage:

./ngc_rdma_wrapper.sh <client hostname/ip> <server hostname/ip> \
    [--with_cuda, default: without cuda] \
    [--cuda_only] \
    [--write] \
    [--read] \
    [--vm] \
    [--aff <file>] \
    [--pairs <file>]

./ngc_internal_lb_rdma_wrapper.sh <hostname/ip> \
    [--with_cuda, default: without cuda] \
    [--cuda_only] \
    [--write] \ 
    [--read] \
    [--vm] \
    [--aff <file>]

TCP test

Will automatically detect device local NUMA node, disable IRQ balancer, increase MTU to max and run iperf3 on the closest NUMA nodes. Report aggregated throughput is in Gb/s.

Usage:

./ngc_tcp_test.sh <client hostname/ip> <client ib device> <server hostname/ip> \
    <server ib device> [--duplex=<"HALF" (default) or "FULL">] \
    [--change_mtu=<"CHANGE" (default) or "DONT_CHANGE">] \
    [--duration=<in seconds, default: 120>]
    [--ipsec <list of DPU clients> <list of PFs associated to list of DPU clients> \
    <list of DPU servers> <list of PFs associated to list of DPU servers>]

IPsec full offload test

  • This test currently supports single port only.

Will configure IPsec full offload on both client and server DPU, and then run a TCP test.

Usage:

./ngc_ipsec_full_offload_tcp_test.sh <client hostname/ip> <client ib device> \
    <server hostname/ip> <server ib device> <client bluefield hostname/ip> \
    <server bluefield hostname/ip> [--mtu=<mtu size>] \
    [--duration=<in seconds, default: 120>]

IPsec crypto offload test

  • Relevant for new HCAs (ConnectX-6 DX and above).

Will configure IPsec crypto offload on both client and server, run TCP test, and remove IPsec configuration.

Usage:

./ngc_ipsec_crypto_offload_tcp_test.sh <client hostname/ip> <client ib device> \
    <server hostname/ip> <server ib device> <number of tunnels>
  • The number of tunnels should not exceed the number of IPs configured on the NICs.

Download the latest stable version

Besides cloning and checking out to the latest stable release, you can also use the following helper script:

curl -Lfs https://raw.githubusercontent.com/Mellanox/ngc_multinode_perf/main/helpers/dl_nmp.sh | bash

And to download the latest 'experimental' (rc) version:

curl -Lfs https://raw.githubusercontent.com/Mellanox/ngc_multinode_perf/main/helpers/dl_nmp.sh | bash -s -- rc

Tuning instructions and HW/FW requirements

Item Description
HCA Firmware version Latest_GA
MLNX_OFED Version Latest_GA
Eth Switch ports Set MTU to 9216
Enable PFC and ECN using the single "Do ROCE" command
IB Switch OpenSM Change IPoIB MTU to 4K:
[standalone: master] → en
[standalone: master] # conf t
[standalone: master] (config) # ib partition Default mtu 4K force
AMD CPUs: EPYC 7002 and 7003 series
BIOS Settings CPU Power Management → Maximum Performance
Memory Frequency → Maximum Performance
Alg. Performance Boost Disable (ApbDis) → Enabled
ApbDis Fixed Socket P-State → P0
NUMA Nodes Per Socket → 2
L3 cache as NUMA Domain → Enabled
x2APIC Mode → Enabled
PCIe ACS → Disabled
Preferred IO → Disabled
Enhanced Preferred IO → Enabled
Boot grub settings iommu=pt numa_balancing=disable processor.max_cstate=0
Intel CPUs: Xeon Gold and Platinum
BIOS Settings Out of the box
Boot grub settings intel_idle.max_cstate=0 processor.max_cstate=0 intel_pstate=disable
NIC PCIe settings For each NIC PCIe function:
Change PCI MaxReadReq to 4096B
Run setpci -s $PCI_FUNCTION 68.w, it will return 4 digits ABCD
→ Run setpci -s $PCI_FUNCTION 68.w=5BCD

ngc_multinode_perf's People

Contributors

amirancel avatar blochl avatar dorkovachi avatar igor-ivanov avatar nshamshoum avatar ravidrothfeld avatar razgavrieli avatar roizilberzwaig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ngc_multinode_perf's Issues

Some message sizes are not compatible with some tests

Now there is a default message size of 65536 for BW tests, and 2 for latency tests. This fails some tests, as for example, UC requires a message size not larger than the MTU.

I suggest having only one message size array (for cases when a user who knows what they do to be able to set manually) and for the 'usual' cases leave the default handling to perftest itself, which already has the knowledge which default is best for which test. ngc_multinode_perf does not need to have a duplication of this logic, and this way we avoid having 3 different arrays (at least!): for BW (non-UC), BW (UC), and lat tests.

Use log() instead of echo, and fatal() instaead of 'echo "..."; exit 1' in ngc_rdma_wrapper.sh

In ngc_multinode_perf, ALL the messages, besides the final results, should be handled by the log() function. This is in order to be controlled from a single location. Currently, in ngc_rdma_wrapper.sh there are some places where this is not so.

Let's use the log() function throughout instead of echo, as in the future it is planned to redirect the log output to the system log using this function.

Release tag request

Hello,
I'm with Azure's AI/HPC engineering organization and we are currently evaluating this tool to replace our current perf-test loopback checks. We would like to integrate this into our VM health check test suite.

Would it be possible to have tagged releases for ngc_multinode_perf? We like to manage our dependencies by release tag versions.

Allow the user to set the pass criterion

Currently, the pass criterion is 90% of max. Sometimes less than that is enough for a pass (like on systems with less compute power) and sometimes more (even more than max!) is required, e.g. in an HBN scenario, where two ports are linked together with an internal bridge on the DPU, something that the host has no knowledge of. To this end, a user shall have an option to set the pass criterion explicitly, while the default shall stay 90% of max.

Some options for TCP test are not documented in README

The changes from commits 622e450 and 9342a27 (the max_proc, disable_ro, allow_core_zero, and neighbor_levels options) appear in the help section of ngc_tcp_test.sh, but are not documented in README.md. We refer all the users to README.md for reference, so all the options should be documented there.

ngc_rdma_test.sh failed on IB_WRITE_BW stage

ngc_rdma_test.sh clx-host-109 mlx5_3,mlx5_4 clx-host-108 mlx5_3,mlx5_4

INFO: Each device can use up to 28 cores (may include core 0)
INFO: Each device can use up to 28 cores (may include core 0)
INFO: run ib_write_bw server on clx-host-108: sudo taskset -c 28 ib_write_bw -d mlx5_3 -s 65536 -D 30 -p 10000 -F --report_gbit -b -q 2 --output=bandwidth
INFO: run ib_write_bw server on clx-host-108: sudo taskset -c 84 ib_write_bw -d mlx5_4 -s 65536 -D 30 -p 10001 -F --report_gbit -b -q 2 --output=bandwidth
INFO: run ib_write_bw client on clx-host-109: sudo taskset -c 28 ib_write_bw -d mlx5_3 -D 30 clx-host-108 -s 65536 -p 10000 -F --report_gbit -b -q 2 --out_json --out_json_file=/tmp/perftest_mlx5_3.json &
INFO: run ib_write_bw client on clx-host-109: sudo taskset -c 84 ib_write_bw -d mlx5_4 -D 30 clx-host-108 -s 65536 -p 10001 -F --report_gbit -b -q 2 --out_json --out_json_file=/tmp/perftest_mlx5_4.json &
WARNING: BW peak won't be measured in this run.

                RDMA_Write Bidirectional BW Test

Dual-port : OFF Device : mlx5_3
Number of qps : 2 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet

local address: LID 0000 QPN 0x01c2 PSN 0x50e20e RKey 0x0060bd VAddr 0x0075252bfaa000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:02
local address: LID 0000 QPN 0x01c3 PSN 0x383de0 RKey 0x0060bd VAddr 0x0075252bfba000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:02
remote address: LID 0000 QPN 0x02a3 PSN 0x6a0372 RKey 0x0060bd VAddr 0x00701124590000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:01
remote address: LID 0000 QPN 0x02a4 PSN 0x1b4d74 RKey 0x0060bd VAddr 0x007011245a0000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:01

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Completion with error at client
Failed status 12: wr_id 1 syndrom 0x81
scnt=256, ccnt=0
Failed to complete run_iter_bw function successfully
Completion with error at client
Failed status 12: wr_id 1 syndrom 0x81
scnt=256, ccnt=0
Failed to complete run_iter_bw function successfully
WARNING: BW peak won't be measured in this run.

                RDMA_Write Bidirectional BW Test

Dual-port : OFF Device : mlx5_4
Number of qps : 2 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet

local address: LID 0000 QPN 0x02c2 PSN 0x497b66 RKey 0x0420bd VAddr 0x0077839e8d9000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:102:02
local address: LID 0000 QPN 0x02c3 PSN 0xf69858 RKey 0x0420bd VAddr 0x0077839e8e9000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:102:02
remote address: LID 0000 QPN 0x03a3 PSN 0x7a7406 RKey 0x0420bd VAddr 0x007ff8e41df000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:102:01
remote address: LID 0000 QPN 0x03a4 PSN 0xb03478 RKey 0x0420bd VAddr 0x007ff8e41ef000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:102:01

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Completion with error at client
Failed status 12: wr_id 1 syndrom 0x81
scnt=256, ccnt=0

Failed to complete run_iter_bw function successfully
Completion with error at client
Failed status 12: wr_id 1 syndrom 0x81
scnt=256, ccnt=0

Failed to complete run_iter_bw function successfully
awk: fatal: cannot open file /tmp/perftest_mlx5_3.json' for reading: No such file or directory Device mlx5_3 reached Gb/s (max possible: 400 Gb/s) Device mlx5_3 didn't reach pass bw rate of 360 Gb/s awk: fatal: cannot open file /tmp/perftest_mlx5_4.json' for reading: No such file or directory
Device mlx5_4 reached Gb/s (max possible: 400 Gb/s)
Device mlx5_4 didn't reach pass bw rate of 360 Gb/s
ib_write_bw - Failed for devices: mlx5_3 mlx5_4 <-> mlx5_3 mlx5_4

ngc_tcp_test.sh failed with error "ntuple filter settings"

I got this error when running ngc_tcp_test.sh.

./ngc_tcp_test.sh clx-host-109 mlx5_3,mlx5_4 clx-host-108 mlx5_3,mlx5_4 --duplex=FULL --disable_ro
WARN: apply WA for Sapphire system - please apply the following tuning in BIOS - Socket Configuration > IIO Configuration > Socket# Configuration > PE# Restore RO Write Perf > Enabled instead of using this WA
INFO: Found 1 IPs associated with the server net device 'ens5f0np0'.
INFO: Found 1 IPs associated with the server net device 'ens5f1np1'.
INFO: Found 1 IPs associated with the client net device 'ens5f0np0'.
INFO: Found 1 IPs associated with the client net device 'ens5f1np1'.
INFO: Each device can use up to 28 cores (may include core 0)
INFO: Each device can use up to 28 cores (may include core 0)
INFO: Number of cores per device to be used is 28, if duplex then half of them will act as servers and half as clients.
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
INFO: Running Full duplex.
netlink error: requested channel counts are too low for existing ntuple filter settings
netlink error: Invalid argument
Error in function main, on line 136.

Cleaning ntuple rules with commands:

for i in {0..100}; do ethtool -N ens5f0np0 delete $i; done
for i in {0..100}; do ethtool -N ens5f1np1 delete $i; done

will fix this error.

ngc_rdma_test.sh may mislead users in Dual-port scenarios

MCX653106A-HDAT_RDMA_Test_dual_unidir.log

ngc_rdma_test.sh may mislead the user about his results (Relevant for dual ports or more).

For example:
If we have a Cx6DX 200GbE Dual-port and it is residing on a Gen5 system, we are limited by the NICs' capabilities to reach maximum BW.
Using ngc_rdma_test.sh, we will get 100~ Gbp/s for each port.

The script thinks that each port of the card can reach 200 Gbp/s and thus will report to the user that the test failed, although it reached maximum BW.

Not found "/usr/sbin/set_irq_affinity_cpulist.sh" shell script

Hi,
After installing all required prerequisites on Ubuntu 22.04 I receive the error by executing script ngc_tcp_test.sh:

cannot access '/usr/sbin/set_irq_affinity_cpulist.sh': No such file or directory

Which component should be installed from Ubuntu inbox packages to fix this issue?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.