Code Monkey home page Code Monkey logo

azhpc-diagnostics's Introduction

https://aka.ms/hpcdiag redirects to this repo.

OS Versions / Build VM Size Family
Linux ND NC HBv2 HB HC H
Ubuntu 18.04 NDv2Build Status NCv2Build Status HBv2Build Status HC44Build Status
Ubuntu 16.04 NDv1Build Status NCv1Build Status HBuild Status
CentOS 8.1 NDv2Build Status HBv2Build Status HBBuild Status
CentOS 7.8 NCv3Build Status HCBuild Status
CentOS 7.7 NDv2Build Status HBBuild Status
CentOS 7.6 NDv1Build Status NCv2Build Status HBv2Build Status
CentOS 7.4 NCv1Build Status HCBuild Status HBuild Status
RHEL 8.2 NCv3Build Status HCBuild Status
RHEL 8.1 NCv2Build Status HBuild Status
RHEL 7.8 NDv2Build Status HBv2Build Status
RHEL 7.7 NCv2Build Status HBBuild Status
RHEL 7.6 NDv2Build Status HCBuild Status
RHEL 7.5 NCv1Build Status HBuild Status
RHEL 7.4 NDv1Build Status HBv2Build Status

Overview

This repo holds a script that, when run on an Azure VM, gathers a variety of diagnostic information for the purposes of diagnosing common HPC, Infiniband, and GPU problems. It runs a suite of diagnostic tools ranging from built-in Linux tools like lscpu to vendor-specific CLI's like nvidia-smi. The resulting information is packaged up into a tarball, so that it can be shared with support engineers to speed up the troubleshooting process.

If you are reading this, you are likely troubleshooting problems on an Azure HPC VM, in which case we suggest you contact support if you have not already and run this tool on your VM so that you can provide the output to support engineers when prompted.

If you have special privacy requirements concerning logs leaving your VM, make sure to open up the tarball and redact any sensitive information before re-tarring it and handing it off to support engineers.

Warning

This tool is meant for diagnosing inactive systems. It runs benchmarks that stress various system devices such as memory, GPU, and Infiniband. It will cause performance degradation for or otherwise interfere with other active processes that use these resources. It is not advised to use this tool on systems where other jobs are currenlty running.

To stop the tool while it is running, interrupt the process (i.e. ctrl-c) to force it to reset system state and terminate.

Install and Run

After cloning this repo, no further installation is required. To run the script, run the following command, replacing {repo-root} with the name of this repo's directory on your VM:

sudo bash {repo-root}/Linux/src/gather_azhpc_vm_diagnostics.sh

PerfInsights for Linux Integration

Alternatively, a version of this tool is included in PerfInsights for Linux under the HPC scenario. Running this scenario directly from the Azure Portal is not supported at this time, so PerfInsights must be downloaded and run from the command line, but the results of this tool are included in the report generated.

Usage

This section describes the output of the script and the configuration options available.

Options

Option (Short) Option (Long) Parameters Description Example Example Description
-d --dir Directory Name Specify custom output location --dir=. Put the tarball in the current directory
-V --version display version information and exit --version Outputs 0.0.1
-h --help display help text -h Outputs the help message
-v --verbose verbose output --verbose Enables more verbose terminal output
--gpu-level 1 (default), 2, or 3 GPU diagnostics run-level --gpu-level=3 Sets dcgmi run-level to 3
--mem-level 0 (default) or 1 Memory diagnostics run-level --mem-level=1 Enables stream benchmark test
--no-update Disables auto-update --no-update Refrains from checking for updates to the script
--offline Prevents internet access --offline Skips stream benchmark and lsvmbus if not installed

Tarball Structure

Note that not all these files will be generated on all runs. What appears below is union of all files that could be generated, which depends on script parameters and VM size:

{vm-id}.{timestamp}.tar.gz
|-- transcript.log (logs for the tool itself)
|-- hpcdiag.err (stderr output from the run, including set -x trace)
|-- VM
|   -- dmesg.log
|   -- waagent.log
|   -- lspci.txt
|   -- lsvmbus.log
|   -- ipconfig.txt
|   -- sysctl.txt
|   -- uname.txt
|   -- dmidecode.txt
|   -- lsmod.txt
|   -- journald.log|syslog|messages
|   -- services
|   -- selinux
|   -- hyperv/kvp_pool*.txt
|-- CPU
|   -- lscpu.txt
|   -- ulimit
|   -- zone_reclaim_mode
|-- Memory
|   -- stream.txt
|-- Infiniband
|   -- ib-vmext.log
|   -- ibstat.out
|   -- ibstatus.out
|   -- ibv_devinfo.out
|   -- pkeys/*
|   -- ethtool.out (ENDURE)
|   -- rate (ENDURE)
|   -- state (ENDURE)
|   -- phys_state (ENDURE)
|-- Nvidia
    -- nvidia-bug-report.log.gz
    -- nvidia-installer.log
    -- nvidia-vmext.log
    -- nvidia-smi.out
    -- nvidia-smi-q.out
    -- nvidia-smi-nvlink.out
    -- nvidia-debugdump.zip (only Nvidia can read)
    -- dcgm-diag-2.log
    -- dcgm-diag-3.log
    -- nvvs.log
    -- stats_*.json

Diagnostic Tools Table

Tool Command Output File(s) Description EULA
dmesg dmesg VM/dmesg.log Dump of kernel ring buffer
rsyslog cp syslog|messages VM/syslog|messages Dump of system log
journald journalctl VM/journald.log Dump of system log
Azure IMDS curl http://169.254.169.254/metadata/... transcript.log VM Metadata (ID,Region,OS Image, etc)
Azure VM Agent cp /var/log/waagent.log waagent.log Logs from the Azure VM Agent
lspci lspci VM/lspci.txt Info on installed PCI devices
lsvmbus lsvmbus VM/lsvmbus.log Displays devices attached to the Hyper-V VMBus
Hyper-V KVP custom-made VM/hyperv/kvp_pool*.txt Exposes certain Windows Registry data from the Azure Host
ipconfig ipconfig VM/ipconfig.txt Checking TCP/IP configuration
sysctl sysctl VM/sysctl.txt Checking kernel parameters
uname uname VM/uname.txt Checking system information
systemd systemctl VM/services Checking for certain active services (tuning only)
selinux cp /etc/sysconfig/selinux VM/selinux Checking for selinux activity (tuning only)
ulimit cp /etc/security/limits.conf Memory/ulimit Checking for default user resource limits (tuning only)
- cp /proc/sys/vm/zone_reclaim_mode Memory/zone_reclaim_mode Checking NUMA memory reclamation policy (tuning only)
dmidecode dmidecode VM/dmidecode.txt DMI table dump (info on hardware components)
lsmod lsmod VM/lsmod.txt List of active kernel modules
lscpu lscpu CPU/lscpu.txt Information about the system CPU architecture
stream stream_zen_double Memory/stream.txt The stream benchmark suite (AMD Only) Stream License
ibstat ibstat Infiniband/ibstat.out Mellanox OFED command for checking Infiniband status MOFED End-User Agreement
ibstatus ibstatus Infiniband/ibstat.out Lightweight Mellanox OFED command for checking Infiniband status MOFED End-User Agreement
ibv_devinfo ibv_devinfo Infiniband/ibv_devinfo.out Mellanox OFED commnd for checking Infiniband Device info MOFED End-User Agreement
Partition Key cp /sys/class/infiniband/.../pkeys/... Infiniband/.../pkeys/... Checks the configured Infinband Partition Keys
Infiniband Driver Extension Logs cp /var/log/azure/ib-vmext-status Infiniband/ib-vmext-status Logs from the Infiniband Driver Extension
ethtool ethtool eth1 Infiniband/ethtool.out Status of IB interface on ENDURE VMs
sysfs cp /sys/class/infiniband/... Infiniband/rate,state,phys_state Status of IB interface on ENDURE VMs
NVIDIA Bug Report nvidia-bug-report.sh Nvidia/nvidia-bug-report.log.gz A script that Nvidia has customers run when reporting hardware problems. CUDA EULA GRID EULA
NVIDIA System Management Interface nvidia-smi Nvidia/nvidia-smi.out Nvidia/nvidia-smi-q.out Nvidia/nvidia-smi-nvlink.out Checks GPU health and configuration CUDA EULA GRID EULA
NVIDIA Debug Dump nvidia-debugbump Nvidia/nvidia-debugdump.zip Generates a binary blob for use with Nvidia internal engineering tools CUDA EULA GRID EULA
NVIDIA Data Center GPU Manager dcgmi Nvidia/dcgm-diag-2.log Nvidia/dcgm-diag-3.log Nvidia/nvvs.log Nvidia/stats_*.json Health monitoring for GPUs in cluster environments DCGM EULA
GPU Driver Extension Logs cp /var/log/azure/nvidia-vmext-status Nvidia/nvidia-vmext-status Logs from the GPU Driver Extension

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azhpc-diagnostics's People

Contributors

anshuljainansja avatar jithinjosepkl avatar microsoftopensource avatar sakshamgupta006 avatar tlcyr4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azhpc-diagnostics's Issues

Errors running hpcdiag on NDv4(40GB A100)

sudo ${HPCDIAG_EXE_DIR}/gather_azhpc_vm_diagnostics.sh -d . --gpu-level=3 --mem-level=1

etc
Running Nvidia GPU Diagnostics
Querying Nvidia GPU Info, writing to {output}/Nvidia/nvidia-smi-q.out
Running plain nvidia-smi, writing to {output}/Nvidia/nvidia-smi.out
Dumping Nvidia GPU internal state to {output}/Nvidia/nvidia-debugdump.zip
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
Temporarily starting nv-hostengine

etc

Checking for common issues
Checking for GPUs with corrupted infoROM
Checking for GPUs with row remapping failures
Checking for GPUs that don't appear in nvidia-smi
Checking PCIe speed training for GPUs
Could not find pkey 0 for device mlx5_ib0
Could not find pkey 1 for device mlx5_ib0
Could not find pkey 0 for device mlx5_ib1
Could not find pkey 1 for device mlx5_ib1
Could not find pkey 0 for device mlx5_ib2
Could not find pkey 1 for device mlx5_ib2
Could not find pkey 0 for device mlx5_ib3
Could not find pkey 1 for device mlx5_ib3
Could not find pkey 0 for device mlx5_ib4
Could not find pkey 1 for device mlx5_ib4
Could not find pkey 0 for device mlx5_ib5
Could not find pkey 1 for device mlx5_ib5
Could not find pkey 0 for device mlx5_ib6
Could not find pkey 1 for device mlx5_ib6
Could not find pkey 0 for device mlx5_ib7
Could not find pkey 1 for device mlx5_ib7

Are these errors normal on NDv4?
I ran on multiple NDv4 VM's and got the same errors above.

Package Manager Logs

It would be useful for the tool to collect package manager logs such as dpkg.log or yum.log.

lsmod log

If would be useful to collect 'lsmod' output

Give instructions for GPU pages pending retirement

If we see in nvidia-smi or another GPU check that there are currently pages with ECC errors that are pending retirement, we can print a message to let the user know that this is fixable with a reboot and doesn't require hardware servicing.

Print warning for GPU power usage check failures

When nvidia-smi output has ERR! as its power usage like this

+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00008ED0:00:00.0 Off |                    0 |
| N/A   29C    P0   ERR! / 250W |    593MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

the tool should point it out to explain that it's just a transient issue.

Robust System Log Collection

Currently, the tool takes /var/log/syslog and dmesg, which captures the system logs we want on Ubuntu. However, RHEL-based distros tend to put logs into /var/log/messages instead of /var/log/syslog, and newer distros (including Ubuntu) are switching over to logging using journald. The tool should be intelligent enough to gather the logs we need in any of these cases.

Check for inactive NVLinks

There should be a check for GPUs experiencing this type of issue:

nvidia-smi nvlink -s
....
GPU 3: Tesla V100-SXM2-32GB (UUID: ...)
         Link 0: 25.781 GB/s
         Link 1: 25.781 GB/s
         Link 2: 25.781 GB/s
         Link 3: <inactive>
         Link 4: 25.781 GB/s
         Link 5: <inactive>

Detecting Known Mellanox Firmware Issue

There is a known issue affecting VMs with ConnectX-5 cards that prevents IB devices from being found and needs to be fixed with a firmware update.

The tool should be able to detect that particular issue and alert its user to contact support for assistance.

Offline Mode

A flag to set at runtime to instruct the script to skip any activities that require Internet access.

Intended for use in sensitive or air-gapped environments.

Update support matrix please

Please, can you update the support matrix?

  • is ubuntu 20.04 supported?
  • and the NCv3?
  • is alma hpc 8.5 supported?
  • are the HBv3 supported?
  • H series are retired

thank you!

Include Version Number in output

We need a way of verifying which version of the tool was run to produce a given output.

This is important for determining whether oddities in the output are because of the system it was run on or defects in an outdated version of the script (or current defects).

Auto-Update

The tool should be able to attempt to query this repo for any updates and apply them before running.

This is to avoid usage of out-of-date versions of the tool in scenarios where it has been installed onto a machine for a long time.

PKEY path wrong

I was investigating an incident involves IB & PKEYs. In the collected data files, information in /$node/Infiniband/$device/pkey0 and pkey1 are missing:

No pkey found

I checked other files. IB drivers are installed correctly. so it's not likely that these 2 files are missing from the vm.

Then I checked the diagnostics.sh and figured that the file path might be wrong.

The correct path for pkey (in my case) is:

/sys/class/infiniband/mlx5_0/ports/1/pkeys/0

The script is looking for:

Line 254 if [ -f "$device/ports/pkeys/0" ]; then

Nvidia Bug Report

The output of the tool should include all the necessary files to submit a bug report to Nvidia.

[Feature requests] Add PKEYs out for IB, AN, SSH

If not already present, output of PKEYS on the VM:
cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/*

Is there a direct test to check if Accelerated Networking is enabled vs not? Not sure if this is in the VM metadata. Though indirectly, this is discernible through lspci, ibv_devinfo etc.

Additionally (unrelated to this ask), how can we add info if the VM is publicly accessible (over SSH, public IP, port 22 open)? This may be relevant to know if we can have direct access to VM (with customer consent) or will need to coordinate VM access with customer.

More Reliable Checks for Drivers

The current implementation of checking for GPU/IB drivers relies on checking to see if CLI tools that come installed with the drivers are functioning (e.g. nvidia-smi, ibstat).

Notably, this relies on these binaries being on the PATH, which can be a source of confusion for containerized environments, but in general, it's not as direct of a method as it could be.

For instance, checking for known kernel modules via lsmod could give us a clearer picture of what drivers have been successfully installed.

CycleCloud Logs

To support CycleCloud deployments, we need to collect any logs that we can that are related to CycleCloud and whichever scheduler is in use.

Some sample CycleCloud Logs:

  • /opt/cycle/jetpack/logs/chef-client.log
  • /opt/cycle/jetpack/system/chef/cache/chef-stacktrace.out
  • /opt/cycle/jetpack/logs/cluster-init/{PROJECT_NAME}

Timeouts for DCGM

The script should recovery from dcgmi hanging by enforcing appropriate timeouts on each run-level and recovering gracefully

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your azhpc-diagnostics repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your Azure GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/Azure/repos/azhpc-diagnostics/compliance

  • The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
  • No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
  • No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
  • Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @JonShelley, @lostern, @jithinjosepkl, @souvik-de, @AnnaDaly, @edwardsp, @rluanma, @vermagit, @tlcyr4, @sakshamgupta006, @anshuljainansja, @alfpark

Output File Size Cap

There should be a way to limit the output file size.

In some cases there are lots of system logs, leading to an overall output size of over 70MB. To avoid creating a huge payload to transfer around in cases like this and even more extreme ones, we need a method for controlling the overall output size.

Non-SRIOV IB Diagnostics

The tool does not currently support collecting Infiniband diagnostics on any non-SRIOV enabled SKUs.

Frontpage for Outputs

There should be a file (general.log is currently the closest thing to it) that holds the most general info or highlights about the system.

Support for running azhpc-diagnostics in kubernetes (and AKS)

As a user of N-series VMs and H-series VMs in AKS, I want the ability to run the azhpc-diagnostics tool in a kubernetes environment, so that I can help the Azure support team gather the diagnostic data they need to root cause and mitigate platform issues encountered on these VMs.

Azure users are sometimes asked by Azure support to run the azhpc-diagnostics tool to gather data for them when an error condition arises that could be due to the platform. However, it's not clear how to run this tool within a kubernetes environment. This issue is a request to support running this tool in kubernetes (and thereby AKS).

Getting IB Link Info When ib_umad isn't loaded

The ibstat command relies on the ib_umad kernel module to be loaded to give full output.

In the event that it's not loaded, the ibstatus command can be used instead to still get some of the information that ibstat provides, in particular, link info.

We'd like to be able to adapt to missing ib_umad, which is not necessarily loaded when not running on the official Azure HPC marketplace images, and still obtain useful info such as link info and firmware version.

Both of these commands are from the Mellanox InfiniBand Fabric Utilities

ibstat fails when Accelerated Networking is enabled

When Accelerated Networking is enabled, ibstat, run with no parameters, can fail due to mistaking the Ethernet interface for an IB one. When this happens, we don't get any information from ibstat:

~$ ls /sys/class/infiniband/mlx5_ mlx5_an0/ mlx5_ib0/ mlx5_ib1/ mlx5_ib2/ mlx5_ib3/ mlx5_ib4/ mlx5_ib5/ mlx5_ib6/ mlx5_ib7/ ~$ ibstat ibpanic: [8371] main: stat of IB device 'mlx5_an0' failed: No such file or directory ~$ ibstat mlx5_ib0 CA 'mlx5_ib0' CA type: MT4124 Number of ports: 1 Firmware version: 20.28.4000 Hardware version: 0 Node GUID: 0x00155dfffe34072b System image GUID: 0x0c42a10300a58684 Port 1: State: Initializing Physical state: LinkUp Rate: 200 Base lid: 65535 LMC: 0 SM lid: 361 Capability mask: 0x2651ec48 Port GUID: 0x00155dfffd34072b Link layer: InfiniBand

This is fixable by using the ibstatus command, also from the Mellanox Fabric Utilities, which does not fail when an AN interface is present. We can even use it to determine which interfaces are actually InfiniBand ones, which would allow us to get the full output of ibstat by running it against one interface at a time.

`~$ ibstatus
Infiniband device 'mlx5_an0' port 1 status:
default gid: unknown
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet

Infiniband device 'mlx5_ib0' port 1 status:
default gid: fe80:0000:0000:0000:0015:5dff:fd34:072b
base lid: 0xffff
sm lid: 0x169
state: 2: INIT
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand

...`

Collect HyperV KVPs

HyperV KVP data consists of several useful data, and will be good to collect it through diag scripts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.