azure / azhpc-diagnostics Goto Github PK

Scripts that run on Azure VM's and gather variety of diagnostic information to debug common issues with VM, GPU and Infiniband.

License: MIT License

Shell 97.33% PowerShell 2.67%

azhpc-diagnostics's Introduction

https://aka.ms/hpcdiag redirects to this repo.

OS	Versions / Build	VM Size Family
Linux		ND	NC	HBv2	HB	HC	H
Ubuntu	18.04	NDv2	NCv2	HBv2		HC44
Ubuntu	16.04	NDv1	NCv1				H
CentOS	8.1	NDv2		HBv2	HB
CentOS	7.8		NCv3			HC
CentOS	7.7	NDv2			HB
CentOS	7.6	NDv1	NCv2	HBv2
CentOS	7.4		NCv1			HC	H
RHEL	8.2		NCv3			HC
RHEL	8.1		NCv2				H
RHEL	7.8	NDv2		HBv2
RHEL	7.7		NCv2		HB
RHEL	7.6	NDv2				HC
RHEL	7.5		NCv1				H
RHEL	7.4	NDv1		HBv2

Overview

This repo holds a script that, when run on an Azure VM, gathers a variety of diagnostic information for the purposes of diagnosing common HPC, Infiniband, and GPU problems. It runs a suite of diagnostic tools ranging from built-in Linux tools like lscpu to vendor-specific CLI's like nvidia-smi. The resulting information is packaged up into a tarball, so that it can be shared with support engineers to speed up the troubleshooting process.

If you are reading this, you are likely troubleshooting problems on an Azure HPC VM, in which case we suggest you contact support if you have not already and run this tool on your VM so that you can provide the output to support engineers when prompted.

If you have special privacy requirements concerning logs leaving your VM, make sure to open up the tarball and redact any sensitive information before re-tarring it and handing it off to support engineers.

Warning

This tool is meant for diagnosing inactive systems. It runs benchmarks that stress various system devices such as memory, GPU, and Infiniband. It will cause performance degradation for or otherwise interfere with other active processes that use these resources. It is not advised to use this tool on systems where other jobs are currenlty running.

To stop the tool while it is running, interrupt the process (i.e. ctrl-c) to force it to reset system state and terminate.

Install and Run

After cloning this repo, no further installation is required. To run the script, run the following command, replacing {repo-root} with the name of this repo's directory on your VM:

sudo bash {repo-root}/Linux/src/gather_azhpc_vm_diagnostics.sh

PerfInsights for Linux Integration

Alternatively, a version of this tool is included in PerfInsights for Linux under the HPC scenario. Running this scenario directly from the Azure Portal is not supported at this time, so PerfInsights must be downloaded and run from the command line, but the results of this tool are included in the report generated.

Usage

This section describes the output of the script and the configuration options available.

Options

Option (Short)	Option (Long)	Parameters	Description	Example	Example Description
-d	--dir	Directory Name	Specify custom output location	--dir=.	Put the tarball in the current directory
-V	--version		display version information and exit	--version	Outputs 0.0.1
-h	--help		display help text	-h	Outputs the help message
-v	--verbose		verbose output	--verbose	Enables more verbose terminal output
	--gpu-level	1 (default), 2, or 3	GPU diagnostics run-level	--gpu-level=3	Sets dcgmi run-level to 3
	--mem-level	0 (default) or 1	Memory diagnostics run-level	--mem-level=1	Enables stream benchmark test
	--no-update		Disables auto-update	--no-update	Refrains from checking for updates to the script
	--offline		Prevents internet access	--offline	Skips stream benchmark and lsvmbus if not installed

Tarball Structure

Note that not all these files will be generated on all runs. What appears below is union of all files that could be generated, which depends on script parameters and VM size:

{vm-id}.{timestamp}.tar.gz
|-- transcript.log (logs for the tool itself)
|-- hpcdiag.err (stderr output from the run, including set -x trace)
|-- VM
|   -- dmesg.log
|   -- waagent.log
|   -- lspci.txt
|   -- lsvmbus.log
|   -- ipconfig.txt
|   -- sysctl.txt
|   -- uname.txt
|   -- dmidecode.txt
|   -- lsmod.txt
|   -- journald.log|syslog|messages
|   -- services
|   -- selinux
|   -- hyperv/kvp_pool*.txt
|-- CPU
|   -- lscpu.txt
|   -- ulimit
|   -- zone_reclaim_mode
|-- Memory
|   -- stream.txt
|-- Infiniband
|   -- ib-vmext.log
|   -- ibstat.out
|   -- ibstatus.out
|   -- ibv_devinfo.out
|   -- pkeys/*
|   -- ethtool.out (ENDURE)
|   -- rate (ENDURE)
|   -- state (ENDURE)
|   -- phys_state (ENDURE)
|-- Nvidia
    -- nvidia-bug-report.log.gz
    -- nvidia-installer.log
    -- nvidia-vmext.log
    -- nvidia-smi.out
    -- nvidia-smi-q.out
    -- nvidia-smi-nvlink.out
    -- nvidia-debugdump.zip (only Nvidia can read)
    -- dcgm-diag-2.log
    -- dcgm-diag-3.log
    -- nvvs.log
    -- stats_*.json

Diagnostic Tools Table

Tool	Command	Output File(s)	Description	EULA
dmesg	dmesg	VM/dmesg.log	Dump of kernel ring buffer
rsyslog	cp syslog\|messages	VM/syslog\|messages	Dump of system log
journald	journalctl	VM/journald.log	Dump of system log
Azure IMDS	curl http://169.254.169.254/metadata/...	transcript.log	VM Metadata (ID,Region,OS Image, etc)
Azure VM Agent	cp /var/log/waagent.log	waagent.log	Logs from the Azure VM Agent
lspci	lspci	VM/lspci.txt	Info on installed PCI devices
lsvmbus	lsvmbus	VM/lsvmbus.log	Displays devices attached to the Hyper-V VMBus
Hyper-V KVP	custom-made	VM/hyperv/kvp_pool*.txt	Exposes certain Windows Registry data from the Azure Host
ipconfig	ipconfig	VM/ipconfig.txt	Checking TCP/IP configuration
sysctl	sysctl	VM/sysctl.txt	Checking kernel parameters
uname	uname	VM/uname.txt	Checking system information
systemd	systemctl	VM/services	Checking for certain active services (tuning only)
selinux	cp /etc/sysconfig/selinux	VM/selinux	Checking for selinux activity (tuning only)
ulimit	cp /etc/security/limits.conf	Memory/ulimit	Checking for default user resource limits (tuning only)
-	cp /proc/sys/vm/zone_reclaim_mode	Memory/zone_reclaim_mode	Checking NUMA memory reclamation policy (tuning only)
dmidecode	dmidecode	VM/dmidecode.txt	DMI table dump (info on hardware components)
lsmod	lsmod	VM/lsmod.txt	List of active kernel modules
lscpu	lscpu	CPU/lscpu.txt	Information about the system CPU architecture
stream	stream_zen_double	Memory/stream.txt	The stream benchmark suite (AMD Only)	Stream License
ibstat	ibstat	Infiniband/ibstat.out	Mellanox OFED command for checking Infiniband status	MOFED End-User Agreement
ibstatus	ibstatus	Infiniband/ibstat.out	Lightweight Mellanox OFED command for checking Infiniband status	MOFED End-User Agreement
ibv_devinfo	ibv_devinfo	Infiniband/ibv_devinfo.out	Mellanox OFED commnd for checking Infiniband Device info	MOFED End-User Agreement
Partition Key	cp /sys/class/infiniband/.../pkeys/...	Infiniband/.../pkeys/...	Checks the configured Infinband Partition Keys
Infiniband Driver Extension Logs	cp /var/log/azure/ib-vmext-status	Infiniband/ib-vmext-status	Logs from the Infiniband Driver Extension
ethtool	ethtool eth1	Infiniband/ethtool.out	Status of IB interface on ENDURE VMs
sysfs	cp /sys/class/infiniband/...	Infiniband/rate,state,phys_state	Status of IB interface on ENDURE VMs
NVIDIA Bug Report	nvidia-bug-report.sh	Nvidia/nvidia-bug-report.log.gz	A script that Nvidia has customers run when reporting hardware problems.	CUDA EULA GRID EULA
NVIDIA System Management Interface	nvidia-smi	Nvidia/nvidia-smi.out Nvidia/nvidia-smi-q.out Nvidia/nvidia-smi-nvlink.out	Checks GPU health and configuration	CUDA EULA GRID EULA
NVIDIA Debug Dump	nvidia-debugbump	Nvidia/nvidia-debugdump.zip	Generates a binary blob for use with Nvidia internal engineering tools	CUDA EULA GRID EULA
NVIDIA Data Center GPU Manager	dcgmi	Nvidia/dcgm-diag-2.log Nvidia/dcgm-diag-3.log Nvidia/nvvs.log Nvidia/stats_*.json	Health monitoring for GPUs in cluster environments	DCGM EULA
GPU Driver Extension Logs	cp /var/log/azure/nvidia-vmext-status	Nvidia/nvidia-vmext-status	Logs from the GPU Driver Extension

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azhpc-diagnostics's People

Contributors

Stargazers

Watchers

Forkers

isabella232 vgamayunov digitalarche bijuth-hpc karthik-zumen tlcyr4 jamiemagee jamesongithub hmeiland sourcecodecheck

azhpc-diagnostics's Issues

Errors running hpcdiag on NDv4(40GB A100)

sudo ${HPCDIAG_EXE_DIR}/gather_azhpc_vm_diagnostics.sh -d . --gpu-level=3 --mem-level=1

etc
Running Nvidia GPU Diagnostics
Querying Nvidia GPU Info, writing to {output}/Nvidia/nvidia-smi-q.out
Running plain nvidia-smi, writing to {output}/Nvidia/nvidia-smi.out
Dumping Nvidia GPU internal state to {output}/Nvidia/nvidia-debugdump.zip
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
Temporarily starting nv-hostengine

etc

Checking for common issues
Checking for GPUs with corrupted infoROM
Checking for GPUs with row remapping failures
Checking for GPUs that don't appear in nvidia-smi
Checking PCIe speed training for GPUs
Could not find pkey 0 for device mlx5_ib0
Could not find pkey 1 for device mlx5_ib0
Could not find pkey 0 for device mlx5_ib1
Could not find pkey 1 for device mlx5_ib1
Could not find pkey 0 for device mlx5_ib2
Could not find pkey 1 for device mlx5_ib2
Could not find pkey 0 for device mlx5_ib3
Could not find pkey 1 for device mlx5_ib3
Could not find pkey 0 for device mlx5_ib4
Could not find pkey 1 for device mlx5_ib4
Could not find pkey 0 for device mlx5_ib5
Could not find pkey 1 for device mlx5_ib5
Could not find pkey 0 for device mlx5_ib6
Could not find pkey 1 for device mlx5_ib6
Could not find pkey 0 for device mlx5_ib7
Could not find pkey 1 for device mlx5_ib7

Are these errors normal on NDv4?
I ran on multiple NDv4 VM's and got the same errors above.

Package Manager Logs

It would be useful for the tool to collect package manager logs such as dpkg.log or yum.log.

lsmod log

If would be useful to collect 'lsmod' output

Give instructions for GPU pages pending retirement

If we see in nvidia-smi or another GPU check that there are currently pages with ECC errors that are pending retirement, we can print a message to let the user know that this is fixable with a reboot and doesn't require hardware servicing.

Print warning for GPU power usage check failures

When nvidia-smi output has ERR! as its power usage like this

+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00008ED0:00:00.0 Off |                    0 |
| N/A   29C    P0   ERR! / 250W |    593MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

the tool should point it out to explain that it's just a transient issue.

Does not support NDv5(A100) (No GPU diagnostics are collected)

Does not support NDv5(H100) (i.e. No GPU (nvidia-smi) diagnostics are collected.

Robust System Log Collection

Currently, the tool takes /var/log/syslog and dmesg, which captures the system logs we want on Ubuntu. However, RHEL-based distros tend to put logs into /var/log/messages instead of /var/log/syslog, and newer distros (including Ubuntu) are switching over to logging using journald. The tool should be intelligent enough to gather the logs we need in any of these cases.

Check for inactive NVLinks

There should be a check for GPUs experiencing this type of issue:

nvidia-smi nvlink -s
....
GPU 3: Tesla V100-SXM2-32GB (UUID: ...)
         Link 0: 25.781 GB/s
         Link 1: 25.781 GB/s
         Link 2: 25.781 GB/s
         Link 3: <inactive>
         Link 4: 25.781 GB/s
         Link 5: <inactive>

Detecting Known Mellanox Firmware Issue

There is a known issue affecting VMs with ConnectX-5 cards that prevents IB devices from being found and needs to be fixed with a firmware update.

The tool should be able to detect that particular issue and alert its user to contact support for assistance.

Collect IB health check for older gen. SKUs

We need to add IB health check coverage for older gen. RDMA SKUs (eg: H16r).

Offline Mode

A flag to set at runtime to instruct the script to skip any activities that require Internet access.

Intended for use in sensitive or air-gapped environments.

Checking for Nouveau Drivers on Compute SKUs

The open source Nouveau GPU drivers are not supported on the compute GPU sizes, so it is worth flagging any time someone has them installed.

Update support matrix please

Please, can you update the support matrix?

is ubuntu 20.04 supported?
and the NCv3?
is alma hpc 8.5 supported?
are the HBv3 supported?
H series are retired

thank you!

Include Version Number in output

We need a way of verifying which version of the tool was run to produce a given output.

This is important for determining whether oddities in the output are because of the system it was run on or defects in an outdated version of the script (or current defects).

Auto-Update

The tool should be able to attempt to query this repo for any updates and apply them before running.

This is to avoid usage of out-of-date versions of the tool in scenarios where it has been installed onto a machine for a long time.

It fails to correctly identify SKU for standard_nc64as_t4_v3

The script fails to correctly identify standard_nc64as_t4_v3 as a Nvidia GPU SKU and therefore won't gather GPU data.

https://github.com/Azure/azhpc-diagnostics/blob/main/Linux/src/gather_azhpc_vm_diagnostics.sh#L190

This line should include 64 vCPU size as well.

PKEY path wrong

I was investigating an incident involves IB & PKEYs. In the collected data files, information in /$node/Infiniband/$device/pkey0 and pkey1 are missing:

No pkey found

I checked other files. IB drivers are installed correctly. so it's not likely that these 2 files are missing from the vm.

Then I checked the diagnostics.sh and figured that the file path might be wrong.

The correct path for pkey (in my case) is:

/sys/class/infiniband/mlx5_0/ports/1/pkeys/0

The script is looking for:

Line 254 if [ -f "$device/ports/pkeys/0" ]; then

Highlight identifying details for down IB links

With ibstat outputs like

...
Port 1:
State: Down
Physical state: Polling
...

we should highlight this issue in output and include the vmbus ID of the affected port.

Check for missing PCI devices

Compare the set of devices showing up on the pci bus to the expected set for the current VM size

collect full ibv_devinfo logs

Please use "ibv_devinfo -v" to collect all the attributes of the IB device(s)

Nvidia Bug Report

The output of the tool should include all the necessary files to submit a bug report to Nvidia.

[Feature requests] Add PKEYs out for IB, AN, SSH

If not already present, output of PKEYS on the VM:
cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/*

Is there a direct test to check if Accelerated Networking is enabled vs not? Not sure if this is in the VM metadata. Though indirectly, this is discernible through lspci, ibv_devinfo etc.

Additionally (unrelated to this ask), how can we add info if the VM is publicly accessible (over SSH, public IP, port 22 open)? This may be relevant to know if we can have direct access to VM (with customer consent) or will need to coordinate VM access with customer.

More Reliable Checks for Drivers

The current implementation of checking for GPU/IB drivers relies on checking to see if CLI tools that come installed with the drivers are functioning (e.g. nvidia-smi, ibstat).

Notably, this relies on these binaries being on the PATH, which can be a source of confusion for containerized environments, but in general, it's not as direct of a method as it could be.

For instance, checking for known kernel modules via lsmod could give us a clearer picture of what drivers have been successfully installed.

CycleCloud Logs

To support CycleCloud deployments, we need to collect any logs that we can that are related to CycleCloud and whichever scheduler is in use.

Some sample CycleCloud Logs:

/opt/cycle/jetpack/logs/chef-client.log
/opt/cycle/jetpack/system/chef/cache/chef-stacktrace.out
/opt/cycle/jetpack/logs/cluster-init/{PROJECT_NAME}

Timeouts for DCGM

The script should recovery from dcgmi hanging by enforcing appropriate timeouts on each run-level and recovering gracefully

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your azhpc-diagnostics repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your Azure GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/Azure/repos/azhpc-diagnostics/compliance

The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @JonShelley, @lostern, @jithinjosepkl, @souvik-de, @AnnaDaly, @edwardsp, @rluanma, @vermagit, @tlcyr4, @sakshamgupta006, @anshuljainansja, @alfpark

Output File Size Cap

There should be a way to limit the output file size.

In some cases there are lots of system logs, leading to an overall output size of over 70MB. To avoid creating a huge payload to transfer around in cases like this and even more extreme ones, we need a method for controlling the overall output size.

Non-SRIOV IB Diagnostics

The tool does not currently support collecting Infiniband diagnostics on any non-SRIOV enabled SKUs.

Frontpage for Outputs

There should be a file (general.log is currently the closest thing to it) that holds the most general info or highlights about the system.

Support for running azhpc-diagnostics in kubernetes (and AKS)

As a user of N-series VMs and H-series VMs in AKS, I want the ability to run the azhpc-diagnostics tool in a kubernetes environment, so that I can help the Azure support team gather the diagnostic data they need to root cause and mitigate platform issues encountered on these VMs.

Azure users are sometimes asked by Azure support to run the azhpc-diagnostics tool to gather data for them when an error condition arises that could be due to the platform. However, it's not clear how to run this tool within a kubernetes environment. This issue is a request to support running this tool in kubernetes (and thereby AKS).

Getting IB Link Info When ib_umad isn't loaded

The ibstat command relies on the ib_umad kernel module to be loaded to give full output.

In the event that it's not loaded, the ibstatus command can be used instead to still get some of the information that ibstat provides, in particular, link info.

We'd like to be able to adapt to missing ib_umad, which is not necessarily loaded when not running on the official Azure HPC marketplace images, and still obtain useful info such as link info and firmware version.

Both of these commands are from the Mellanox InfiniBand Fabric Utilities

ibstat fails when Accelerated Networking is enabled

When Accelerated Networking is enabled, ibstat, run with no parameters, can fail due to mistaking the Ethernet interface for an IB one. When this happens, we don't get any information from ibstat:

~$ ls /sys/class/infiniband/mlx5_ mlx5_an0/ mlx5_ib0/ mlx5_ib1/ mlx5_ib2/ mlx5_ib3/ mlx5_ib4/ mlx5_ib5/ mlx5_ib6/ mlx5_ib7/ ~$ ibstat ibpanic: [8371] main: stat of IB device 'mlx5_an0' failed: No such file or directory ~$ ibstat mlx5_ib0 CA 'mlx5_ib0' CA type: MT4124 Number of ports: 1 Firmware version: 20.28.4000 Hardware version: 0 Node GUID: 0x00155dfffe34072b System image GUID: 0x0c42a10300a58684 Port 1: State: Initializing Physical state: LinkUp Rate: 200 Base lid: 65535 LMC: 0 SM lid: 361 Capability mask: 0x2651ec48 Port GUID: 0x00155dfffd34072b Link layer: InfiniBand

This is fixable by using the ibstatus command, also from the Mellanox Fabric Utilities, which does not fail when an AN interface is present. We can even use it to determine which interfaces are actually InfiniBand ones, which would allow us to get the full output of ibstat by running it against one interface at a time.

`~$ ibstatus
Infiniband device 'mlx5_an0' port 1 status:
default gid: unknown
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet

Infiniband device 'mlx5_ib0' port 1 status:
default gid: fe80:0000:0000:0000:0015:5dff:fd34:072b
base lid: 0xffff
sm lid: 0x169
state: 2: INIT
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand

...`

Collect HyperV KVPs

HyperV KVP data consists of several useful data, and will be good to collect it through diag scripts.