azure / azhpc-diagnostics Goto Github PK

Scripts that run on Azure VM's and gather variety of diagnostic information to debug common issues with VM, GPU and Infiniband.

License: MIT License

Shell 97.33% PowerShell 2.67%

azhpc-diagnostics's Issues

Include Version Number in output

We need a way of verifying which version of the tool was run to produce a given output.

This is important for determining whether oddities in the output are because of the system it was run on or defects in an outdated version of the script (or current defects).

Non-SRIOV IB Diagnostics

The tool does not currently support collecting Infiniband diagnostics on any non-SRIOV enabled SKUs.

Timeouts for DCGM

The script should recovery from dcgmi hanging by enforcing appropriate timeouts on each run-level and recovering gracefully

lsmod log

If would be useful to collect 'lsmod' output

Check for inactive NVLinks

There should be a check for GPUs experiencing this type of issue:

nvidia-smi nvlink -s
....
GPU 3: Tesla V100-SXM2-32GB (UUID: ...)
         Link 0: 25.781 GB/s
         Link 1: 25.781 GB/s
         Link 2: 25.781 GB/s
         Link 3: <inactive>
         Link 4: 25.781 GB/s
         Link 5: <inactive>

PKEY path wrong

I was investigating an incident involves IB & PKEYs. In the collected data files, information in /$node/Infiniband/$device/pkey0 and pkey1 are missing:

No pkey found

I checked other files. IB drivers are installed correctly. so it's not likely that these 2 files are missing from the vm.

Then I checked the diagnostics.sh and figured that the file path might be wrong.

The correct path for pkey (in my case) is:

/sys/class/infiniband/mlx5_0/ports/1/pkeys/0

The script is looking for:

Line 254 if [ -f "$device/ports/pkeys/0" ]; then

Collect HyperV KVPs

HyperV KVP data consists of several useful data, and will be good to collect it through diag scripts.

Frontpage for Outputs

There should be a file (general.log is currently the closest thing to it) that holds the most general info or highlights about the system.

Errors running hpcdiag on NDv4(40GB A100)

sudo ${HPCDIAG_EXE_DIR}/gather_azhpc_vm_diagnostics.sh -d . --gpu-level=3 --mem-level=1

etc
Running Nvidia GPU Diagnostics
Querying Nvidia GPU Info, writing to {output}/Nvidia/nvidia-smi-q.out
Running plain nvidia-smi, writing to {output}/Nvidia/nvidia-smi.out
Dumping Nvidia GPU internal state to {output}/Nvidia/nvidia-debugdump.zip
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
Temporarily starting nv-hostengine

etc

Checking for common issues
Checking for GPUs with corrupted infoROM
Checking for GPUs with row remapping failures
Checking for GPUs that don't appear in nvidia-smi
Checking PCIe speed training for GPUs
Could not find pkey 0 for device mlx5_ib0
Could not find pkey 1 for device mlx5_ib0
Could not find pkey 0 for device mlx5_ib1
Could not find pkey 1 for device mlx5_ib1
Could not find pkey 0 for device mlx5_ib2
Could not find pkey 1 for device mlx5_ib2
Could not find pkey 0 for device mlx5_ib3
Could not find pkey 1 for device mlx5_ib3
Could not find pkey 0 for device mlx5_ib4
Could not find pkey 1 for device mlx5_ib4
Could not find pkey 0 for device mlx5_ib5
Could not find pkey 1 for device mlx5_ib5
Could not find pkey 0 for device mlx5_ib6
Could not find pkey 1 for device mlx5_ib6
Could not find pkey 0 for device mlx5_ib7
Could not find pkey 1 for device mlx5_ib7

Are these errors normal on NDv4?
I ran on multiple NDv4 VM's and got the same errors above.

It fails to correctly identify SKU for standard_nc64as_t4_v3

The script fails to correctly identify standard_nc64as_t4_v3 as a Nvidia GPU SKU and therefore won't gather GPU data.

https://github.com/Azure/azhpc-diagnostics/blob/main/Linux/src/gather_azhpc_vm_diagnostics.sh#L190

This line should include 64 vCPU size as well.

Lacking of one line supporting NDH100v5 collecting Nvidia GPU details

Lacking of one line in line 196:
[[ "$clean" =~ ^standard_nd96i?sr(_h100)?_v5$ ]] ||

More Reliable Checks for Drivers

The current implementation of checking for GPU/IB drivers relies on checking to see if CLI tools that come installed with the drivers are functioning (e.g. nvidia-smi, ibstat).

Notably, this relies on these binaries being on the PATH, which can be a source of confusion for containerized environments, but in general, it's not as direct of a method as it could be.

For instance, checking for known kernel modules via lsmod could give us a clearer picture of what drivers have been successfully installed.

Give instructions for GPU pages pending retirement

If we see in nvidia-smi or another GPU check that there are currently pages with ECC errors that are pending retirement, we can print a message to let the user know that this is fixable with a reboot and doesn't require hardware servicing.

CycleCloud Logs

To support CycleCloud deployments, we need to collect any logs that we can that are related to CycleCloud and whichever scheduler is in use.

Some sample CycleCloud Logs:

/opt/cycle/jetpack/logs/chef-client.log
/opt/cycle/jetpack/system/chef/cache/chef-stacktrace.out
/opt/cycle/jetpack/logs/cluster-init/{PROJECT_NAME}

Offline Mode

A flag to set at runtime to instruct the script to skip any activities that require Internet access.

Intended for use in sensitive or air-gapped environments.

Collect IB health check for older gen. SKUs

We need to add IB health check coverage for older gen. RDMA SKUs (eg: H16r).

Update support matrix please

Please, can you update the support matrix?

is ubuntu 20.04 supported?
and the NCv3?
is alma hpc 8.5 supported?
are the HBv3 supported?
H series are retired

thank you!

Package Manager Logs

It would be useful for the tool to collect package manager logs such as dpkg.log or yum.log.

Check for missing PCI devices

Compare the set of devices showing up on the pci bus to the expected set for the current VM size

Auto-Update

The tool should be able to attempt to query this repo for any updates and apply them before running.

This is to avoid usage of out-of-date versions of the tool in scenarios where it has been installed onto a machine for a long time.

Detecting Known Mellanox Firmware Issue

There is a known issue affecting VMs with ConnectX-5 cards that prevents IB devices from being found and needs to be fixed with a firmware update.

The tool should be able to detect that particular issue and alert its user to contact support for assistance.

collect full ibv_devinfo logs

Please use "ibv_devinfo -v" to collect all the attributes of the IB device(s)

[Feature requests] Add PKEYs out for IB, AN, SSH

If not already present, output of PKEYS on the VM:
cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/*

Is there a direct test to check if Accelerated Networking is enabled vs not? Not sure if this is in the VM metadata. Though indirectly, this is discernible through lspci, ibv_devinfo etc.

Additionally (unrelated to this ask), how can we add info if the VM is publicly accessible (over SSH, public IP, port 22 open)? This may be relevant to know if we can have direct access to VM (with customer consent) or will need to coordinate VM access with customer.

Robust System Log Collection

Currently, the tool takes /var/log/syslog and dmesg, which captures the system logs we want on Ubuntu. However, RHEL-based distros tend to put logs into /var/log/messages instead of /var/log/syslog, and newer distros (including Ubuntu) are switching over to logging using journald. The tool should be intelligent enough to gather the logs we need in any of these cases.

Output File Size Cap

There should be a way to limit the output file size.

In some cases there are lots of system logs, leading to an overall output size of over 70MB. To avoid creating a huge payload to transfer around in cases like this and even more extreme ones, we need a method for controlling the overall output size.

Does not support NDv5(A100) (No GPU diagnostics are collected)

Does not support NDv5(H100) (i.e. No GPU (nvidia-smi) diagnostics are collected.

Getting IB Link Info When ib_umad isn't loaded

The ibstat command relies on the ib_umad kernel module to be loaded to give full output.

In the event that it's not loaded, the ibstatus command can be used instead to still get some of the information that ibstat provides, in particular, link info.

We'd like to be able to adapt to missing ib_umad, which is not necessarily loaded when not running on the official Azure HPC marketplace images, and still obtain useful info such as link info and firmware version.

Both of these commands are from the Mellanox InfiniBand Fabric Utilities

Checking for Nouveau Drivers on Compute SKUs

The open source Nouveau GPU drivers are not supported on the compute GPU sizes, so it is worth flagging any time someone has them installed.

Support for running azhpc-diagnostics in kubernetes (and AKS)

As a user of N-series VMs and H-series VMs in AKS, I want the ability to run the azhpc-diagnostics tool in a kubernetes environment, so that I can help the Azure support team gather the diagnostic data they need to root cause and mitigate platform issues encountered on these VMs.

Azure users are sometimes asked by Azure support to run the azhpc-diagnostics tool to gather data for them when an error condition arises that could be due to the platform. However, it's not clear how to run this tool within a kubernetes environment. This issue is a request to support running this tool in kubernetes (and thereby AKS).

ibstat fails when Accelerated Networking is enabled

When Accelerated Networking is enabled, ibstat, run with no parameters, can fail due to mistaking the Ethernet interface for an IB one. When this happens, we don't get any information from ibstat:

~$ ls /sys/class/infiniband/mlx5_ mlx5_an0/ mlx5_ib0/ mlx5_ib1/ mlx5_ib2/ mlx5_ib3/ mlx5_ib4/ mlx5_ib5/ mlx5_ib6/ mlx5_ib7/ ~$ ibstat ibpanic: [8371] main: stat of IB device 'mlx5_an0' failed: No such file or directory ~$ ibstat mlx5_ib0 CA 'mlx5_ib0' CA type: MT4124 Number of ports: 1 Firmware version: 20.28.4000 Hardware version: 0 Node GUID: 0x00155dfffe34072b System image GUID: 0x0c42a10300a58684 Port 1: State: Initializing Physical state: LinkUp Rate: 200 Base lid: 65535 LMC: 0 SM lid: 361 Capability mask: 0x2651ec48 Port GUID: 0x00155dfffd34072b Link layer: InfiniBand

This is fixable by using the ibstatus command, also from the Mellanox Fabric Utilities, which does not fail when an AN interface is present. We can even use it to determine which interfaces are actually InfiniBand ones, which would allow us to get the full output of ibstat by running it against one interface at a time.

`~$ ibstatus
Infiniband device 'mlx5_an0' port 1 status:
default gid: unknown
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet

Infiniband device 'mlx5_ib0' port 1 status:
default gid: fe80:0000:0000:0000:0015:5dff:fd34:072b
base lid: 0xffff
sm lid: 0x169
state: 2: INIT
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand

...`

azhpcdaig Tool Lacks GPU Diagnostics Support for Latest N Series SKUs (NDH100v4, NDA100v5): Timeline for Updates Needed

The azhpcdaig tool currently does not support GPU diagnostics information for the latest N series SKUs, such as NDH100v4 and NDA100v5. Is there a timeline for updating the scripts? This tool is crucial for collecting logs for support ticket and analysis.

Highlight identifying details for down IB links

With ibstat outputs like

...
Port 1:
State: Down
Physical state: Polling
...

we should highlight this issue in output and include the vmbus ID of the affected port.

Nvidia Bug Report

The output of the tool should include all the necessary files to submit a bug report to Nvidia.

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your azhpc-diagnostics repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your Azure GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/Azure/repos/azhpc-diagnostics/compliance

The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @JonShelley, @lostern, @jithinjosepkl, @souvik-de, @AnnaDaly, @edwardsp, @rluanma, @vermagit, @tlcyr4, @sakshamgupta006, @anshuljainansja, @alfpark

Print warning for GPU power usage check failures

When nvidia-smi output has ERR! as its power usage like this

+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00008ED0:00:00.0 Off |                    0 |
| N/A   29C    P0   ERR! / 250W |    593MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

the tool should point it out to explain that it's just a transient issue.

azure / azhpc-diagnostics Goto Github PK

azhpc-diagnostics's Issues

Action required: 4 compliance tasks

GitHub inside Microsoft program information

Recommend Projects

Recommend Topics

Recommend Org