ibm / cast Goto Github PK

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.

License: Eclipse Public License 1.0

CMake 1.86% C++ 50.00% C 28.87% Shell 7.90% Perl 2.08% RobotFramework 0.90% Python 4.61% Awk 0.07% Ruby 0.10% JavaScript 0.27% PLpgSQL 3.19% Cuda 0.08% Makefile 0.03% BitBake 0.02% HTML 0.01% Dockerfile 0.02% Less 0.01%

cast's Introduction

CAST

Cluster Administration and Storage Tools, or CAST, can enhance the system management of cluster-wide resources. It consists of the open source tools: Cluster System Management (CSM), Burst Buffer and Function Shipping.

CSM provides an integrated view of your large cluster and includes discovery and management of system resources, database integration, support of job launches, diagnostics, RAS events and actions, and health check.

Burst Buffer is a cost-effective mechanism that can improve I/O performance for a large class of high-performance computing applications without requirement of intermediary hardware. Burst Buffer provides a fast storage tier between compute nodes and the traditional parallel file system, allowing overlapping jobs to stage-in and stage-out data for a variety of checkpoint/restart, scratch volume, and extended memory I/O workloads.

Function Shipping is a file I/O forwarding layer for Linux that aims to provide low-jitter access to remote parallel file system while retaining common POSIX semantics.

For more details on CAST please view:

How to Contribute

If this sounds like a project you would like to help develop, please print, fill out and sign one of the following contributor agreements:

Corporate Contributor License Agreement: cast-project-cla-corporate.pdf
Individual Contributor License Agreement: cast-project-cla-individual.pdf

For further details, please view the CONTRIBUTING.md

How CAST is Licensed

CAST is licensed under the Eclipse Public License 1.0. For more details please view the LICENSE file for details.

cast's People

Contributors

Stargazers

Watchers

cast's Issues

Integrate gpfs related tests into Diagnostic

Integrate gpfsperf and nsdperf into Diagnostic.

CSM_FVT: Test cases for allocation create in diagnostics

#114
#112

CSM_FVT: hostname validation tests

We need a test case that checks for correct handling of configured hostnames in csmd config files.
See pull request #119

valid hostnames: should cause the daemon to try to connect to that host and handle any connection errors accordingly.
invalid hostnames (currently with spaces or '_') should cause a warning in the log and disable the connection attempts
NONE or empty hostnames should cause an info level msg in the log and disable the connection attempts

csm_allocation_delete is not capturing the correct information on failure

csm_allocation_delete is not capturing the correct information on faulire:

# /opt/ibm/csm/bin/csm_allocation_query_details -a 28
---
allocation_id:                  28
primary_job_id:                 373
secondary_job_id:               0
num_nodes:                      18
compute_nodes:
  f0n07:
   power_cap:                   3050
   power_shifting_ratio:        100
  f0n08:
   power_cap:                   3050
   power_shifting_ratio:        100
  f0n09:
   power_cap:                   3050
   power_shifting_ratio:        100
  f0n11:
.....

csmdb=# select * from csm_allocation_node_history where allocation_id = '28' and node_name = 'f0n07';
        history_time        | allocation_id | node_name |  state  | shared | energy | gpfs_read | gpfs_write |   i
b_tx    |   ib_rx    | power_cap | power_shifting_ratio | power_cap_hit | gpu_usage | gpu_energy | cpu_usage | mem
ory_usage_max | archive_history_time
----------------------------+---------------+-----------+---------+--------+--------+-----------+------------+----
--------+------------+-----------+----------------------+---------------+-----------+------------+-----------+----
--------------+----------------------
 2018-05-08 13:29:59.218207 |            28 | f0n07     | running | f      |      0 |  17137664 |       8192 |  39
6811784 |  401237665 |      3050 |                  100 |             0 |        -1 |            |           |
              |
 2018-05-08 13:30:32.356493 |            28 | f0n07     | running | f      |      0 | -17137664 |      -8192 | -39
6811784 | -401237665 |      3050 |                  100 |             0 |        -1 |            |         0 |
            0 |
 2018-05-08 13:30:32.356493 |            28 | f0n07     | failed  | f      |      0 | -17137664 |      -8192 | -39
6811784 | -401237665 |      3050 |                  100 |             0 |        -1 |            |         0 |
            0 |
(3 rows)

csmdb=#

Display the correct version when running Diagnostic.

Get the correct version from the build.

cast_1.1.0 Branch Creation

@IBM/cast_all

Hi,

A new branch named cast_1.1.0 will be created on Friday 5/11 at 7:00 am EDT. Please have all needed code checked in by then.

Thanks

Develop new Diagnostic tests

Develop the remaining new Diagnostic tests.

clang build error(s)

running ./scripts/configure.pl --parallel --clang and build I get the following build error

[ 67%] Building CXX object csmd/src/daemon/src/CMakeFiles/csmd_lib.dir/thread_pool.cc.o
In file included from /u/plundgr/CAST/csmd/src/daemon/src/thread_pool.cc:17:
In file included from /u/plundgr/CAST/csmd/src/daemon/src/csmi_request_handler/csmi_base.h:43:
In file included from /u/plundgr/CAST/csmd/src/daemon/include/csm_daemon_network_manager.h:38:
In file included from /u/plundgr/CAST/csmd/include/csm_daemon_config.h:47:
/u/plundgr/CAST/csmd/src/daemon/include/bds_info.h:62:36: error: comparison of constant 9223372036854775807 with expression of type 'int' is always false
      [-Werror,-Wtautological-constant-out-of-range-compare]
    if(( errno == ERANGE ) || ( pn == INTMAX_MAX ) || ( pn <= 0 ) || ( pn > 65535 ))
                                ~~ ^  ~~~~~~~~~~
1 error generated.
make[2]: *** [csmd/src/daemon/src/CMakeFiles/csmd_lib.dir/thread_pool.cc.o] Error 1
make[1]: *** [csmd/src/daemon/src/CMakeFiles/csmd_lib.dir/all] Error 2
make: *** [all] Error 2
*** CONFIGURE FAILED  (cmd= make  install   rc=512)

Regression/Unit tests for CSM serialization engine

Work from pull request #8 reinforced that CSM needs a unit test for the x-macro serialization engine. The tests should perform the following operations:

Success case (fixed data)
Success case (randomized data)
Failure case (buffer length mismatch)

Additional tests should be added as deemed necessary.

CSM_FVT Test Case: Prolog/Epilog failures

Need to add some test cases to regression. Goal is to manipulate prolog and epilog scripts to force various csm_allocation_create and csm_allocation_delete errors and time outs. Generally, checks will be to observe expected failure behaviors, clean (as possible) exits. More discussion to be had regarding exact errors to force, checks to make.

rpm command is able to install 2 versions of ibm-csm-hcdiag simultaneously

Noticed this while working on install scripts for FVT

Currently installed and running version 1.1.1-163

[root@c650f03p09 rpms]# rpm -qa | grep ibm-
ibm-csm-core-1.1.1-163.ppc64le
ibm-csm-hcdiag-1.1.1-163.noarch
ibm-flightlog-1.1.1-163.ppc64le
ibm-csm-api-1.1.1-163.ppc64le

I can rpm -i a newer version of hcdiag rpm, and both versions are listed as installed

[root@c650f03p09 rpms]# rpm -ivh ibm-csm-hcdiag-1.1.1-188.noarch.rpm 
Preparing...                          ################################# [100%]
Updating / installing...
   1:ibm-csm-hcdiag-1.1.1-188         ################################# [100%]
[root@c650f03p09 rpms]# rpm -qa | grep ibm-csm
ibm-csm-hcdiag-1.1.1-188.noarch
ibm-csm-core-1.1.1-163.ppc64le
ibm-csm-hcdiag-1.1.1-163.noarch
ibm-csm-api-1.1.1-163.ppc64le

NOTE: This does get blocked if I try to install an older version of the rpm
The compute daemon does stay up and running in this case

I'm guessing there may be a bug in the building/packaging process for the hcdiag rpm, or some functionality is missing. If I try this with any other rpms I get blocked with some output similar to the following example:

[root@c650f03p09 rpms]# rpm -ivh ibm-csm-core-1.1.1-188.ppc64le.rpm 
Preparing...                          ################################# [100%]
	file /opt/ibm/csm/bin/csm_smt from install of ibm-csm-core-1.1.1-188.ppc64le conflicts with file from package ibm-csm-core-1.1.1-163.ppc64le

CSM_FVT: jsrun_cmd basic bucket fails to execute test script, delete output file

csm_jsrun_cmd is not executing correctly.

2 problems to address here

Regression missed this bug due to a failure to properly clean up output file
fix the bug

The output file that regression has been checking for was observed to have been created on the regression machines on 6/11, so it seems likely that the break occurred on or shortly after that date.

Logging: csm_allocation_step_begin and csm_allocation_step_end don't log an end message

The csm_allocation_step_begin and csm_allocation_step_end APIs don't print an end message. It smells like there might be a state machine error for the final state, but I need to dig a little more.

build error in CSMIMcast.h

I was having trouble building on my branch, so I decided to build off of master first. It looks like there's a type mismatch in the current master. I don't know enough about the code base yet to say for certain this is the problem since I haven't chased the call path all the way yet (which is unfortunately really hard in the API due to the serialization stuff).

/g/g0/bertsch2/src/build/CAST/csmd/src/daemon/src/csmi_request_handler/csmi_mcast/CSMIMcast.h:94:34: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
             for( uint32_t i=0; i < _Data->num_nodes; ++i)

Not clear if it is intentional or not, but in different structures num_nodes is sometimes defined signed, and sometimes unsigned. csmi_allocation_step_t is signed, while the others are all unsigned.

[bertsch2@butte5 CAST]$ grep -r 'int32_t num_nodes' *
csmd/src/daemon/src/csmi_request_handler/CSMIAllocationStepQueryDetails.cc:            int32_t num_nodes = output->steps[i]->num_nodes;
csmd/src/daemon/src/csmi_request_handler/CSMIJSRUNCMD.cc:        uint32_t num_nodes =  strtol(fields->data[1], nullptr, 10);
csmd/src/daemon/src/csmi_request_handler/csmi_mcast/CSMIMcastBB.h:    uint32_t num_nodes;
csmd/src/daemon/src/csmi_request_handler/csmi_mcast/CSMIMcastJSRUN.h:    uint32_t num_nodes;
csmi/include/csmi_type_wm.h:    uint32_t num_nodes; /**< Number of nodes, size of @ref compute_nodes. */
csmi/include/csmi_type_wm.h:    uint32_t num_nodes; /**< Number of nodes in the step, size of @ref compute_nodes. */
csmi/include/csmi_type_wm.h:    uint32_t num_nodes; /**< Number of nodes in allocation, size of @ref node_accounting. */
csmi/include/csmi_type_wm.h:    int32_t num_nodes; /**< Number of nodes, size of @ref compute_nodes.*/
csmi/src/wm/include/csmi_wm_type_internal.h:    uint32_t num_nodes; /**< Number of nodes, size of @ref compute_nodes. */
csmi/src/wm/include/csmi_wm_type_internal.h:    int32_t num_nodes; /**< Number of nodes, size of @ref compute_nodes.*/
[bertsch2@butte5 CAST]$

Feature Request: set a string value to NULL

Request: Reset a field in the CSM Database back to NULL. (aka blank)

Objective: CSM update APIs should have the ability to set values to NULL in the CSM database.

Notes:

obviously not all CSM APIs should be able to set values in database to NULL
only APIs with the tag of "update"?
and even then, not all update APIs should let you set a value to NULL in the database.
the important thing here is the feature, once the ability to implement the feature is abailable, then individual APIs can incorporate it, for specific attributes.

ToDo:

test with a single API - node_attributes_update
test a single field - comment
print the keyword on the back end logs (prove we got there)
update the database
pass a NULL value into a CSM API to set the value in the database to be NULL.
support multiple fields on that API
support multiple APIs
improve re-usability of this new feature.

Feature branch associated with this issue: https://github.com/NickyDaB/CAST/tree/feature-new/NULL_Database_fields_CSMAPIS

csmd utility daemon SEGFAULT on NetworkManager

Utility daemon SEGFAULT on NetworkManager

[UTL]2018-05-17 11:21:38.121519       csmd::info     | CSMD: UTILITY; Build: 1.1.0; Config file: /etc/ibm/csm/csm_utility.cfg

[UTL]2018-05-20 21:57:45.552520     csmapi::info     | CSM_CMD_node_resources_query_all[1999289472]; Client Recv; PID: 85848; UID:0; GID:0
[UTL]2018-05-20 21:57:45.552602     csmnet::info     | Deleting known client address: tmp/csmi0.85848.20971
[UTL]2018-05-20 21:57:45.552620     csmnet::warning  | Found disconnecting endpoint with pending Accept() of client tmp/csmi0.85848.20971
[UTL]2018-05-20 21:57:45.552689     csmnet::info     | Sync:Active:Endpoint removal: 192.168.75.124:9816  remaining=1
[UTL]2018-05-20 21:57:45.553052       csmd::info     | DAEMONSTATE: DISCONNECTEP: 192.168.75.124:9816
[UTL]2018-05-20 21:57:45.553073       csmd::info     | DAEMONSTATE: UpdateEPStatus(): Status of connection to: 192.168.75.124:9816 change to RUNMODE_DISCONNECTED
[UTL]2018-05-20 21:57:45.553091       csmd::info     | CONN-HDLNG: Removing listening endpoint type:CSM_NETWORK_TYPE_LOCAL addr=/run/csmd.sock
[UTL]2018-05-20 21:57:45.553110     csmnet::info     | ConnectionHandling::TakeDependentDown:Endpoint removal: /run/csmd.sock  remaining=0
[UTL]2018-05-20 21:57:45.553214       csmd::info     | CONN-HDLNG: Transition from RUNMODE_READY_RUNNING to: RUNMODE_DISCONNECTED Reason: ERROR
[UTL]2018-05-20 21:57:45.554662       csmd::critical | Caught segfault when accessing data at address: 0x78
[UTL]2018-05-20 21:57:45.554693       csmd::critical | Instruction Address from where the crash happened: 0x109ad08c
[UTL]2018-05-20 21:57:45.554709       csmd::critical | SEGFAULT: thread_id = 7fff8174f040 handler_name = NetworkManager

Improved Error Messages for CSM struct version mismatch

CSM needs to improve the error message/reporting when mismatched version ids are found. Today no warning is provided to the API user which can be extremely confusing.

Implement a warning (or should it be info level) log for API version mismatches.

gpu-health needs to run for devices on socket 0 and 1

Fix the gpu-health script to run it twice, one per socket.

dgemm-gpu is not working

File: /tmp/137987/dgemm_gpu-errors-5-0629.132217.log
Error, invalid argument: -E

Usage: jsrun [OPTION...] []
Try jsrun --help' or jsrun --usage' for more information.

================================================================

Results:

The Spectrum MPI /opt/ibm/spectrum_mpi/ is used.
Using /tmp/137987/ as directory for the logs

GPU: 0

GPU: 1

GPU: 2

GPU: 3

GPU: 4

GPU: 5

GPU: 0,1,2,3,4,5

dgemm-gpu.sh test FAIL, rc=2

CSM_FVT: Test Case for creating allocations with an active "staging-out" allocation on the same node

2 Test Cases here:

Create an allocation at staging-in with active staging-out allocation(s) on the node
Create an allocation at running with active staging-out allocation(s) on the node

Manual Run
Test Case Design & Approval
Integrate in to regression

Add launch_node_name flag to csm_allocation_create command line

Add launch_node_name flag to csm_allocation_create command line.

Diagnostic invoking csm_allocation_create/delete

Diagnostic issues the command csm_allocation_create passing the status "running",
then at the end of the run it issue csm_allocation_delete. The csm_allocation_delete is failing with rc=25.
Then I Invoked the command manually, it says:

c699mgt00: > /opt/ibm/csm/bin/csm_allocation_delete -a 14222
[csmapi][warning]       /home/ppsbld/workspace/PUBLIC_CAST_V1.1.1_ppc64LE_RH7.5_ProdBuild/csmi/src/common/src/csmi_common_utils.c-147: the Error Flag Set
[csmapi][error] csmi_sendrecv_cmd failed: 25 - csm_allocation_delete[838226367]; Database Error Message: ERROR:  Detected a multicast operation in progress for allocation, rejecting delete.

This message comes from the db function fn_csm_allocation_delete_start, when checking for valid transition.

The database says:

c699mgt00:/home/diagadmin/log > psql -U postgres -d csmdb -c "select  state  from csm_allocation  where allocation_id=14222"
   state
------------
 to-running
(1 row)

Why the csm_allocation_create is not setting the state to "running"?

CSM_FVT Test Case: csm_allocation_delete captures correct information on failure

Issue #31
Pull Request #34

Steps for Test Case:

Force csm_allocation_delete failure
Verify csm_allocation_query_details output
Verify csmdb data

Progress:

Test Case Design
Manual run
Approval
Integrate in to regression

CSM Utility Daemon is unable to reconnect to CSM Master Daemon

The problem was originally detected via a failure of an LSF daemon to communicate with the local CSM Utility daemon. Attempts to manually run /opt/ibm/csm/bin/csm_node_resources_query_all (or any other CSM command line program) on the utility node in question fail with the following message:

$ /opt/ibm/csm/bin/csm_node_resources_query_all

[csmapi][error] csmlib: Failed to connect to CSM_SSOCKET=/run/csmd.sock; retried 9 times.
csm_net_unix_Connect: No such file or directory
[csmapi][error] csm_net_unix_Connect() failed: /run/csmd.sock
[csmapi][error] /home/ppsbld/workspace/PUBLIC_CAST_V1.1.1_ppc64LE_RH7.5_ProdBuild/csmi/src/wm/cmd/node_resources_query_all.c-166:
[csmapi][error]   csm_init_lib rc= 2, Initialization failed. Success is required to be able to communicate between library and daemon. Are the daemons running?

The message above is considered a normal response when the local CSM daemon is unable to establish a connection with the CSM Master or Aggregator daemon. When there is a daemon-to-daemon communication failure, CSM intentionally stops listening for local client connections until the daemon-to-daemon communication is restored.

Through examination of the CSM Utility and Master daemon logs, the following behavior is observed:

The Utility daemon is started ~10 minutes before the problem occurs.
The Utility daemon is able to successfully connect to the Master daemon and process several message over several minutes.
An unknown error occurs, causing the Utility daemon to detect the connection with the Master daemon is down via a failed heartbeat.
The Utility daemon closes the connection to the Master daemon.
The Utility daemon stops accepting local client requests, as designed. Local client API calls all fail with the message above.
The Utiltity daemon then enters a retry loop, where it attempts to reconnect to the Master daemon, ~ every 50 seconds. However, it is never able to successfully reconnect, despite retrying for over 2 hours.
Finally, a restart of just the Utility daemon allows it to successfully reconnect to the Master and normal operation is restored.

During each retry cycle, the following pattern is observed:

The Utility daemon creates a new connection to the Master daemon and sends a version message.
The Master daemon accepts the connection and processes the version message.
Somewhere between the two daemons, the response from the version message from the Master to Utility daemon is lost, causing the Utility daemon to detect the connection as down and close it.

Open questions:
1.) What was the initial error that triggered this behavior?
2.) What is causing the connection retry handshake to fail, despite repeated attempts to bring the connection back?

The installed version of CSM is:

ibm-csm-core-1.1.1-163
Version: 1.1.1; Build: fc7ceab25b; Date: Tue May 22 14:51:00 2018 -0400;

Add chk-memory test as part of Diagnostic.

Rename test_memsize, adding the number/size of banks check.

CSM_FVT Test Case: csm_allocation_delete with failing conditions

Test case for issue #69

Steps for test case:

Create an allocation
Create and LV for that allocation
Delete the allocation multiple times

# /opt/ibm/csm/bin/csm_allocation_delete -a 632
[csmapi][warning]       /home/ppsbld/workspace/PUBLIC_CAST_V1.1.0_ppc64LE_RH7.5_ProdBuild/csmi/src/common/src/csmi_common_utils.c-147: the Error Flag Set
[csmapi][error] csmi_sendrecv_cmd failed: 25 - csm_allocation_delete[270273068]; Database Error Message: ERROR:  error_handling_test: 23503/update or delete on table "csm_allocation_node" violates foreign key constraint "csm_lv_allocation_id_fkey" on table "csm_lv"

# /opt/ibm/csm/bin/csm_allocation_delete FAILED: returned: 25, errcode: 25 errmsg: csm_allocation_delete[270273068]; Database Error Message: ERROR:  error_handling_test: 23503/update or delete on table "csm_allocation_node" violates foreign key constraint "csm_lv_allocation_id_fkey" on table "csm_lv"

#

Diagnostic chk-led is not working

Invoked chk-led from the management node, with target c699c032 (known to be faulty) and it PASS.

Preparing to run chk-led.                                                         
chk-led started on 1 node(s) at 2018-06-11 18:08:53.802291. It might take up to 10s.                                                                                
.                                                                                 
chk-led ended on 1 node(s) at 2018-06-11 18:08:56.825385, rc= 0, elapsed time: 0:00:03.023094
chk-led PASS on node c699mgt00, serial number: 6856ECA.

=============================== Results summary  ===============================

18:08:53 =======================================================================

chk-led PASS on 2 node(s):

c699mgt00,c699c032


================================================================================

Health Check Diagnostics ended, exit code 0.

rvitals shows:

c699mgt00:/ > sudo /opt/xcat/bin/rvitals c699c032 leds
c699c032: LEDs Fan0: Off
c699c032: LEDs Fan1: Off
c699c032: LEDs Fan2: Off
c699c032: LEDs Fan3: Off
c699c032: LEDs Front Fault: On
c699c032: LEDs Front Identify: Off
c699c032: LEDs Front Power: On
c699c032: LEDs Rear Fault: On
c699c032: LEDs Rear Identify: Off
c699c032: LEDs Rear Power: On

Improve dcgm-diag test with with single/double precision diagnostic option

(Steven Roberts) We need to allow the following run to Diagnostic dcgm-diag:

Single Precision
dcgmi diag -r 3 -j --statspath /root/sPrecision --debugLogFile /root/dcgm_debug.log -v -d 5 &>> dcgm_console.log

Double Precision
dcgmi diag -r diagnostic --parameters Diagnostic.use_doubles=True --statspath /root/dPrecision --debugLogFile /root/dcgm_debug_doubleprecision.log -v -d 5 &>> dcgm_console_doubleprecision.log

CSM_FVT Test Case: csm_ctrl_cmd reset option

Test case for pull request #60

Steps for test case:

get a compute node to use the configured secondary as the primary connection
- either by restarting the primary
- or by starting the secondary first, then start compute; then start primary
run health-check on master/utility to confirm that primary node is linked as secondary:

/csm_infrastructure_health_check -v
       ...
    AGGREGATOR: primaryhost (bounced=0; version=1.1.0)
       ...
	Primary Nodes:
		Active: 1
			COMPUTE: computehost (bounced=0; version=1.1.0; link=SECONDARY)

run ctrl-cmd on compute: csm_ctrl_cmd --agg.reset
run health-check again to confirm link status matches configured status:

csm_infrastructure_health_check -v
       ...
    AGGREGATOR: primaryhost (bounced=0; version=1.1.0)
       ...
	Primary Nodes:
		Active: 1
			COMPUTE: ichonsocks (bounced=0; version=1.1.0; link=PRIMARY)

Progress:

Test Case Design
Manual run
Approval
Integrate to regression

hcdiag version file created by hcdiag/src/framework/CMakeLists.txt is untracked in git

In commit dd1faed

Since the version.py file is created directly from that CMakeLists.txt file when running ./scripts/configure.pl during the build process, it is untracked by git.

So currently, if a user attempts to build CAST they wind up with a dirty git repo
after running build:

-bash-4.1$ git status
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	hcdiag/src/framework/version.py

easy enough to handle in the interim but needs to be addressed going forward

Two possible suggested solutions:

Create the file directly (i.e. not in the CMakeLists.txt file) and add it to git
Add it to the .gitignore

However, maybe there is another way to handle this

CSM_FVT Test Case: Additional task migration for Cgroups

Modify cgroup check test cases in regression to check additional task files described in pull request #64

Manual Run
Test Case Development
Integration

Issue with YAML output in CMD Line Interfaces of CSM APIs

Description: There is an issue in the YAML output of the CMD Line Interfaces for CSM APIs. If the field contains a colon followed by a space, it is no longer valid YAML.

Example:

Invalid: comment: diags: failed dcgm
Valid: comment: "diags: failed dcgm"

Here the field is "comment" and the value is "diags: failed dcgm". The YAML gets confused with the regular print of "comment: diags: failed dcgm".

Possible Solution:

Wrap all text values with double quotes in the YAML CMD Line output? But then do we have problems with values containing quotes? Do we have problems with values containing quotes independent of this wrapping solution?
CAST includes boost. boost::property_tree can export to a variety of formats. XML, JSON, etc. Maybe the CSM CMD lines can utilize this feature.

Issue Source: This issue was brought up as part of a customer defect. Problem Report number: 37947 and on the private git. But that was a very direct issue and was resolved as to make the customer happy. This is the true cause of that specific issue and needs to eventually be resolved in the more general case, which is why this issue has been opened.

`build package` fails when trying to build RPM packages

Helping to validate the build process on this project before moving the code.... I ran build package to attempt to create the RPM packages, and I see the following failure:

CPackRPM:Debug: You may consult rpmbuild logs in:
CPackRPM:Debug:    - /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/rpmbuildibm-export_layout.err
CPackRPM:Debug: *** + umask 022
+ cd /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/BUILD
+ mv /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/tmpBBroot
+ exit 0
+ umask 022
+ cd /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/BUILD
+ '[' /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout '!=' / ']'
+ rm -rf /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout
++ dirname /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout
+ mkdir -p /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux
+ mkdir /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout
+ '[' -e /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout ']'
+ rm -rf /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout
+ mv /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/tmpBBroot /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux/export_layout
+ /bin/true
error: Bad syntax: %defattr(,,,)
    Bad syntax: %defattr(,,,)
 ***
CPackRPM:Debug:    - /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/rpmbuildibm-export_layout.out
CPackRPM:Debug: *** Building target platforms: ppc64le
Building for target ppc64le
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.lQphSk
Executing(%install): /bin/sh -e /var/tmp/rpm-tmp.gNF0Pt
Processing files: ibm-export_layout-1.0.0-9474.ppc64le


RPM build errors:
 ***
...
CPack Error: Problem copying the package: /root/bluecoral/work/rpms/_CPack_Packages/Linux/RPM/ibm-1.0.0-Linux-export_layout.rpm to /root/bluecoral/work/rpms/ibm-1.0.0-Linux-export_layout.rpm
CPack Error: Error when generating package: ibm
make: *** [package] Error 1
*** CONFIGURE FAILED  (cmd= make package partial_install   rc=512)

Attaching full logs. This is trying to build dev branch of bluecoral, pulled about 30 min ago.

build.package.log

Retrieve the machine model from /proc/cpuinfo instead of /proc/device-tree/model

In some easier versions of the node motherboard, /proc/device-tree/model returns all 0s.
/proc/cpuinfo always returns the correct model.

hxeewm_pass2 test fails, mdt file is not created by default

hxeewm_pass2 requires the mdt file, mdt.tpd which it used to be created by default.
There is a new procedure to create this mdt file.

CSM_FVT Test Case: csm_allocation_create using launch_node_name flag

Issue #38
Pull Request #40

build expects to write to /opt/ibm

The build system expects to be able to write to /opt. We really should be installing to some alternate path and then building packages from there, not trying to install in the system default location. This enables two really important capabilities.

Ability to build as non-privileged user (someone who can't write to /opt).
Ability to build new packages without overwriting an existing install on the same node.

CMake Error at cmake_install.cmake:36 (file):
  file INSTALL cannot copy file
  "/g/g0/bertsch2/src/build/CAST/csmconf/csm_aggregator.cfg" to
  "/opt/ibm/bin/csm_aggregator.cfg".

libcsmpam.so inappropriately uses 'session' to perform 'account' function

libcsmpam.so is currently performing account function in the session part of the module. There are three issues to fix by separating these two functions.

This can cause a pam stack to initiate other session functions inappropriately when account should not have succeeded in the first place.
This module inappropriately prints information related to authentication during session (printing is not allowed during authentication) and this can break protocols such as rsh or scp that rely on connecting stdout/stdin to a program running on the remote host.
It is desirable to be able to re-order the account pam stack separately from the session pam stack (which in libcsmpam is used to set up cgroups).

We are currently trying to work the CAST corporate contributor agreement through our legal department. If we are successful we can probably submit a pull request to fix these issues.

csm_ras.log file has incorrect timestamp for events created by Diagnostic

Diagnostic creates a ras event using csm_ras_event_create common without passing the timestamp.
When I query the database using csm_ras_event_query, the timestamp is displayed correctly.
But in the csm_ras.log files shows :

{"time_stamp":"now","msg_id":"hcdiag.test.fail","location_name":"c699c192","raw_data":"","kvcsv":"test=chk-gpfs-mount,node=c699c192,sn=785600A,rc=1,file=\/shared\/logs\/node_health\/180606114724663493\/chk-gpfs-mount\/c699c192-2018-06-06-11_47_27.output","test":"chk-gpfs-mount","node":"c699c192","sn":"785600A","rc":"1","file":"\/shared\/logs\/node_health\/180606114724663493\/chk-gpfs-mount\/c699c192-2018-06-06-11_47_27.output","ctxid":"418","severity":"WARNING","message":"chk-gpfs-mount FAIL on node c699c192 serial number: 785600A (details in \/shared\/logs\/node_health\/180606114724663493\/chk-gpfs-mount\/c699c192-2018-06-06-11_47_27.output).","control_action":"NONE","description":"NONE","enabled":"1","set_state":"","visible_to_users":"1","threshold_count":"1","threshold_period":"5","count":"1"}

csm_allocation_delete fails with rc=25 while allocation in to-running state

As of CAST 1.0, the new to-running state appears to be causing us problems in synchronization with LSF. If a user starts an interactive job and Ctrl-C's out of it while the jobs is in to-running, LSF will issue a csm_allocation_delete. This call will return rc=25, but LSF interprets that as "delete has already happened" and never tries again. This leaves an orphan allocation.

Ideally, the csm_allocation_delete would be queued and processed as soon as the prologue finishes.

CSM is reporting incorrect information from the job

CSM seems to show incorrect values for several fields. This is using:
ibm-csm-hcdiag-1.0.0-9460.noarch
ibm-csm-api-1.0.0-9460.ppc64le
ibm-csm-core-1.0.0-9460.ppc64le

$ /opt/ibm/csm/bin/csm_allocation_query_details -a 2272
---
allocation_id:                  2272
primary_job_id:                 19121
secondary_job_id:               0
num_nodes:                      2
compute_nodes:
  h41n01:
  h41n02:
num_transitions:                2
state_transitions:
  - history_time:            2018-05-03 17:06:46.655256
    state:                   running
  - history_time:            2018-05-03 17:06:30.105715
    state:                   to-running
ssd_file_system_name:
launch_node_name:               login1
user_flags:
system_flags:
ssd_min:                        0
ssd_max:                        0
num_processors:                 0
num_gpus:                       0
projected_memory:               0
state:                          running
type:                           jsm-cgroup-step
job_type:                       batch
user_name:                      vgv
user_id:                        11685
user_group_id:                  22401
**INCORRECT: user_script:                    1525381589.19121**
begin_time:                     2018-05-03 17:06:30.105715
account:
comment:
**INCORRECT: job_name:                       stf006accept**
job_submit_time:                2018-05-03 17:06:29
queue:                          batch
requeue:
time_limit:                     10800
wc_key:
isolated_cores:                 1
num_steps:      1

Prior failed jobs causing failover problems

When LVs are left on one or more CNs due to failures, subsequent failover attempts to a new bbServer attempts to register those LVs and restart any associated transfer definitions for those failed LVs. Not sure of the exposure right now, but this is definitely a problem for spanning handles where not all contributors will be restarted.

Variable that is not initialized should be tested in chk-os

And a more user friendly message should be displayed. For example: node not found in the clustconf.yaml file.

Running chk-os.sh on c699wrk01, machine type  8335-GTC.
 cat /etc/os-release | grep PRETTY_NAME | cut -d= -f2 2>/tmp/ko_OFX6Uzf/stderr
 uname -r 2>/tmp/ko_OFX6Uzf/stderr
 Use of uninitialized value $exp_os in string ne at ./chk-os.pm line 92.
 Use of uninitialized value $exp_os in concatenation (.) or string at ./chk-os.pm line 93.
 Use of uninitialized value $exp_kernel in concatenation (.) or string at ./chk-os.pm line 93.
 (ERROR) c699wrk01: nodeCfg->{os}->{pretty_name} failed
 (ERROR) c699wrk01: nodeCfg->{kernel}->{release} failed
 (ERROR) c699wrk01: invalid version, release: Red Hat Enterprise Linux Server 7.5 (Maipo),4.14.0-49.8.1.el7a.ppc64le; expected: ,
 chk-os test FAIL, rc=1
 Remote_command_rc = 1

csm_allocation_delete state change on a failure condition

csm_allocation_delete is not changing the state on a faulire condition.

# /opt/ibm/csm/bin/csm_allocation_query_details -a 632
---
allocation_id:                  632
primary_job_id:                 1874
secondary_job_id:               0
num_nodes:                      1
compute_nodes:
  f0n17:
num_transitions:                8
state_transitions:
  - history_time:            2018-05-22 09:54:43.871944
    state:                   deleting-mcast
  - history_time:            2018-05-22 09:46:40.173818
    state:                   deleting-mcast
  - history_time:            2018-05-21 19:19:30.43866
    state:                   deleting-mcast
  - history_time:            2018-05-21 19:17:30.026285
    state:                   deleting-mcast
  - history_time:            2018-05-21 19:15:29.470352
    state:                   deleting-mcast
  - history_time:            2018-05-21 19:14:32.313438
    state:                   running
  - history_time:            2018-05-21 19:14:01.987771
    state:                   to-running
  - history_time:            2018-05-21 19:13:39.327739
    state:                   staging-in
ssd_file_system_name:
launch_node_name:               f0n16
user_flags:
system_flags:
ssd_min:                        0
ssd_max:                        0
num_processors:                 0
num_gpus:                       0
projected_memory:               0
state:                          deleting-mcast
type:                           jsm
job_type:                       batch
user_name:                      uno
user_id:                        1060
user_group_id:                  1
user_script:                    /gpfst/jake/myexec.sh
begin_time:                     2018-05-21 19:13:39.327739
account:
comment:
job_name:                       default
job_submit_time:                2018-05-21 19:13:35
queue:                          csm_bb
requeue:
time_limit:                     0
wc_key:
isolated_cores:                 0
num_steps:      3
steps:
 - step_id:         3
   num_nodes:       1
   compute_nodes:
    - f0n17
 - step_id:         2
   end_time:        2018-05-21 19:15:16.583802
   num_nodes:       1
   compute_nodes:
    - f0n17
 - step_id:         1
   end_time:        2018-05-21 19:14:32.823289
   num_nodes:       1
   compute_nodes:
    - f0n17
...
#

User ulimit do not propagate when CSM fork() and exec() user process

Look like user ulimit do not propagate when CSM does fork and exec of user process.

We found an example for LimitMEMLOCK=infinity. The documented workaround is:

As root, on every compute node: Edit /usr/lib/systemd/system/csmd-compute.service and add this line:
LimitMEMLOCK=infinity
As root, on every compute node:
systemctl daemon-reload
systemctl restart csmd-compute
On the launch node:
$ /opt/ibm/csm/bin/csm_node_attributes_update -s “IN_SERVICE” -n

CSM_FVT Test Case: csm_allocation_update state, private test

csm_allocation_update_state needs a regression case for private allocations.

The test should have a flow something like this:

Create an allocation as a non-root user (A) in the staging-in state.
Attempt to transition the state from staging-in to running as another non-root user (B), the API should reject.
Attempt to transition the state from staging-in to running as the proper user (A), API should accept.
csm_allocation_query_details should have a staging-in, to-running and running state.
Attempt to transition the state from running to staging-out as another non-root user (B), the API should reject.
Attempt to transition the state from staging-in to running as the proper user (A), API should accept.
csm_allocation_query_details should have a staging-in, to-running, running, to-staging-out and staging-out state.
Delete the allocation.

epilog failures at LLNL

In the nvidia epilog, llnl getting lots of failures.

[root@sierra1133:epilog.d]# /admin/scripts/test_sierra_node
FAILED: sierra1133 Failed during cudaGetDeviceCount: initialization error

so on a compute node on sierra, say sierra1, there is a file /admin/scripts/test_sierra_node.cu

The script "test_sierra_node" is checking the state of the gpus on the system....what they find is that this code will fail for some nodes, after a job has finished....it means that no other jobs can use the GPUs unless the node is rebooted.

customer don't have a reproducer, but the hope is that somehow nvidia could determine why the node is stuck....we have plenty of example nodes for them to check on.

epilog failures that we're seeing on sierra

CSM maintenance

compute: "GPU 0 failed cudaMalloc 1": 1: sierra3061
compute: "IB CQE completion error 12": 1: sierra1626
compute: "SLOW GPU": 1: sierra1539
compute: "epilog: FAILED: sierra1133 Failed during cudaGetDeviceCount: initialization error": 1: sierra1133
compute: "epilog: FAILED: sierra1301 Failed during cudaGetDeviceCount: initialization error": 1: sierra1301
compute: "epilog: FAILED: sierra1310 Failed during cudaGetDeviceCount: initialization error": 1: sierra1310
compute: "epilog: FAILED: sierra1840 Failed during cudaGetDeviceCount: initialization error": 1: sierra1840
compute: "epilog: FAILED: sierra1942 Failed during cudaGetDeviceCount: initialization error": 1: sierra1942
compute: "epilog: FAILED: sierra2075 Failed during cudaGetDeviceCount: initialization error": 1: sierra2075
compute: "epilog: FAILED: sierra2551 Failed during cudaGetDeviceCount: initialization error": 1: sierra2551
compute: "epilog: FAILED: sierra2789 GPU 0 failed cudaMalloc 1: out of memory": 1: sierra2789
compute: "epilog: FAILED: sierra3053 GPU 0 failed cudaMalloc 1: out of memory": 1: sierra3053
compute: "epilog: FAILED: sierra3518 Failed during cudaGetDeviceCount: initialization error": 1: sierra3518
compute: "epilog: FAILED: sierra3832 GPU 3 failed cudaMalloc 1: out of memory": 1: sierra3832
compute: "epilog: FAILED: sierra4151 Failed during cudaGetDeviceCount: initialization error": 1: sierra4151
compute: "epilog: FAILED: sierra824 Failed during cudaGetDeviceCount: initialization error": 1: sierra824
compute: "epilog: TEST_SIERRA_NODE failed with empty output": 1: sierra3797
compute: "foxd: IB 2 missing":

CSM_FVT: Test Case for large mgs protocol

We found out that the large-msg protocol for the direction from client to daemon was broken. We need to add a test case that causes a client to send a large msg to a daemon.
Pull request #95 fixes the protocol.

The simplest option is to use the error_inject tool, however, it's only in the build tree and not included in the RPM right now:

error_inject -m <msg_size>

whith msg_size to cover 3 cases:

m < 128k
128k < m < 2x128k
m > 2x128k
Reason for the 2x category is that the case that made us find the issue was in that range and it caused the operation to successfully complete on the daemon side but then followed by disconnects and error logging. A case with > 2x would have failed completely.

Another option could be through the IB switch API which sends the content of a file to the daemon for insertion into the CSMDB. The size of that file can be controlled to create a set of test cases. I think @NickyDaB should have more info about how to create a test case from that.

Change the way Diagnostic handles allocation_id

allocation_id is a csm_diag_run table's column.

Current implementation:

allocation_id: only when Diagnostic creates an allocation (issue csm_allocation_create): run in Management mode, with allocation (--noallocation argument is not used)
0: when Diagnostic does not create an allocation: in Management Mode when --noallocation argument is passed, or in Node Mode (lsf job, prolog/epilog)

The new implementation:

allocation_id: if Diagnostic run in an allocation, either created by Diagnostic or by LSF
0: if Diagnostic run in Management mode with --noallocation argument

CSM Version Mismatch message not being displayed in compute log

using latest daily build for master/aggregator/utility

ibm-csm-core-1.1.0-53.ppc64le

and using an old build for compute

ibm-csm-core-0.5.0-9237_prega.ppc64le.rpm

The regression bucket makes 3 checks for this test case

Master daemon is still running (systemctl)
Compute daemon is not active (systemctl)
Compute daemon is disconnected and VERSION_MISMATCH message is logged (compute log)

Checks 2 and 3 are failing. The observed behavior is that the compute daemon stays active, and the compute log is as follows:

[COMPUTE]2018-05-04 08:54:33.326326       csmd::info     | CSMD: COMPUTE; Build: 302ee8da6b; Config file: /etc/ibm/csm/csm_compute.cfg
[COMPUTE]2018-05-04 08:54:33.327506       csmd::info     | Configuration: DaemonID: 29409735 using hostname: c650f03p07 PID: 120199
[COMPUTE]2018-05-04 08:54:33.328412       csmd::info     | Configuration: thread_pool_size = 4
[COMPUTE]2018-05-04 08:54:33.328629       csmd::info     | Configuration: Reading API permissions from ACL file: /etc/ibm/csm/csm_api.acl
[COMPUTE]2018-05-04 08:54:33.329219       csmd::warning  | Configuration: API=csm_jsrun_cmd NOT DEFINED!
[COMPUTE]2018-05-04 08:54:33.329325       csmd::info     | Configuration: Reading API configuration file: /etc/ibm/csm/csm_api.cfg
[COMPUTE]2018-05-04 08:54:33.329552       csmd::warning  | Configuration: API=csm_jsrun_cmd NOT DEFINED!
[COMPUTE]2018-05-04 08:54:33.333860       csmd::info     | Configuration: Found 1 buckets.
[COMPUTE]2018-05-04 08:54:33.334317       csmd::info     | CONN-HDLNG: Transition from RUNMODE_STARTED to: RUNMODE_CONFIGURED Reason: TRANSITION
[COMPUTE]2018-05-04 08:54:33.334556       csmd::info     | CONN-HDLNG: Setting up compute-aggregator connections. SingleMode=false
[COMPUTE]2018-05-04 08:54:33.334581       csmd::info     | DaemonState: Setting up 10.7.0.222:9800 as aggregator[0]
[COMPUTE]2018-05-04 08:54:33.334605       csmd::info     | DaemonState: Setting up 10.7.3.13:9800 as aggregator[1]
[COMPUTE]2018-05-04 08:54:33.334737       csmd::info     | NETMGR: Starting Network manager thread (3fff90e0f060)
[COMPUTE]2018-05-04 08:54:33.334801       csmd::info     | CONN-HDLNG: Transition from RUNMODE_CONFIGURED to: RUNMODE_DISCONNECTED Reason: TRANSITION
[COMPUTE]2018-05-04 08:54:33.334867     csmnet::info     | Connecting socket: 9 to 10.7.0.222:9800
[COMPUTE]2018-05-04 08:54:33.335076       csmd::info     | DaemonState::UpdateEPStatus(): Status of connection to: 10.7.0.222:9800 change to RUNMODE_DISCONNECTED
[COMPUTE]2018-05-04 08:54:33.335497     csmnet::info     | MultiEndpoint::Recv: shutdown or disconnect of 10.7.0.222:9800
[COMPUTE]2018-05-04 08:54:33.335531     csmnet::info     | Recv:HUP:Endpoint removal: 10.7.0.222:9800  remaining=0
[COMPUTE]2018-05-04 08:54:33.335586       csmd::info     | Updated aggregator priority: primary=10.7.3.13:9800 secondary=10.7.0.222:9800
[COMPUTE]2018-05-04 08:54:33.335608       csmd::info     | DISCONNECTEP: 10.7.0.222:9800
[COMPUTE]2018-05-04 08:54:33.385222     csmnet::info     | Connecting socket: 9 to 10.7.0.222:9800
[COMPUTE]2018-05-04 08:54:33.385786     csmnet::info     | MultiEndpoint::Recv: shutdown or disconnect of 10.7.0.222:9800

So the daemon is moving in to the disconnected state, but I don't see a version mismatch message. There a couple of things to discuss here:

Are the checks for this test case valid? Or is the observed behavior the correct behavior?
Is the "old" rpm being used for the compute node in this test case not a good version to be using?
And finally... is there a bug in the code suppressing the VERSION_MISMATCH message?