vpenso / prometheus-slurm-exporter Goto Github PK
View Code? Open in Web Editor NEWPrometheus exporter for performance metrics from Slurm.
License: GNU General Public License v3.0
Prometheus exporter for performance metrics from Slurm.
License: GNU General Public License v3.0
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
compute-1 0 7772 0/2/0/2 idle
compute-1 0 7772 0/2/0/2 idle
computegpu-1 0 31356 0/8/0/8 idle
computegpu-1 0 31356 0/8/0/8 idle
computemgpu-v0 515694 0/128/0/128 idle
prometheus-slurm-exporter --listen-address=myip:port
INFO[0000] Starting Server: ip:port source="main.go:59"
INFO[0000] GPUs Accounting: false source="main.go:60"
panic: runtime error: index out of range [4] with length 4
goroutine 26 [running]:
main.ParseNodeMetrics(0xc000030600, 0x25e, 0x600, 0x1)
/opt/slurm_exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x6)
/opt/slurm_exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0000ac000, 0xc000070d80)
/opt/slurm_exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/opt/slurm_exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/opt/slurm_exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe4d
I'm thinking it would be better (and more in-pair with prometheus metrics/labels conventions) to have job status and nodes states as labels instead of separate metrics. Since we are measuring the same thing.
From Prometheus page:
Use labels to differentiate the characteristics of the thing that is being measured:
api_http_requests_total - differentiate request types: type="create|update|delete"
api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"
It would make it easier to show totals as well (now we can't really easily show totals, because we don't know all metrics name in the beginning - in my case, failed/error metrics are not present, because none of my nodes are in that state yet).
Thinking about it, it would be good to set default values for metrics to 0 in case the metric doesn't have a value / doesn't exists. From prometheus page:
Avoid missing metrics
Time series that are not present until something happens are difficult to deal with, as the usual simple operations are no longer sufficient to correctly handle them. To avoid this, export 0 (or NaN, if 0 would be misleading) for any time series you know may exist in advance.
Most Prometheus client libraries (including Go, Java, and Python) will automatically export a 0 for you for metrics with no labels.
It would also good to have an 'up' metrics, something like slurm_up, with value of 0 if scrape of any of slurm commands would be unsuccessful (prometheus documentation). In that case, one can set an alert if slurm_up == 0; alert('Slurm is not responding')
.
Nothing critical, I just though I would let you know.
Thanks for the great exporter!
Hello, I have followed your steps to build the CentOS executable and that went fine . As a new user to prometheus I'm unsure from your instructions what to do to have a working configuration.
I have created a file prometheus.yml in my prometheus-slurm-exporter directory. Is there anything else that needs to be in that file other than what you have in your readme?
Maybe you could post an example of a basic working config.
This server is also the slurm master node of a test cluster I run.
Mine currently looks like this:
#
# SLURM resource manager:
#
- job_name: 'my_slurm_exporter'
scrape_interval: 30s
scrape_timeout: 30s
static_configs:
- targets: ['localhost:8080']
$~ promtool check config prometheus.yml
Checking prometheus.yml
FAILED: parsing YAML file prometheus.yml: yaml: unmarshal errors:
line 4: cannot unmarshal !!seq into config.plain
When I run the executable
$~ prometheus-slurm-exporter
INFO[0000] Starting Server: :8080 source="main.go:42"
When I access http://hostname:8080/graphs in the browser I get 404 page not found
If I access http://hostname:8080/metrics it is updating with the correct scheduler info.
Im my grafana server datasource settings, when I add new datasource I get a HTTP Error.
Many thanks,
Brendan
CentOS 8.1.1911
slurm 20.11.7
When trying to retrieve the metrics url:
$ wget http://localhost:8080/metrics
--2022-06-20 14:27:18-- http://localhost:8080/metrics
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8080... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2022-06-20 14:27:19-- (try: 2) http://localhost:8080/metrics
Connecting to localhost (localhost)|::1|:8080... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:8080... failed: Connection refused.
Server shows:
./prometheus-slurm-exporter -gpus-acct
INFO[0000] Starting Server: :8080 source="main.go:59"
INFO[0000] GPUs Accounting: true source="main.go:60"
2022/06/20 14:26:58 exit status 127
Running strace on slurm exporter:
$ strace prometheus-slurm-exporter
execve("/usr/bin/prometheus-slurm-exporter", ["prometheus-slurm-exporter"], 0x7ffe15f9f370 /* 40 vars */) = 0
brk(NULL) = 0x17fd000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe7c3257a0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=53799, ...}) = 0
mmap(NULL, 53799, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7febb614e000
close(3) = 0
openat(AT_FDCWD, "/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000o\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=754552, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb614c000
mmap(NULL, 2225344, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7febb5d14000
mprotect(0x7febb5d2f000, 2093056, PROT_NONE) = 0
mmap(0x7febb5f2e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7febb5f2e000
mmap(0x7febb5f30000, 13504, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb5f30000
close(3) = 0
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\2009\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=5993088, ...}) = 0
mmap(NULL, 3942432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7febb5951000
mprotect(0x7febb5b0a000, 2097152, PROT_NONE) = 0
mmap(0x7febb5d0a000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b9000) = 0x7febb5d0a000
mmap(0x7febb5d10000, 14368, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb5d10000
close(3) = 0
openat(AT_FDCWD, "/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\20\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=52328, ...}) = 0
mmap(NULL, 2109744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7febb574d000
mprotect(0x7febb5750000, 2093056, PROT_NONE) = 0
mmap(0x7febb594f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7febb594f000
close(3) = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb6149000
arch_prctl(ARCH_SET_FS, 0x7febb6149740) = 0
mprotect(0x7febb5d0a000, 16384, PROT_READ) = 0
mprotect(0x7febb594f000, 4096, PROT_READ) = 0
mprotect(0x7febb5f2e000, 4096, PROT_READ) = 0
mprotect(0x7febb615c000, 4096, PROT_READ) = 0
munmap(0x7febb614e000, 53799) = 0
set_tid_address(0x7febb6149a10) = 26536
set_robust_list(0x7febb6149a20, 24) = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(NULL) = 0x17fd000
brk(0x181e000) = 0x181e000
sched_getaffinity(0, 8192, [0, 1, 2, 3]) = 16
openat(AT_FDCWD, "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY) = 3
read(3, "2097152\n", 20) = 8
close(3) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb6109000
mmap(NULL, 131072, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb60e9000
mmap(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb5fe9000
mmap(NULL, 8388608, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb4f4d000
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb0f4d000
mmap(NULL, 536870912, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb90f4d000
mmap(0xc000000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(NULL, 33554432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8ef4d000
mmap(NULL, 2165776, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8ed3c000
mmap(0xc000000000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(0x7febb60e9000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb60e9000
mmap(0x7febb6069000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb6069000
mmap(0x7febb5353000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb5353000
mmap(0x7febb2f7d000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb2f7d000
mmap(0x7feba10cd000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7feba10cd000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8ec3c000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb5fd9000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb5fc9000
rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
sigaltstack({ss_sp=0xc000002000, ss_flags=0, ss_size=32768}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
gettid() = 26536
rt_sigaction(SIGHUP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGHUP, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGINT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGILL, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGILL, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGTRAP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTRAP, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGABRT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGABRT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGBUS, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGFPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGFPE, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGUSR1, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGUSR1, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGSEGV, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGUSR2, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGUSR2, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGALRM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGTERM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGSTKFLT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSTKFLT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGCHLD, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGCHLD, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGURG, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGURG, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGXCPU, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGXCPU, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGXFSZ, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGXFSZ, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGVTALRM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGVTALRM, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGPROF, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPROF, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGWINCH, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGWINCH, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGIO, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGIO, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGPWR, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPWR, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGSYS, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSYS, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRTMIN, NULL, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRTMIN, NULL, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, NULL, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRT_1, NULL, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_2, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_3, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_3, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_4, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_4, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_5, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_5, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_6, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_6, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_7, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_7, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_8, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_8, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_9, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_9, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_10, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_10, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_11, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_11, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_12, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_12, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_13, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_13, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_14, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_14, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_15, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_15, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_16, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_16, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_17, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_17, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_18, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_18, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_19, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_19, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_20, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_20, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_21, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_21, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_22, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_22, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_23, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_23, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_24, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_24, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_25, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_25, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_26, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_26, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_27, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_27, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_28, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_28, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_29, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_29, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_30, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_30, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_31, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_31, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_32, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_32, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8e43b000
mprotect(0x7feb8e43c000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8ec3afb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8ec3b9d0, tls=0x7feb8ec3b700, child_tidptr=0x7feb8ec3b9d0) = 26537
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 712
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8dc3a000
mprotect(0x7feb8dc3b000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8e439fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8e43a9d0, tls=0x7feb8e43a700, child_tidptr=0x7feb8e43a9d0) = 26538
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8cc38000
mprotect(0x7feb8cc39000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8d437fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8d4389d0, tls=0x7feb8d438700, child_tidptr=0x7feb8d4389d0) = 26540
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8c437000
mprotect(0x7feb8c438000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8cc36fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8cc379d0, tls=0x7feb8cc37700, child_tidptr=0x7feb8cc379d0) = 26541
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
fcntl(0, F_GETFL) = 0x402 (flags O_RDWR|O_APPEND)
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8c3f7000
fcntl(1, F_GETFL) = 0x402 (flags O_RDWR|O_APPEND)
fcntl(2, F_GETFL) = 0x402 (flags O_RDWR|O_APPEND)
futex(0x7febb59500e8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=53799, ...}) = 0
mmap(NULL, 53799, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7febb614e000
close(3) = 0
openat(AT_FDCWD, "/lib64/libcrypto.so.10", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\322\6\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2779880, ...}) = 0
mmap(NULL, 4598856, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7feb77b9d000
mprotect(0x7feb77dd4000, 2097152, PROT_NONE) = 0
mmap(0x7feb77fd4000, 163840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x237000) = 0x7feb77fd4000
mmap(0x7feb77ffc000, 15432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7feb77ffc000
close(3) = 0
openat(AT_FDCWD, "/lib64/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0'\0\0\0\0\0\0"..., 832) = 832
lseek(3, 88296, SEEK_SET) = 88296
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32
fstat(3, {st_mode=S_IFREG|0755, st_size=101032, ...}) = 0
lseek(3, 88296, SEEK_SET) = 88296
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32
mmap(NULL, 2187272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7feb8c1e0000
mprotect(0x7feb8c1f6000, 2093056, PROT_NONE) = 0
mmap(0x7feb8c3f5000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0x7feb8c3f5000
mmap(0x7feb8c3f6000, 8, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7feb8c3f6000
close(3) = 0
mprotect(0x7feb8c3f5000, 4096, PROT_READ) = 0
mprotect(0x7feb77fd4000, 114688, PROT_READ) = 0
openat(AT_FDCWD, "/etc/pki/tls/legacy-settings", O_RDONLY) = -1 ENOENT (No such file or directory)
access("/etc/system-fips", F_OK) = -1 ENOENT (No such file or directory)
munmap(0x7febb614e000, 53799) = 0
getpid() = 26536
newfstatat(AT_FDCWD, "/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}, 0) = 0
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(2, "\33[36mINFO\33[0m[0000] Starting Ser"..., 95INFO[0000] Starting Server: :8080 source="main.go:59"
) = 95
write(2, "\33[36mINFO\33[0m[0000] GPUs Account"..., 95INFO[0000] GPUs Accounting: false source="main.go:60"
) = 95
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 3
close(3) = 0
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 3
setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
bind(3, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 4
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
bind(4, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
close(4) = 0
close(3) = 0
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
openat(AT_FDCWD, "/proc/sys/net/core/somaxconn", O_RDONLY|O_CLOEXEC) = 4
epoll_create1(EPOLL_CLOEXEC) = 5
pipe2([6, 7], O_NONBLOCK|O_CLOEXEC) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=11977192, u64=11977192}}) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353060264, u64=140649647102376}}) = 0
fcntl(4, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
read(4, "128\n", 65536) = 4
read(4, "", 65532) = 0
epoll_ctl(5, EPOLL_CTL_DEL, 4, 0xc00019fa14) = 0
close(4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(3, 128) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353060264, u64=140649647102376}}) = 0
getsockname(3, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28]) = 0
accept4(3, 0xc00019fb00, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(5, [], 128, 0, NULL, 2) = 0
epoll_pwait(5, [{EPOLLIN, {u32=2353060264, u64=140649647102376}}], 128, -1, NULL, 0) = 1
futex(0xb3dfb8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xb3deb8, FUTEX_WAKE_PRIVATE, 1) = 1
accept4(3, {sa_family=AF_INET6, sin6_port=htons(34412), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 4
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353060032, u64=140649647102144}}) = 0
getsockname(4, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28]) = 0
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(4, SOL_TCP, TCP_KEEPINTVL, [15], 4) = 0
setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [15], 4) = 0
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
accept4(3, 0xc00019fb00, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb3d1d0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0
futex(0xb3dfb8, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 1
futex(0xb3deb8, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]}) = 1
futex(0xb3dfa8, FUTEX_WAKE_PRIVATE, 1) = 1
newfstatat(AT_FDCWD, "/share/apps/slurm/bin/sinfo", {st_mode=S_IFREG|0755, st_size=502752, ...}, 0) = 0
newfstatat(AT_FDCWD, "/share/apps/slurm/bin/squeue", {st_mode=S_IFREG|0755, st_size=642824, ...}, 0) = 0
pipe2([8, 12], O_CLOEXEC) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 8, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353059104, u64=140649647101216}}) = 0
fcntl(8, F_GETFL) = 0 (flags O_RDONLY)
fcntl(8, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353058408, u64=140649647100520}}) = 0
fcntl(12, F_GETFL) = 0x1 (flags O_WRONLY)
fcntl(12, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
openat(AT_FDCWD, "/dev/null", O_RDONLY|O_CLOEXEC) = 14
epoll_ctl(5, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353057480, u64=140649647099592}}) = -1 EPERM (Operation not permitted)
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CLOEXEC) = 17
epoll_ctl(5, EPOLL_CTL_ADD, 17, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353056784, u64=140649647098896}}) = -1 EPERM (Operation not permitted)
fcntl(12, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK)
fcntl(12, F_SETFL, O_WRONLY) = 0
pipe2([10, 11], O_CLOEXEC) = 0
getpid() = 26536
rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[], NULL, 8) = 0
clone(child_stack=NULL, flags=CLONE_VM|CLONE_VFORK|SIGCHLD) = 26556
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(11) = 0
read(10, "", 8) = 0
close(10) = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, 0xc000224afc) = 0
close(12) = 0
close(14) = 0
close(17) = 0
read(8, 0xc000252000, 512) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb3d1d0, FUTEX_WAIT_PRIVATE, 0, NULL2022/06/20 14:46:03 exit status 127
) = ?
+++ exited with 1 +++
Hello, I built slurm-exporter, and running is foreground in terminal per DEVELOPMENT.md. Added node to my prometheus.yml and cycled prom. RHEL 7.9 node.
./prometheus-slurm-exporter --listen-address="0.0.0.0:9500" -gpus-acct
INFO[0000] Starting Server: 0.0.0.0:9500 source="main.go:59"
INFO[0000] GPUs Accounting: true source="main.go:60"
However, in 2nd terminal window of my node,
curl http://localhost:9500/metrics
curl: (7) Failed connect to localhost:9500; Connection refused
And if I ps: it shows:
ps -ef | grep prometheus-slurm-exporter
gbeyer3 238618 225323 0 15:26 pts/1 00:00:00 grep --color=auto prometheus-slurm-exporter
So it seems to not be running. Thought it shows running in foreground in 1st terminal
Can someone suggest a solution?
Thanks
First of all, great project! This works great.
Found one easily fixable issue. If the node names are too long, the memoryAlloc field value merges into the node name value:
INFO[0000] Starting Server: 0.0.0.0:9090 source="main.go:59"
INFO[0000] GPUs Accounting: false source="main.go:60"
INFO[0049] sinfo fields: [production-slurm-compute-10 8192 0/4/0/4 idle] source="node.go:55"
panic: runtime error: index out of range [4] with length 4
goroutine 39 [running]:
main.ParseNodeMetrics({0xc0001b9000, 0x594, 0xc0000adde0?})
/home/jyost/development/git/prometheus-slurm-exporter/node.go:58 +0x5a5
main.NodeGetMetrics()
/home/jyost/development/git/prometheus-slurm-exporter/node.go:41 +0x1e
main.(*NodeCollector).Collect(0xc00007f740, 0xc0000adf60?)
/home/jyost/development/git/prometheus-slurm-exporter/node.go:131 +0x3e
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/home/jyost/development/git/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0xfb
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/home/jyost/development/git/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xb0b
The node name is production-slurm-compute-1 and the 0 got appended to the end of the node name. I fixed in my fork and will submit a PR soon
PATCH:
cmd := exec.Command("/usr/bin/squeue", "-h", "-o %A,%T,%r", "--states=all")
ISSUE:
Does not track COMPLETED jobs.
man squeue (slurm >= 17.11.12)
...
-t <state_list>, --states=<state_list>
Specify the states of jobs to view. Accepts a comma separated list of state names or "all". If
"all" is specified then jobs of all states will be reported. If no state is specified then pending,
running, and completing jobs are reported. See the JOB STATE CODES section below for a list
of valid states. Both extended and compact forms are valid. Note the <state_list> supplied is
case insensitive ("pd" and "PD" are equivalent).
...
The number of pending jobs isn't quite correct, as job 12345_[1-10] is 10 pending jobs, but only shows up as one. Using squeue -r would list each array job as its own line. Thoughts?
Hi, Thank you for building this cool project.
I have to apologize in advance. I don't know the G in GO. I'd just like to use this project without learning GO.
I've had a minor hiccup with the installation process as documented by DEVELOPMENT.md.
go test -v *.go
=== RUN TestCPUsMetrics
cpus_test.go:31: &{alloc:5725 idle:877 other:34 total:6636}
--- PASS: TestCPUsMetrics (0.00s)
=== RUN TestCPUssGetMetrics
cpus_test.go:35: &{alloc:76 idle:84 other:40 total:200}
--- PASS: TestCPUssGetMetrics (0.01s)
=== RUN TestNodesMetrics
nodes_test.go:31: &{alloc:0 comp:0 down:0 drain:0 err:0 fail:0 idle:0 maint:0 mix:0 resv:0}
--- PASS: TestNodesMetrics (0.03s)
=== RUN TestNodesGetMetrics
nodes_test.go:35: &{alloc:0 comp:0 down:1 drain:0 err:0 fail:0 idle:1 maint:0 mix:3 resv:0}
--- PASS: TestNodesGetMetrics (0.01s)
=== RUN TestNodeMetrics
node_test.go:48: map[a048:0xc000258300 a049:0xc0002583c0 a050:0xc000258440 a051:0xc0002584c0 a052:0xc000258540 b001:0xc000258640 b002:0xc000258e00 b003:0xc000258f00]
--- PASS: TestNodeMetrics (0.00s)
=== RUN TestParseQueueMetrics
queue_test.go:31: &{pending:4 pending_dep:0 running:28 suspended:1 cancelled:1 completing:2 completed:1 configuring:1 failed:1 timeout:1 preempted:1 node_fail:1}
--- PASS: TestParseQueueMetrics (0.00s)
=== RUN TestQueueGetMetrics
queue_test.go:35: &{pending:69 pending_dep:0 running:12 suspended:0 cancelled:0 completing:0 completed:0 configuring:0 failed:0 timeout:0 preempted:0 node_fail:0}
--- PASS: TestQueueGetMetrics (0.01s)
=== RUN TestSchedulerMetrics
scheduler_test.go:31: &{threads:3 queue_size:0 dbd_queue_size:0 last_cycle:97209 mean_cycle:74593 cycle_per_minute:63 backfill_last_cycle:1.94289e+06 backfill_mean_cycle:1.96082e+06 backfill_depth_mean:29324 total_backfilled_jobs_since_start:111544 total_backfilled_jobs_since_cycle:793 total_backfilled_heterogeneous:10}
--- PASS: TestSchedulerMetrics (0.01s)
=== RUN TestSchedulerGetMetrics
scheduler_test.go:35: &{threads:3 queue_size:0 dbd_queue_size:0 last_cycle:63 mean_cycle:625 cycle_per_minute:1 backfill_last_cycle:0 backfill_mean_cycle:625 backfill_depth_mean:0 total_backfilled_jobs_since_start:6249 total_backfilled_jobs_since_cycle:0 total_backfilled_heterogeneous:0}
--- PASS: TestSchedulerGetMetrics (0.03s)
PASS
ok command-line-arguments 0.125s
$ go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,sshare,users}.go
# command-line-arguments
./main.go:31:26: undefined: NewNodeCollector
$go version
go version go1.15 linux/amd64
Does anyone have a clue on how i can fix this?
Thanks in advance.
Hi,
Thank you for your tool! It is super useful.
Is it possible to obtain the 'jobid' and the 'nodelist'? I used the slurm dashboard on grafana but I do not see these data.
Could you help me on that?
If node names are over 20 characters long, the output of sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
, used at node.go:85, looks like this:
cpu-always-on-st-t30 1 0/2/0/2 idle
cpu-spot-dy-c52xlar0 1 0/8/0/8 idle~
cpu-spot-dy-c52xlar0 1 0/8/0/8 idle~
You can see that node name and memory are not separated by whitespace.
This results in a crash with the following output:
prometheus-slurm-exporter[5783]: panic: runtime error: index out of range [4] with length 4
prometheus-slurm-exporter[5783]: goroutine 9 [running]:
prometheus-slurm-exporter[5783]: main.ParseNodeMetrics(0xc00016e000, 0x5eb, 0xe00, 0xc0000b10d8)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:56 +0x6cf
prometheus-slurm-exporter[5783]: main.NodeGetMetrics(0x8b7f20)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:40 +0x2a
prometheus-slurm-exporter[5783]: main.(*NodeCollector).Collect(0xc00007a000, 0xc0000b1080)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:128 +0x37
prometheus-slurm-exporter[5783]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x1a2
prometheus-slurm-exporter[5783]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe8e
systemd[1]: slurm_exporter.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: slurm_exporter.service: Failed with result 'exit-code'.
It expects 5 fields separated by whatespace, but finds only 4 which results in out-of-bounds array access and panic.
Possible fix is to change sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
to sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: "
, explicitly telling SLURM to append a space after each value.
We are running slurm 17.11.2 and exporter is not reporting reserved nodes correctly because of a different output of sinfo
command.
Exporter is using %T
field, which prints long format. It should be %t
.
From slurm manuals:
%t
State of nodes, compact form
%T
State of nodes, extended form
Currently the squeue metrics parser will only detect Pending jobs, and additionally if they have a Dependency
. Slurm has other states it can put here, such as DependencyNeverSatisfied
or JobArrayTaskLimit
Hi
would it be possible to attach queue info to the jobs? Iw ould be nice to plot the job state graph filtered by queue.
Best
Justin
Why is the default port 8080 when there is an allocated port 9341?
Hi!
I'm willing to contribute a minimal CI/CD workflow for the project using GitHub actions (just building and running tests).
Is this something you'd be interested on having? Let me know and I'll prepare a PR.
Hello, I'm getting the same error as issue #56:
panic: runtime error: index out of range [4] with length 4
goroutine 26 [running]:
main.ParseNodeMetrics(0xc000140000, 0x25e, 0x600, 0x87080d)
/storage/home/hhiveman1/gbeyer3/prometheus-slurm-exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x1)
/storage/home/hhiveman1/gbeyer3/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc000020030, 0xc000072660)
/storage/home/hhiveman1/gbeyer3/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/storage/home/hhiveman1/gbeyer3/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/storage/home/hhiveman1/gbeyer3/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe4d
Answering the questions you asked that poster:
I am using the latest version of the slurm-exporter which I cloned directly from your repo and built using instructions in DEVELOPMENT.md ver 0.20
slurm version 22.05.0
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
atl1-1-02-018-35 0 515741 0/64/0/64 idle
atl1-1-03-002-35 0 191856 0/24/0/24 idle
compile-coda 0 191881 0/24/0/24 idle
compute-dev-slurm-1-0 3770 0/4/0/4 idle
compute-dev-slurm-1-0 3770 0/4/0/4 idle
compute-dev-slurm-2-0 3770 0/4/0/4 idle
Could we get GPU stats added to the node usage metrics. similar to the CPU stats that have been added in 18.0?
Hi,
I am testing the latest version and GPU infor seems to not be so accurate, how can I start debugging?
# TYPE slurm_gpus_alloc gauge
slurm_gpus_alloc 21
# HELP slurm_gpus_idle Idle GPUs
# TYPE slurm_gpus_idle gauge
slurm_gpus_idle -21
# HELP slurm_gpus_total Total GPUs
# TYPE slurm_gpus_total gauge
slurm_gpus_total 0
# HELP slurm_gpus_utilization Total GPU utilization
# TYPE slurm_gpus_utilization gauge
slurm_gpus_utilization +Inf
Cheers.
Hi,
We are counting the number of GPUs that certain account is using on running jobs by something like this:
gpucount=`( squeue -p gpu -h -A $group -o "%t %b" | grep ^R | cut -f2 -d' ' | sed -e 's/gpu://g' | tr '\n' '+'; echo 0 ) | bc || echo 0`
Is this something you could easily provide to the exporter? If not, maybe I can add it but I would need some guidance.
Cheers.
Hi,
I have installed prometheus-slurm-exporter as a service - I have grafana, prometheus running as services also. I have configured the Prometheus data source correctly in Grafana(tested and working) and added the recommended configuration to the prometheus.yml. Unfortunately the graphs in the dashboards report "No data points". Do I need to make any other configurations in order for this to work? Have you seen this type of behavior - any hints would be very much welcomed.
Thank you
Hi all,
I use docker on my dev stack, and I think it will be very interesting to have an official docker image of slurm exporter.
Regards,
Hi
slurm-exporter can't scrape job status from Slurm 22.05.5
slurm_queue_cancelled 0
slurm_queue_completed 0
slurm_queue_completing 0
slurm_queue_configuring 0
slurm_queue_failed 0
slurm_queue_node_fail 0
slurm_queue_pending 0
slurm_queue_pending_dependency 0
slurm_queue_preempted 0
slurm_queue_running 0
slurm_queue_suspended 0
Any ideas?
Hey,
We're using this to monitor a small Slurm cluster, and it's very useful, thanks! Facing an issue however, after recently upgrading to 0.17.
In ParseAllocatedGPUs()
, sacct
is executed to get some data. We don't use Slurm accounting, so the subprocess exits with code 1
to show failure. Execute()
receives the non-zero code, and considers this fatal, killing the entire exporter.
I'm happy to attempt a fix myself, but do you have any suggestions for a good logic flow in this case?
Perhaps something like an optional argument to Execute() that designates "allowable" exit codes; meaning blank data is returned and execution continues.
Hi,
We have a nested account arrangement, and those accounts aren't properly being reported on.
I dug into the code, and the command is:
$ sshare -n -P -o account,fairshare
root|0.500000
top_1|0.999998
nested_1_1|0.999998
nested_1_2|1.000000
nested_1_2_1|1.000000
top_2|0.481723
nested_2_1|0.858038
nested_2_2|0.961831
However when I get the metrics, I only get root, top_1 and top_2.
'root' isn't useful. top accounts are useful as an aggregate, but I'd also like to see the nested accounts.
Ideally, we would have "slurm_account_fairshare" as it is, and also offer "slurm_subaccount_fairshare" so that I could graph both.
Looks like ParseFairShareMetrics() is the culprit, throwing away anything that starts with more than one space.
if ! strings.HasPrefix(line," ") {
I can see the argument for doing it, hence my proposal to gather two sets of metrics.
Hi,
Is it possible to collect memory requested per job and chart graphs similar to CPU?
I've followed the DEVELOPMENT.md for installation, but when trying to use the command curl http://localhost:9103/metrics
after ./bin/prometheus-slurm-exporter --listen-address="0.0.0.0:9103"
, there is no output and the curl command waits until I kill the exporter.
The output from the exporter is only:
INFO[0000] Starting Server: 0.0.0.0:9103 source="main.go:48"
Any idea why? I use CentOS 8.2 with Prometheus 2.22.0 and Slurm 20.02.5.
Good day
Ubuntu 16.04 LTS
I've compiled and ran the prometheus slurm exporter as per the guide on a VM that is connected to the slurm cluster, so it is able to get information from sinfo
, squeue
, etc.
As per the guide for debian it says install jessie-backports
, that part doesn't work. (unable to locate such package)
My issue is, after I've compiled and run the bin file. Once I open the url (wget or curl) localhost:8080/metrics it gives a ERROR 404 page.
--2019-06-10 11:44:02-- http://localhost:8080/metrics
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
Eventually the process stops with this:
INFO[0000] Starting Server: :8080 source="main.go:43"
2019/06/10 11:36:00 fork/exec /usr/bin/sdiag: no such file or directory
And therefore I cannot connect at all then.
Any steps to troubleshoot this? Is there also a way to change the ports?
In order to package your tools, could you update your dependencies indeed
name | declared | latest |
---|---|---|
client_golang | 1.2.1 | 1.15.0 |
prometheus-common | 0.7.0 | 0.42.0 |
Not sure if it's me who is doing something wrong or if it's somewhere in the code but when i put a node in drain-state it does not show up at all, it just disappears from the node count all together.
Please let me know what i need to do to troubleshoot this.
I'm probably missing something really obvious but following the instructions I hit this on Rocky Linux 8.5:
[root@dev-control slurm-exporter]# go version
go version go1.15.15 linux/amd64
[root@dev-control slurm-exporter]# make
mkdir -p /tmp/slurm-exporter/bin
Build main.go nodes.go queue.go scheduler.go to bin/prometheus-slurm-exporter
main.go:22:3: cannot find package "github.com/prometheus/client_golang/prometheus" in any of:
/usr/local/go/src/github.com/prometheus/client_golang/prometheus (from $GOROOT)
/tmp/slurm-exporter/src/github.com/prometheus/client_golang/prometheus (from $GOPATH)
/usr/share/gocode/src/github.com/prometheus/client_golang/prometheus
main.go:23:3: cannot find package "github.com/prometheus/client_golang/prometheus/promhttp" in any of:
/usr/local/go/src/github.com/prometheus/client_golang/prometheus/promhttp (from $GOROOT)
/tmp/slurm-exporter/src/github.com/prometheus/client_golang/prometheus/promhttp (from $GOPATH)
/usr/share/gocode/src/github.com/prometheus/client_golang/prometheus/promhttp
main.go:21:3: cannot find package "github.com/prometheus/common/log" in any of:
/usr/local/go/src/github.com/prometheus/common/log (from $GOROOT)
/tmp/slurm-exporter/src/github.com/prometheus/common/log (from $GOPATH)
/usr/share/gocode/src/github.com/prometheus/common/log
make: *** [Makefile:11: build] Error 1
I did start off with go 1.16 (as that has a package available on EPEL for RHEL clones). But I hit the above (or similar) and discovered that 1.16 changes go module handling, so deleted it and used 1.15 as per the instructions.
Never used go, so a bit stumped here, any help appreciated.
Our usage of slurm has many nodes, which makes the call to /metrics
take >5 minutes. I poked around in the code, and found an inefficiency here: https://github.com/vpenso/prometheus-slurm-exporter/blob/master/nodes.go#L60-L123, where you make a query per node to find the aggregate node statistics. Slurm has a summary feature which returns this information summarized instantly.
I will submit a PR for review if you think this improvement should be upstreamed.
Hi
While using your exporter and the SLURM grafana dashboard, I noticed that those metrics are not exposed:
"expr": "slurm_account_cpus_running"
"expr": "slurm_account_jobs_pending"
"expr": "slurm_account_jobs_running"
"expr": "slurm_partition_cpus_allocated"
"expr": "slurm_partition_jobs_pending"
"expr": "slurm_user_cpus_running"
"expr": "slurm_user_jobs_pending"
"expr": "slurm_user_jobs_running"
I guess I am missing something in enabling some metrics when starting the exporter, but cannot find which ones...
Could you help please?
In fact it seems that the metrics are exported only if their value is >0 ... but why that? since 0 is a data, but a missing metric is a N/A in the dashboard, thinkin the dashboard has an issue!
Thanks!
I've tried the CentOS build instructions on Scientific Linux 7.8 and failed:
$ make test
$GOPATH/go.mod exists but should not
make: *** [test] Fehler 1
This is with golang 1.13.6 from the EPEL repository.
Hi,
I get the following error when I try to build ,
./main.go:34:26: undefined: NewPartitionsCollector
Can you please let me know how to resolve this issue?
I've been testing the prometheus exporter and over night it crashed with this message
panic: runtime error: index out of range
goroutine 38867 [running]:
panic(0x8ccec0, 0xc82000a0f0)
/usr/lib/go-1.6/src/runtime/panic.go:481 +0x3e6
main.SplitColonValueToFloat(0xc8202613c1, 0x11, 0x0)
/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:59 +0xc5
main.ParseSchedulerMetrics(0xc8201d6000, 0x1015, 0x1e00, 0xc8202bec00)
/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:72 +0x1c3
main.SchedulerGetMetrics(0x4307fd)
/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:81 +0x48
main.(*SchedulerCollector).Collect(0xc82005b000, 0xc8201ff380)
/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:117 +0x1c
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc8200fc1f0,
0xc8201ff380, 0x7f3fa171eae0, 0xc82005b000)
/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/src/github.com/prometheus/client_golang/prometheus/registry.go:433 +0x58
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/src/github.com/prometheus/client_golang/prometheus/registry.go:434 +0x360
I am running slurm 16.05.10 on Enterprise Linux 7.3, but the exporter was built using Go 1.6 on Ubuntu.
Good day
I previously used this exporter on ubuntu 16 and an older version of slurm. It worked correctly.
However I'm running Ubuntu 18 LTS with Slurm 20 and get a "404" error when I run the exporter.
Jul 22 10:43:05 slurm-login systemd[1]: Started slurm exporter for prometheus.
Jul 22 10:43:05 slurm-login prometheus-slurm-exporter[4706]: time="2021-07-22T10:43:05+02:00" level=info msg="Starting Server: :9341" source="main.go:59"
Jul 22 10:43:05 slurm-login prometheus-slurm-exporter[4706]: time="2021-07-22T10:43:05+02:00" level=info msg="GPUs Accounting: true" source="main.go:60"
root@slurm-login:/opt/prometheus-slurm-exporter-0.19# curl localhost:9341
404 page not found
I'm able to run items such as squeue
, sinfo
etc from anywhere on the box and it works correctly.
Any ideas?
I see that the last commit to main was in March of 2022. I also see a lot of outstanding PR's. Does this mean the repo is not maintained anymore? Is there a dependable fork to rely on?
Hello, I attempted building the exporter as per DEVELOPMENT.md, which threw a bunch of errors. Please see attached file of the console errors. Any suggestions as to what has gone wrong, how to resolve?
slurm_exporter_errors.txt
RHEL 7.9
go version go1.15.14 linux/amd64
Hi
Can somebody please merge 37 as this would be a great feature to add!
Hi!
I am seeing the following problem
[root@ip-10-3-5-236 prometheus-slurm-exporter]# curl http://localhost:8080/metrics
panic: runtime error: index out of range [4] with length 4
goroutine 66 [running]:
main.ParseNodeMetrics(0xc0006a0000, 0x328ca, 0x3fe00, 0x0)
/root/prometheus-slurm-exporter/node.go:56 +0x6cf
main.NodeGetMetrics(0x0)
/root/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0001986c0, 0xc0000a8540)
/root/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/root/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x1a2
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/root/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe8e
curl: (52) Empty reply from server
[1]+ Exit 2 /usr/bin/prometheus-slurm-exporter
[root@ip-10-3-5-236 prometheus-slurm-exporter]#
any suggestion ?
How difficult would it be to get user and account info into the exporter? Use cases would be things like pie charts of jobs per account and/or user for the cluster.
The latest prometheus-slurm-exporter runs for a few seconds before terminating with a fatal error:
prometheus-slurm-exporter/bin/prometheus-slurm-exporter
INFO[0000] Starting Server: :8080 source="main.go:48"
FATA[0004] exit status 1 source="gpus.go:101"
I'm running slurm-20.11.3-1 and a rebuild picked up the new gpus.go module. Digging into it a bit, it appears the Allocgres option to sacct is treated as fatal, which causes the Execute() routine to terminate:
sh-4.4$ sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES
This exporter is fantastic, and we're hoping to get a bit more out of it. I've been looking at the code for node status, and I'd really like to track our drain reasons. I think this would help us spot trends.
Where you are gathering the sinfo https://github.com/vpenso/prometheus-slurm-exporter/blob/master/nodes.go#L113
could you add in %E and grab the reason? What would need to accompany that change for printing it out properly?
Thanks
Hi
It would be great if the backfill stats from sdiag would be exported as well.
Best
Justin
When I try to run curl http://localhost:8080/metrics on the latest build of the exporter, I see the following error message. Is there a fix for this?
panic: runtime error: index out of range [4] with length 4
goroutine 12 [running]:
main.ParseNodeMetrics(0xc0003c6000, 0x1f9, 0x600, 0x0)
/opt/prometheus-slurm-exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x0)
/opt/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0000ab710, 0xc0001a2660)
/opt/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:454 +0x5ce
Can someone please point me on to where exactly the exporter must be installed? Login Node? Controller Node? Worker Node? or on all Worker nodes?
Regards
Deric
it would be great if a count of gpus (and other gres's) could be provided in the metrics :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.