Code Monkey home page Code Monkey logo

prometheus-slurm-exporter's People

Contributors

ana avatar bedroge avatar cread avatar erimar77 avatar jamesbeedy avatar jbd avatar jiyub avatar joerihermans avatar lahwaacz avatar mtds avatar mtpdt avatar vpenso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prometheus-slurm-exporter's Issues

panic: runtime error: index out of range [4] with length 4 when running slurm-exporter (HEAD)

sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
compute-1 0 7772 0/2/0/2 idle
compute-1 0 7772 0/2/0/2 idle
computegpu-1 0 31356 0/8/0/8 idle
computegpu-1 0 31356 0/8/0/8 idle
computemgpu-v0 515694 0/128/0/128 idle

prometheus-slurm-exporter --listen-address=myip:port
INFO[0000] Starting Server: ip:port source="main.go:59"
INFO[0000] GPUs Accounting: false source="main.go:60"
panic: runtime error: index out of range [4] with length 4

goroutine 26 [running]:
main.ParseNodeMetrics(0xc000030600, 0x25e, 0x600, 0x1)
/opt/slurm_exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x6)
/opt/slurm_exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0000ac000, 0xc000070d80)
/opt/slurm_exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/opt/slurm_exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/opt/slurm_exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe4d

Queue and nodes status as labels

I'm thinking it would be better (and more in-pair with prometheus metrics/labels conventions) to have job status and nodes states as labels instead of separate metrics. Since we are measuring the same thing.

From Prometheus page:

Use labels to differentiate the characteristics of the thing that is being measured:

api_http_requests_total - differentiate request types: type="create|update|delete"
api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"

It would make it easier to show totals as well (now we can't really easily show totals, because we don't know all metrics name in the beginning - in my case, failed/error metrics are not present, because none of my nodes are in that state yet).

Thinking about it, it would be good to set default values for metrics to 0 in case the metric doesn't have a value / doesn't exists. From prometheus page:

Avoid missing metrics

Time series that are not present until something happens are difficult to deal with, as the usual simple operations are no longer sufficient to correctly handle them. To avoid this, export 0 (or NaN, if 0 would be misleading) for any time series you know may exist in advance.

Most Prometheus client libraries (including Go, Java, and Python) will automatically export a 0 for you for metrics with no labels.

It would also good to have an 'up' metrics, something like slurm_up, with value of 0 if scrape of any of slurm commands would be unsuccessful (prometheus documentation). In that case, one can set an alert if slurm_up == 0; alert('Slurm is not responding').

Nothing critical, I just though I would let you know.

Thanks for the great exporter!

Configuration for the SLURM exporter

Hello, I have followed your steps to build the CentOS executable and that went fine . As a new user to prometheus I'm unsure from your instructions what to do to have a working configuration.

I have created a file prometheus.yml in my prometheus-slurm-exporter directory. Is there anything else that needs to be in that file other than what you have in your readme?
Maybe you could post an example of a basic working config.
This server is also the slurm master node of a test cluster I run.

Mine currently looks like this:

#
# SLURM resource manager:
#
  - job_name: 'my_slurm_exporter'

    scrape_interval:  30s
    scrape_timeout:   30s
    static_configs:
    - targets: ['localhost:8080']

$~ promtool check config prometheus.yml
Checking prometheus.yml
  FAILED: parsing YAML file prometheus.yml: yaml: unmarshal errors:
  line 4: cannot unmarshal !!seq into config.plain

When I run the executable

$~ prometheus-slurm-exporter
INFO[0000] Starting Server: :8080                        source="main.go:42"

When I access http://hostname:8080/graphs in the browser I get 404 page not found
If I access http://hostname:8080/metrics it is updating with the correct scheduler info.
Im my grafana server datasource settings, when I add new datasource I get a HTTP Error.

Many thanks,
Brendan

Crashes during HTTP request

CentOS 8.1.1911
slurm 20.11.7

When trying to retrieve the metrics url:

 $ wget http://localhost:8080/metrics
--2022-06-20 14:27:18--  http://localhost:8080/metrics
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8080... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2022-06-20 14:27:19--  (try: 2)  http://localhost:8080/metrics
Connecting to localhost (localhost)|::1|:8080... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:8080... failed: Connection refused.

Server shows:

./prometheus-slurm-exporter -gpus-acct
INFO[0000] Starting Server: :8080                        source="main.go:59"
INFO[0000] GPUs Accounting: true                         source="main.go:60"
2022/06/20 14:26:58 exit status 127

Running strace on slurm exporter:

$ strace prometheus-slurm-exporter
execve("/usr/bin/prometheus-slurm-exporter", ["prometheus-slurm-exporter"], 0x7ffe15f9f370 /* 40 vars */) = 0
brk(NULL)                               = 0x17fd000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe7c3257a0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=53799, ...}) = 0
mmap(NULL, 53799, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7febb614e000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000o\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=754552, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb614c000
mmap(NULL, 2225344, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7febb5d14000
mprotect(0x7febb5d2f000, 2093056, PROT_NONE) = 0
mmap(0x7febb5f2e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7febb5f2e000
mmap(0x7febb5f30000, 13504, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb5f30000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\2009\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=5993088, ...}) = 0
mmap(NULL, 3942432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7febb5951000
mprotect(0x7febb5b0a000, 2097152, PROT_NONE) = 0
mmap(0x7febb5d0a000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b9000) = 0x7febb5d0a000
mmap(0x7febb5d10000, 14368, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb5d10000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\20\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=52328, ...}) = 0
mmap(NULL, 2109744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7febb574d000
mprotect(0x7febb5750000, 2093056, PROT_NONE) = 0
mmap(0x7febb594f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7febb594f000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb6149000
arch_prctl(ARCH_SET_FS, 0x7febb6149740) = 0
mprotect(0x7febb5d0a000, 16384, PROT_READ) = 0
mprotect(0x7febb594f000, 4096, PROT_READ) = 0
mprotect(0x7febb5f2e000, 4096, PROT_READ) = 0
mprotect(0x7febb615c000, 4096, PROT_READ) = 0
munmap(0x7febb614e000, 53799)           = 0
set_tid_address(0x7febb6149a10)         = 26536
set_robust_list(0x7febb6149a20, 24)     = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(NULL)                               = 0x17fd000
brk(0x181e000)                          = 0x181e000
sched_getaffinity(0, 8192, [0, 1, 2, 3]) = 16
openat(AT_FDCWD, "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY) = 3
read(3, "2097152\n", 20)                = 8
close(3)                                = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb6109000
mmap(NULL, 131072, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb60e9000
mmap(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb5fe9000
mmap(NULL, 8388608, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb4f4d000
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb0f4d000
mmap(NULL, 536870912, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb90f4d000
mmap(0xc000000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(NULL, 33554432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8ef4d000
mmap(NULL, 2165776, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8ed3c000
mmap(0xc000000000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(0x7febb60e9000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb60e9000
mmap(0x7febb6069000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb6069000
mmap(0x7febb5353000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb5353000
mmap(0x7febb2f7d000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7febb2f7d000
mmap(0x7feba10cd000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7feba10cd000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8ec3c000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb5fd9000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febb5fc9000
rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
sigaltstack({ss_sp=0xc000002000, ss_flags=0, ss_size=32768}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
gettid()                                = 26536
rt_sigaction(SIGHUP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGHUP, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGINT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGILL, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGILL, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGTRAP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTRAP, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGABRT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGABRT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGBUS, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGFPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGFPE, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGUSR1, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGUSR1, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGSEGV, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGUSR2, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGUSR2, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGALRM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGTERM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGSTKFLT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSTKFLT, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGCHLD, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGCHLD, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGURG, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGURG, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGXCPU, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGXCPU, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGXFSZ, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGXFSZ, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGVTALRM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGVTALRM, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGPROF, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPROF, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGWINCH, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGWINCH, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGIO, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGIO, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGPWR, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPWR, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGSYS, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSYS, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRTMIN, NULL, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRTMIN, NULL, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7febb5d1a9a0, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, NULL, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRT_1, NULL, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7febb5d1aa30, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_2, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_3, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_3, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_4, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_4, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_5, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_5, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_6, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_6, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_7, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_7, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_8, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_8, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_9, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_9, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_10, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_10, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_11, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_11, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_12, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_12, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_13, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_13, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_14, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_14, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_15, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_15, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_16, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_16, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_17, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_17, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_18, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_18, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_19, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_19, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_20, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_20, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_21, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_21, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_22, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_22, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_23, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_23, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_24, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_24, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_25, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_25, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_26, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_26, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_27, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_27, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_28, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_28, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_29, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_29, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_30, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_30, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_31, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_31, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigaction(SIGRT_32, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGRT_32, {sa_handler=0x46c2e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7febb5d26dc0}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8e43b000
mprotect(0x7feb8e43c000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8ec3afb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8ec3b9d0, tls=0x7feb8ec3b700, child_tidptr=0x7feb8ec3b9d0) = 26537
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 11422176
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 712
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8dc3a000
mprotect(0x7feb8dc3b000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8e439fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8e43a9d0, tls=0x7feb8e43a700, child_tidptr=0x7feb8e43a9d0) = 26538
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8cc38000
mprotect(0x7feb8cc39000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8d437fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8d4389d0, tls=0x7feb8d438700, child_tidptr=0x7feb8d4389d0) = 26540
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7feb8c437000
mprotect(0x7feb8c438000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7feb8cc36fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb8cc379d0, tls=0x7feb8cc37700, child_tidptr=0x7feb8cc379d0) = 26541
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
fcntl(0, F_GETFL)                       = 0x402 (flags O_RDWR|O_APPEND)
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb8c3f7000
fcntl(1, F_GETFL)                       = 0x402 (flags O_RDWR|O_APPEND)
fcntl(2, F_GETFL)                       = 0x402 (flags O_RDWR|O_APPEND)
futex(0x7febb59500e8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=53799, ...}) = 0
mmap(NULL, 53799, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7febb614e000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libcrypto.so.10", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\322\6\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2779880, ...}) = 0
mmap(NULL, 4598856, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7feb77b9d000
mprotect(0x7feb77dd4000, 2097152, PROT_NONE) = 0
mmap(0x7feb77fd4000, 163840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x237000) = 0x7feb77fd4000
mmap(0x7feb77ffc000, 15432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7feb77ffc000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0'\0\0\0\0\0\0"..., 832) = 832
lseek(3, 88296, SEEK_SET)               = 88296
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32
fstat(3, {st_mode=S_IFREG|0755, st_size=101032, ...}) = 0
lseek(3, 88296, SEEK_SET)               = 88296
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32
mmap(NULL, 2187272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7feb8c1e0000
mprotect(0x7feb8c1f6000, 2093056, PROT_NONE) = 0
mmap(0x7feb8c3f5000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0x7feb8c3f5000
mmap(0x7feb8c3f6000, 8, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7feb8c3f6000
close(3)                                = 0
mprotect(0x7feb8c3f5000, 4096, PROT_READ) = 0
mprotect(0x7feb77fd4000, 114688, PROT_READ) = 0
openat(AT_FDCWD, "/etc/pki/tls/legacy-settings", O_RDONLY) = -1 ENOENT (No such file or directory)
access("/etc/system-fips", F_OK)        = -1 ENOENT (No such file or directory)
munmap(0x7febb614e000, 53799)           = 0
getpid()                                = 26536
newfstatat(AT_FDCWD, "/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}, 0) = 0
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000080150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(2, "\33[36mINFO\33[0m[0000] Starting Ser"..., 95INFO[0000] Starting Server: :8080                        source="main.go:59"
) = 95
write(2, "\33[36mINFO\33[0m[0000] GPUs Account"..., 95INFO[0000] GPUs Accounting: false                        source="main.go:60"
) = 95
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 3
close(3)                                = 0
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 3
setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
bind(3, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 4
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
bind(4, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
close(4)                                = 0
close(3)                                = 0
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
openat(AT_FDCWD, "/proc/sys/net/core/somaxconn", O_RDONLY|O_CLOEXEC) = 4
epoll_create1(EPOLL_CLOEXEC)            = 5
pipe2([6, 7], O_NONBLOCK|O_CLOEXEC)     = 0
epoll_ctl(5, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=11977192, u64=11977192}}) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353060264, u64=140649647102376}}) = 0
fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
read(4, "128\n", 65536)                 = 4
read(4, "", 65532)                      = 0
epoll_ctl(5, EPOLL_CTL_DEL, 4, 0xc00019fa14) = 0
close(4)                                = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(3, 128)                          = 0
epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353060264, u64=140649647102376}}) = 0
getsockname(3, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28]) = 0
accept4(3, 0xc00019fb00, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(5, [], 128, 0, NULL, 2)     = 0
epoll_pwait(5, [{EPOLLIN, {u32=2353060264, u64=140649647102376}}], 128, -1, NULL, 0) = 1
futex(0xb3dfb8, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0xb3deb8, FUTEX_WAKE_PRIVATE, 1)  = 1
accept4(3, {sa_family=AF_INET6, sin6_port=htons(34412), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 4
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353060032, u64=140649647102144}}) = 0
getsockname(4, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28]) = 0
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(4, SOL_TCP, TCP_KEEPINTVL, [15], 4) = 0
setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [15], 4) = 0
futex(0xc000046d50, FUTEX_WAKE_PRIVATE, 1) = 1
accept4(3, 0xc00019fb00, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb3d1d0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0
futex(0xb3dfb8, FUTEX_WAKE_PRIVATE, 1)  = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 1
futex(0xb3deb8, FUTEX_WAKE_PRIVATE, 1)  = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 1
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=26536, si_uid=0} ---
rt_sigreturn({mask=[]})                 = 1
futex(0xb3dfa8, FUTEX_WAKE_PRIVATE, 1)  = 1
newfstatat(AT_FDCWD, "/share/apps/slurm/bin/sinfo", {st_mode=S_IFREG|0755, st_size=502752, ...}, 0) = 0
newfstatat(AT_FDCWD, "/share/apps/slurm/bin/squeue", {st_mode=S_IFREG|0755, st_size=642824, ...}, 0) = 0
pipe2([8, 12], O_CLOEXEC)               = 0
epoll_ctl(5, EPOLL_CTL_ADD, 8, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353059104, u64=140649647101216}}) = 0
fcntl(8, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(8, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353058408, u64=140649647100520}}) = 0
fcntl(12, F_GETFL)                      = 0x1 (flags O_WRONLY)
fcntl(12, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
openat(AT_FDCWD, "/dev/null", O_RDONLY|O_CLOEXEC) = 14
epoll_ctl(5, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353057480, u64=140649647099592}}) = -1 EPERM (Operation not permitted)
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CLOEXEC) = 17
epoll_ctl(5, EPOLL_CTL_ADD, 17, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2353056784, u64=140649647098896}}) = -1 EPERM (Operation not permitted)
fcntl(12, F_GETFL)                      = 0x801 (flags O_WRONLY|O_NONBLOCK)
fcntl(12, F_SETFL, O_WRONLY)            = 0
pipe2([10, 11], O_CLOEXEC)              = 0
getpid()                                = 26536
rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[], NULL, 8) = 0
clone(child_stack=NULL, flags=CLONE_VM|CLONE_VFORK|SIGCHLD) = 26556
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(11)                               = 0
read(10, "", 8)                         = 0
close(10)                               = 0
epoll_ctl(5, EPOLL_CTL_DEL, 12, 0xc000224afc) = 0
close(12)                               = 0
close(14)                               = 0
close(17)                               = 0
read(8, 0xc000252000, 512)              = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb3d1d0, FUTEX_WAIT_PRIVATE, 0, NULL2022/06/20 14:46:03 exit status 127
) = ?
+++ exited with 1 +++

Getting "Connection Refused"

Hello, I built slurm-exporter, and running is foreground in terminal per DEVELOPMENT.md. Added node to my prometheus.yml and cycled prom. RHEL 7.9 node.

./prometheus-slurm-exporter --listen-address="0.0.0.0:9500" -gpus-acct
INFO[0000] Starting Server: 0.0.0.0:9500 source="main.go:59"
INFO[0000] GPUs Accounting: true source="main.go:60"

However, in 2nd terminal window of my node,

curl http://localhost:9500/metrics
curl: (7) Failed connect to localhost:9500; Connection refused

And if I ps: it shows:

ps -ef | grep prometheus-slurm-exporter
gbeyer3 238618 225323 0 15:26 pts/1 00:00:00 grep --color=auto prometheus-slurm-exporter

So it seems to not be running. Thought it shows running in foreground in 1st terminal

Can someone suggest a solution?

Thanks

long node name causes index out of range error

First of all, great project! This works great.

Found one easily fixable issue. If the node names are too long, the memoryAlloc field value merges into the node name value:

INFO[0000] Starting Server: 0.0.0.0:9090                 source="main.go:59"
INFO[0000] GPUs Accounting: false                        source="main.go:60"
INFO[0049] sinfo fields: [production-slurm-compute-10 8192 0/4/0/4 idle]  source="node.go:55"
panic: runtime error: index out of range [4] with length 4

goroutine 39 [running]:
main.ParseNodeMetrics({0xc0001b9000, 0x594, 0xc0000adde0?})
	/home/jyost/development/git/prometheus-slurm-exporter/node.go:58 +0x5a5
main.NodeGetMetrics()
	/home/jyost/development/git/prometheus-slurm-exporter/node.go:41 +0x1e
main.(*NodeCollector).Collect(0xc00007f740, 0xc0000adf60?)
	/home/jyost/development/git/prometheus-slurm-exporter/node.go:131 +0x3e
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/home/jyost/development/git/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0xfb
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/home/jyost/development/git/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xb0b

The node name is production-slurm-compute-1 and the 0 got appended to the end of the node name. I fixed in my fork and will submit a PR soon

missing squeue option (--states=all) in queue.go

PATCH:

cmd := exec.Command("/usr/bin/squeue", "-h", "-o %A,%T,%r", "--states=all")

ISSUE:

Does not track COMPLETED jobs.

man squeue (slurm >= 17.11.12)
...
-t <state_list>, --states=<state_list>
Specify the states of jobs to view. Accepts a comma separated list of state names or "all". If
"all" is specified then jobs of all states will be reported. If no state is specified then pending,
running, and completing jobs are reported. See the JOB STATE CODES section below for a list
of valid states. Both extended and compact forms are valid. Note the <state_list> supplied is
case insensitive ("pd" and "PD" are equivalent).
...

Pending Array Jobs

The number of pending jobs isn't quite correct, as job 12345_[1-10] is 10 pending jobs, but only shows up as one. Using squeue -r would list each array job as its own line. Thoughts?

Install troubles: undefined: NewNodeCollector

Hi, Thank you for building this cool project.
I have to apologize in advance. I don't know the G in GO. I'd just like to use this project without learning GO.

I've had a minor hiccup with the installation process as documented by DEVELOPMENT.md.

go test -v *.go
=== RUN   TestCPUsMetrics
    cpus_test.go:31: &{alloc:5725 idle:877 other:34 total:6636}
--- PASS: TestCPUsMetrics (0.00s)
=== RUN   TestCPUssGetMetrics
    cpus_test.go:35: &{alloc:76 idle:84 other:40 total:200}
--- PASS: TestCPUssGetMetrics (0.01s)
=== RUN   TestNodesMetrics
    nodes_test.go:31: &{alloc:0 comp:0 down:0 drain:0 err:0 fail:0 idle:0 maint:0 mix:0 resv:0}
--- PASS: TestNodesMetrics (0.03s)
=== RUN   TestNodesGetMetrics
    nodes_test.go:35: &{alloc:0 comp:0 down:1 drain:0 err:0 fail:0 idle:1 maint:0 mix:3 resv:0}
--- PASS: TestNodesGetMetrics (0.01s)
=== RUN   TestNodeMetrics
    node_test.go:48: map[a048:0xc000258300 a049:0xc0002583c0 a050:0xc000258440 a051:0xc0002584c0 a052:0xc000258540 b001:0xc000258640 b002:0xc000258e00 b003:0xc000258f00]
--- PASS: TestNodeMetrics (0.00s)
=== RUN   TestParseQueueMetrics
    queue_test.go:31: &{pending:4 pending_dep:0 running:28 suspended:1 cancelled:1 completing:2 completed:1 configuring:1 failed:1 timeout:1 preempted:1 node_fail:1}
--- PASS: TestParseQueueMetrics (0.00s)
=== RUN   TestQueueGetMetrics
    queue_test.go:35: &{pending:69 pending_dep:0 running:12 suspended:0 cancelled:0 completing:0 completed:0 configuring:0 failed:0 timeout:0 preempted:0 node_fail:0}
--- PASS: TestQueueGetMetrics (0.01s)
=== RUN   TestSchedulerMetrics
    scheduler_test.go:31: &{threads:3 queue_size:0 dbd_queue_size:0 last_cycle:97209 mean_cycle:74593 cycle_per_minute:63 backfill_last_cycle:1.94289e+06 backfill_mean_cycle:1.96082e+06 backfill_depth_mean:29324 total_backfilled_jobs_since_start:111544 total_backfilled_jobs_since_cycle:793 total_backfilled_heterogeneous:10}
--- PASS: TestSchedulerMetrics (0.01s)
=== RUN   TestSchedulerGetMetrics
    scheduler_test.go:35: &{threads:3 queue_size:0 dbd_queue_size:0 last_cycle:63 mean_cycle:625 cycle_per_minute:1 backfill_last_cycle:0 backfill_mean_cycle:625 backfill_depth_mean:0 total_backfilled_jobs_since_start:6249 total_backfilled_jobs_since_cycle:0 total_backfilled_heterogeneous:0}
--- PASS: TestSchedulerGetMetrics (0.03s)
PASS
ok      command-line-arguments  0.125s
$ go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,sshare,users}.go
# command-line-arguments
./main.go:31:26: undefined: NewNodeCollector
$go version
go version go1.15 linux/amd64

Does anyone have a clue on how i can fix this?
Thanks in advance.

Nodelist and jobID

Hi,

Thank you for your tool! It is super useful.

Is it possible to obtain the 'jobid' and the 'nodelist'? I used the slurm dashboard on grafana but I do not see these data.

Could you help me on that?

Crash when node names are too long

If node names are over 20 characters long, the output of sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong", used at node.go:85, looks like this:

cpu-always-on-st-t30                   1                   0/2/0/2             idle                
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~               
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~               

You can see that node name and memory are not separated by whitespace.

This results in a crash with the following output:

prometheus-slurm-exporter[5783]: panic: runtime error: index out of range [4] with length 4
prometheus-slurm-exporter[5783]: goroutine 9 [running]:
prometheus-slurm-exporter[5783]: main.ParseNodeMetrics(0xc00016e000, 0x5eb, 0xe00, 0xc0000b10d8)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:56 +0x6cf
prometheus-slurm-exporter[5783]: main.NodeGetMetrics(0x8b7f20)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:40 +0x2a
prometheus-slurm-exporter[5783]: main.(*NodeCollector).Collect(0xc00007a000, 0xc0000b1080)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:128 +0x37
prometheus-slurm-exporter[5783]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x1a2
prometheus-slurm-exporter[5783]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe8e
systemd[1]: slurm_exporter.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: slurm_exporter.service: Failed with result 'exit-code'.

It expects 5 fields separated by whatespace, but finds only 4 which results in out-of-bounds array access and panic.

Possible fix is to change sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong" to sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: ", explicitly telling SLURM to append a space after each value.

Wrong regex for reserved nodes

We are running slurm 17.11.2 and exporter is not reporting reserved nodes correctly because of a different output of sinfo command.

Exporter is using %T field, which prints long format. It should be %t.

From slurm manuals:

%t
State of nodes, compact form
%T
State of nodes, extended form

export queue info from squeue

Hi

would it be possible to attach queue info to the jobs? Iw ould be nice to plot the job state graph filtered by queue.

Best
Justin

CI/CD workflow

Hi!

I'm willing to contribute a minimal CI/CD workflow for the project using GitHub actions (just building and running tests).

Is this something you'd be interested on having? Let me know and I'll prepare a PR.

panic: runtime error: index out of range [4] with length 4

Hello, I'm getting the same error as issue #56:

panic: runtime error: index out of range [4] with length 4

goroutine 26 [running]:
main.ParseNodeMetrics(0xc000140000, 0x25e, 0x600, 0x87080d)
/storage/home/hhiveman1/gbeyer3/prometheus-slurm-exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x1)
/storage/home/hhiveman1/gbeyer3/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc000020030, 0xc000072660)
/storage/home/hhiveman1/gbeyer3/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/storage/home/hhiveman1/gbeyer3/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/storage/home/hhiveman1/gbeyer3/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe4d

Answering the questions you asked that poster:

I am using the latest version of the slurm-exporter which I cloned directly from your repo and built using instructions in DEVELOPMENT.md ver 0.20

slurm version 22.05.0

sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
atl1-1-02-018-35 0 515741 0/64/0/64 idle
atl1-1-03-002-35 0 191856 0/24/0/24 idle
compile-coda 0 191881 0/24/0/24 idle
compute-dev-slurm-1-0 3770 0/4/0/4 idle
compute-dev-slurm-1-0 3770 0/4/0/4 idle
compute-dev-slurm-2-0 3770 0/4/0/4 idle

Node Usage - GPU metrics

Could we get GPU stats added to the node usage metrics. similar to the CPU stats that have been added in 18.0?

gpu version

Hi,
I am testing the latest version and GPU infor seems to not be so accurate, how can I start debugging?

# TYPE slurm_gpus_alloc gauge
slurm_gpus_alloc 21
# HELP slurm_gpus_idle Idle GPUs
# TYPE slurm_gpus_idle gauge
slurm_gpus_idle -21
# HELP slurm_gpus_total Total GPUs
# TYPE slurm_gpus_total gauge
slurm_gpus_total 0
# HELP slurm_gpus_utilization Total GPU utilization
# TYPE slurm_gpus_utilization gauge
slurm_gpus_utilization +Inf

Cheers.

Count GPU requests on running jobs per account

Hi,
We are counting the number of GPUs that certain account is using on running jobs by something like this:

gpucount=`( squeue -p gpu -h -A $group -o "%t %b" | grep ^R | cut -f2 -d' ' | sed -e 's/gpu://g' | tr '\n' '+'; echo 0 ) | bc || echo 0`

Is this something you could easily provide to the exporter? If not, maybe I can add it but I would need some guidance.

Cheers.

No data on graphs

Hi,

I have installed prometheus-slurm-exporter as a service - I have grafana, prometheus running as services also. I have configured the Prometheus data source correctly in Grafana(tested and working) and added the recommended configuration to the prometheus.yml. Unfortunately the graphs in the dashboards report "No data points". Do I need to make any other configurations in order for this to work? Have you seen this type of behavior - any hints would be very much welcomed.

Thank you

grafana_slurm_2
grafana_slurm_1

Create an official docker image

Hi all,

I use docker on my dev stack, and I think it will be very interesting to have an official docker image of slurm exporter.

Regards,

Job Status not retrieved

Hi

slurm-exporter can't scrape job status from Slurm 22.05.5

slurm_queue_cancelled 0
slurm_queue_completed 0
slurm_queue_completing 0
slurm_queue_configuring 0
slurm_queue_failed 0
slurm_queue_node_fail 0
slurm_queue_pending 0
slurm_queue_pending_dependency 0
slurm_queue_preempted 0
slurm_queue_running 0
slurm_queue_suspended 0

Any ideas?

Exporter dies when Slurm accounting not enabled

Hey,

We're using this to monitor a small Slurm cluster, and it's very useful, thanks! Facing an issue however, after recently upgrading to 0.17.

In ParseAllocatedGPUs(), sacct is executed to get some data. We don't use Slurm accounting, so the subprocess exits with code 1 to show failure. Execute() receives the non-zero code, and considers this fatal, killing the entire exporter.

I'm happy to attempt a fix myself, but do you have any suggestions for a good logic flow in this case?

Perhaps something like an optional argument to Execute() that designates "allowable" exit codes; meaning blank data is returned and execution continues.

Nested accounts missing from fairshare

Hi,

We have a nested account arrangement, and those accounts aren't properly being reported on.

I dug into the code, and the command is:

$ sshare -n -P -o account,fairshare
root|0.500000
 top_1|0.999998
  nested_1_1|0.999998
  nested_1_2|1.000000
   nested_1_2_1|1.000000
 top_2|0.481723
  nested_2_1|0.858038
   nested_2_2|0.961831

However when I get the metrics, I only get root, top_1 and top_2.

'root' isn't useful. top accounts are useful as an aggregate, but I'd also like to see the nested accounts.

Ideally, we would have "slurm_account_fairshare" as it is, and also offer "slurm_subaccount_fairshare" so that I could graph both.

Looks like ParseFairShareMetrics() is the culprit, throwing away anything that starts with more than one space.

                if ! strings.HasPrefix(line,"  ") {

I can see the argument for doing it, hence my proposal to gather two sets of metrics.

no metrics

I've followed the DEVELOPMENT.md for installation, but when trying to use the command curl http://localhost:9103/metrics after ./bin/prometheus-slurm-exporter --listen-address="0.0.0.0:9103", there is no output and the curl command waits until I kill the exporter.
The output from the exporter is only:
INFO[0000] Starting Server: 0.0.0.0:9103 source="main.go:48"

Any idea why? I use CentOS 8.2 with Prometheus 2.22.0 and Slurm 20.02.5.

fork/exec /usr/bin/sdiag: no such file or directory

Good day

Ubuntu 16.04 LTS

I've compiled and ran the prometheus slurm exporter as per the guide on a VM that is connected to the slurm cluster, so it is able to get information from sinfo, squeue, etc.

As per the guide for debian it says install jessie-backports, that part doesn't work. (unable to locate such package)

My issue is, after I've compiled and run the bin file. Once I open the url (wget or curl) localhost:8080/metrics it gives a ERROR 404 page.

--2019-06-10 11:44:02--  http://localhost:8080/metrics
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

Eventually the process stops with this:

INFO[0000] Starting Server: :8080                        source="main.go:43"

2019/06/10 11:36:00 fork/exec /usr/bin/sdiag: no such file or directory

And therefore I cannot connect at all then.

Any steps to troubleshoot this? Is there also a way to change the ports?

Nodes not shown when in drain state.

Not sure if it's me who is doing something wrong or if it's somewhere in the code but when i put a node in drain-state it does not show up at all, it just disappears from the node count all together.

Please let me know what i need to do to troubleshoot this.

Failing to build

I'm probably missing something really obvious but following the instructions I hit this on Rocky Linux 8.5:

[root@dev-control slurm-exporter]# go version
go version go1.15.15 linux/amd64
[root@dev-control slurm-exporter]# make
mkdir -p /tmp/slurm-exporter/bin
Build main.go nodes.go queue.go scheduler.go to bin/prometheus-slurm-exporter
main.go:22:3: cannot find package "github.com/prometheus/client_golang/prometheus" in any of:
        /usr/local/go/src/github.com/prometheus/client_golang/prometheus (from $GOROOT)
        /tmp/slurm-exporter/src/github.com/prometheus/client_golang/prometheus (from $GOPATH)
        /usr/share/gocode/src/github.com/prometheus/client_golang/prometheus
main.go:23:3: cannot find package "github.com/prometheus/client_golang/prometheus/promhttp" in any of:
        /usr/local/go/src/github.com/prometheus/client_golang/prometheus/promhttp (from $GOROOT)
        /tmp/slurm-exporter/src/github.com/prometheus/client_golang/prometheus/promhttp (from $GOPATH)
        /usr/share/gocode/src/github.com/prometheus/client_golang/prometheus/promhttp
main.go:21:3: cannot find package "github.com/prometheus/common/log" in any of:
        /usr/local/go/src/github.com/prometheus/common/log (from $GOROOT)
        /tmp/slurm-exporter/src/github.com/prometheus/common/log (from $GOPATH)
        /usr/share/gocode/src/github.com/prometheus/common/log
make: *** [Makefile:11: build] Error 1

I did start off with go 1.16 (as that has a package available on EPEL for RHEL clones). But I hit the above (or similar) and discovered that 1.16 changes go module handling, so deleted it and used 1.15 as per the instructions.

Never used go, so a bit stumped here, any help appreciated.

Missing exposed metrics in Prometheus

Hi
While using your exporter and the SLURM grafana dashboard, I noticed that those metrics are not exposed:

"expr": "slurm_account_cpus_running"
"expr": "slurm_account_jobs_pending"
"expr": "slurm_account_jobs_running"
"expr": "slurm_partition_cpus_allocated"
"expr": "slurm_partition_jobs_pending"
"expr": "slurm_user_cpus_running"
"expr": "slurm_user_jobs_pending"
"expr": "slurm_user_jobs_running"

I guess I am missing something in enabling some metrics when starting the exporter, but cannot find which ones...
Could you help please?

In fact it seems that the metrics are exported only if their value is >0 ... but why that? since 0 is a data, but a missing metric is a N/A in the dashboard, thinkin the dashboard has an issue!

Thanks!

Doesn't build on EL7

I've tried the CentOS build instructions on Scientific Linux 7.8 and failed:

$ make test
$GOPATH/go.mod exists but should not
make: *** [test] Fehler 1

This is with golang 1.13.6 from the EPEL repository.

Crash with a runtime error

I've been testing the prometheus exporter and over night it crashed with this message

panic: runtime error: index out of range

goroutine 38867 [running]:
panic(0x8ccec0, 0xc82000a0f0)
	/usr/lib/go-1.6/src/runtime/panic.go:481 +0x3e6
main.SplitColonValueToFloat(0xc8202613c1, 0x11, 0x0)
	/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:59 +0xc5
main.ParseSchedulerMetrics(0xc8201d6000, 0x1015, 0x1e00, 0xc8202bec00)
	/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:72 +0x1c3
main.SchedulerGetMetrics(0x4307fd)
	/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:81 +0x48
main.(*SchedulerCollector).Collect(0xc82005b000, 0xc8201ff380)
	/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/scheduler.go:117 +0x1c
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc8200fc1f0,
0xc8201ff380, 0x7f3fa171eae0, 0xc82005b000)
	/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/src/github.com/prometheus/client_golang/prometheus/registry.go:433 +0x58
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/home/test/build/src/github.com/vpenso/prometheus-slurm-exporter/src/github.com/prometheus/client_golang/prometheus/registry.go:434 +0x360

I am running slurm 16.05.10 on Enterprise Linux 7.3, but the exporter was built using Go 1.6 on Ubuntu.

Slurm 20 compatible?

Good day

I previously used this exporter on ubuntu 16 and an older version of slurm. It worked correctly.

However I'm running Ubuntu 18 LTS with Slurm 20 and get a "404" error when I run the exporter.

Jul 22 10:43:05 slurm-login systemd[1]: Started slurm exporter for prometheus.
Jul 22 10:43:05 slurm-login prometheus-slurm-exporter[4706]: time="2021-07-22T10:43:05+02:00" level=info msg="Starting Server: :9341" source="main.go:59"
Jul 22 10:43:05 slurm-login prometheus-slurm-exporter[4706]: time="2021-07-22T10:43:05+02:00" level=info msg="GPUs Accounting: true" source="main.go:60"
root@slurm-login:/opt/prometheus-slurm-exporter-0.19# curl localhost:9341
404 page not found

I'm able to run items such as squeue, sinfo etc from anywhere on the box and it works correctly.

Any ideas?

Is this still maintained?

I see that the last commit to main was in March of 2022. I also see a lot of outstanding PR's. Does this mean the repo is not maintained anymore? Is there a dependable fork to rely on?

panic: runtime error: index out of range [4] with length 4

Hi!

I am seeing the following problem

[root@ip-10-3-5-236 prometheus-slurm-exporter]# curl http://localhost:8080/metrics
panic: runtime error: index out of range [4] with length 4

goroutine 66 [running]:
main.ParseNodeMetrics(0xc0006a0000, 0x328ca, 0x3fe00, 0x0)
        /root/prometheus-slurm-exporter/node.go:56 +0x6cf
main.NodeGetMetrics(0x0)
        /root/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0001986c0, 0xc0000a8540)
        /root/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
        /root/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x1a2
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /root/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe8e
curl: (52) Empty reply from server
[1]+  Exit 2                  /usr/bin/prometheus-slurm-exporter
[root@ip-10-3-5-236 prometheus-slurm-exporter]# 

any suggestion ?

User and Account info?

How difficult would it be to get user and account info into the exporter? Use cases would be things like pie charts of jobs per account and/or user for the cluster.

Exporter fails due to "AllocGRES is deprecated" fatal error

The latest prometheus-slurm-exporter runs for a few seconds before terminating with a fatal error:

prometheus-slurm-exporter/bin/prometheus-slurm-exporter   
INFO[0000] Starting Server: :8080                        source="main.go:48"
FATA[0004] exit status 1                                 source="gpus.go:101"

I'm running slurm-20.11.3-1 and a rebuild picked up the new gpus.go module. Digging into it a bit, it appears the Allocgres option to sacct is treated as fatal, which causes the Execute() routine to terminate:

sh-4.4$ sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

Slurm exporter crashes on Slurm 20.11.8

When I try to run curl http://localhost:8080/metrics on the latest build of the exporter, I see the following error message. Is there a fix for this?

panic: runtime error: index out of range [4] with length 4

goroutine 12 [running]:
main.ParseNodeMetrics(0xc0003c6000, 0x1f9, 0x600, 0x0)
/opt/prometheus-slurm-exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x0)
/opt/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0000ab710, 0xc0001a2660)
/opt/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:454 +0x5ce

Question: Where to install?

Can someone please point me on to where exactly the exporter must be installed? Login Node? Controller Node? Worker Node? or on all Worker nodes?

Regards

Running as systemd service with port change does not work

  1. Copied the original service file and added option -listen-address 0.0.0.0:9101 and service starts but the metrics page does not work.
  2. Running /usr/bin/prometheus-slurm-exporter -listen-address 0.0.0.0:9101 manually does work.

Deric

add gpu statistics

it would be great if a count of gpus (and other gres's) could be provided in the metrics :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.