Comments (15)
@Sebi94nbg -- @vpenso and @mtds seem to have gone off line. Another bug fix about 20+ character node names with PR that was offered in MAY has also not been merged into main. 🤷
from prometheus-slurm-exporter.
Are you using -gpus-acct? I found that gpus.go needs to be updated.
- --format=Allocgres
+ --format=AllocTRES
Here's the error if you run the sacct command as is:
> sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES
from prometheus-slurm-exporter.
If you are using --gpus-acct
try #73.
from prometheus-slurm-exporter.
Hi all,
I am having the same issue. I tried both solutions (@wmoore28 and @itzsimpl).
My slurm version is 21.08.5
Does anyone have any workaround for the panic array length?
The process/service is killed with this panic.
PS.: The got test *.go runs fine for all test
Thanks
from prometheus-slurm-exporter.
Reviewing the error listed in the first post this has nothing to do with gpus.go
, but with node.go
. Try running the command sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong
on your system and examine the output. This is what node.go
executes and then parses. Compare your output with what is in sinfo_mem.txt.
from prometheus-slurm-exporter.
This fixed it in my environment. I can send PR if it fixes the issue globally
DImuthuUpe@3adffd8
from prometheus-slurm-exporter.
Hi All,
Thanks for the quick reply.
The output of the command: sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong
Just some entries, because we have hundreds instances in cloud as well:
Please, consider the space rendered as a 'tab'
cloudsmp-r548-40 0 756878 0/48/0/48 idle~
cloudsmp-r548-41 0 756878 0/48/0/48 idle~
cloudsmp-r548-42 0 756878 0/48/0/48 idle~
node02 386016 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
Comparing with sinfo_mem.txt it seems to be equal (5 columns)
Second test was run the @DImuthuUpe command:
sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong:"
The difference from the previous command it is a single space instead of a tab
cloudsmp-r548-37 0 756878 0/48/0/48 idle~
cloudsmp-r548-38 0 756878 0/48/0/48 idle~
cloudsmp-r548-39 0 756878 0/48/0/48 idle~
cloudsmp-r548-40 0 756878 0/48/0/48 idle~
node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed
I have updated the node.go as suggested by @DImuthuUpe, compiled and run in foreground:
bin/prometheus-slurm-exporter -gpus-acct
INFO[0000] Starting Server: :8080 source="main.go:59"
INFO[0000] GPUs Accounting: true source="main.go:60"
FATA[0026] exit status 1 source="gpus.go:101"
Then I amended the sacct command as suggested by @wmoore28 and it worked fine.
In resume, there are two changes::
1 - nodes.go as suggested by @DImuthuUpe
2 - gpus.go for who use -gpus-acct parameter as suggested by @wmoore28
A good improvement could be an automated test to sacct command (in this case for gpus):
sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES
Thanks everyone for the quick help.
I will keep an eye on new versions available here with fixes applied.
I am pretty sure @vpenso will find a way to keep back compatibility in slurm versions.
Thanks again!
from prometheus-slurm-exporter.
@JaderGiacon I have integrated the fix from @DImuthuUpe, and also patched gpus.go
as there was an issue when gres does not use gpuType. Could you perhaps test if it works for you?
from prometheus-slurm-exporter.
Thanks, @itzsimpl for integrating the fix
from prometheus-slurm-exporter.
Hi @itzsimpl,
I have compiled the version present in your github (https://github.com/itzsimpl/prometheus-slurm-exporter) and it worked fine.
The process is not being killed and the gpus information is coming well. Also, the go test worked fine.
Thank you so much for it!
from prometheus-slurm-exporter.
@ALL : I have merged the updated PR #73 into the development branch
The master branch will be kept backward compatible to Slurm version (up to) 18.x.
from prometheus-slurm-exporter.
Hello!
Are there any news and plans when this fix from the development branch will be released on the master/main branch?
from prometheus-slurm-exporter.
Hello! Are there any news and plans when this fix from the development branch will be released on the master/main branch?
Possibly choosing the name development for the other branch was not the right one. The PR #73 was not merged
into master because we would like to keep backward compatible for the version of Slurm we are still using (18.x)
and this PR is not meant for old versions.
Additional PRs, which requires new and/or updated functionalities of Slurm will be merged only into the development
branch. If you are using newer version of Slurm (from 20.x onwards) our suggestion is to skip the master branch
and only use development.
from prometheus-slurm-exporter.
Thanks for your feedback and information.
It's unfortunately not enough to only have a proper branch name here.
We need a tagged release, so that we can pick a specific version of this project to ensure, that the exporter is always the same version. Simply using the branch could lead to different installations and behaviours, when someone pushes / merges changes into this branch.
That's the reason, why we currently for example use this branch, but define a specific commit to avoid different version deployments.
Currently, you're simply incrementing your release version. What about these release tags?
- Releases 0.x => Slurm 18.x
- Releases 1.x => Slurm 20.x
Then you would keep the backward compatibility and users could decide, which version / release they need or want.
In addition, you may also want to rename the development
branch to something like slurm-20.x
.
from prometheus-slurm-exporter.
I have still the same Problem with slurm 22.05.5.1. The node exporter crashed after few moments.
from prometheus-slurm-exporter.
Related Issues (20)
- Add TLS/SSL to slurm_exporter
- prometheus-slurm-exporter (v0.20) crash with slurm 23.02.2 HOT 1
- Slurm Exporter Compatibility issues with Slurm 23.X Version. HOT 1
- 开启GPU支持报错 HOT 2
- Cut a new release for -gpus-acct to work
- Failing to build HOT 1
- Create an official docker image HOT 1
- Errors when attempting make on RHEL 7.9 HOT 4
- long node name causes index out of range error HOT 1
- panic: runtime error: index out of range [4] with length 4 when running slurm-exporter (HEAD)
- Crashes during HTTP request HOT 1
- Getting "Connection Refused" HOT 1
- panic: runtime error: index out of range [4] with length 4 HOT 1
- Is this still maintained? HOT 10
- squeue metrics: handle more pending states HOT 1
- Job Status not retrieved HOT 2
- Nested accounts missing from fairshare HOT 2
- Running as systemd service with port change does not work HOT 8
- Nodelist and jobID HOT 1
- Update dependencies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prometheus-slurm-exporter.