Code Monkey home page Code Monkey logo

Comments (15)

gbeyer3 avatar gbeyer3 commented on July 17, 2024 1

@Sebi94nbg -- @vpenso and @mtds seem to have gone off line. Another bug fix about 20+ character node names with PR that was offered in MAY has also not been merged into main. 🤷

from prometheus-slurm-exporter.

wmoore28 avatar wmoore28 commented on July 17, 2024

Are you using -gpus-acct? I found that gpus.go needs to be updated.

- --format=Allocgres
+ --format=AllocTRES

Here's the error if you run the sacct command as is:

> sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

from prometheus-slurm-exporter.

itzsimpl avatar itzsimpl commented on July 17, 2024

If you are using --gpus-acct try #73.

from prometheus-slurm-exporter.

JaderGiacon avatar JaderGiacon commented on July 17, 2024

Hi all,

I am having the same issue. I tried both solutions (@wmoore28 and @itzsimpl).
My slurm version is 21.08.5

Does anyone have any workaround for the panic array length?
The process/service is killed with this panic.

PS.: The got test *.go runs fine for all test

Thanks

from prometheus-slurm-exporter.

itzsimpl avatar itzsimpl commented on July 17, 2024

Reviewing the error listed in the first post this has nothing to do with gpus.go, but with node.go. Try running the command sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong on your system and examine the output. This is what node.go executes and then parses. Compare your output with what is in sinfo_mem.txt.

from prometheus-slurm-exporter.

DImuthuUpe avatar DImuthuUpe commented on July 17, 2024

This fixed it in my environment. I can send PR if it fixes the issue globally
DImuthuUpe@3adffd8

from prometheus-slurm-exporter.

JaderGiacon avatar JaderGiacon commented on July 17, 2024

Hi All,

Thanks for the quick reply.
The output of the command: sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong
Just some entries, because we have hundreds instances in cloud as well:

Please, consider the space rendered as a 'tab'

cloudsmp-r548-40 0 756878 0/48/0/48 idle~
cloudsmp-r548-41 0 756878 0/48/0/48 idle~
cloudsmp-r548-42 0 756878 0/48/0/48 idle~

node02 386016 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated

Comparing with sinfo_mem.txt it seems to be equal (5 columns)

Second test was run the @DImuthuUpe command:
sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong:"

The difference from the previous command it is a single space instead of a tab

cloudsmp-r548-37 0 756878 0/48/0/48 idle~
cloudsmp-r548-38 0 756878 0/48/0/48 idle~
cloudsmp-r548-39 0 756878 0/48/0/48 idle~
cloudsmp-r548-40 0 756878 0/48/0/48 idle~

node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed

I have updated the node.go as suggested by @DImuthuUpe, compiled and run in foreground:
bin/prometheus-slurm-exporter -gpus-acct
INFO[0000] Starting Server: :8080 source="main.go:59"
INFO[0000] GPUs Accounting: true source="main.go:60"
FATA[0026] exit status 1 source="gpus.go:101"

Then I amended the sacct command as suggested by @wmoore28 and it worked fine.

In resume, there are two changes::
1 - nodes.go as suggested by @DImuthuUpe
2 - gpus.go for who use -gpus-acct parameter as suggested by @wmoore28

A good improvement could be an automated test to sacct command (in this case for gpus):
sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

Thanks everyone for the quick help.
I will keep an eye on new versions available here with fixes applied.
I am pretty sure @vpenso will find a way to keep back compatibility in slurm versions.

Thanks again!

from prometheus-slurm-exporter.

itzsimpl avatar itzsimpl commented on July 17, 2024

@JaderGiacon I have integrated the fix from @DImuthuUpe, and also patched gpus.go as there was an issue when gres does not use gpuType. Could you perhaps test if it works for you?

from prometheus-slurm-exporter.

DImuthuUpe avatar DImuthuUpe commented on July 17, 2024

Thanks, @itzsimpl for integrating the fix

from prometheus-slurm-exporter.

JaderGiacon avatar JaderGiacon commented on July 17, 2024

Hi @itzsimpl,

I have compiled the version present in your github (https://github.com/itzsimpl/prometheus-slurm-exporter) and it worked fine.
The process is not being killed and the gpus information is coming well. Also, the go test worked fine.

Thank you so much for it!

from prometheus-slurm-exporter.

mtds avatar mtds commented on July 17, 2024

@ALL : I have merged the updated PR #73 into the development branch

The master branch will be kept backward compatible to Slurm version (up to) 18.x.

from prometheus-slurm-exporter.

Sebbo94BY avatar Sebbo94BY commented on July 17, 2024

Hello!
Are there any news and plans when this fix from the development branch will be released on the master/main branch?

from prometheus-slurm-exporter.

mtds avatar mtds commented on July 17, 2024

Hello! Are there any news and plans when this fix from the development branch will be released on the master/main branch?

Possibly choosing the name development for the other branch was not the right one. The PR #73 was not merged
into master because we would like to keep backward compatible for the version of Slurm we are still using (18.x)
and this PR is not meant for old versions.

Additional PRs, which requires new and/or updated functionalities of Slurm will be merged only into the development
branch. If you are using newer version of Slurm (from 20.x onwards) our suggestion is to skip the master branch
and only use development.

from prometheus-slurm-exporter.

Sebbo94BY avatar Sebbo94BY commented on July 17, 2024

Thanks for your feedback and information.

It's unfortunately not enough to only have a proper branch name here.

We need a tagged release, so that we can pick a specific version of this project to ensure, that the exporter is always the same version. Simply using the branch could lead to different installations and behaviours, when someone pushes / merges changes into this branch.

That's the reason, why we currently for example use this branch, but define a specific commit to avoid different version deployments.

Currently, you're simply incrementing your release version. What about these release tags?

  • Releases 0.x => Slurm 18.x
  • Releases 1.x => Slurm 20.x

Then you would keep the backward compatibility and users could decide, which version / release they need or want.

In addition, you may also want to rename the development branch to something like slurm-20.x.

from prometheus-slurm-exporter.

imoes avatar imoes commented on July 17, 2024

I have still the same Problem with slurm 22.05.5.1. The node exporter crashed after few moments.

from prometheus-slurm-exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.