Comments (5)
OK, I actually think I was able to reproduce this without having to restart Flux and instead just unloading/reloading the plugin. If I have a number of jobs in SCHED state (i.e they've received a priority and are waiting to run) and I reload the plugin without updating it with flux-accounting information and call reprioritize
on all jobs:
(flux.Flux().rpc("job-manager.mf_priority.reprioritize")
)
they will have an exception raised on them saying that the plugin cannot find a valid user/bank entry for those previously held jobs.
Is there a process in restarting Flux where jobs could be reprioritized? My thinking is that might have been what caused this.
In any case, the plugin should probably handle this case more gracefully so a bunch of users' jobs don't get canceled if Flux gets restarted.
I'll need to test this, but off the top of my head, I think I can include a check in the callback for job.state.priority
that checks the plugin's internal map for data before deciding what to do with the job going through reprioritization.
If the plugin's internal map is empty (i.e it is waiting for flux-accounting information), it can continue to hold the job in PRIORITY until it loads some information. This would be similar to the behavior in the callback for job.validate
.
from flux-accounting.
Nice job debugging @cmoussa1!
All jobs are prioritized any time a jobtap plugin is loaded, so I had assumed this would happen after mf_priority.so
is loaded.
from flux-accounting.
Without trying to construct a reproducer (yet), I believe this type of exception message is raised when a job's user/bank information is being updated in job.state.priority
and the plugin cannot find a valid user/bank entry for the user that this job is submitted under:
flux-accounting/src/plugins/mf_priority.cpp
Lines 676 to 679 in 1a84215
Looking at the timestamps of the eventlog, it looks like the job was:
- submitted successfully (and received a priority)
- Flux was restarted
- the priority plugin was loaded
- jobs were reprioritized before the plugin received any flux-accounting data, so it rejected this job and presumably all pending jobs.
I'll see if I can reproduce this behavior.
from flux-accounting.
Closed by #407?
from flux-accounting.
Ah, yes, should be closed by #407 - sorry that I didn't close this yesterday. Closing now
from flux-accounting.
Related Issues (20)
- testsuite: fix tests that look at job state HOT 1
- support bank and project updates HOT 1
- `view-bank`: `-t` option does show hierarchy for a sub bank with users in it
- per-queue user limits HOT 2
- plugin: create external `bank_info` class HOT 1
- plugin: create new `Association` class
- plugin: improve callback for `job.validate` HOT 1
- error in flux account view-job-records HOT 2
- `plugin.query`: abstract helper functions that create JSON objects of flux-accounting data HOT 1
- `job.new`: use new external functions for user/bank lookups
- plugin: support bypassing limits
- `job.update`/`job.update...queue`: use new external methods for association lookup
- `job.state.priority`: use new external function for association lookup, general function improvement
- plugin: move accounting-specific helper functions to `accounting.cpp`
- plugin: send max nodes information per-association
- plugin: create estimation of node count helper function
- docs: move flux-accounting guide to this repo HOT 1
- create script for crontab tasks HOT 3
- flux account commands hang while fairshare is being updated HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-accounting.