Code Monkey home page Code Monkey logo

slurm-accounting-fix's Introduction

slurm-accounting-fix

Introduction

This project aims to fix jobs accumulating cpu runtime indefinitely because they do not have an end time reported in the slurm accounting database, but have a state that is in completed or failed state. It is expected for the jobs in COMPLETED or FAILED state to have an end time that is reported and have a non-zero entry in the database.

These ghost jobs were messing up the accounting reports. The reports showed thousands of compute hours used between few users though the actual usage was way below the reported usage. We tried to look at the runaway jobs but there were no runaway jobs that were being reported.

So after evaluating a few ways to resolve this issue we decided to take the approach that is mentioned in https://bugs.schedmd.com/show_bug.cgi?id=5988 ,which would set the endtime to starttime.

Setup

Note: First create a back up of your slurm accounting database and work in a local test environment.

You can provision yourself a test environment by following this repo. Also make sure that the cluster_name and slurm_acct_db attributes in the group_vars/all file match the production cluster before . If not, you might want to change it to match the prod.

This repo will get you to the point of a working slurm installation across a basic cluster. This basic cluster will consist of a master node based on OpenHPC project (https://openhpc.community) and a compute node. You would only need the master node 'ohpc' to test this patch.

After you set up the test environment, place your slurm accounting database backup dump file in your local test environment and follow the steps mentioned below.

Steps to Restore the slurmDB

Create a slurm database (with the same name mentioned in slurm_acct_db attribute of group_vars/all file) before you try to restore from the mysql dumpfile.

mysql -u root -e "create database <slurm_acct_db>;

Restore the database by using the following command.

mysql <slurm_acct_db> -u root < <path to your DB dumpfile>;

Patch

Run the following queries to apply the patch.

Count of task_ids for ghost jobs that have id_step in step_table

SELECT COUNT(id_array_task) FROM slurm_cluster_job_table as jt JOIN slurm_cluster_step_table as st WHERE (jt.state=3 or jt.state=5) and st.time_end IS NOT NULL and st.time_end=0 and jt.time_end=0 and jt.job_db_inx=st.job_db_inx;

List of task_ids for ghost jobs that have id_step in step_table

SELECT GROUP_CONCAT(id_array_task) FROM slurm_cluster_job_table as jt JOIN slurm_cluster_step_table as st WHERE (jt.state=3 or jt.state=5) and st.time_end IS NOT NULL and st.time_end=0 and jt.time_end=0 and jt.job_db_inx=st.job_db_inx;

Update stmt for step table

START TRANSACTION; update slurm_cluster_step_table as st INNER JOIN slurm_cluster_job_table as jt ON st.job_db_inx=jt.job_db_inx SET st.time_end=st.time_start where id_array_job=1001131 and id_array_task in ( <Comma separated list of taskIDs> );

Count of JobIDs having state Completed or Failed but no time_end in Job table

SELECT COUNT(id_job) FROM slurm_cluster_job_table as jt WHERE (jt.state=3 or jt.state=5) AND jt.time_end=0;

List of JobIDs having state Completed or Failed but no time_end in Job table

SELECT GROUP_CONCAT(id_job) FROM slurm_cluster_job_table WHERE (state=3 or state=5) AND time_end=0;

Update statement for job table

UPDATE slurm_cluster_job_table SET time_end=time_start WHERE id_job in ( < Comma separated list of jobIDs from last step> );

Don't forget COMMIT; or ROLLBACK; after you checked if the desired patch applied successfully.

DEPRECATED

Note: Before using this script you need to have an input file which is generated by running the following command sacct -P -o jobid,state,totalcpu,timelimit,start,end,resvcpuraw --accounts 'username' > inputfile

./slurm-acct-db-patch inputfile OR

sacct -P -o jobid,state,totalcpu,timelimit,start,end,resvcpuraw --accounts username | ./slurm-acct-db-patch > outfile.sql

slurm-accounting-fix's People

Contributors

chirag06 avatar eatluri avatar eesaanatluri avatar

Watchers

 avatar

Forkers

ishan747 chirag06

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.