slurm-accounting-fix

Introduction

This project aims to fix jobs accumulating cpu runtime indefinitely because they do not have an end time reported in the slurm accounting database, but have a state that is in completed or failed state. It is expected for the jobs in COMPLETED or FAILED state to have an end time that is reported and have a non-zero entry in the database.

These ghost jobs were messing up the accounting reports. The reports showed thousands of compute hours used between few users though the actual usage was way below the reported usage. We tried to look at the runaway jobs but there were no runaway jobs that were being reported.

So after evaluating a few ways to resolve this issue we decided to take the approach that is mentioned in https://bugs.schedmd.com/show_bug.cgi?id=5988 ,which would set the endtime to starttime.

Setup

Note: First create a back up of your slurm accounting database and work in a local test environment.

You can provision yourself a test environment by following this repo. Also make sure that the cluster_name and slurm_acct_db attributes in the group_vars/all file match the production cluster before . If not, you might want to change it to match the prod.

This repo will get you to the point of a working slurm installation across a basic cluster. This basic cluster will consist of a master node based on OpenHPC project (https://openhpc.community) and a compute node. You would only need the master node 'ohpc' to test this patch.

After you set up the test environment, place your slurm accounting database backup dump file in your local test environment and follow the steps mentioned below.

Steps to Restore the slurmDB

Create a slurm database (with the same name mentioned in slurm_acct_db attribute of group_vars/all file) before you try to restore from the mysql dumpfile.

mysql -u root -e "create database <slurm_acct_db>;

Restore the database by using the following command.

mysql <slurm_acct_db> -u root < <path to your DB dumpfile>;

Patch

Run the following queries to apply the patch.

Count of task_ids for ghost jobs that have id_step in step_table

SELECT COUNT(id_array_task) FROM slurm_cluster_job_table as jt JOIN slurm_cluster_step_table as st WHERE (jt.state=3 or jt.state=5) and st.time_end IS NOT NULL and st.time_end=0 and jt.time_end=0 and jt.job_db_inx=st.job_db_inx;

List of task_ids for ghost jobs that have id_step in step_table

SELECT GROUP_CONCAT(id_array_task) FROM slurm_cluster_job_table as jt JOIN slurm_cluster_step_table as st WHERE (jt.state=3 or jt.state=5) and st.time_end IS NOT NULL and st.time_end=0 and jt.time_end=0 and jt.job_db_inx=st.job_db_inx;

Update stmt for step table

START TRANSACTION; update slurm_cluster_step_table as st INNER JOIN slurm_cluster_job_table as jt ON st.job_db_inx=jt.job_db_inx SET st.time_end=st.time_start where id_array_job=1001131 and id_array_task in ( <Comma separated list of taskIDs> );

Count of JobIDs having state Completed or Failed but no time_end in Job table

SELECT COUNT(id_job) FROM slurm_cluster_job_table as jt WHERE (jt.state=3 or jt.state=5) AND jt.time_end=0;

List of JobIDs having state Completed or Failed but no time_end in Job table

SELECT GROUP_CONCAT(id_job) FROM slurm_cluster_job_table WHERE (state=3 or state=5) AND time_end=0;

Update statement for job table

UPDATE slurm_cluster_job_table SET time_end=time_start WHERE id_job in ( < Comma separated list of jobIDs from last step> );

Don't forget COMMIT; or ROLLBACK; after you checked if the desired patch applied successfully.

DEPRECATED

Note: Before using this script you need to have an input file which is generated by running the following command sacct -P -o jobid,state,totalcpu,timelimit,start,end,resvcpuraw --accounts 'username' > inputfile

./slurm-acct-db-patch inputfile OR

sacct -P -o jobid,state,totalcpu,timelimit,start,end,resvcpuraw --accounts username | ./slurm-acct-db-patch > outfile.sql

eesaanatluri / slurm-accounting-fix Goto Github PK