meghanat / megha_simulator Goto Github PK

A simple simulator for the Megha Federated Scheduling Framework. This simulator enables comparison with other frameworks such as Sparrow, Eagle and Pigeon for which simulators already exist.

Python 3.46% Jupyter Notebook 87.00% Dockerfile 0.01% Shell 0.03% HTML 9.50%

megha_simulator's People

Contributors

Stargazers

Watchers

megha_simulator's Issues

Addition of extra NETWORK_DELAY constant in simulator in case of InconsistencyEvent

Issue:

Ma'am, in the simulator code, we think that one extra NETWORK_DELAY constant is added, in the case of an InconsistencyEvent.
Ma'am, in lines 310 and 322 in the file megha_sim.py in the branch Code-Documentation (branch referred throughout the issue), we have:

self.simulation.event_queue.put((current_time+NETWORK_DELAY,InconsistencyEvent(task,gm,InconsistencyType.EXTERNAL_INCONSISTENCY,self.simulation)))

self.simulation.event_queue.put((current_time+NETWORK_DELAY,InconsistencyEvent(task,gm,InconsistencyType.INTERNAL_INCONSISTENCY,self.simulation)))

Ma'am, currently we are at the local master and this NETWORK_DELAY constant is for sending the inconsistency event message to the global master along with the updated cluster state, but on following the chain of events, we reached the LMUpdateEvent where we are sending an updated view of the cluster state to that global master which caused the inconsistency event. In the LMUpdateEvent, in lines 149-150:

if not self.periodic:
	self.gm.update_status(current_time+NETWORK_DELAY)

Ma'am, here we are again adding another NETWORK_DELAY constant for sending the updated cluster state to the global master when we have already accounted for this NETWORK_DELAY before when handling the InconsistencyEvent.

Possible Fix:

Remove the NETWORK_DELAY constant added in the lines 310 and 322, making them:

self.simulation.event_queue.put((current_time,InconsistencyEvent(task,gm,InconsistencyType.EXTERNAL_INCONSISTENCY,self.simulation)))

self.simulation.event_queue.put((current_time,InconsistencyEvent(task,gm,InconsistencyType.INTERNAL_INCONSISTENCY,self.simulation)))

Ma'am, requesting you to let us know if this error is valid and if the possible fix suggested is correct.

Thank You,
Rishit.C

Missing Jobs in the list simulator_utils.globals.jobs_completed when running on the YH.tr trace dataset

Issue

Jobs are missing from the list simulator_utils.globals.jobs_completed. In the case of the YH.tr trace dataset, there should be 10 jobs in the list (as there are in total 10 jobs in the trace dataset as well). However, there are only 3 jobs on the list on the inspection while all the other jobs are missing, though they have been completely scheduled by the respective Global Master.

Debugging Steps Taken

We analysed the various variables related to the instance of the Job class in the code and here are a few observations we made:

simulator_utils.globals.jobs_completed at the end does not capture all the Jobs from the input trace dataset
a. This variable stores the jobs that have completed.
b. However, in our experiments with the YH.tr trace dataset, it stored only 3 out of the completed 10 jobs.
i. For the YH.tr trace with 10 jobs, it completes just 3 jobs with job_id's 5, 6 and 7.
jobs_scheduled holds the remaining completed jobs, that should have been present in simulator_utils.globals.jobs_completed
a. In the end, jobs_scheduled has all the jobs that are missing from the simulator_utils.globals.jobs_completed
b. In our experiments with the YH.tr trace dataset, jobs with job_id's 1, 2, 3, 4, 8, 9 and 10 are present here at the end of the simulation.
No unusual behaviour was observed in the job_queue
a. No issues were found in the behaviour of this list.
a. This list remains empty at the end of the simulation as expected.

Our Diagnosis

The issue should lie in the fact that the jobs after being scheduled and moved to the jobs_scheduled list are not moves to the jobs_completed global variable despite all of the Jobs tasks being scheduled.

Another interesting observation was that the number of non-periodic LM update events (task completion update event) does not match with the total number of tasks in YH.tr.
Below are our measurements for the same:
a. Total tasks in the trace YH.tr = 183
b. Total number of times non-periodic LM update events occurred while running on the trace YH.tr= 223

Solution

There was an error in the update_status() function of the GM. The required fixes have been pushed to the branch job_count_fix (pull request #13). The tasks were being added to completed_tasks only for Jobs that were fully scheduled and not for those Jobs which still had pending tasks in the job_queue.

Further testing and confirmation of the fix's effectiveness, from all assignees, is awaited.

Thank You,
Rishit.C

Clarification regarding recording of start time and end time of a Task and Job

Ma'am, currently in the code, we noted that the times were recorded as follows:

Time	When is it recorded in the simulator?
`task.start_time`	This is the same as the start time of the job it is part of.
`task.end_time`	This is the time taken from `task.start_time` to the task completion message reaching the Local Master.
`job.start_time`	This time is taken from the input `.tr` file.
`job.end_time`	This is the time of the `task.end_time` of the last task of that job.

Ma'am, requesting you to let us know if these notes are correct.

Thank You,
Rishit.C

Research Sub-Task-3: Tuning the LM HEARTBEAT INTERVAL

Summary of the Task

Make the LM heartbeat interval dynamic (the value keeps changing as the state of the cluster evolves).

Change HEARTBEAT_INTERVAL (keeping it fixed for a run) over different runs.
- Study the effect on the number of inconsistencies caused.
Try dynamically changing the HEARTBEAT_INTERVAL using ML (other statistical methods are also accepted).

Current Suggestions to Experiment/Discuss:

When the number of (job/task allocation) requests to the GM is high then send more updates to the GM from the LMs so that lesser inconsistencies are caused. (@meghanat)
When a lot of tasks complete in a short span of time, then avoid sending (periodic) updates to the GM from the LM as the GMs would have already been updated about the latest cluster state via the (aperiodic) updates (caused due to task completion updates) from the LMs. (@rishitc)

Thank You,
Rishit.C

Addition of extra NETWORK_DELAY constant in simulator in case of Task Completion

Issue:

Ma'am, in the simulator code, we think that one extra NETWORK_DELAY constant is added, in the case of a Task Completion.
Ma'am, in line 331 (in LM.task_completed) in the file megha_sim.py in the branch Code-Documentation (branch referred throughout the issue), we have:

self.simulation.event_queue.put((task.end_time+NETWORK_DELAY,LMUpdateEvent(self.simulation,periodic=False, gm=self.simulation.gms[task.GM_id])))

Ma'am, here we are currently at the local master and we are sending the task completion message to the global master which had sent this completed task's task launch message as well as updating that global master with the updated state of the cluster.

Ma'am, in line 331 we are adding the NETWORK_DELAY constant for accounting for the delay of sending these details from the local master to the global master, but in lines 149 and 150 in LMUpdateEvent.run, we have:

if not self.periodic:
	self.gm.update_status(current_time+NETWORK_DELAY)

Ma'am, here we are again adding the same NETWORK_DELAY constant for sending the task completion message and the updated state of the cluster to that global master whose task has completed, hence double counting the local master to global master delay.

Possible Fix:

Remove the NETWORK_DELAY constant added in the line 331, making it finally:

self.simulation.event_queue.put((task.end_time,LMUpdateEvent(self.simulation,periodic=False, gm=self.simulation.gms[task.GM_id])))

Ma'am, requesting you to let us know if this error is valid and if the possible fix suggested is correct.

Thank You,
Rishit.C

Job's end_time not being updated on Job Completion

Issue

The Job class has a end_time variable associated with each job. It's initially assigned to the job's start_time. However, the value is never updated after this step. The following has been initialized in the Job class:

self.end_time = self.start_time

Verification of the Issue

The issue was verified by printing the value of end_time for each job after the simulation ended. This was done in the last line of runner.py file. The end_time was the same as start_time of the corresponding job when the the output was analyzed.

for job in simulator_globals.jobs_completed:
    print(job.job_id, job.start_time, job.end_time)

Fix

The job end_time can be updated whenever each job is completed in the update_status function of the gm.py file. The following change was made to cater to this issue:

job.end_time = job.completion_time

The required fixes have been pushed to branch job_endtime_update (pull request #17).

Thank You,
@Saurav51

meghanat / megha_simulator Goto Github PK

megha_simulator's People

Contributors

Stargazers

Watchers

megha_simulator's Issues

Issue:

Possible Fix:

Issue

Debugging Steps Taken

Our Diagnosis

Solution

Summary of the Task

Current Suggestions to Experiment/Discuss:

Issue:

Possible Fix:

Issue

Verification of the Issue

Fix

Recommend Projects

Recommend Topics

Recommend Org