Code Monkey home page Code Monkey logo

megha_simulator's People

Contributors

eshaarun avatar meghanat avatar obliviousparadigm avatar rishitc avatar saurav51 avatar snyk-bot avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

megha_simulator's Issues

Addition of extra NETWORK_DELAY constant in simulator in case of InconsistencyEvent

Issue:

Ma'am, in the simulator code, we think that one extra NETWORK_DELAY constant is added, in the case of an InconsistencyEvent.
Ma'am, in lines 310 and 322 in the file megha_sim.py in the branch Code-Documentation (branch referred throughout the issue), we have:

self.simulation.event_queue.put((current_time+NETWORK_DELAY,InconsistencyEvent(task,gm,InconsistencyType.EXTERNAL_INCONSISTENCY,self.simulation)))
self.simulation.event_queue.put((current_time+NETWORK_DELAY,InconsistencyEvent(task,gm,InconsistencyType.INTERNAL_INCONSISTENCY,self.simulation)))

Ma'am, currently we are at the local master and this NETWORK_DELAY constant is for sending the inconsistency event message to the global master along with the updated cluster state, but on following the chain of events, we reached the LMUpdateEvent where we are sending an updated view of the cluster state to that global master which caused the inconsistency event. In the LMUpdateEvent, in lines 149-150:

if not self.periodic:
	self.gm.update_status(current_time+NETWORK_DELAY)

Ma'am, here we are again adding another NETWORK_DELAY constant for sending the updated cluster state to the global master when we have already accounted for this NETWORK_DELAY before when handling the InconsistencyEvent.

Possible Fix:

Remove the NETWORK_DELAY constant added in the lines 310 and 322, making them:

self.simulation.event_queue.put((current_time,InconsistencyEvent(task,gm,InconsistencyType.EXTERNAL_INCONSISTENCY,self.simulation)))
self.simulation.event_queue.put((current_time,InconsistencyEvent(task,gm,InconsistencyType.INTERNAL_INCONSISTENCY,self.simulation)))

Ma'am, requesting you to let us know if this error is valid and if the possible fix suggested is correct.

Thank You,
Rishit.C

Missing Jobs in the list simulator_utils.globals.jobs_completed when running on the YH.tr trace dataset

Issue

Jobs are missing from the list simulator_utils.globals.jobs_completed. In the case of the YH.tr trace dataset, there should be 10 jobs in the list (as there are in total 10 jobs in the trace dataset as well). However, there are only 3 jobs on the list on the inspection while all the other jobs are missing, though they have been completely scheduled by the respective Global Master.

Debugging Steps Taken

We analysed the various variables related to the instance of the Job class in the code and here are a few observations we made:

  1. simulator_utils.globals.jobs_completed at the end does not capture all the Jobs from the input trace dataset
    a. This variable stores the jobs that have completed.
    b. However, in our experiments with the YH.tr trace dataset, it stored only 3 out of the completed 10 jobs.
    i. For the YH.tr trace with 10 jobs, it completes just 3 jobs with job_id's 5, 6 and 7.

  2. jobs_scheduled holds the remaining completed jobs, that should have been present in simulator_utils.globals.jobs_completed
    a. In the end, jobs_scheduled has all the jobs that are missing from the simulator_utils.globals.jobs_completed
    b. In our experiments with the YH.tr trace dataset, jobs with job_id's 1, 2, 3, 4, 8, 9 and 10 are present here at the end of the simulation.

  3. No unusual behaviour was observed in the job_queue
    a. No issues were found in the behaviour of this list.
    a. This list remains empty at the end of the simulation as expected.

Our Diagnosis

The issue should lie in the fact that the jobs after being scheduled and moved to the jobs_scheduled list are not moves to the jobs_completed global variable despite all of the Jobs tasks being scheduled.

Another interesting observation was that the number of non-periodic LM update events (task completion update event) does not match with the total number of tasks in YH.tr.
Below are our measurements for the same:
a. Total tasks in the trace YH.tr = 183
b. Total number of times non-periodic LM update events occurred while running on the trace YH.tr= 223

Solution

There was an error in the update_status() function of the GM. The required fixes have been pushed to the branch job_count_fix (pull request #13). The tasks were being added to completed_tasks only for Jobs that were fully scheduled and not for those Jobs which still had pending tasks in the job_queue.

Further testing and confirmation of the fix's effectiveness, from all assignees, is awaited.

Thank You,
Rishit.C

Clarification regarding recording of start time and end time of a Task and Job

Ma'am, currently in the code, we noted that the times were recorded as follows:

Time When is it recorded in the simulator?
task.start_time This is the same as the start time of the job it is part of.
task.end_time This is the time taken from task.start_time to the task completion message reaching the Local Master.
job.start_time This time is taken from the input .tr file.
job.end_time This is the time of the task.end_time of the last task of that job.

Ma'am, requesting you to let us know if these notes are correct.

Thank You,
Rishit.C

Research Sub-Task-3: Tuning the LM HEARTBEAT INTERVAL

Summary of the Task

Make the LM heartbeat interval dynamic (the value keeps changing as the state of the cluster evolves).

  • Change HEARTBEAT_INTERVAL (keeping it fixed for a run) over different runs.
    • Study the effect on the number of inconsistencies caused.
  • Try dynamically changing the HEARTBEAT_INTERVAL using ML (other statistical methods are also accepted).

Current Suggestions to Experiment/Discuss:

  1. When the number of (job/task allocation) requests to the GM is high then send more updates to the GM from the LMs so that lesser inconsistencies are caused. (@meghanat)
  2. When a lot of tasks complete in a short span of time, then avoid sending (periodic) updates to the GM from the LM as the GMs would have already been updated about the latest cluster state via the (aperiodic) updates (caused due to task completion updates) from the LMs. (@rishitc)

Thank You,
Rishit.C

Addition of extra NETWORK_DELAY constant in simulator in case of Task Completion

Issue:

Ma'am, in the simulator code, we think that one extra NETWORK_DELAY constant is added, in the case of a Task Completion.
Ma'am, in line 331 (in LM.task_completed) in the file megha_sim.py in the branch Code-Documentation (branch referred throughout the issue), we have:

self.simulation.event_queue.put((task.end_time+NETWORK_DELAY,LMUpdateEvent(self.simulation,periodic=False, gm=self.simulation.gms[task.GM_id])))

Ma'am, here we are currently at the local master and we are sending the task completion message to the global master which had sent this completed task's task launch message as well as updating that global master with the updated state of the cluster.

Ma'am, in line 331 we are adding the NETWORK_DELAY constant for accounting for the delay of sending these details from the local master to the global master, but in lines 149 and 150 in LMUpdateEvent.run, we have:

if not self.periodic:
	self.gm.update_status(current_time+NETWORK_DELAY)

Ma'am, here we are again adding the same NETWORK_DELAY constant for sending the task completion message and the updated state of the cluster to that global master whose task has completed, hence double counting the local master to global master delay.

Possible Fix:

Remove the NETWORK_DELAY constant added in the line 331, making it finally:

self.simulation.event_queue.put((task.end_time,LMUpdateEvent(self.simulation,periodic=False, gm=self.simulation.gms[task.GM_id])))

Ma'am, requesting you to let us know if this error is valid and if the possible fix suggested is correct.

Thank You,
Rishit.C

Job's end_time not being updated on Job Completion

Issue

The Job class has a end_time variable associated with each job. It's initially assigned to the job's start_time. However, the value is never updated after this step. The following has been initialized in the Job class:

self.end_time = self.start_time

Verification of the Issue

The issue was verified by printing the value of end_time for each job after the simulation ended. This was done in the last line of runner.py file. The end_time was the same as start_time of the corresponding job when the the output was analyzed.

for job in simulator_globals.jobs_completed:
    print(job.job_id, job.start_time, job.end_time)

Fix

The job end_time can be updated whenever each job is completed in the update_status function of the gm.py file. The following change was made to cater to this issue:

job.end_time = job.completion_time

The required fixes have been pushed to branch job_endtime_update (pull request #17).

Thank You,
@Saurav51

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.