stanford-futuredata / gavel Goto Github PK

View Code? Open in Web Editor NEW

124.0 124.0 31.0 107.28 MB

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020

License: MIT License

Python 16.21% Shell 0.07% Makefile 0.01% Jupyter Notebook 83.50% Perl 0.21%

gavel's People

Contributors

Stargazers

Watchers

gavel's Issues

Initialize the base FIFO policy with the correct random seed

The FIFO base policy accepts a random seed as an argument, but we aren't passing the emulator random seed in to the FIFO policy constructor.

Add heartbeats to allow scheduler to remotely kill failed jobs

De-duplicate functions for emulation

There are currently two functions for emulation - one for emulating from a trace and one for emulating with generated jobs. They share a lot of code in common, fix this

Clean up imports for scheduler.py

The availability of gavel AMI

Hi,
Is the AMI 'gavel' still available and how can I find it?

Rename lease variables

The lease variables duration and max_duration are too ambiguous - rename these

Compute KL divergence between true distribution and allocated distribution

    @preconditions(lambda self: self._scheduler_lock.locked())
    def _compute_kl_divergence(self, timestamp):

        if self._allocation is None:
            return

        time_since_last_reset = timestamp - self._reset_timestamp

        time_run_so_far_fraction = {}
        for job_id in self._time_run_so_far:
            time_run_so_far_fraction[job_id] = {}
            for worker_type in self._time_run_so_far[job_id]:
                if time_since_last_reset == 0:
                    time_run_so_far_fraction[job_id][worker_type] = \
                        1.0 / len(self._time_run_so_far[job_id])
                else:
                    time_run_so_far_fraction[job_id][worker_type] = \
                        (self._time_run_so_far[job_id][worker_type] /
                            time_since_last_reset)

        for worker_type in self._worker_types:
            allocation_distribution = []
            time_run_distribution = []
            for job_id in time_run_so_far_fraction:
                if (job_id not in self._allocation or
                    worker_type not in self._allocation[job_id]):
                    continue
                if worker_type not in time_run_so_far_fraction[job_id]:
                    continue
                allocation_distribution.append(
                    self._allocation[job_id][worker_type])
                time_run_distribution.append(
                    time_run_so_far_fraction[job_id][worker_type])

Make Scheduler and Profiler subclasses of SchedulerMechanism

The Scheduler and Profiler classes currently share a lot of code - we can factor this out into a common superclass (e.g. SchedulerMechanism)

Extend lease if placement for job has not changed

Factor out the `get_policy` function

The get_policy function is repeated across many different files, we should factor it out and put it in a single file.

Improve error handling

Right now, either the scheduler or worker hangs, or we get back some obtuse gRPC exception that is pretty unrelated to the bug (such as missing key in a dict)

Job completion times overestimated in round-based scheduling

Currently we mark a job/micro-task's completion time as the end of the round regardless of when the job/micro-task actually finished. This increases JCTs unnecessarily as well as overestimates utilization.

Question: Understanding structure of throughputs.json files

I wanted to understand a bit about the structure of xxx-throughputs.json files present in the repository.
For example, in simulation_throughputs.json

ResNet-18 (batch size 16)', 1)": {
            "null": 4.795294551566172,
            "('ResNet-18 (batch size 32)', 1)": [
                2.539979567443098,
                3.1201925448827033
            ]

What are the two values in the array with key 'ResNet-18 (batch size 32)', 1?
What does the null key represent?
It would be great if you could also provide details on how you collected/generated these files so that it can be reproduced for a GPU not present in the repository (say, Turing).

Rename isolated policy

isolated -> max_min_fairness
max_min_fairness -> max_min_fairness_perf

Running scheduler_tests.py

I wanted to run the test scheduler_tests.py. I believe, for a given trace, this test will give me the schedule in a file /tmp/simple.output.

The traces used seems to be missing. I can use a different trace, but it does not have an "expected" output file for those. Also, the relevant file (run_scheduler_with_trace.py) is not prepared to take the arguments passed by the test.

Question: How can I get the spot prices aws/azure?

It's necessary to generate the prices file aws/azure for us to run the policy with SLO.
Is it possible to open-source the way how I can find the prices?
Or is it possible to detail the attributes in 'log/aws' and 'log/azure'?

Question about the datasets

Hi,

Can I know from where the listed datasets of artifact_evaluation.trace are downloaded? It would save me the effort of debugging the data-processing part. For example, I downloaded Monet2Photo from Kaggle datasets. I am getting the following issue while running Gavel in a physical cluster:
Traceback (most recent call last):
File "cyclegan.py", line 111, in
dataloader = DataLoader(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 213, in init
sampler = RandomSampler(dataset)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 93, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

Shut down workers when `run_scheduler_with_trace.py` is done

Make GavelIterator write to a file instead of printing

code for SJT policy

Hi,

Are you going to distribute the code of SJT policy? I cannot find this from the repository.

Thanks.

Questions about the simulation

Hi, is there any constrains on the traces used for the simulation, e.g., the arrival time and the steps?
I use a randomly generated trace and get the following error infomation:

Traceback (most recent call last):
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
    main(parser.parse_args())
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
    jobs_to_complete=jobs_to_complete)
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
    scheduled_jobs = self._schedule_jobs_on_workers()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
    self._update_priorities()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
    time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
Traceback (most recent call last):
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
    main(parser.parse_args())
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
    jobs_to_complete=jobs_to_complete)
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
    scheduled_jobs = self._schedule_jobs_on_workers()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
    self._update_priorities()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
    time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

Besides, does the scale factor mean the number of servers used? What if there are 8 GPUs on one server and the job requires 1/2/4GPUs?

The operation of shutil.rmtree

Hi all,

Thanks for your excellent work. Just a small note when playing around with the framework, will you consider modifying the part of creating checkpoint_dir in line 73 of worker.py? I think the action of shutil.rmtree is dangerous and I almost deleted my whole workspace since I set the path outside the gavel directory (that's my fault of course) :(

Move policy computation to separate thread

Factor out job generation code

Code to generate jobs exists in multiple places (scheduler.py, scripts/utils/generate_trace.py, scripts/test/solver.py) - dedup this

stanford-futuredata / gavel Goto Github PK

gavel's People

Contributors

Stargazers

Watchers

Forkers

gavel's Issues

Recommend Projects

Recommend Topics

Recommend Org