stanford-futuredata / gavel Goto Github PK
View Code? Open in Web Editor NEWCode for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
License: MIT License
Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
License: MIT License
The FIFO base policy accepts a random seed as an argument, but we aren't passing the emulator random seed in to the FIFO policy constructor.
There are currently two functions for emulation - one for emulating from a trace and one for emulating with generated jobs. They share a lot of code in common, fix this
Hi,
Is the AMI 'gavel' still available and how can I find it?
The lease variables duration
and max_duration
are too ambiguous - rename these
@preconditions(lambda self: self._scheduler_lock.locked())
def _compute_kl_divergence(self, timestamp):
if self._allocation is None:
return
time_since_last_reset = timestamp - self._reset_timestamp
time_run_so_far_fraction = {}
for job_id in self._time_run_so_far:
time_run_so_far_fraction[job_id] = {}
for worker_type in self._time_run_so_far[job_id]:
if time_since_last_reset == 0:
time_run_so_far_fraction[job_id][worker_type] = \
1.0 / len(self._time_run_so_far[job_id])
else:
time_run_so_far_fraction[job_id][worker_type] = \
(self._time_run_so_far[job_id][worker_type] /
time_since_last_reset)
for worker_type in self._worker_types:
allocation_distribution = []
time_run_distribution = []
for job_id in time_run_so_far_fraction:
if (job_id not in self._allocation or
worker_type not in self._allocation[job_id]):
continue
if worker_type not in time_run_so_far_fraction[job_id]:
continue
allocation_distribution.append(
self._allocation[job_id][worker_type])
time_run_distribution.append(
time_run_so_far_fraction[job_id][worker_type])
The Scheduler
and Profiler
classes currently share a lot of code - we can factor this out into a common superclass (e.g. SchedulerMechanism
)
The get_policy
function is repeated across many different files, we should factor it out and put it in a single file.
Right now, either the scheduler or worker hangs, or we get back some obtuse gRPC exception that is pretty unrelated to the bug (such as missing key in a dict)
Currently we mark a job/micro-task's completion time as the end of the round regardless of when the job/micro-task actually finished. This increases JCTs unnecessarily as well as overestimates utilization.
I wanted to understand a bit about the structure of xxx-throughputs.json
files present in the repository.
For example, in simulation_throughputs.json
ResNet-18 (batch size 16)', 1)": {
"null": 4.795294551566172,
"('ResNet-18 (batch size 32)', 1)": [
2.539979567443098,
3.1201925448827033
]
'ResNet-18 (batch size 32)', 1
?null
key represent?isolated
-> max_min_fairness
max_min_fairness
-> max_min_fairness_perf
I wanted to run the test scheduler_tests.py
. I believe, for a given trace, this test will give me the schedule in a file /tmp/simple.output
.
The traces used seems to be missing. I can use a different trace, but it does not have an "expected" output file for those. Also, the relevant file (run_scheduler_with_trace.py
) is not prepared to take the arguments passed by the test.
It's necessary to generate the prices file aws/azure for us to run the policy with SLO.
Is it possible to open-source the way how I can find the prices?
Or is it possible to detail the attributes in 'log/aws' and 'log/azure'?
Hi,
Can I know from where the listed datasets of artifact_evaluation.trace are downloaded? It would save me the effort of debugging the data-processing part. For example, I downloaded Monet2Photo from Kaggle datasets. I am getting the following issue while running Gavel in a physical cluster:
Traceback (most recent call last):
File "cyclegan.py", line 111, in
dataloader = DataLoader(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 213, in init
sampler = RandomSampler(dataset)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 93, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0
Hi,
Are you going to distribute the code of SJT policy? I cannot find this from the repository.
Thanks.
Hi, is there any constrains on the traces used for the simulation, e.g., the arrival time and the steps?
I use a randomly generated trace and get the following error infomation:
Traceback (most recent call last):
File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
main(parser.parse_args())
File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
jobs_to_complete=jobs_to_complete)
File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
scheduled_jobs = self._schedule_jobs_on_workers()
File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
self._update_priorities()
File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
Traceback (most recent call last):
File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
main(parser.parse_args())
File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
jobs_to_complete=jobs_to_complete)
File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
scheduled_jobs = self._schedule_jobs_on_workers()
File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
self._update_priorities()
File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
Besides, does the scale factor mean the number of servers used? What if there are 8 GPUs on one server and the job requires 1/2/4GPUs?
Hi all,
Thanks for your excellent work. Just a small note when playing around with the framework, will you consider modifying the part of creating checkpoint_dir in line 73 of worker.py? I think the action of shutil.rmtree is dangerous and I almost deleted my whole workspace since I set the path outside the gavel directory (that's my fault of course) :(
Code to generate jobs exists in multiple places (scheduler.py
, scripts/utils/generate_trace.py
, scripts/test/solver.py
) - dedup this
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.