Code Monkey home page Code Monkey logo

gavel's People

Contributors

akshayka avatar deepakn94 avatar fiodarkazhamiaka avatar mateiz avatar santhnm2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gavel's Issues

De-duplicate functions for emulation

There are currently two functions for emulation - one for emulating from a trace and one for emulating with generated jobs. They share a lot of code in common, fix this

The operation of shutil.rmtree

Hi all,

Thanks for your excellent work. Just a small note when playing around with the framework, will you consider modifying the part of creating checkpoint_dir in line 73 of worker.py? I think the action of shutil.rmtree is dangerous and I almost deleted my whole workspace since I set the path outside the gavel directory (that's my fault of course) :(

Question about the datasets

Hi,

Can I know from where the listed datasets of artifact_evaluation.trace are downloaded? It would save me the effort of debugging the data-processing part. For example, I downloaded Monet2Photo from Kaggle datasets. I am getting the following issue while running Gavel in a physical cluster:
Traceback (most recent call last):
File "cyclegan.py", line 111, in
dataloader = DataLoader(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 213, in init
sampler = RandomSampler(dataset)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 93, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

code for SJT policy

Hi,

Are you going to distribute the code of SJT policy? I cannot find this from the repository.

Thanks.

Factor out job generation code

Code to generate jobs exists in multiple places (scheduler.py, scripts/utils/generate_trace.py, scripts/test/solver.py) - dedup this

Running scheduler_tests.py

I wanted to run the test scheduler_tests.py. I believe, for a given trace, this test will give me the schedule in a file /tmp/simple.output.

The traces used seems to be missing. I can use a different trace, but it does not have an "expected" output file for those. Also, the relevant file (run_scheduler_with_trace.py) is not prepared to take the arguments passed by the test.

Question: How can I get the spot prices aws/azure?

It's necessary to generate the prices file aws/azure for us to run the policy with SLO.
Is it possible to open-source the way how I can find the prices?
Or is it possible to detail the attributes in 'log/aws' and 'log/azure'?

Compute KL divergence between true distribution and allocated distribution

    @preconditions(lambda self: self._scheduler_lock.locked())
    def _compute_kl_divergence(self, timestamp):

        if self._allocation is None:
            return

        time_since_last_reset = timestamp - self._reset_timestamp

        time_run_so_far_fraction = {}
        for job_id in self._time_run_so_far:
            time_run_so_far_fraction[job_id] = {}
            for worker_type in self._time_run_so_far[job_id]:
                if time_since_last_reset == 0:
                    time_run_so_far_fraction[job_id][worker_type] = \
                        1.0 / len(self._time_run_so_far[job_id])
                else:
                    time_run_so_far_fraction[job_id][worker_type] = \
                        (self._time_run_so_far[job_id][worker_type] /
                            time_since_last_reset)

        for worker_type in self._worker_types:
            allocation_distribution = []
            time_run_distribution = []
            for job_id in time_run_so_far_fraction:
                if (job_id not in self._allocation or
                    worker_type not in self._allocation[job_id]):
                    continue
                if worker_type not in time_run_so_far_fraction[job_id]:
                    continue
                allocation_distribution.append(
                    self._allocation[job_id][worker_type])
                time_run_distribution.append(
                    time_run_so_far_fraction[job_id][worker_type])

Improve error handling

Right now, either the scheduler or worker hangs, or we get back some obtuse gRPC exception that is pretty unrelated to the bug (such as missing key in a dict)

Questions about the simulation

Hi, is there any constrains on the traces used for the simulation, e.g., the arrival time and the steps?
I use a randomly generated trace and get the following error infomation:

Traceback (most recent call last):
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
    main(parser.parse_args())
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
    jobs_to_complete=jobs_to_complete)
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
    scheduled_jobs = self._schedule_jobs_on_workers()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
    self._update_priorities()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
    time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
Traceback (most recent call last):
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
    main(parser.parse_args())
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
    jobs_to_complete=jobs_to_complete)
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
    scheduled_jobs = self._schedule_jobs_on_workers()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
    self._update_priorities()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
    time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

Besides, does the scale factor mean the number of servers used? What if there are 8 GPUs on one server and the job requires 1/2/4GPUs?

Question: Understanding structure of throughputs.json files

I wanted to understand a bit about the structure of xxx-throughputs.json files present in the repository.
For example, in simulation_throughputs.json

ResNet-18 (batch size 16)', 1)": {
            "null": 4.795294551566172,
            "('ResNet-18 (batch size 32)', 1)": [
                2.539979567443098,
                3.1201925448827033
            ]
  1. What are the two values in the array with key 'ResNet-18 (batch size 32)', 1?
  2. What does the null key represent?
    It would be great if you could also provide details on how you collected/generated these files so that it can be reproduced for a GPU not present in the repository (say, Turing).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.