snuspl / harmony Goto Github PK

Harmony: A new scheduling framework that executes multiple Parameter-Server (PS) Machine Learning (ML) training jobs efficiently to improve cluster resource utilization.

License: Apache License 2.0

Python 0.31% Shell 1.96% Java 96.27% JavaScript 0.13% CSS 1.23% HTML 0.11%

distributed-machine-learning resource-scheduling

harmony's People

Contributors

Stargazers

Watchers

Forkers

zyq11223 wynot12 wooyeonlee0

harmony's Issues

Improve usability

We need to clean up APIs (e.g., launch scripts) and elaborate documents (e.g., ReadMe).

Change GBT to fit with new Trainer interface

#25 has changed the Trainer interface.
The new interface assume that there's only one iteration in a mini-batch.

However, GBT app's regression mode has internal iterations in a mini-batch.
We may change it to batch communications of the internal iterations.

Handle different in-memory data formats for the same input data

Different apps (e.g., MLR, GBT, Lasso) may use the same input data.
But in some case, they use different in-memory format for the exactly same data.

MLR, which is for classification task, maintains values in integer type.
Lasso, which is for regression task, maintains values in float type.
GBT, which is for both classification and regression, maintains value in float.

This becomes problem in #21, which makes jobs share the input table for the same input file.

We may fix them all to store data in one same type (integer or float) and transform it on use.

Make Pregel schedulable with TaskUnit

We need to change Pregel schedulable with TaskUnit as Dolphin.

TaskUnit in an iteration will be: Comp, Send, Sync.

Decompose Trainer task into more fine-grained steps

It's a sub-issue of #23.

Current Trainer interface provides runMiniBatch method, which runs a mini-batch by itself (e.g., pull -> comp -> push).
However for jobs to run harmoniously with each other, we need to control them more fine-grained manner.

Enable resource sharing across jobs running on jobserver

Currently job running on JobServer has been allocated partitioned resources (executor).

We can greatly improve the overall job performance by sharing resources between jobs, not strictly partitioning resources.

Within this approach, different jobs run on the same executor, fully utilizing resources.

Copy global model when retrieving it from local tablet

In PS-collocation mode, workers retrieve the global model from servers including its local server.
In that case, ET's get API returns the original objects stored in table.
It becomes problem when the returned objects mutate in background.
Workers may see the intermittent state of model values, which violate the format.

In summary, workers need to use the copy of mutable model, which can be updated by other (remote) worker threads.

Introduce GlobalTaskUnitScheduler

Currently TaskUnits of multiple jobs are scheduled only by local TaskUnitScheduler.
This incurs a problem that workers run job in different orders, which requires unnecessary synchronization overhead.

We need to globally schedule them.

Generic job-server

We have Dolphin-specific job server.
We can extend it to support other frameworks, too (e.g., Pregel).

Scheduler for running multiple jobs resource-efficiently

Current JobServer runs jobs with partitioned resources.

However, we can run jobs more efficiently by sharing resource across jobs.
For this, we need to coordinate jobs run harmoniously without contention, maximizing resource utilization.

In detail, we need to do following things:

change worker trainer task to be controllable with more fine-grained manner.
introduce a component to control trainer tasks.

Checkpoint local models in LDA and NMF

For offline model evaluation, we need to checkpoint global and local models.
However, currently only global models are checkpointed.

We need to extend it to cover local models.

Introduce a way for controlling a degree of asynchronicity across workers (clock slack)

Dolphin only supports totally asynchronous execution (TAP).
We need to extend it to support synchronous execution (BSP).

SSP is a good way for controlling synchronicity.

To minimize the implementation effort, it's possible to ignore cache layer and just control the progress of mini-batches in workers.

Decoupling progress control and cache policy also enables us support TAP, which pulls model upon every mini-batch start, in our SSP implementation; slack 0 means BSP and infinity means TAP.

To make it be original SSP, users need to configure cache layer (e.g., CachedModelAccessor) correspondingly.

Enable data sharing across jobs running on JobServer

Currently, jobs cannot share data (table).

So they should maintain their own tables, even the table contents are same.

By enabling table-sharing across jobs, we can eliminate overhead incurred by individual tables (e.g., memory pressure, data loading time).

Enable model evaluation in JobServer

We need to enable model evaluation in jobserver, which runs multiple jobs.

We may perform evaluations for all jobs when closing jobserver.

Optimize local access routine of update and get operation

#10 has fixed concurrency bug in local access of update operation.
However it exploits remote access routine for local access, which has unnecessary overheads (e.g., serialization of key, update value. passing loopback interface).

We can further optimize the local access routine of update operation.

Parallelize model evaluation in NMF, MLR, LDA apps

Currently these apps perform model evaluation with a single-thread.
We need to parallelize them for speedup.

Tasklet-level metric collection service

The metric collection service does not support tasklet granularity.
It only supports one type of metrics in an executor.

We need to extend it to support different metric types and policies for multiple tasklets in the same executor.

Complete implementation of table sharing

Though #21 has introduced table sharing feature, it lacks several details as following.

When launching apps concurrently, only the first job waits for data loading and other jobs start training with the table that is not complete yet.
When there's no ongoing jobs that use the specific table, we need to drop the table not to waste the memory resources.

Separate out mutable state from input tables

Currently dolphin maintains mutable worker state and immutable input data together in the same table.

This is easy for application use, but it makes data sharing across job impossible.
The immutable data can be shared, but mutable data for maintaining worker state should be separate between workers.