Code Monkey home page Code Monkey logo

harmony's People

Contributors

bchocho avatar beomyeol avatar bgchun avatar chenehk avatar dongjoon-hyun avatar dongjun-lee avatar gwsshs22 avatar gyeongin avatar hjp615 avatar jieunparklee avatar johnyangk avatar jooykim avatar jsjason avatar jsryu21 avatar junhoekim avatar kijungs avatar mhkwon924 avatar seojangho avatar swlsw avatar wonook avatar wynot12 avatar yunseong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

harmony's Issues

Improve usability

We need to clean up APIs (e.g., launch scripts) and elaborate documents (e.g., ReadMe).

Change GBT to fit with new Trainer interface

#25 has changed the Trainer interface.
The new interface assume that there's only one iteration in a mini-batch.

However, GBT app's regression mode has internal iterations in a mini-batch.
We may change it to batch communications of the internal iterations.

Handle different in-memory data formats for the same input data

Different apps (e.g., MLR, GBT, Lasso) may use the same input data.
But in some case, they use different in-memory format for the exactly same data.

  • MLR, which is for classification task, maintains values in integer type.
  • Lasso, which is for regression task, maintains values in float type.
  • GBT, which is for both classification and regression, maintains value in float.

This becomes problem in #21, which makes jobs share the input table for the same input file.

We may fix them all to store data in one same type (integer or float) and transform it on use.

Decompose Trainer task into more fine-grained steps

It's a sub-issue of #23.

Current Trainer interface provides runMiniBatch method, which runs a mini-batch by itself (e.g., pull -> comp -> push).
However for jobs to run harmoniously with each other, we need to control them more fine-grained manner.

Enable resource sharing across jobs running on jobserver

Currently job running on JobServer has been allocated partitioned resources (executor).

We can greatly improve the overall job performance by sharing resources between jobs, not strictly partitioning resources.

Within this approach, different jobs run on the same executor, fully utilizing resources.

Copy global model when retrieving it from local tablet

In PS-collocation mode, workers retrieve the global model from servers including its local server.
In that case, ET's get API returns the original objects stored in table.
It becomes problem when the returned objects mutate in background.
Workers may see the intermittent state of model values, which violate the format.

In summary, workers need to use the copy of mutable model, which can be updated by other (remote) worker threads.

Introduce GlobalTaskUnitScheduler

Currently TaskUnits of multiple jobs are scheduled only by local TaskUnitScheduler.
This incurs a problem that workers run job in different orders, which requires unnecessary synchronization overhead.

We need to globally schedule them.

Generic job-server

We have Dolphin-specific job server.
We can extend it to support other frameworks, too (e.g., Pregel).

Scheduler for running multiple jobs resource-efficiently

Current JobServer runs jobs with partitioned resources.

However, we can run jobs more efficiently by sharing resource across jobs.
For this, we need to coordinate jobs run harmoniously without contention, maximizing resource utilization.

In detail, we need to do following things:

  • change worker trainer task to be controllable with more fine-grained manner.
  • introduce a component to control trainer tasks.

Checkpoint local models in LDA and NMF

For offline model evaluation, we need to checkpoint global and local models.
However, currently only global models are checkpointed.

We need to extend it to cover local models.

Introduce a way for controlling a degree of asynchronicity across workers (clock slack)

Dolphin only supports totally asynchronous execution (TAP).
We need to extend it to support synchronous execution (BSP).

SSP is a good way for controlling synchronicity.

To minimize the implementation effort, it's possible to ignore cache layer and just control the progress of mini-batches in workers.

Decoupling progress control and cache policy also enables us support TAP, which pulls model upon every mini-batch start, in our SSP implementation; slack 0 means BSP and infinity means TAP.

To make it be original SSP, users need to configure cache layer (e.g., CachedModelAccessor) correspondingly.

Enable data sharing across jobs running on JobServer

Currently, jobs cannot share data (table).

So they should maintain their own tables, even the table contents are same.

By enabling table-sharing across jobs, we can eliminate overhead incurred by individual tables (e.g., memory pressure, data loading time).

Optimize local access routine of update and get operation

#10 has fixed concurrency bug in local access of update operation.
However it exploits remote access routine for local access, which has unnecessary overheads (e.g., serialization of key, update value. passing loopback interface).

We can further optimize the local access routine of update operation.

Tasklet-level metric collection service

The metric collection service does not support tasklet granularity.
It only supports one type of metrics in an executor.

We need to extend it to support different metric types and policies for multiple tasklets in the same executor.

Complete implementation of table sharing

Though #21 has introduced table sharing feature, it lacks several details as following.

  • When launching apps concurrently, only the first job waits for data loading and other jobs start training with the table that is not complete yet.
  • When there's no ongoing jobs that use the specific table, we need to drop the table not to waste the memory resources.

Separate out mutable state from input tables

Currently dolphin maintains mutable worker state and immutable input data together in the same table.

This is easy for application use, but it makes data sharing across job impossible.
The immutable data can be shared, but mutable data for maintaining worker state should be separate between workers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.