nemo,snuspl

Set up a new Maven project

Introduce Master/Executor

Create a simple Master/Executor abstractions that exchange messages with each other.
Please refer to the previous version of Vortex on their class structures and types of messages exchanged.
Modify SimpleEngine to use the Master/Executor abstractions.

Different types of tasks

There are various types of tasks. (ex. Do, Merge, Partition)
However, tasks must be clearly defined with a thorough review of how varying jobs that can be run using these task definitions.

Support Beam Streaming in Compiler

Extract window-related information from Beam operators and save them in Vortex IR
Create a simple Beam Streaming example(e.g., WindowedMapReduce) to demonstrate the compiler's support

Introduce initial ShuffleManager

Currently, intermediate data and their shuffle/broadcast are managed inside SimpleEngine.
Let's extract the related code into a separate sub-package called shuffle and hide the details with APIs.
It'd be great if you could make the APIs flexible and pluggable such that we can use the same code in a distributed environment(REEF), only with different implementations.

Web UI

A web ui for visualizing Vortex job executions.

It would be similar to Spark UI in high-level features(e.g., visualizing DAG, progress, faults, tasks, streaming), but different in the elements that construct them(e.g., Optimizer, Task/Channel, State Machines).

Introduce simple backend

Compiler's backend is responsible for converting IR representation into an ExecutionPlan executable by Vortex Runtime. A simple example has been introduced in #43 and we can use this example to introduce a simple version of the backend.

Multi-window Example

Create an example user code for multi-windowing in edu.snu.vortex.examples.beam.
Please use Java8, as much as possible. :)

Show build status

We can make Jenkins show our build status, so that we don't miss out on a bad pull request once it is merged with the master branch

Specify Types in Requesting Containers

Specify types(e.g., storage, compute, transient, reserved) in requesting containers to the resource manager(RM).

However, REEF, which we use to communicate with the RM, does not support this. One simple approach that requires minimum modifications to REEF is using node labelling features provided by RMs, simply assuming that datacenter operators statically pre-label each node with its type.

We can add a String field for node label in REEF's EvaluatorRequest, and modify REEF's YARN/Mesos runtimes to use the information appropriately. Then, in Vortex, we can simply set the node label field in EvaluatorRequest when requesting new Evaluators.

I wonder whether node labels can be also used to request for dynamically labelled containers(not nodes) such as Mesos's revocable containers(http://mesos.apache.org/documentation/latest/oversubscription). This might make a good discussion topic in the REEF community.

Refactor Attributes Class

It should be explicit that only certain attribute values can be mapped to attribute keys.

We can do it in 2 ways

Using strongly typed attribute hashmaps in node and edge (We need to restructure Attributes class accordingly)
We just go with Attributes.Key->Object hashmaps and check if the mapping is correct or not after a graph pass

Vortex Compiler

This issue is for keeping track of the subtasks

The Vortex Compiler is composed of three components: the Frontend, the Optimizer, and the Backend. The structure is similar to the LLVM compiler structure.

We will create a Vortex IR from the given DAG of BEAM program through the Frontend and process and optimize it, to pass it on to the Backend which transforms the Vortex IR into an ExecutionPlan which will be received and processed by the Vortex Runtime (#9).

Vortex IR

The main job of the Vortex Compiler is to label each of the vertices and edges of the DAG with specific attributes including:

Vertex:

implemented as classes of:
- SourceVertex
- OperatorVertex
  - Transform
    - DoTransform
    - GroupByKeyTransform
    - WindowTransform
with attributes:
- Placement
  - e.g. Reserved/Transient
  - e.g. Storage/Compute
- Parallelism
  - e.g. # of partitions

Edges:

with attributes:
- EdgeChannel (distinguishes intermediate data location)
  - Local Memory
  - TCP Pipe (push)
  - File (local disk)
  - Distributed Filesystem
- Relation
  - ScatterGather (M2M)
  - OneToOne (O2O)
  - Broadcast (O2M)

Vortex Labeling & Correctness Check

The placement/labeling algorithm/policy will be pluggable and will be decided by the user, customized for each of the usages and environments. While labeling each nodes and edges, it will also check if there are any anomalies in the DAG. Our previous implementation of transient-reserved-specified Vortex will use the algorithm shown in the paper. The specifications and the details of the policy are shown under the compiler.optimizer.passes package. Each pass receives a DAG and outputs another DAG, tagged with attributes.

Vortex DAG Optimization

The Compiler can further optimize the DAG for it to be run efficiently on the Runtime layer, by adding/merging/removing operators, modifying edges, and tweaking system attributes. (like FlumeJava, etc.)
This would also be ideal to be done during the Runtime.

Vortex Stage Generation

Then, the Compiler Backend splits the DAG into Vortex Stages (#73).

Details and the following sub-issues will be updated.

Sub-issues:

Implement Sink Node

Translate Beam's Sink Node into Vortex's Sink Node. Please provide Beam program examples for testing the translation.

Jenkins Setup

Also checking if webhooks are working

Introduce Task

Introduce a simple Task class with the following specifications.

It should have states(running, queued, executing, etc)
Its state transitions should be explicitly managed in the code (e.g., using a state machine library), or at least its invalid state transitions should be checked
It should execute one or more operators (pipelining operators with one-to-one dependencies would be a good start)

Please change SimpleEngine to use your Task class to execute Vortex DAGs. Make sure the engine correctly runs the Beam examples after the change.

Receive Optimization policy as a parameter

After #29 , we want to receive optimization policy to run the program with as a parameter.

Code cleanup and clarification

Currently, the code needs some code cleanup and restructuring, to reflect the changes discussed.
Also, edge attributes need to be added to clarify the DAG.

Task state management on a state machine

Task states are roughly defined (ex. READY, SCHEDULED, RUNNING, COMPLETE), but state transitions are rather vaguely implemented, without an explicit use of a state machine.

Use a state machine to formally manage task states in Runtime.

Upgrade Beam to 0.4.0-incubating

Beam's 0.4.0-incubating release is out.
However, it is not yet uploaded to maven central (https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-core).
Let's upgrade our dependency when it is uploaded so that our users don't have to manually install the snapshot version.

Add existing example programs from Pado to run and test

Reference:
https://github.com/cmssnu/pado/tree/master/bin
https://github.com/cmssnu/pado/tree/master/src/main/java/edu/snu/cay/vortex/beam/applications

We currently have:

MR
ALS

Runtime operators with user defined function

RtOperator is runtime's version of operators. It must have a way of receiving user defined function from compiler to execute.

Make Optimizer Configurable

The current optimizer statically applies the placement optimization.
Let's make it configurable so that the compiler can apply arbitrary optimizations(i.e., DAG passes) specified by the user.
It might be a good idea to introduce a new package(e.g., edu.snu.vortex.compiler.optimizer.pass) and keep all the pass-related code in it.

Add more instantiation policies

Be sure to:

Add examples that transforms the DAG structure itself (See discussions in #53).
- Operator fusion
Add examples for:
- In-memory Query Execution
- Dynamic Partitioning

Vortex BEAM Translator

This issue is for keeping track of the subtasks

We will need a layer to translate BEAM programs into Vortex DAGs, which follow our style.
These DAGs will later be received and processed by the Vortex Compiler (#8).

Details and the following sub-issues will be updated.

Sub-issues:

Study BEAM
List out BEAM functionalities to translate for our usages (#)
Add more sub-issues

Initial Working Code

Create an initial working code for us to build upon.

Initial runtime implementation

An initial setup of Runtime must be implemented with basic interfaces between modularized components designed.

Scheduling policies for practical use

Runtime has an interface class SchedulingPolicy defined.
A simple and naive, round-robin scheduling policy is currently used in the simple Master's scheduler.

There must be more practical scheduling policies implemented for general use.
Moreover, optimal scheduling policies depending on job characteristics are preferable.

Compiler Interfaces

Compiler-related code is currently intermixed with translator/engine code.
Let's move them into a separate package and introduce APIs around them.

Import REEF

We should import REEF and run our code on top of it to make quite a few things easier :)

Frontend for Spark Programs

A new compiler frontend that translates Spark programs into the Vortex IR.

Vortex Runtime

This issue is for keeping track of the subtasks

Given the labeled and splitted Vortex DAG, which is processed by the Vortex Compiler (#8), the Vortex Runtime will run the given DAG in a physical level. Its main components and contributions are as followings:

Execution
- Given the information of each Nodes and Edges, stated in #8, the Runtime executes them appropriately.
Scheduler
- Division of each operators into tasks
- Pipelining
Fault-tolerance mechanisms
Optimizations

Details and the following sub-issues will be updated.

Sub-issues:

Add more sub-issues

DAG Integrity Check

Let's perform integrity checks on graphs upon their initialization as well as manipulation.

The checks should include

No cycle
At least one source node, and one sink node
Completely connected
No null/incorrect pointers that connect the nodes

As a start, we can inject the checker in the beginning of DAGBuilder#build.

Introduce Style Checker

Let's bring in the style checker we had in pre-vortex.

Interfaces for Dynamic Optimization

Currently, we assume that the compiler optimization happens just once, before the job commences.

Let's allow the optimization to happen multiple times, at runtime. We need to carefully think about how the interfaces between different components in the system should change.

The execution flow might look like this: The engine feeds runtime metrics into the compiler optimizer, which outputs a new IR for the compiler backend. The compiler backend then manipulates the JobDAG, with which the engine resumes execution.

Change terminology for edge types

O2O --> OneToOne
O2M --> Broadcast
M2M --> ScatterGather

Clean up Compiler and Runtime's DAG implementations

Compiler and Runtime both have "DAG" implementations. We can clean up and merge the code.

Use Tang to Parse User Arguments

In the client JobLauncher, we assume that the first argument is the user main class and the rest are the user main arguments.

Let's replace this assumption with Tang. Then, the user will be able to specify other types of configurations(e.g., compiler types, resource types, etc).

VortexJobLauncher in the previous version of Vortex is a good reference for implementing it.

Implement Beam Result

Currently, all of the Beam Result's APIs throw UnsupportedOperationException. Let's implement the APIs so that they correctly tell the job status.

Stage partitioning

Implement stage partitioning in the optimizer of the compiler

pom.xml per Sub-Package

Currently, we have one big pom.xml at the root directory.

Let's use one pom.xml per sub-package(beam/dag/engine) and set up dependencies as the following.

beam depends on dag
engine depends on dag
there should be no other dependencies (e.g., engine should not know about beam, and dag should not know about engine)

Implement Multi-Output Do Node

Support Beam's ParDo.UnboundMulti.

My guess is that we will have multiple edges coming out of a single Vortex's Do node, We will then somehow need to match each output to an edge using Beam's multi-output tags.

It might be helpful to take a look at how Beam's side input(which sort of is a multi-input thing) is translated into Vortex's Broadcast.

Make ProcessContext Kinder

Currently, most of the interfaces throw UnsupportedOperationException without any further explanation. Let's leave comments or exception messages. For example, we can say that streaming is currently not supported.

Support code/data serialization

The simple engine we have assumes everything runs on a single computer. Thus, it never ser/des anything.

But that's not the case in a distributed environment.

With #15, #16, #17 in place, let's ser/des code/data in message exchanges between master and executors. The translator, and then the compiler should pass down the required class/codec information to the runtime.

Runtime execution plan builder

Runtime execution plan must be generated in compiler's backend.
Runtime must provide APIs to generate the execution plan. This includes runtime's operators, edges and attributes that correspond to those of IR

Discuss code structure

We've discussed about the overall architecture of our new version of Vortex, consisting of the translator layer, the compilation layer, and lastly, the runtime layer.
Keeping this in mind, do you have any opinions about our code structure?
@johnyangk @gwsshs22

Methods to register new attributes

In order to make Vortex as extensible as possible, the set of attributes used to decide how Runtime executes jobs must be made extensible as well.

Runtime currently has a fixed set of attributes. New attributes must be flexibly added.

Implement Join Node

Support Join in Vortex DAG, and translate Beam's CoGroupByKey into it.
In the PR, please provide Beam program examples for testing the code.

Optimization Pass for PAFAS

Traverse the IR DAG and detect multi-window optimization opportunity
Optimize the DAG by "compacting" operators and inserting PAFAS-specific operators(e.g., dependency tracking, etc)

Implement VortexBackend

Implement VortexBackend, a backend that converts IR representation into an ExecutionPlan executable by Vortex Runtime.
Implement stage partitioning

Fix Jenkins with BEAM

Currently Jenkins seems like it's having some trouble as it doesn't have BEAM compiled on the machine. I'll try to fix this ASAP to use our CI functionality.

snuspl / nemo Goto Github PK

nemo's People

Contributors

Stargazers

Watchers

Forkers

nemo's Issues

This issue is for keeping track of the subtasks

Vortex IR

Vertex:

Edges:

Vortex Labeling & Correctness Check

Vortex DAG Optimization

Vortex Stage Generation

Sub-issues:

This issue is for keeping track of the subtasks

Sub-issues:

This issue is for keeping track of the subtasks

Sub-issues:

Recommend Projects

Recommend Topics

Recommend Org