snuspl / nemo Goto Github PK
View Code? Open in Web Editor NEWNemo: A flexible data processing system
Home Page: https://snuspl.github.io/nemo/
License: Apache License 2.0
Nemo: A flexible data processing system
Home Page: https://snuspl.github.io/nemo/
License: Apache License 2.0
Create a simple Master/Executor abstractions that exchange messages with each other.
Please refer to the previous version of Vortex on their class structures and types of messages exchanged.
Modify SimpleEngine to use the Master/Executor abstractions.
There are various types of tasks. (ex. Do, Merge, Partition)
However, tasks must be clearly defined with a thorough review of how varying jobs that can be run using these task definitions.
Currently, intermediate data and their shuffle/broadcast are managed inside SimpleEngine.
Let's extract the related code into a separate sub-package called shuffle
and hide the details with APIs.
It'd be great if you could make the APIs flexible and pluggable such that we can use the same code in a distributed environment(REEF), only with different implementations.
A web ui for visualizing Vortex job executions.
It would be similar to Spark UI in high-level features(e.g., visualizing DAG, progress, faults, tasks, streaming), but different in the elements that construct them(e.g., Optimizer, Task/Channel, State Machines).
Compiler's backend is responsible for converting IR representation into an ExecutionPlan
executable by Vortex Runtime. A simple example has been introduced in #43 and we can use this example to introduce a simple version of the backend.
Create an example user code for multi-windowing in edu.snu.vortex.examples.beam
.
Please use Java8, as much as possible. :)
We can make Jenkins show our build status, so that we don't miss out on a bad pull request once it is merged with the master branch
Specify types(e.g., storage, compute, transient, reserved) in requesting containers to the resource manager(RM).
However, REEF, which we use to communicate with the RM, does not support this. One simple approach that requires minimum modifications to REEF is using node labelling features provided by RMs, simply assuming that datacenter operators statically pre-label each node with its type.
We can add a String
field for node label in REEF's EvaluatorRequest
, and modify REEF's YARN/Mesos runtimes to use the information appropriately. Then, in Vortex, we can simply set the node label field in EvaluatorRequest
when requesting new Evaluators.
I wonder whether node labels can be also used to request for dynamically labelled containers(not nodes) such as Mesos's revocable containers(http://mesos.apache.org/documentation/latest/oversubscription). This might make a good discussion topic in the REEF community.
It should be explicit that only certain attribute values can be mapped to attribute keys.
We can do it in 2 ways
graph pass
The Vortex Compiler
is composed of three components: the Frontend
, the Optimizer
, and the Backend
. The structure is similar to the LLVM compiler structure.
We will create a Vortex IR
from the given DAG of BEAM program through the Frontend
and process and optimize it, to pass it on to the Backend
which transforms the Vortex IR into an ExecutionPlan
which will be received and processed by the Vortex Runtime
(#9).
The main job of the Vortex Compiler
is to label each of the vertices and edges of the DAG with specific attributes including:
The placement/labeling algorithm/policy will be pluggable and will be decided by the user, customized for each of the usages and environments. While labeling each nodes and edges, it will also check if there are any anomalies in the DAG. Our previous implementation of transient-reserved-specified Vortex will use the algorithm shown in the paper. The specifications and the details of the policy are shown under the compiler.optimizer.passes
package. Each pass receives a DAG and outputs another DAG, tagged with attributes.
The Compiler
can further optimize the DAG for it to be run efficiently on the Runtime
layer, by adding/merging/removing operators, modifying edges, and tweaking system attributes. (like FlumeJava, etc.)
This would also be ideal to be done during the Runtime.
Then, the Compiler Backend
splits the DAG into Vortex Stages
(#73).
Details and the following sub-issues will be updated.
Add more sub-issues
#12 Sink Node
#25 Compiler interfaces
#76 new IR
#29 Configurable Optimizer
#21 Refactor Attributes class
#13 Join Node
#22 DAG Integrity check
#14 Multi-Output Do node
#28 VortexBackend
#30 Tang to parse user arguments
#31 Interfaces for Runtime Optimization
#56 More instantiation policies
#36 Stream support
Translate Beam's Sink Node into Vortex's Sink Node. Please provide Beam program examples for testing the translation.
Also checking if webhooks are working
Introduce a simple Task class with the following specifications.
Please change SimpleEngine to use your Task class to execute Vortex DAGs. Make sure the engine correctly runs the Beam examples after the change.
After #29 , we want to receive optimization policy to run the program with as a parameter.
Currently, the code needs some code cleanup and restructuring, to reflect the changes discussed.
Also, edge attributes need to be added to clarify the DAG.
Task states are roughly defined (ex. READY, SCHEDULED, RUNNING, COMPLETE), but state transitions are rather vaguely implemented, without an explicit use of a state machine.
Use a state machine to formally manage task states in Runtime.
Beam's 0.4.0-incubating release is out.
However, it is not yet uploaded to maven central (https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-core).
Let's upgrade our dependency when it is uploaded so that our users don't have to manually install the snapshot version.
Reference:
https://github.com/cmssnu/pado/tree/master/bin
https://github.com/cmssnu/pado/tree/master/src/main/java/edu/snu/cay/vortex/beam/applications
We currently have:
RtOperator
is runtime's version of operators. It must have a way of receiving user defined function from compiler to execute.
The current optimizer statically applies the placement optimization.
Let's make it configurable so that the compiler can apply arbitrary optimizations(i.e., DAG passes) specified by the user.
It might be a good idea to introduce a new package(e.g., edu.snu.vortex.compiler.optimizer.pass) and keep all the pass-related code in it.
Be sure to:
We will need a layer to translate BEAM programs
into Vortex DAGs
, which follow our style.
These DAGs will later be received and processed by the Vortex Compiler
(#8).
Details and the following sub-issues will be updated.
Create an initial working code for us to build upon.
An initial setup of Runtime must be implemented with basic interfaces between modularized components designed.
Runtime has an interface class SchedulingPolicy
defined.
A simple and naive, round-robin scheduling policy is currently used in the simple Master's scheduler.
There must be more practical scheduling policies implemented for general use.
Moreover, optimal scheduling policies depending on job characteristics are preferable.
Compiler-related code is currently intermixed with translator/engine code.
Let's move them into a separate package and introduce APIs around them.
We should import REEF and run our code on top of it to make quite a few things easier :)
A new compiler frontend that translates Spark programs into the Vortex IR.
Given the labeled and splitted Vortex DAG
, which is processed by the Vortex Compiler
(#8), the Vortex Runtime
will run the given DAG
in a physical level. Its main components and contributions are as followings:
Node
s and Edge
s, stated in #8, the Runtime
executes them appropriately.Details and the following sub-issues will be updated.
Let's perform integrity checks on graphs upon their initialization as well as manipulation.
The checks should include
As a start, we can inject the checker in the beginning of DAGBuilder#build.
Let's bring in the style checker we had in pre-vortex.
Currently, we assume that the compiler optimization happens just once, before the job commences.
Let's allow the optimization to happen multiple times, at runtime. We need to carefully think about how the interfaces between different components in the system should change.
The execution flow might look like this: The engine feeds runtime metrics into the compiler optimizer, which outputs a new IR for the compiler backend. The compiler backend then manipulates the JobDAG, with which the engine resumes execution.
O2O --> OneToOne
O2M --> Broadcast
M2M --> ScatterGather
Compiler and Runtime both have "DAG" implementations. We can clean up and merge the code.
In the client JobLauncher, we assume that the first argument is the user main class and the rest are the user main arguments.
Let's replace this assumption with Tang. Then, the user will be able to specify other types of configurations(e.g., compiler types, resource types, etc).
VortexJobLauncher
in the previous version of Vortex is a good reference for implementing it.
Currently, all of the Beam Result's APIs throw UnsupportedOperationException
. Let's implement the APIs so that they correctly tell the job status.
Implement stage partitioning in the optimizer of the compiler
Currently, we have one big pom.xml at the root directory.
Let's use one pom.xml per sub-package(beam/dag/engine) and set up dependencies as the following.
Support Beam's ParDo.UnboundMulti
.
My guess is that we will have multiple edges coming out of a single Vortex's Do node, We will then somehow need to match each output to an edge using Beam's multi-output tags.
It might be helpful to take a look at how Beam's side input(which sort of is a multi-input thing) is translated into Vortex's Broadcast.
Currently, most of the interfaces throw UnsupportedOperationException
without any further explanation. Let's leave comments or exception messages. For example, we can say that streaming is currently not supported.
The simple engine we have assumes everything runs on a single computer. Thus, it never ser/des anything.
But that's not the case in a distributed environment.
With #15, #16, #17 in place, let's ser/des code/data in message exchanges between master and executors. The translator, and then the compiler should pass down the required class/codec information to the runtime.
Runtime execution plan must be generated in compiler's backend.
Runtime must provide APIs to generate the execution plan. This includes runtime's operators, edges and attributes that correspond to those of IR
We've discussed about the overall architecture of our new version of Vortex, consisting of the translator layer, the compilation layer, and lastly, the runtime layer.
Keeping this in mind, do you have any opinions about our code structure?
@johnyangk @gwsshs22
In order to make Vortex as extensible as possible, the set of attributes used to decide how Runtime executes jobs must be made extensible as well.
Runtime currently has a fixed set of attributes. New attributes must be flexibly added.
Support Join in Vortex DAG, and translate Beam's CoGroupByKey into it.
In the PR, please provide Beam program examples for testing the code.
Implement VortexBackend, a backend that converts IR representation into an ExecutionPlan
executable by Vortex Runtime.
Implement stage partitioning
Currently Jenkins seems like it's having some trouble as it doesn't have BEAM compiled on the machine. I'll try to fix this ASAP to use our CI functionality.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.