shifuml / guagua Goto Github PK

View Code? Open in Web Editor NEW

71.0 22.0 40.0 967 KB

An iterative computing framework for both Hadoop MapReduce and Hadoop YARN.

Home Page: https://github.com/ShifuML/guagua/wiki

License: Apache License 2.0

Shell 1.31% Java 98.69%

in-memory iterative machine-learning hadoop yarn

guagua's Introduction

Guagua

An iterative computing framework on both Hadoop MapReduce and Hadoop YARN.

News

Guagua 0.7.7 is released with a lot of improvements. Check our changes

Conference

QCON Shanghai 2014 Slides

Getting Started

Please visit Guagua wiki site for tutorials.

What is Guagua?

Guagua, a sub-project of Shifu, is a distributed, pluggable and scalable iterative computing framework based on Hadoop MapReduce and YARN.

This graph shows the iterative computing process for Guagua.

Typical use cases for Guagua are distributed machine learning model training based on Hadoop. By using Guagua, we implement distributed neural network algorithm which can reduce model training time from days to hours on 1TB data sets. Distributed neural network algorithm is based on Encog and Guagua. Any details please check our example source code.

Google Group

Please join Guagua group if questions, bugs or anything else.

Copyright and License

guagua's People

Contributors

Stargazers

Watchers

Forkers

wachaong picnic106 lyh411201000 leoliujie chinalongganhu zqyang0124 qlycool kanewang bitted exhalo zhang7575 zhengxiaobin cyclefusion suoluoji brucezhou2012 nathanwangyi huge-stone sunny4help davidmr001 minizw minizhuwei luciferaaa lipengyu fuhmdyd wuhaifengdhu ldw-sh-cn zmyer seanzhou1023 naveentrtumkur chenweiye83 huzza helanyao mikefong startime-h spbohai tool-recommender-bot liweisnake moony320 zhishanlingyun kendung

guagua's Issues

Straggler Mitigation Improvement

So far the policy is to detect whether one worker is over threshold three times. Guagua will kill worker and make it run in another machine.

In some cases it does not work well in a busy Hadoop cluster. Some times a worker is very slow but never over threshold which cause bad performance.

Consider this policy:
In each iteration, master receives all running time of workers, if the running time is over std, should be better than original policy.

Doc for new features (Before 0.5.0)

No doc for guagua-site.xml, ComputableMonitor, ...

Improve guagua Bash Script

Add GUAGUA_CLASSPATH env.
Add guagua-env.sh in conf folder
Add GUAGUA_CONF_DIR env
ADD GUAGUA_OPTS to support CLI java opts

Merge guagua-mapredeuce-examples and guagua-yarn-examples into guagua-examples

Should use maven profile to support that.

NumberFormatException: For input string: "split"

2014-09-15 20:43:30,734 ERROR [main] ml.shifu.guagua.mapreduce.GuaguaMapper: Error in guagua main run method.
java.lang.NumberFormatException: For input string: "split"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at ml.shifu.guagua.worker.AbstractWorkerCoordinator$FailOverCoordinatorCommand$1.compare(AbstractWorkerCoordinator.java:126)
at ml.shifu.guagua.worker.AbstractWorkerCoordinator$FailOverCoordinatorCommand$1.compare(AbstractWorkerCoordinator.java:123)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:189)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at java.util.Collections.sort(Collections.java:217)
at ml.shifu.guagua.worker.AbstractWorkerCoordinator$FailOverCoordinatorCommand.doExecute(AbstractWorkerCoordinator.java:123)
at ml.shifu.guagua.BasicCoordinator$BasicCoordinatorCommand.execute(BasicCoordinator.java:461)
at ml.shifu.guagua.worker.SyncWorkerCoordinator.preApplication(SyncWorkerCoordinator.java:61)
at ml.shifu.guagua.worker.GuaguaWorkerService.start(GuaguaWorkerService.java:160)
at ml.shifu.guagua.mapreduce.GuaguaMapper.setup(GuaguaMapper.java:99)
at ml.shifu.guagua.mapreduce.GuaguaMapper.run(GuaguaMapper.java:133)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Partial Complete Support

In current solution, each iteration, every worker is sucessful, then master can go on next iteration. Although there are two parameters to define the percentage of successful workers in each iteration, but at last iteration, all mapper tasks are done and successful, the job can be successful.

If we have 10 workers, 9 worker at last iteration are done successful, how about just terminate the job with successful state.

This should be easy in YARN, but not easy in MapReduce implemenataion.

Add Logistic Regression Example

Try to implement by using SGD.

Check if can add linear SVM and other algorithms in guagua

From this paper, MapReduce is used, check if we can implement them into guagua,

http://www.csie.ntu.edu.tw/~cjlin/talks/icmr2012.pdf

UT: guagua-mapreduce-examples and guagua-yarn-examples components

Add KMeans Algorithm Example

First K points
In 1st iteration, choose random k points and set to workers
Worker Iteration
Compare each point with K points, find closet one and tag each point. At the same time, For each category, count and sum value and then send them to master
Master Computation
Recompute new K points and send to workers
How to save all points with tag
Add one Worker Interceptor: Save all data to one part hdfs file. Data set to Context#props and used in interceptor#postApplication
Think about fail-over (tag???)???

Save ZNode space for ZooKeeper

So far each worker each iteration we use a zookeeper znode. Investigate on whether we can use one znode for all iterations.

Compare LR from Shifu-0.2.3 with Spark LR

By using some data Shifu used, set NN setting without hidden layer or just one layer.
For spark, test SparkHDFSLR in 2.2.0 cluster by using the same data.

Compare the performance.

Avro Format Input Support

The same as proto-buf

Consider to make all algorithms to one lib like mllib or metromone

Not like metromone, to be sure all algorithms are right not only for speed but for performance

Consider Functions like byPass in Interceptors

Real interceptor chain should be cancelled, in each interceptor user may use context.byPass() to stop next execution for all interceptors and even computable. Investigate whether it is doable in guagua interceptor chain.

Release 0.4.1 version

Java doc change
guagua shell change
add kmeans example

Add In-Memory ZooKeeper Support

ZooKeeper supports in-memory set-up. This can be used for in memory coordinator. Maybe a choice for embeded zookeeper server.

GUAGUA-YARN: clean app resources in HDFS no matter successful or not

We store some jars and files into hdfs tmp folder for yarn container usage. Should be cleaned when guagua application is done.

Set Worker Results in Master Context not as A List to Save Memory

This optimization is for large model.. In master, so far worker results is set as a list, but for large models and more workers, master memory may be a bottleneck. To optimize master, should set worker results not a list but implement as Iteratable to load worker result in memory one by one.

Extract guagua-common and guagua-examples projects

guagua-common: hadoop related common features like InputFormat, guagua-mapreduce and guagua-yarn should depend on it.

guagua-examples: merge guagua-mapreduce-examples and guagua-yarn-examples together.

Add guagua-site.xml Support

So far, user can only set guagua parameters by using cmd-line '-Dkey=value'. We'd like to support xml configuration file:
User can configure guagua parameters by setting in guagua-site.xml

Deep Learning Investigation

Check

Looks like we should change a little for deep learning neural network support.

UT: guagua-yarn component

AbstractWorker to load and iterate all data in only one function

Currently, we load data firstly in a collection, then in each iteration iterating the collection. Should have a good abstraction to make data loading and collection iterating together, the user only needs to write the only function like map in MapReduce to process each record.

Not sure the senario, but it should be a good abstraction.

UT: guagua-core component

Linear SVM Implementation

How to do it?

Follow Spark implementation:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm

Code:
https://github.com/apache/spark/tree/branch-1.0/mllib/src/main/scala/org/apache/spark

Compare KMeans Performance with Mahout

Our input data are located into memory and only one Hadoop job is used. While in Mahout each iteration is a Hadoop job.

Add NN example into guagua-yarn-examples

Investigate Vector Computing on Gradient Computing

Now on worker, gradient computing is one by one, should be run in parallel. Looks like vector abstraction are working on it, check other vector implementation and learn from them.

Or just use multi-threads work on it.

Add default total iteration number

If user doesn't set total iteration number, add a default number.

This is useful to make user set less parameter.

Support speculative feature in guagua mapreduce and yarn implementation

If we can support speculative feature, we can add a backup for each master or worker task to make sure their velocity.

Looks very hard on Yarn, but doable in MapReduce

Add Neural Network Classification to Spark MLlib

No Neural Network support in MLlib, try to add such feature to Spark MLLib, contribute to Guagua's competetor.

BIG MODEL: Another Option to Store Models (Bytable Master and Worker Results) into HDFS

So far, Bytable results are stored into ZooKeeper znodes, because of size limit of znode(although multiple znodes can be used to store one big Bytable result), another option like HDFS should be supported to support bigger Bytable results.

To use HDFS, read and write are slow, think about an good in-memory solution to both support bigger model and gain good speed.

Add Internal ZooKeeper Support

User don't need to specify zookeeper for some cases. Like Giraph, to start a zookeeper process in master.

Release Guagua 0.5.0

A big improvement including embeded zookeeper, time out computable, iteratable worker results instead of list (save memory efficiently in master).

It's time to release 0.5.0.

KMeans: Initial K Centriods Selection

In current implementation, k is configured, it is not good.

We'd like to: Use the first iteration to choose K centrods.

Each worker randomly choose k points and send them to master
Master sort all m * k points and choose K discrete points as the first centroids