Code Monkey home page Code Monkey logo

guagua's Introduction

Guagua

Build Status
Maven Central

Guagua

An iterative computing framework on both Hadoop MapReduce and Hadoop YARN.

News

Guagua 0.7.7 is released with a lot of improvements. Check our changes

Conference

QCON Shanghai 2014 Slides

Getting Started

Please visit Guagua wiki site for tutorials.

What is Guagua?

Guagua, a sub-project of Shifu, is a distributed, pluggable and scalable iterative computing framework based on Hadoop MapReduce and YARN.

This graph shows the iterative computing process for Guagua.

Guagua Process

Typical use cases for Guagua are distributed machine learning model training based on Hadoop. By using Guagua, we implement distributed neural network algorithm which can reduce model training time from days to hours on 1TB data sets. Distributed neural network algorithm is based on Encog and Guagua. Any details please check our example source code.

Google Group

Please join Guagua group if questions, bugs or anything else.

Copyright and License

Copyright 2013-2017, PayPal Software Foundation under the Apache License V2.0.

guagua's People

Contributors

dependabot[bot] avatar gjastrebski avatar jlleitschuh avatar pengshanzhang avatar zhang7575 avatar zhangpengshan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

guagua's Issues

Straggler Mitigation Improvement

So far the policy is to detect whether one worker is over threshold three times. Guagua will kill worker and make it run in another machine.

In some cases it does not work well in a busy Hadoop cluster. Some times a worker is very slow but never over threshold which cause bad performance.

Consider this policy:
In each iteration, master receives all running time of workers, if the running time is over std, should be better than original policy.

Improve guagua Bash Script

  1. Add GUAGUA_CLASSPATH env.
  2. Add guagua-env.sh in conf folder
  3. Add GUAGUA_CONF_DIR env
  4. ADD GUAGUA_OPTS to support CLI java opts

NumberFormatException: For input string: "split"

2014-09-15 20:43:30,734 ERROR [main] ml.shifu.guagua.mapreduce.GuaguaMapper: Error in guagua main run method.
java.lang.NumberFormatException: For input string: "split"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at ml.shifu.guagua.worker.AbstractWorkerCoordinator$FailOverCoordinatorCommand$1.compare(AbstractWorkerCoordinator.java:126)
at ml.shifu.guagua.worker.AbstractWorkerCoordinator$FailOverCoordinatorCommand$1.compare(AbstractWorkerCoordinator.java:123)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:189)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at java.util.Collections.sort(Collections.java:217)
at ml.shifu.guagua.worker.AbstractWorkerCoordinator$FailOverCoordinatorCommand.doExecute(AbstractWorkerCoordinator.java:123)
at ml.shifu.guagua.BasicCoordinator$BasicCoordinatorCommand.execute(BasicCoordinator.java:461)
at ml.shifu.guagua.worker.SyncWorkerCoordinator.preApplication(SyncWorkerCoordinator.java:61)
at ml.shifu.guagua.worker.GuaguaWorkerService.start(GuaguaWorkerService.java:160)
at ml.shifu.guagua.mapreduce.GuaguaMapper.setup(GuaguaMapper.java:99)
at ml.shifu.guagua.mapreduce.GuaguaMapper.run(GuaguaMapper.java:133)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Partial Complete Support

In current solution, each iteration, every worker is sucessful, then master can go on next iteration. Although there are two parameters to define the percentage of successful workers in each iteration, but at last iteration, all mapper tasks are done and successful, the job can be successful.

If we have 10 workers, 9 worker at last iteration are done successful, how about just terminate the job with successful state.

This should be easy in YARN, but not easy in MapReduce implemenataion.

Add KMeans Algorithm Example

  1. First K points
    In 1st iteration, choose random k points and set to workers
  2. Worker Iteration
    Compare each point with K points, find closet one and tag each point. At the same time, For each category, count and sum value and then send them to master
  3. Master Computation
    Recompute new K points and send to workers
  4. How to save all points with tag
    Add one Worker Interceptor: Save all data to one part hdfs file. Data set to Context#props and used in interceptor#postApplication
  5. Think about fail-over (tag???)???

Save ZNode space for ZooKeeper

So far each worker each iteration we use a zookeeper znode. Investigate on whether we can use one znode for all iterations.

Compare LR from Shifu-0.2.3 with Spark LR

By using some data Shifu used, set NN setting without hidden layer or just one layer.
For spark, test SparkHDFSLR in 2.2.0 cluster by using the same data.

Compare the performance.

Consider Functions like byPass in Interceptors

Real interceptor chain should be cancelled, in each interceptor user may use context.byPass() to stop next execution for all interceptors and even computable. Investigate whether it is doable in guagua interceptor chain.

Add In-Memory ZooKeeper Support

ZooKeeper supports in-memory set-up. This can be used for in memory coordinator. Maybe a choice for embeded zookeeper server.

Set Worker Results in Master Context not as A List to Save Memory

This optimization is for large model.. In master, so far worker results is set as a list, but for large models and more workers, master memory may be a bottleneck. To optimize master, should set worker results not a list but implement as Iteratable to load worker result in memory one by one.

Extract guagua-common and guagua-examples projects

guagua-common: hadoop related common features like InputFormat, guagua-mapreduce and guagua-yarn should depend on it.

guagua-examples: merge guagua-mapreduce-examples and guagua-yarn-examples together.

Add guagua-site.xml Support

So far, user can only set guagua parameters by using cmd-line '-Dkey=value'. We'd like to support xml configuration file:
User can configure guagua parameters by setting in guagua-site.xml

AbstractWorker to load and iterate all data in only one function

Currently, we load data firstly in a collection, then in each iteration iterating the collection. Should have a good abstraction to make data loading and collection iterating together, the user only needs to write the only function like map in MapReduce to process each record.

Not sure the senario, but it should be a good abstraction.

Investigate Vector Computing on Gradient Computing

Now on worker, gradient computing is one by one, should be run in parallel. Looks like vector abstraction are working on it, check other vector implementation and learn from them.

Or just use multi-threads work on it.

BIG MODEL: Another Option to Store Models (Bytable Master and Worker Results) into HDFS

So far, Bytable results are stored into ZooKeeper znodes, because of size limit of znode(although multiple znodes can be used to store one big Bytable result), another option like HDFS should be supported to support bigger Bytable results.

To use HDFS, read and write are slow, think about an good in-memory solution to both support bigger model and gain good speed.

Release Guagua 0.5.0

A big improvement including embeded zookeeper, time out computable, iteratable worker results instead of list (save memory efficiently in master).

It's time to release 0.5.0.

KMeans: Initial K Centriods Selection

In current implementation, k is configured, it is not good.

We'd like to: Use the first iteration to choose K centrods.

  1. Each worker randomly choose k points and send them to master
  2. Master sort all m * k points and choose K discrete points as the first centroids

GuaguaAppMaster Fault Tolerance

In Hadoop 2.2.0, even in MRAppMaster there is no fault tolerance, of course GuaguaAppMaster no fault tolerance.

Check Hadoop 2.4.1 to see if there is fault tolerance in MRAppMaster, if yes, learn and make it in GuaguaAppMaster.

Coordinator Improvement for Big Model

For Big Model, zookeeper is a bottleneck. Sometimes we need big zookeeper heap(10G) and big disk(500G). Consider only using zookeeper for coordination, but transferring results through TCP server/client.

One big issue, fault tolerance doesn't work well. Should consider to persistent master result to HDFS or zookeeper.

Big Model Support Test

So far, gradients results are stored into one zookeeper znode, if it is over 1MB, other secondary znodes will be used, but a big model like 20M (may need 20 znodes) are still not tested.

Test this case with big data on our hadoop cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.