cloudml / zen Goto Github PK

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.

License: Apache License 2.0

Scala 98.88% Shell 0.34% Java 0.78%

zen's Introduction

Zen

Zen is based on Apache Spark, MLlib and GraphX, but with sophisticated optimizations and newly-added features to optimize and scale up the machine learning training. Zen is developed with the mind that a successful machine learning platform should and must combine both data insight, ml algorithm and system experience together.

Contributors

Bo Zhao (@bhoppi)
Guoqiang Li (@witgo)
Hucheng Zhou (@hucheng)
Sendong Li (@lisendong)

zen's People

Contributors

Stargazers

Watchers

Forkers

witgo lisendong daishichao debasish83 mindis rockcobra weixiaohua fangzheng354 defaultrobot hanxiangtam lnsoso hbdrawn dengbenyang 2pc hucheng billbargens foolchan2556 lenovor caobokai luckytina jjdblast njuhugn codeaudit yetingqiaqia fedorajzf zdx ljzzju wuzhongdehua riskyhe309 luyee girlatsnow wanjun0511 alexchao2012 e-hu fengzeli benmccann tczhp yanjiegao yxzf colinsongf woodstone121 tracholar zdh2292390 mpekalski huang-hf strategist922 learking norens hammingcube w0lker fengchu0618 jinyu0310 sharkdtu sunfu2016 tanyinyan github-hongweizhang iuanloveyou charleshm yaakovsu karthikmurugesan2 zfh01234 titsuki knowledgehacker uranuszs bowenwong frankfqchen

zen's Issues

(LDA)Example/LDADriver/ Job aborted due to stage failure: java.lang.ArrayIndexOutOfBoundsException: -6

Example/LDADriver

Job aborted due to stage failure: Task 9 in stage 28.1 failed 4 times, most recent failure: Lost task 9.3 in stage 28.1 (TID 355, cloud1014121118.wd.nm.ss.nop.ted): java.lang.ArrayIndexOutOfBoundsException: -6
at org.apache.spark.graphx2.impl.EdgePartition.dstIds(EdgePartition.scala:114)
at org.apache.spark.graphx2.impl.EdgePartition$$anon$1.next(EdgePartition.scala:341)
at org.apache.spark.graphx2.impl.EdgePartition$$anon$1.next(EdgePartition.scala:333)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.graphx2.impl.EdgePartition$$anon$1.foreach(EdgePartition.scala:333)
at org.apache.spark.graphx2.impl.RoutingTablePartition$.edgePartitionToMsgs(RoutingTablePartition.scala:58)
at org.apache.spark.graphx2.VertexRDD$$anonfun$4$$anonfun$apply$2.apply(VertexRDD.scala:359)
at org.apache.spark.graphx2.VertexRDD$$anonfun$4$$anonfun$apply$2.apply(VertexRDD.scala:359)
at scala.Function$$anonfun$tupled$1.apply(Function.scala:77)
at scala.Function$$anonfun$tupled$1.apply(Function.scala:76)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

(FM/MVM, etc.): GraphX limitations

The typical flow of FM/MVM on GraphX is as follows:

  val margin = forward(iter)
  var (_, rmse, gradient) = backward(margin, iter) {
       multi = multiplier(q, iter)
  }
  gradient = updateGradientSum(gradient, iter)
  vertices = updateWeight(gradient, iter)

Logically, margin and multiplier are all temporary variable that all samples in a partition can share them since sample is executed sequentially one by one, thus only one margin/multiplier copy exists. However, in current GraphX implementation, there are many (the number of samples) margin/multiplier variables. This will cost huge memory thus affect the scalability.

(FM/MVM, etc) Change data type of margin/multiplier/gradient, etc from Double to Float

Currently, margin/multiplier/gradient RDDs cost a lot of memory space since there is an array with length of (view_rank_sizeof(Double)) for each sample vertex. For instance, given 1B samples, 3 views and 20 ranks, the Margin or Multiplier RDDs would cost 480GB. By changing data type from Double to Float, this can reduce half of the RDD size, with negligible accuracy loss.

(GraphX): shipVertexAttributes is costly

Master vertex would ship the updated attribute to all slaves. Consider a case that there are multiple partitions (slaves) in a machine, it is unnecessary to ship multiple times to that machine but once.
shipVertexAttributes will be called twice in LDA, one is at MapTripplet in sampleToken to ship attribute from master to slaves, another is joinVertices in updateCounter that ships attribute from vertices to edges. The thing is that it is unnecessary to copy thus ship the attribute, but keep a hash map (vid -> local vertex index).

[FM] The training process of FM algorithm is so slow

I use the fm algorithm in zen to train a dataset, which has 15 partition and about 600MB. The training is so slow, about 1h30m. The xgboost and lr algorithm only cost 30min.

Add convergenceTol

Spark's GradientDescent optimizer has a convergenceTol which is very helpful. It would be good to add that here as well.

See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L99

(LDA): scalability issue in updateCounter.

Current updateCounter is implemented via aggregateMessages. It will create an dense arrays with length of vertices in each partition, and each element is a sparseVector. When the number of vertices in one partition is huge (consider 1B vertices), it cannot be hold in memory.
This can be solved by:

better graph partition approach such that the number of vertices in a partition is limited even when the graph (input data) is super large.
use aggregateByKey. There are several advantages against aggregateMessages:
(1). For each edge, aggregateMessages will new a sparseVector (edge attribute is an array), and new a sparseVector that is the result of sparseVector+sparseVector.
(2). Why not reduceByKey is because it seqOp definition (U, V) => U does not needs V to be the same type with U, thus unnecessary to new sparseVector from edge attribute array. Besides, to avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
(3). The side-effect of aggregateByKey is that it needs a sort-phase if sort-based shuffle is used.

(LDA): aliasTable opts

Several opportunities in optimizing aliasTable：

change probability type from Double to Float, save space
unnecessary to use a JPriorityQueue, which is cost introduced by sort. Just a simple queue is enough.

(GraphX): better partitioning strategies

There are four partitioning strategies in GraphX:

random hash
edge1D (src or dst)
edgePartition2D

Besides, we also implemented:

DBH (Degree-Based Hashing)
balanced label propagation from Facebook (http://stanford.edu/~jugander/papers/wsdm13-blp.pdf and https://code.facebook.com/posts/274771932683700/large-scale-graph-partitioning-with-apache-giraph/)
Bounded and Balanced Partitioner (two stages, edges belongs to vertex partition that has larger degree, and a re-balanced partitioner, details later. )

Project's status

What's the status of the project?

The version published on Maven is 0.1.1 and is dated back to 2015, the version on the repository is 0.4-SNAPSHOT, with the latest update dating back 7 months. This all raises the question of whether the project is at all alive.

mvn package fail

Hi, I checked out the newest master branch, and mvn package, finally got some errors.

Downloaded: https://repo.maven.apache.org/maven2/org/scala-lang/scala-compiler/2.10.6/scala-compiler-2.10.6.jar (14134 KB at 127.9 KB/sec)
[INFO]
[INFO] --- maven-enforcer-plugin:1.2:enforce (enforce-versions) @ zen-ml_2.10 ---
[INFO]
[INFO] --- maven-enforcer-plugin:1.2:enforce (enforce-maven) @ zen-ml_2.10 ---
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @ zen-ml_2.10 ---
[INFO] Add Source directory: /search/odin/yulei/spark_models/zen/ml/src/main/scala
[INFO] Add Test Source directory: /search/odin/yulei/spark_models/zen/ml/src/test/scala
[INFO]
[INFO] --- build-helper-maven-plugin:1.7:add-source (add-scala-sources) @ zen-ml_2.10 ---
[INFO] Source directory: /search/odin/yulei/spark_models/zen/ml/src/main/scala added.
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ zen-ml_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /search/odin/yulei/spark_models/zen/ml/src/main/resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ zen-ml_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile
[INFO] Using incremental compilation
[INFO] 'compiler-interface' not yet compiled for Scala 2.10.6. Compiling...
[INFO] Compilation completed in 19.924 s
[INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.6,2.0.1,null)
Downloading: https://repo.maven.apache.org/maven2/org/scalamacros/paradise_2.10.6/2.0.1/paradise_2.10.6-2.0.1.jar
Downloaded: https://repo.maven.apache.org/maven2/org/scalamacros/paradise_2.10.6/2.0.1/paradise_2.10.6-2.0.1.jar (1807 KB at 107.2 KB/sec)
[INFO] Compiling 98 Scala sources and 4 Java sources to /search/odin/yulei/spark_models/zen/ml/target/scala-2.10/classes...
[WARNING] /search/odin/yulei/spark_models/zen/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation/BSFMModel.scala:117: non-variable type argument Double in type pattern Seq[Double] is unchecked since it is eliminated by erasure
[WARNING] case Row(featureId: Long, factors: Seq[Double]) =>
[WARNING] ^
[WARNING] /search/odin/yulei/spark_models/zen/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation/FMModel.scala:112: non-variable type argument Double in type pattern Seq[Double] is unchecked since it is eliminated by erasure
[WARNING] case Row(featureId: Long, factors: Seq[Double]) =>
[WARNING] ^
[WARNING] /search/odin/yulei/spark_models/zen/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation/MVMModel.scala:115: non-variable type argument Double in type pattern Seq[Double] is unchecked since it is eliminated by erasure
[WARNING] case Row(featureId: Long, factors: Seq[Double]) =>
[WARNING] ^
[WARNING] /search/odin/yulei/spark_models/zen/ml/src/main/scala/com/github/cloudml/zen/ml/tree/Node.scala:20: imported `Node' is permanently hidden by definition of object Node in package tree
[WARNING] import org.apache.spark.mllib.tree.model.{Node, Predict}
[WARNING] ^
[WARNING] four warnings found
[WARNING] warning: [options] bootstrap class path not set in conjunction with -source 1.6
[WARNING] 1 warning
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ zen-ml_2.10 ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 4 source files to /search/odin/yulei/spark_models/zen/ml/target/scala-2.10/classes
[INFO]
[INFO] --- build-helper-maven-plugin:1.7:add-test-source (add-scala-test-sources) @ zen-ml_2.10 ---
[INFO] Test Source directory: /search/odin/yulei/spark_models/zen/ml/src/test/scala added.
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ zen-ml_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first) @ zen-ml_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile
[INFO] Using incremental compilation
[INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.6,2.0.1,null)
[INFO] Compiling 12 Scala sources to /search/odin/yulei/spark_models/zen/ml/target/scala-2.10/test-classes...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Zen Project Parent POM ............................. SUCCESS [ 2.416 s]
[INFO] Zen Project ML Library ............................. FAILURE [03:58 min]
[INFO] Zen Project Assembly ............................... SKIPPED
[INFO] Zen Project Examples ............................... SKIPPED

(LDA)How to set up scale related parameters?

Hi, I'm testing LDA for large scale dataset, billion docs*million words. However, spark originated lda always HANG, then i found ZEN.
However, I've not found parameter setup guide for lda, except simple description in source code.
My question is:

There are several parameters relate to SCALE, is there a guide for setting them up?
As for parition number, how to choose for better parallelization?
Thanks!

(LDA): Unnecessary updates thus unnecessary network traffic happens in updateCounter.

updateCounter: the newly sampled topic in edges are send to word/doc vertex to update the word-topic/doc-topic table.

Unnecessary updates thus unnecessary network traffic happens in updateCounter if there is NO topic change in one edge (the newly sampled topic is the same as old topic). In other words, only those edges with topic change need to update the counter in vertices.

The solution is update the delta rather than value. The attribute of an edge (a token) is changed from topicId to (oldTopicId, newTopicId) pair, that means the oldTopicId in word-topic/doc-topic table needs to subtract the aggregated delta while newTopicId will be added by the aggregated delta.

项目使用疑问

@witgo 大神偶然间看到cloudml/zen项目,想用里面的一些算法，我需要将项目下载下来然后按照自己的集群环境修改pom文件，编译打包使用么？(提问者水平极菜，第一次用GitHub项目)

(Graphx) Upstream necessary changes to graphx

I'd like to make any changes necessary to graphx to use the upstream library. This repo is still on graphx 1.4 I believe and so we're not getting any of the graphx bug fixes.

I sent #56 and #57 to reduce the diff between graphx2 and upstream

I sent apache/spark#14291 to upstream addition of a new method

Changes left:

GraphImpl.fromEdgeRDD visibility
RoutingTablePartition.routingTable visibility
Immutability in EdgePartition
Can you explain why this change in GraphImpl was necessary?

(LDA) Multi-thread GraphX implementation

Now the implementation of GraphX has serious scalability issues. The reason is that it's RDD data structure is specially optimized for join operations (edges with vertices, inner edges join, outer vertices join, etc.), that causes its data are not loaded block by block like other RDDs, but one partition as a whole which OOM occurs often if many partitions loaded at the same time.
So our solution to this issue is, constraining the partition number loaded at the same time, and processing each partition using multi-thread techniques.

(FM/MVM, etc.) FM is controlled by zen.lda.numPartitions

To change the number of partitions in FM you have to set zen.lda.numPartitions. It's strange FM is controlled by LDA configuration. Perhaps this config property should be renamed to not include LDA in the name or a second configuration property should be introduced for FM?

FM.initializeDataSet calls DBHPartitioner.partitionByDBH which references LDADefines.cs_numPartitions which is defined as zen.lda.numPartitions.

(LDA) Could you please give more detail about tested dataset and configuration?

I tried with 10k-80k doc-word dataset, the program worked, but when I tested with 100k-160k doc-word dataset the program failed with exception:

Exception in thread "main" java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$40.apply(RDD.scala:1027)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$40.apply(RDD.scala:1027)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1027)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
at org.apache.spark.rdd.RDD$$anonfun$max$1.apply(RDD.scala:1410)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.max(RDD.scala:1409)
at com.github.cloudml.zen.ml.clustering.DistributedLDAModel.save(LDAModel.scala:242)
at com.github.cloudml.zen.ml.clustering.DistributedLDAModel.save(LDAModel.scala:222)
at com.github.cloudml.zen.examples.ml.LDADriver$.runTraining(LDADriver.scala:139)
at com.github.cloudml.zen.examples.ml.LDADriver$.main(LDADriver.scala:114)
at com.github.cloudml.zen.examples.ml.LDADriver.main(LDADriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I am wondering whether it is necessary to specify some configuration or meet up with some equipment, so could you please tell me more detail about your configuration when works with billions of documents?

And thanks so much for your awesome work!

(GraphX): better partitioning strategies

There are four partitioning strategies in GraphX:

random hash
edge1D (src or dst)
edgePartition2D

Besides, we also implemented:

DBH (Degree-Based Hashing)
balanced label propagation from Facebook (http://stanford.edu/~jugander/papers/wsdm13-blp.pdf and https://code.facebook.com/posts/274771932683700/large-scale-graph-partitioning-with-apache-giraph/)
Bounded and Balanced Partitioner (two stages, edges belongs to vertex partition that has larger degree, and a re-balanced partitioner, details later. )

Tests failing

If I run mvn test I get:

Tests: succeeded 4, failed 5, canceled 0, ignored 8, pending 0
*** 5 TESTS FAILED ***

(LDA): F+ tree for sampling

F+ tree (see paper http://www.cs.utexas.edu/~rofuyu/papers/nomad-lda-www.pdf) can get more fresh state than aliasTable, and with lower initialization cost (log(k)), but with higher sample cost (log(k)).

This is good for LDA model inference?

Resubmit to Spark Packages

Hi,

Thank you very much for your submission to Spark Packages! Due to an internal error, your process couldn't be completed. Could you please resubmit it?

Thanks,
Burak

KryoSerializer doesn't work in LDADriver

(Util) XORShiftRandom is not thread-safe

Constrained ALS and ALM

@witgo

I have a package for factorization that's based on ml.recommendation.ALS but several major changes:

For ALS, user and product constraints can be specified. This allows us to add column wise L2 regularization for words and L1 regularization for documents (through Breeze QuadraticMinimizer) to run sparse coding.
In place of L1 regularization, probability simplex can be added on documents and positive constraints on words to get PLSA constraints with least square loss.
Alternating Minimization supports KL Divergence and likelihood loss with positive constraints in matrix factorization to run PLSA formulation and generate LDA results through factorization.
Alternating minimization shuffles sparse vectors and is designed to scale to large ranks matrix factorization like petuum.

Details are on the following JIRAs:

If it looks useful, I can add a factorization package in zen and bring the code from the Spark PRs. zen is already in spark-packages and so I don't have to introduce another new package. If users find it useful, may be later we can move it back to ml. It changes user facing API significantly.

Next I want to move these algorithms to graphx API and compare the runtime and efficiency. Since zen is focused on optimizing graphx for ML, I feel zen is an ideal package for these factorization algorithms.

Factorization output are large distributed models and natural extension is to add few hidden layers between user/word and item/document and develop a distributed neural net formulation which should use optimized graphx API and I think you have already built many of these optimizations in zen.

(FM/MVM, etc.): ArrayIndexOutOfBoundsException when scaling

I get the exception below whenever I run the FM code on any of my real datasets. It seems to break roughly when you have >100k training examples and >100 machines.

java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
    at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
    at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
    at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:163)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Here's the driver stacktrace:

org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
com.github.cloudml.zen.ml.partitioner.DBHPartitioner$.partitionByDBH(DBHPartitioner.scala:70)
com.github.cloudml.zen.ml.recommendation.FM$.initializeDataSet(FM.scala:498)

The odd thing is that the driver stacktrace shows the error happening in initializeDataSet, but it doesn't seem to occur until training is done. To speed reproduction of the problem I set numIterations to 1.

(LDA): sparse initialization rather than uniformly random initialization

The phenomenon of LDA training is that the first several training is very costly, this is largely due to the uniformly random initialization that the word-topic thus doc-topic is quite dense.

There are two approaches：

sparse initialization that constraints a word to only a part (like 1%) (randomly) of all topics, and for each tokens of that word, randomly sample from those constrained topics rather than all topics.
First use part of corpus (like 1%) to train several iterations to initialize the word-topic distribution, which should be quite sparse than uniformly random initialization.

(LDA): Search complexity in SparseVector (word-topic and doc-topic vector) is log(K). Please consider to use HashVector in Breeze (or OpenAddressHashArray in Breeze) that is with O(1)

During sampleToken, when computing probability of Nkd*(Nkw+beta), given a topic k in Nkd, find the corresponding Nkw of that topic k is O(Kw). If we choose to use HashVector, the complexity can be reduced to O(1), even with a little bit space overhead.
Since HashVector in Breeze is not serializable, so please consider to directly use OpenAddressHashArray in Breeze.

(FM/MVM, etc.) Bounded by high memory consumption but with Low CPU usage. Consider to use multi-thread to fully utilize the CPU cores

Currently, the resource usage in FM/MVM training is bounded by high memory consumption but with Low CPU usage. Consider to use multi-thread to fully utilize the CPU cores.

Documentation to /run/ examples?

Hello,

Could you provide some documentation to run the examples - I'm having issues compiling, as mvn install is disabled, its difficult to do a mvn exec:java from inside the examples directory. Is there a way to get a single shaded Jar for all sub-modules? The jar created in assembly/target does not include the examples, and with install disabled I'm not clear how to run an example.

Any advice?

Craig

(LDA): change word-topic/doc-topic count type from Double to Int

change word-topic/doc-topic count type from Double to Int.

(LDA): different terminate condition for different vertices.

The insight is that the convergence speed of topics of some edges or word-topic distribution of some words is different, some converge earlier. For those converged edges/words, it is unnecessary to add them in the working set in the next iteration.
The thing is how to determine an edge/word converge or not. A feasible solution is to use bhattacharyya coefficient (https://en.wikipedia.org/wiki/Bhattacharyya_distance) to compare the word-topic similarity of two consecutive iterations. The more similar, the more probability that that word is converged.
We do not simply filter out the converged words based on a threshold value, instead, we use a probability to sample the edges of that word, the sample probability is negative-proportional to the similarity degree, and we also consider the time factor that the longer that an edge is not sampled, the new sample probability would be higher.

(LambdaMART) File format to run example

Could you provide file formats to run LambdaMARTRunner.scala? Or files itself?
It would be better to support this defacto standart: https://arxiv.org/ftp/arxiv/papers/1306/1306.2597.pdf

cloudml / zen Goto Github PK

zen's Introduction

Zen

Contributors

zen's People

Contributors

Stargazers

Watchers

Forkers

zen's Issues

Recommend Projects

Recommend Topics

Recommend Org