I tried with 10k-80k doc-word dataset, the program worked, but when I tested with 100

Yes, I used libsvm format, and the submit command is as follows: <code class="notr

The wrong: -numThreads means number of threads allocated for each partition, so

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

(LDA) Could you please give more detail about tested dataset and configuration? about zen HOT 4 CLOSED

razrLeLe commented on August 10, 2024

(LDA) Could you please give more detail about tested dataset and configuration?

from zen.

Comments (4)

bhoppi commented on August 10, 2024

Is your input corpus in LIBSVM format?
And what's your command arguments used?

from zen.

razrLeLe commented on August 10, 2024

Yes, I used libsvm format, and the submit command is as follows:
spark-submit --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://goyoo/user/yuyue/log --executor-cores 4 --num-executors 21 --driver-memory 4G --executor-memory 4G --master yarn-client --class com.github.cloudml.zen.examples.ml.LDADriver zen-examples-0.3-SNAPSHOT-spark1.6.1.jar -numPartitions=21 -LDAAlgorithm=LightLDA -numThreads=16 -numTopics=500 -alpha=0.01 -beta=0.01 -alphaAS=1.0 -totalIter=1500 hdfs://goyoo/user/yuyue/10w_doc.libsvm hdfs://goyoo/user/yuyue/zen_10w_result

from zen.

bhoppi commented on August 10, 2024

The wrong: -numThreads means number of threads allocated for each partition, so this parameter must be <= --executor-cores, otherwise the job won't start.
May need tune: I don't know how much your corpus is, but if your corpus is very big, --executor-memory 4G may be not enough and you may need increase it if OOM happens.
Other suggestions: -LDAAlgorithm=ZenLDA is the fastest algorithm among all the LDA implementations; -chkptinterval=100 (for example) is needed to do checkpoints every 100 iterations, otherwise your job will be very slow after hundreds of iterations (because driver memory is eaten up by the very long RDD lineage information)

from zen.

razrLeLe commented on August 10, 2024

@bhoppi Thanks so much for your help, I finally find out there is something wrong with pretreatment of the corpus, which caused every word count of every document is zero, and then got the exception.

from zen.

Recommend Projects

(LDA) Could you please give more detail about tested dataset and configuration? about zen HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent