dmlc / wormhole Goto Github PK

Deprecated

License: Other

Makefile 5.12% Shell 1.75% C++ 88.68% Protocol Buffer 4.45%

wormhole's Introduction

This repo is deprecated. We have realized that it is much better to build individual projects that have deep optimization and integration with the existing system.

Pointers to the Projects

XGBoost for tree boosting.
MXNet for for deep learning
DiFaco for factorization.

List of Tools

Boosted Trees (GBDT): XGBoost: eXtreme Gradient Boosting
Clustering: kmeans
Linear method: Asynchrouns SGD L-BFGS
Factorization Machine: DiFacto

wormhole's People

Contributors

Stargazers

Watchers

Forkers

uwroute thomasdic2000 jethrotan 7472741 goldencwcui damingnju chagge prosoul yiiwood luyee orangelpai nkhuyu nelsonwhf 52nlp zjf rt0220 litaoshao sdd031215 betulike ericxsun xinchoubiology alienfeel sadapple fangzheng354 ty01csbaidu ambier yfszzx txd866 wycharry erickousz nagyistoce sinzero elviswf louiss007 julianyu123456 zkailinzhang leanhd evansky wellsoftware guzhaki zhdrfirst tjevgerres ctozlm zhouxiazx jjdblast winggyn ericchendm chenghuige dlotw miaoli elephann dfgusdf jl9n ml-lab travisbarrydick novieq aialbert lihuibng zjucsxxd yanqingmen milestonesvn shilad c4e8ece0 poneyo kevinhsu mindis yongchn easonchan1213 bssrdf tongming jackiexie168 isnowfy huangpingchun wachaong zbxzc35 lqzhao891016 antonymayi alexchao2012 huangjun6919 xiangqiaolxq liuchenxjtu edi-bice fedorajzf sinaad zhimingz pablete liyong3forever nexcafe xiangbai yigenliang stevencoding intoraw bikong2 datamining4science mstebelev hexi2015 ygan1129 vivekhariharan-zz ericsimonzhu wenmin-wu

wormhole's Issues

* Check failure stack trace: * 已放弃 (核心已转储)

按照http://wormhole.readthedocs.org/en/latest/tutorial/criteo_kaggle.html，
运行wormhole/bin/text2crb.dmlc train.txt data/train criteo 300时，出现
*** Check failure stack trace: *** 已放弃 (核心已转储)

xgboost in yarn container failing to start without obvious error

I am unable to launch xgboost on yarn - all the containers fails upon starting spilling following output:

antony.mayi(&container_1442490159810_0228_01_000007��stderr�0�stdout�0(&container_1442490159810_0228_01_000002��stderr�0�stdout�0(&container_1442490159810_0228_01_000011��stderr�0�stdout�0��
VERSION*(&container_1442490159810_0228_01_000011��none��������data:BCFile.index�none�R
�data:TFile.index�none��66data:TFile.meta�none�����]����h��׶9�A@���P

I am quite puzzled where to look now,
thanks Antony.

[DMLC] Task 0 killed because of exceeding allocated virtual memory

I submit job via

tracker/dmlc-submit \
    --cluster yarn \
    --num-workers 1 \
    --num-servers 1 \
    --queue my_queue \
    --worker-cores 4 \
    --server-cores 4 \
    --ship-libcxx /opt/gcc-4.8.2/lib64/ \
    bin/linear.dmlc demo/linear/conf.linear.train

But the application is killed because of exceeding allocated virtual memory. The full log listed below.

16/09/01 14:48:03 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1472556645971_451712_01_000003
16/09/01 14:48:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy : rz-data.rz.xxx.com:8043
16/09/01 14:48:03 INFO dmlc.ApplicationMaster: onContainerStarted Invoked
16/09/01 14:48:03 INFO dmlc.ApplicationMaster: onContainerStarted Invoked
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: [DMLC] Task 0 killed because of exceeding allocated virtual memory
16/09/01 14:48:13 INFO impl.NMClientAsyncImpl: Processing Event EventType: STOP_CONTAINER for Container container_1472556645971_451712_01_000002
16/09/01 14:48:13 INFO impl.NMClientAsyncImpl: Processing Event EventType: STOP_CONTAINER for Container container_1472556645971_451712_01_000003
16/09/01 14:48:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : rz-data.rz.xxx.com:8043
16/09/01 14:48:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : rz-data.rz.xxx.com:8043
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: onContainerStopped Invoked
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: onContainerStopped Invoked
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: Application completed. Stopping running containers
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: Diagnostics., num_tasks2, finished=0, failed=2
[DMLC] Task 0 killed because of exceeding allocated virtual memory
16/09/01 14:48:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
Exception in thread "main" java.lang.Exception: Application not successful
        at org.apache.hadoop.yarn.dmlc.ApplicationMaster.run(ApplicationMaster.java:290)
        at org.apache.hadoop.yarn.dmlc.ApplicationMaster.main(ApplicationMaster.java:115)
End of LogType:stderr

How to solve this problem?

Build failed .. need help .. thank you ..

[spark@a1_sw_3552p wormhole]$ make
make -C repo/ps-lite ps config=/home/spark/wormhole/make/config.mk DEPS_PATH=/home/spark/wormhole/deps CXX=g++
make[1]: Entering directory /home/spark/wormhole/repo/ps-lite' make[1]: Nothing to be done forps'.
make[1]: Leaving directory /home/spark/wormhole/repo/ps-lite' make -C repo/dmlc-core libdmlc.a config=/home/spark/wormhole/make/config.mk DEPS_PATH=/home/spark/wormhole/deps CXX=g++ make[1]: Entering directory/home/spark/wormhole/repo/dmlc-core'
make[1]: “libdmlc.a”是最新的。
make[1]: Leaving directory /home/spark/wormhole/repo/dmlc-core' make -C learn/test build.mk config=/home/spark/wormhole/make/config.mk DEPS_PATH=/home/spark/wormhole/deps CXX=g++ make[1]: Entering directory/home/spark/wormhole/learn/test'
make[1]: Nothing to be done for build.mk'. make[1]: Leaving directory/home/spark/wormhole/learn/test'
make -C learn/kmeans kmeans.dmlc DEPS_PATH=/home/spark/wormhole/deps CXX=g++
make[1]: Entering directory /home/spark/wormhole/learn/kmeans' g++ -Wall -msse2 -Wno-unknown-pragmas -fPIC -I../../repo/rabit/include -I../../repo/dmlc-core/include -std=c++11 -o kmeans.dmlc kmeans.cc ../../repo/dmlc-core/libdmlc.a ../../repo/rabit/lib/librabit.a -L../../lib /home/spark/wormhole/deps/lib/libglog.a /home/spark/wormhole/deps/lib/libgflags.a -pthread -lm -lrt -fopenmp -lrt ../../repo/dmlc-core/libdmlc.a(data.o): In functionstd::_cxx11::basic_string<char, std::char_traits, std::allocator >* google::MakeCheckOpString<bool, bool>(bool const&, bool const&, char const)':
data.cc:(.text._ZN6google17MakeCheckOpStringIbbEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN6google17MakeCheckOpStringIbbEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0x42): undefined reference to _ZN6google4base21CheckOpMessageBuilder9NewStringB5cxx11Ev' ../../repo/dmlc-core/libdmlc.a(data.o): In functionstd::cxx11::basic_string<char, std::char_traits, std::allocator > google::MakeCheckOpString<unsigned long, unsigned long>(unsigned long const&, unsigned long const&, char const)':
data.cc:(.text._ZN6google17MakeCheckOpStringImmEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN6google17MakeCheckOpStringImmEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0x41): undefined reference to _ZN6google4base21CheckOpMessageBuilder9NewStringB5cxx11Ev' ../../repo/dmlc-core/libdmlc.a(data.o): In functionstd::cxx11::basic_string<char, std::char_traits, std::allocator > google::MakeCheckOpString<unsigned long, int>(unsigned long const&, int const&, char const)':
data.cc:(.text._ZN6google17MakeCheckOpStringImiEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc[_ZN6google17MakeCheckOpStringImiEEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKT_RKT0_PKc]+0x40): undefined reference to _ZN6google4base21CheckOpMessageBuilder9NewStringB5cxx11Ev' collect2: 错误：ld 返回 1 make[1]: *_\* [kmeans.dmlc] 错误 1 make[1]: Leaving directory /home/spark/wormhole/learn/kmeans'
make: *** [learn/kmeans/kmeans.dmlc] 错误 2
[spark@a1_sw_3552p wormhole]$

where is or how to build the file wormhole/bin/text2crb.dmlc

base/crc32.h is missing

make linear fails now

make[1]: Entering directory /mnt/home.dokt/antonkos/code/dmlc/wormhole/learn/linear' make[1]: *** No rule to make target../base/crc32.h', needed by build/linear.o'. Stop. make[1]: Leaving directory/mnt/home.dokt/antonkos/code/dmlc/wormhole/learn/linear'
make: *** [learn/linear/build/linear.dmlc] Error 2

what is relationship between the kv_app(KVWorker, KVServer) and Apps(BCD, LBFGS, SGD)

Hi, guys:
As has been introduced in the paper 'DiFacto — Distributed Factorization Machines', Difacto's communication relies on the Parameter Server. From the project's dictionary, we see that kv_app.h encapsulates the van logic to supply some higher but simplistic interface Pull, Push;
However we don't see any function call of Push and Pull in apps such as BCD, LBFGS, SGD etc., which is called by store_local's Pull through the function Get. So, I am puzzeled what is relationship between the kv_app(KVWorker, KVServer) and Apps(BCD, LBFGS, SGD). Can anybody help me ?

wormhole no longer makes

I had no issue running make for wormhole in November, but I recently tried again after cloning the latest git and received the following error and the make did not succeed:

g++ -O3 -ggdb -Wall -std=c++11 -I./ -I../ -I../../repo/ps-lite/src -I../../repo/dmlc-core/include -I../../repo/dmlc-core/src -I/home/hadoop/wormhole/deps/include -fopenmp -fPIC -DDMLC_USE_HDFS=1 -I/home/hadoop/.versions/2.4.0-amzn-5/include -I/usr/java/latest/include -DDMLC_USE_S3=1 -DDMLC_USE_GLOG=1 -DDMLC_USE_AZURE=0 -MM -MT build/linear.o linear.cc >build/linear.d
g++ -O3 -ggdb -Wall -std=c++11 -I./ -I../ -I../../repo/ps-lite/src -I../../repo/dmlc-core/include -I../../repo/dmlc-core/src -I/home/hadoop/wormhole/deps/include -fopenmp -fPIC -DDMLC_USE_HDFS=1 -I/home/hadoop/.versions/2.4.0-amzn-5/include -I/usr/java/latest/include -DDMLC_USE_S3=1 -DDMLC_USE_GLOG=1 -DDMLC_USE_AZURE=0 -c linear.cc -o build/linear.o
In file included from ../base/workload_pool.h:5:0,
from ../solver/data_parallel.h:8,
from ../solver/iter_solver.h:5,
from ../solver/minibatch_solver.h:5,
from async_sgd.h:5,
from linear.cc:1:
../base/match_file.h: In function ‘void dmlc::MatchFile(const string&, std::vectorstd::basic_string)’:
../base/match_file.h:22:58: error: no matching function for call to ‘dmlc::io::FileSystem::GetInstance(std::string&)’
dmlc::io::FileSystem::GetInstance(path_uri.protocol);
^
../base/match_file.h:22:58: note: candidate is:
In file included from ../base/match_file.h:2:0,
from ../base/workload_pool.h:5,
from ../solver/data_parallel.h:8,
from ../solver/iter_solver.h:5,
from ../solver/minibatch_solver.h:5,
from async_sgd.h:5,
from linear.cc:1:
../../repo/dmlc-core/src/io/filesys.h:84:22: note: static dmlc::io::FileSystem dmlc::io::FileSystem::GetInstance(const dmlc::io::URI&)
static FileSystem GetInstance(const URI &path);
^
../../repo/dmlc-core/src/io/filesys.h:84:22: note: no known conversion for argument 1 from ‘std::string {aka std::basic_string}’ to ‘const dmlc::io::URI&’
make[1]: ** [build/linear.o] Error 1
make[1]: Leaving directory `/home/hadoop/wormhole/learn/linear'
make: *** [learn/linear/build/linear.dmlc] Error 2

The deps built successfully, it's only the wormhole packages that aren't building.

document xgboost YARN build parameters

the wiki/readthedocs build instructions do not mention dmlc parameter - one finds out the hard way after running in cluster and looking at job logs.

furthermore dmlc=1 which is suggested in those job error logs doesn't cut it - the build fails with a cryptic "file does not exist 1/make/config.mk" ... which apparently is due to the "dmlc=1"

googling, and reading, around it appears one can/should specify a path to the dmlc-core via dmlc build parameter like so "make dmlc=../dmlc-core"

difacto.dmlc hanged in nanosleep()

Hello and Regards...
I tried to replicate http://wormhole.readthedocs.org/en/latest/tutorial/criteo_kaggle.html on single server single worker.I am stuck at prediction in Factorization Machine. Any I am sorry for not knowing the proper format to describe the issues...
ISSUES:

I started the script with 1 worker still it created 3 workers. all of the workers are stuck at nanosleep() never to return (plz see below for strace)..
output is not created properly(plz see below for wc -l of the output folder and testData) .
tracker script never returned.

$ wormhole/tracker/dmlc_local.py -n 1 -s 1 wormhole/bin/difacto.dmlc difacto.test.conf.small
INFO start listen on 127.0.1.1:9091
Connected 1 servers and 1 workers
Loading the last model
Predicting
sec ttl #ex inc #ex | |w|_0 logloss_w | |V|_0 logloss AUC

$cat difacto.test.conf.small
val_data = "data/train-part_80"
data_format = "libsvm"
model_in = "model/criteo"
predict_out = "output/criteo"
embedding {
dim = 7
threshold = 7
lambda_l2 = 0.0001
}

$ wc -l output/* data/train-part_80
47870 output/criteotrain-part_80_part-0
47876 output/criteotrain-part_80_part-9
482611 data/train-part_80

$ ps -Af | grep dmlc
madhur 9304 3379 0 13:06 pts/11 00:00:04 python wormhole/tracker/dmlc_local.py -n 1 -s 1 wormhole/bin/difacto.dmlc difacto.test.conf.small
madhur 9306 9304 0 13:06 pts/11 00:00:00 /bin/sh -c wormhole/bin/difacto.dmlc difacto.test.conf.small
madhur 9309 9304 0 13:06 pts/11 00:00:00 bash -c nrep=0 rc=254 while [ $rc -eq 254 ]; do export DMLC_NUM_ATTEMPT=$nrep wormhole/bin/difacto.dmlc difacto.test.conf.small rc=$?; nrep=$((nrep+1)); done
madhur 9311 9304 0 13:06 pts/11 00:00:00 bash -c nrep=0 rc=254 while [ $rc -eq 254 ]; do export DMLC_NUM_ATTEMPT=$nrep wormhole/bin/difacto.dmlc difacto.test.conf.small rc=$?; nrep=$((nrep+1)); done
madhur 9312 9309 0 13:06 pts/11 00:00:02 wormhole/bin/difacto.dmlc difacto.test.conf.small
madhur 9313 9311 0 13:06 pts/11 00:00:01 wormhole/bin/difacto.dmlc difacto.test.conf.small
madhur 9330 9306 0 13:06 pts/11 00:00:00 wormhole/bin/difacto.dmlc difacto.test.conf.small

$ sudo strace -p 9312
Process 9312 attached
restart_syscall(<... resuming interrupted call ...>) = 0
nanosleep({0, 50000000}, NULL) = 0
nanosleep({0, 50000000}, NULL) = 0
nanosleep({0, 50000000}, NULL) = 0
nanosleep({0, 50000000}, NULL) = 0
nanosleep({0, 50000000}, ^CProcess 9312 detached
<detached ...>

$sudo strace -p 9313 same as above

$sudo strace -p 9330
restart_syscall(<... resuming interrupted call ...>) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffef260e300) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, ^CProcess 9330 detached
<detached ...>

[question] Asynchronous SGD?

Hi, I am maintain a history project which using wormhole linear Asynchronous SGD model, however, now wormhole is deperacated and cannot compiling success. I'm wondering that is it move to exists projects?

xgboost is compiled in local mode

Hi,

really struggling to compile xgboost for yarn with hdfs support. I can launch the yarn app but in the container logs it show:

terminate called after throwing an instance of 'std::runtime_error'
  what():  xgboost is compiled in local mode
to use hdfs, s3 or distributed version, compile with make dmlc=1

The libhdfs library on my system is placed as follows:

/usr/lib64/libhdfs.so
/usr/lib64/libhdfs.so.0.0.0
/usr/lib/hadoop/lib/native/libhdfs.a
/usr/include/hdfs.h

I tried building xgboost using these options:

make USE_HDFS=1 HDFS_LIB_PATH=/usr/lib64 JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64 HADOOP_HDFS_HOME=/usr
make USE_HDFS=1 HDFS_LIB_PATH=/usr/lib/hadoop/lib/native/ JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64 HADOOP_HDFS_HOME=/usr
make USE_HDFS=1 HDFS_LIB_PATH=/usr/lib64 HDFS_INC_PATH=/usr/include JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64
make USE_HDFS=1 HDFS_LIB_PATH=/usr/lib/hadoop/lib/native/ HDFS_INC_PATH=/usr/include JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64

But still no joy, keep getting the "xgboost is compiled in local mode".

Can you please advise?

Thanks,
Antony.

xgboost on yarn seems getting stuck

I'm running mushroom example on yarn like this,
tracker/dmlc_yarn.py -n 4 --vcores 2
bin/xgboost.dmlc
learn/xgboost/mushroom.hadoop.conf nthread=2
data=hdfs://${hdfs_path}/expdata/agaricus.txt.train
eval[testhdfs://${hdfs_path}/expdata/agaricus.txt.test
model_out=s://${hdfs_path}/mushroom.final.model

These messages are printed on screen:
2015-11-26 14:26:25,491 INFO @TracKer All of 4 nodes getting started
2015-11-26 14:26:29,634 INFO [0] train-error:0.014433
[0] train-error:0.014433
2015-11-26 14:26:29,914 INFO [1] train-error:0.001228
[1] train-error:0.001228

and it has not finished after 20 minutes
Have anyone seen this before ?

wormhole xgboost doesn't build with HDFS

Despite building wormhole with USE_HDFS=1 the xgboost tool is not built that way:

terminate called after throwing an instance of 'std::runtime_error'
what(): xgboost is compiled in local mode
to use hdfs, s3 or distributed version, compile with make dmlc=1
Aborted

The link to "parameter server" is invalid in README.md

Ah, that may confuse users who prefer to build the dependent libs one by one.

can not get the correct predict result

I has used xgboost (single machine version) before， recently I used distributed version on hadoop to train model. And I get the trained model from hadoop, then used the python wrappered xgboost to predict, but the predict result is terrible. But if I use the xgboost.dmlc tool to predict, the result is correct. I'm reading the source code, may be the python wrappered xgboost not load model correct？

Readable model dump.

Hi,

Currently all learning methods in wormhole save resulted models in binary format. This is pretty well in cases of solving machine learning competitions, i.e training and predicting both using wormhole components. However in more general cases when we train the models offline and want to apply them in an online component (in our case it's a server running on JVM), the binary format results in some inconvenience. So a readable model output in text format (or other exchangeable format such as protobuf) is highly expected.

Thanks,
Gang

linear local mode error:JUST_A_UNKNOWN_NODE is disconnected

command is :
repo/dmlc-core/tracker/dmlc-submit --cluster local --env DMLC_CPU_VCORES=1 --env DMLC_MEMORY_MB=512 --num-workers 2 --num-servers 1 --worker-cores 1 --server-cores 1 learn/linear/build/linear.dmlc learn/linear/guide/demo.conf

client error show:
Connected 1 servers and 2 workers
Training: iter = 0
sec ttl #ex inc #ex |w|_0 logloss accuracy AUC
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd
raise RuntimeError('Get nonzero return code=%d' % ret)
RuntimeError: Get nonzero return code=-11

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd
raise RuntimeError('Get nonzero return code=%d' % ret)
RuntimeError: Get nonzero return code=-11

/tmp/linear.dmlc.H.log.INFO.20170918-133532.8524 show:
I0918 13:35:32.054481 8524 van.cc:30] I'm [role: SCHEDULER id: "H" hostname: "10.2.177.240" port: 9092]
I0918 13:35:32.057025 8524 manager.cc:34] Staring system. Logging into /tmp/linear.dmlc.log.*
I0918 13:35:32.068665 8557 workload_pool.h:168] assign W_10.2.177.240_52648 job learn/data/agaricus.txt.train 0 / 10. 1 #jobs on processing.
I0918 13:35:32.068797 8557 workload_pool.h:168] assign W_10.2.177.240_43323 job learn/data/agaricus.txt.train 1 / 10. 2 #jobs on processing.
I0918 13:35:32.125037 8556 manager.cc:275] JUST_A_UNKNOWN_NODE is disconnected

/tmp/linear.dmlc.W_10.2.177.240_43323.log.INFO.20170918-133532.8525 show:
I0918 13:35:32.048825 8525 van.cc:30] I'm [role: WORKER id: "W_10.2.177.240_43323" hostname: "10.2.177.240" port: 43323]
I0918 13:35:32.069018 8551 minibatch_solver.h:291] iter = 0, training, learn/data/agaricus.txt.train 1 / 10, minibatch = 1000, concurrency = 2, shuffle ratio = 10000, negative sampling =

/tmp/linear.dmlc.S_10.2.177.240_40067.log.INFO.20170918-133532.8529 show:
I0918 13:35:32.052947 8529 van.cc:30] I'm [role: SERVER id: "S_10.2.177.240_40067" hostname: "10.2.177.240" port: 40067]

Can we change the thread number while running the application?

When I run linear locally, I find there is a config.proto where I get some parameters, such as max_concurrency and num_threads .

But when I change the num_threads , the efficiency wasn't improved. After I read the src, I wonder whether the num_threads can change through the config, and how can I get the thread number of the application.

example '' ../learn/tool/text2crb train.txt data/train criteo 300" failed

ubgpu@ubgpu:/github/DMLC/wormhole/data$ ll ../learn/tool
total 9732
drwxrwxr-x 2 ubgpu ubgpu 4096 8月 22 01:30 ./
drwxrwxr-x 12 ubgpu ubgpu 4096 8月 21 22:19 ../
-rwxrwxr-x 1 ubgpu ubgpu 3275903 8月 21 22:22 convert*
-rw-rw-r-- 1 ubgpu ubgpu 3293 8月 21 22:19 convert.cc
-rw-rw-r-- 1 ubgpu ubgpu 1737 8月 21 22:22 convert.d
-rw-rw-r-- 1 ubgpu ubgpu 1746584 8月 21 22:22 convert.o
-rw-rw-r-- 1 ubgpu ubgpu 9 8月 21 22:19 .gitignore
-rw-rw-r-- 1 ubgpu ubgpu 805 8月 21 22:19 Makefile
-rwxrwxr-x 1 ubgpu ubgpu 3241724 8月 22 01:30 text2crb*
-rw-rw-r-- 1 ubgpu ubgpu 2456 8月 21 22:19 text2crb.cc
-rw-rw-r-- 1 ubgpu ubgpu 1541 8月 22 01:30 text2crb.d
-rw-rw-r-- 1 ubgpu ubgpu 1661216 8月 22 01:30 text2crb.o
ubgpu@ubgpu:/github/DMLC/wormhole/data$ ../learn/tool/text2crb train.txt data/train criteo 300
F0822 01:32:35.664461 21683 local_filesys.cc:150] Check failed: allow_null LocalFileSystem: fail to open data/train-part_00
*** Check failure stack trace: ***
@ 0x4218ea google::LogMessage::Fail()
@ 0x42381f google::LogMessage::SendToLog()
@ 0x4214cf google::LogMessage::Flush()
@ 0x42415e google::LogMessageFatal::LogMessageFatal()
@ 0x41bc8e dmlc::io::LocalFileSystem::Open()
@ 0x410ce4 dmlc::Stream::Create()
@ 0x4054a4 main
@ 0x7f5f9e1fcec5 __libc_start_main
@ 0x4071b6 (unknown)
Aborted (core dumped)
ubgpu@ubgpu:/github/DMLC/wormhole/data$
ubgpu@ubgpu:/github/DMLC/wormhole/data$ ll
total 12311968
drwxrwxr-x 2 ubgpu ubgpu 4096 8月 21 23:29 ./
drwxrwxr-x 10 ubgpu ubgpu 4096 8月 22 00:50 ../
-rw-rw-r-- 1 ubgpu ubgpu 0 8月 21 23:29 put_dac.tar.gz_in_big_data
-rw-r--r-- 1 ubgpu ubgpu 1927 8月 22 2014 readme.txt
-rw-r--r-- 1 ubgpu ubgpu 1460246311 8月 22 2014 test.txt
-rw-r--r-- 1 ubgpu ubgpu 11147184845 5月 12 2014 train.txt
ubgpu@ubgpu:/github/DMLC/wormhole/data$

./build.sh

src/org/apache/hadoop/yarn/dmlc/ApplicationMaster.java:549: error: error while writing ApplicationMaster.RMCallbackHandler: could not create parent directories
private class RMCallbackHandler implements AMRMClientAsync.CallbackHandler {
^
1 error
java.io.FileNotFoundException: dmlc-yarn.jar (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at sun.tools.jar.Main.run(Main.java:195)
at sun.tools.jar.Main.main(Main.java:1288)

Python/R Interface to wormhole

Feature enhancement request.
As per XGBoost and MXnet, is there any plans to add python / R interfaces to wormhole (specifically difacto is my use case)

falied on make

ubgpu@ubgpu:/github/wormhole$ cp make/config.mk ./
ubgpu@ubgpu:/github/wormhole$ make
cd dmlc-core; make libdmlc.a config=/home/ubgpu/github/wormhole/config.mk; cd /home/ubgpu/github/wormhole
make[1]: Entering directory /home/ubgpu/github/wormhole/dmlc-core' g++ -c -O3 -Wall -msse2 -Wno-unknown-pragmas -Iinclude -std=c++0x -fopenmp -fPIC -DDMLC_USE_HDFS=1 -I/include -I/include -DDMLC_USE_S3=1 -o io.o src/io.cc In file included from src/io.cc:17:0: src/io/hdfs_filesys.h:10:18: fatal error: hdfs.h: No such file or directory #include <hdfs.h> ^ compilation terminated. make[1]: *** [io.o] Error 1 make[1]: Leaving directory/home/ubgpu/github/wormhole/dmlc-core'
cd repo/xgboost; make dmlc=/home/ubgpu/github/wormhole/dmlc-core config=/home/ubgpu/github/wormhole/config.mk
make[1]: Entering directory /home/ubgpu/github/wormhole/repo/xgboost' make[1]: *** No rule to make target/home/ubgpu/github/wormhole/dmlc-core/libdmlc.a', needed by xgboost'. Stop. make[1]: Leaving directory/home/ubgpu/github/wormhole/repo/xgboost'
make: *** [repo/xgboost/xgboost] Error 2
ubgpu@ubgpu:~/github/wormhole$

even I build xgboost like

ubgpu@ubgpu:/github/wormhole/repo/xgboost$ ./build.sh
g++ -Wall -O3 -msse2 -Wno-unknown-pragmas -funroll-loops -fopenmp -fPIC -fPIC -shared -o wrapper/libxgboostwrapper.so wrapper/xgboost_wrapper.cpp updater.o gbm.o io.o subtree/rabit/lib/librabit.a dmlc_simple.o -pthread -lm
Successfully build multi-thread xgboost
ubgpu@ubgpu:/github/wormhole/repo/xgboost$ cd -
/home/ubgpu/github/wormhole

I still got this error

RabitTracker instance has no attribute 'port'

It always fails when I run the demo tracker/dmlc_local.py -n 2 -s 1 bin/linear.dmlc learn/linear/guide/demo.conf

And I find out that it can't find an available port from 9091 to 9999. So self.port is not defined.

Traceback (most recent call last):
File "tracker/dmlc_local.py", line 100, in
pscmd= (' '.join(args.command) + ' ' + ' '.join(unknown)))
File "/home/jonny/workspace/wormhole/repo/dmlc-core/tracker/tracker.py", line 380, in submit
rabit = RabitTracker(hostIP = hostIP, nslave = nworker)
File "/home/jonny/workspace/wormhole/repo/dmlc-core/tracker/tracker.py", line 142, in init
logging.info('start listen on %s:%d' % (hostIP, self.port))
AttributeError: RabitTracker instance has no attribute 'port'

XGBoost repeatedly copying data across machines - slowing down computation

Fellow XGBoost Users,

I am facing a strange problem that I am hoping to get some help from you!
It seems that multi-machine multi-threaded XGBoost is taking more time to finish the task as compared to the multi-threaded version on a single machine!

Initially, I was experiencing trouble that XGBoost kept complaining that it was compiled in the local mode. However, I followed this issue reported by another user: xgboost is compiled in local mode #31 and solved it by following their advice.

However, now my job when run with a single machine but two threads completes in 17 seconds, whereas the same job with two machines and three threads (2 threads on one machine and 1 thread on another machine) takes ~90 seconds. I am running these jobs on AWS t2.medium and t2.micro instance.

Does anyone know why this might be happening? At this point of time, it seems to me, that either there is some thing wrong with my MPI setup (not sure what that might be though) or perhaps the way distributed XGBoost was compiled in issue #31 is not the correct way.

Thanks,
Ankur

issue for the run_yarn.sh

https://github.com/dmlc/wormhole/blob/master/learn/xgboost/run_yarn.sh

../../dmlc-core/tracker/dmlc_yarn.py used, but this file does not exist in dmlc-core.

does wormhole work well with dmlc-core ?

linear/difacto coredump

i use both dmlc_local.py and dmlc-submit,
repo/dmlc-core_old/tracker/dmlc_local.py -n 1 -s 1 learn/difacto/build/difacto.dmlc learn/difacto/guide/demo.conf
2017-09-18 17:56:26,954 INFO start listen on 10.2.177.240:9095
Connected 1 servers and 1 workers
Training: iter = 0
sec ttl #ex inc #ex | |w|_0 logloss_w | |V|_0 logloss AUC
bash: line 9: 6718 Segmentation fault (core dumped) learn/difacto/build/difacto.dmlc learn/difacto/guide/demo.conf

and gdb showes:
gdb ./bin/linear.dmlc -c core.15514
Core was generated by `bin/linear.dmlc learn/linear/guide/demo.conf'.
Program terminated with signal 11, Segmentation fault.
#0 dmlc::data::RowBlockContainer::Push (this=this@entry=0x7fa838000af8, batch=...) at ../../repo/dmlc-core/src/data/row_block.h:132
132 CHECK_LE(batch.field[i], std::numeric_limits::max())
Missing separate debuginfos, use: debuginfo-install gflags-2.1.1-6.el7.x86_64 glibc-2.17-106.el7_2.4.x86_64 glog-0.3.3-8.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libgomp-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64
(gdb) bt
#0 dmlc::data::RowBlockContainer::Push (this=this@entry=0x7fa838000af8, batch=...) at ../../repo/dmlc-core/src/data/row_block.h:132
#1 0x0000000000423584 in Push (len=, pos=0, this=0x7fa838000a90) at ../base/minibatch_iter.h:147
#2 dmlc::data::MinibatchIter::Next (this=0x7fa838000a90) at ../base/minibatch_iter.h:99
#3 0x00000000004236af in dmlc::data::MinibatchIter::Next (this=this@entry=0x7fa847ffe730) at ../base/minibatch_iter.h:85
#4 0x0000000000423bcc in dmlc::solver::MinibatchWorker::Process (this=0x1307070, wl=...) at ../solver/minibatch_solver.h:307
#5 0x000000000043700f in dmlc::solver::DataParWorker::ProcessRequest (this=0x1307070, request=0x7fa83c00c780) at ../solver/data_parallel.h:203
#6 0x000000000046c899 in ps::Executor::ProcessActiveMsg (this=this@entry=0x1307088) at src/system/executor.cc:245
#7 0x00000000004701d8 in ps::Executor::Run (this=0x1307088) at ./src/system/executor.h:53
#8 0x00007fa84f1172a0 in ?? () from /lib64/libstdc++.so.6
#9 0x00007fa84e70fdc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa84e43d28d in clone () from /lib64/libc.so.6

Rabit Module currently only work with dmlc worker when running xgboost on yarn

Trying to run xgboost using dmlc_yarn.py but getting just:

2015-08-04 10:16:53,375 INFO start listen on 10.7.202.80:9094
Rabit Module currently only work with dmlc worker, quit this program by exit 0

Conflict between dynamic and static glog linkages in OS-X

When I build, for example, linear, the very last step is the following:

g++-4.9 -ggdb -O3 -ggdb -Wall -std=c++11 -I./ -I../ -I../../repo/ps-lite/src -I../../repo/dmlc-core/include 
    -I../../repo/dmlc-core/src -I/dir/Projects/wormhole/deps/include -fPIC -DDMLC_USE_HDFS=0 
    -DDMLC_USE_S3=0 -DDMLC_USE_GLOG=1 -DDMLC_USE_AZURE=0  
     build/config.pb.o build/linear.o ../../repo/dmlc-core/libdmlc.a ../../repo/ps-lite/build/libps.a 
    -lglog /dir/Projects/wormhole/deps/lib/libprotobuf.a /dir/Projects/wormhole/deps/lib/libglog.a
    /dir/Projects/wormhole/deps/lib/libgflags.a /dir/Projects/wormhole/deps/lib/libzmq.a 
    /dir/Projects/wormhole/deps/lib/libcityhash.a /dir/Projects/wormhole/deps/lib/liblz4.a  
    -o build/linear.dmlc

Notice both the inclusion of -lglog (the only shared lib) and libglog.a. This triggers a segfault, as described in google/glog#53.

Manually removing the shared lib in the last step fixes the problem, but I'm not sure how to properly address this.

troubleshooting yarn job

I built wormhole for Yarn, HDFS, and S3, and the make succeeded, but when I try to run the example xgboost command the job reads as FINISHED but FAILED in the Resource Manager and no final model is output to hdfs. What's a good way to troubleshoot this? There's nothing in the stderr, stdout, or syslog logs, and I'm struggling to figure out what the error is.

Extend wormhole Async to include dropouts

Have you thought about adding dropouts as an enhancement?
Similar to the below
https://github.com/bingzhengwei/ftrl_proximal_lr

redhat-release-5Server-5.3.0.3 `make` error

Dear all,
I download wormhole and make it on redhat-release-5Server-5.3.0.3. An error happened as following:
In file included from ./src/base/common.h:40:0,
from src/system/postoffice.h:2,
from src/ps.h:8,
from src/ps_main.cc:1:
./src/base/resource_usage.h: In function ‘timespec ps::hwtic()’:
./src/base/resource_usage.h:50:17: error: ‘CLOCK_MONOTONIC_RAW’ was not declared in this scope
clock_gettime(CLOCK_MONOTONIC_RAW, &tv);
I hope you can tell me how to fix it. Thanks!

Train on apache hadoop yarn takes more time as the worker >=2 in configuration

It seems that the train speed is slower as the worker(node) in configuration changed from 1 to 2 or more.Can anyone tell me why this happens, is there anything wrong with my configuration?

The configuration：

booster = gbtree
objective = multi:softmax
eta = 0.5
max_depth = 5
num_class = 10
num_round = 50
save_period = 0
eval_train = 1

The Shell Script:

../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=4 --worker-cores=2
../../xgboost parameter.conf nthread=16
data=hdfs://hadoop01:8020/xgb-demo/train
eval[test]=hdfs://hadoop01:8020/xgb-demo/test
model_dir=hdfs://hadoop01:8020/xgb-demo/model

get errors running on local machine

when i run the command tracker/dmlc_local.py -n 1 -s 1 bin/linear.dmlc learn/linear/guide/demo.conf i got following results:

2016-02-11 15:01:01,856 INFO start listen on ::1:9091
F0211 15:01:01.863415 9390 van.cc:48] Check failed: !zmq_socket_monitor( senders_[scheduler_.id()], "inproc://monitor", ZMQ_EVENT_ALL)
F0211 15:01:01.863451 9392 van.cc:48] Check failed: !zmq_socket_monitor( senders_[scheduler_.id()], "inproc://monitor", ZMQ_EVENT_ALL)
*** Check failure stack trace: ***
*** Check failure stack trace: ***
@ 0x7f1186763e6d (unknown)
@ 0x7fa56a478e6d (unknown)
@ 0x7f1186765ced (unknown)
@ 0x7f1186763a5c (unknown)
@ 0x7fa56a47aced (unknown)
@ 0x7f118676663e (unknown)
F0211 15:01:01.863739 9385 manager.cc:173] Check failed: van_.Connect(node)
@ 0x474171 ps::Van::Init()
*** Check failure stack trace: ***
@ 0x7fa56a478a5c (unknown)
@ 0x47926c ps::Manager::Init()
@ 0x46d748 ps::Postoffice::Run()
@ 0x7fa56a47b63e (unknown)
@ 0x408681 main
@ 0x7f0974fb2e6d (unknown)
@ 0x474171 ps::Van::Init()
@ 0x7f0974fb4ced (unknown)
@ 0x7f1185765b15 __libc_start_main
@ 0x47926c ps::Manager::Init()
@ 0x7f0974fb2a5c (unknown)
@ 0x46d748 ps::Postoffice::Run()
@ 0x7f0974fb563e (unknown)
@ 0x409a21 (unknown)
@ 0x408681 main
@ 0x47903a ps::Manager::AddNode()
@ 0x4793c3 ps::Manager::Init()
@ 0x46d748 ps::Postoffice::Run()
@ 0x7fa56947ab15 __libc_start_main
@ 0x408681 main
@ 0x409a21 (unknown)
@ 0x7f0973fb4b15 __libc_start_main
@ 0x409a21 (unknown)
bash: line 9: 9392 Aborted (core dumped) bin/linear.dmlc learn/linear/guide/demo.conf
bash: line 9: 9390 Aborted (core dumped) bin/linear.dmlc learn/linear/guide/demo.conf
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/xiaxin/Documents/parameter_application/wormhole/repo/dmlc-core/tracker/tracker.py", line 354, in
self.thread = Thread(target = (lambda : subprocess.check_call(self.cmd, env=env, shell=True)), args = ())
File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'bin/linear.dmlc learn/linear/guide/demo.conf ' returned non-zero exit status -6

to find the problem i then ran the command : bin/linear.dmlc learn/linear/guide/demo.conf and got the messages below:

F0211 15:01:30.553581 9454 manager.cc:55] Timeout (10 sec) to wait all other nodes initialized. See commmets for more information
*** Check failure stack trace: ***
@ 0x7fc8faf78e6d (unknown)
@ 0x7fc8faf7aced (unknown)
@ 0x7fc8faf78a5c (unknown)
@ 0x7fc8faf7b63e (unknown)
@ 0x475c42 ps::Manager::Run()
@ 0x46d939 ps::Postoffice::Run()
@ 0x408681 main
@ 0x7fc8f9f7ab15 __libc_start_main
@ 0x409a21 (unknown)
Aborted (core dumped)

Every example get the same error, local environment is centos 7.

ubuntu14.04不能编译

rm -rf cityhash-1.1.1.tar.gz cityhash-1.1.1
wget https://raw.githubusercontent.com/mli/deps/master/build/cityhash-1.1.1.tar.gz && tar -zxf cityhash-1.1.1.tar.gz
--2015-08-26 04:49:28-- https://raw.githubusercontent.com/mli/deps/master/build/cityhash-1.1.1.tar.gz
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 103.245.222.133
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|103.245.222.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度： 376456 (368K) [application/octet-stream]
正在保存至: “cityhash-1.1.1.tar.gz”

100%[======================================>] 376,456 106KB/s 用时 3.5s

2015-08-26 04:49:34 (106 KB/s) - 已保存 “cityhash-1.1.1.tar.gz” [376456/376456])

cd cityhash-1.1.1 && ./configure -prefix=/home/selay/wormhole/deps --enable-sse4.2 && make CXXFLAGS="-g -O3 -msse4.2" && make install
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... yes
checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking how to print strings... printf
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... gcc3
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 805306365
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert i686-pc-linux-gnu file names to i686-pc-linux-gnu format... func_convert_file_noop
checking how to convert i686-pc-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether byte ordering is bigendian... no
checking for stdint.h... (cached) yes
checking for stdlib.h... (cached) yes
checking for inline... inline
checking for size_t... yes
checking for ssize_t... yes
checking for uint32_t... yes
checking for uint64_t... yes
checking for uint8_t... yes
checking if the compiler supports __builtin_expect... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: creating src/Makefile
config.status: creating config.h
config.status: executing depfiles commands

config.status: executing libtool commands

CityHash Version 1.1.1

Prefix: '/home/selay/wormhole/deps'.
Compiler: 'g++ -g -O2'

Now type 'make []'
where the optional is:
all - build everything
check - build and run tests
install - install everything

make[1]: 正在进入目录 /home/selay/wormhole/cityhash-1.1.1' make all-recursive make[2]: 正在进入目录/home/selay/wormhole/cityhash-1.1.1'
Making all in src
make[3]: 正在进入目录 /home/selay/wormhole/cityhash-1.1.1/src' /bin/bash ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -g -O3 -msse4.2 -MT city.lo -MD -MP -MF .deps/city.Tpo -c -o city.lo city.cc libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -g -O3 -msse4.2 -MT city.lo -MD -MP -MF .deps/city.Tpo -c city.cc -fPIC -DPIC -o .libs/city.o city.cc: In function 'void CityHashCrc256Long(const char*, size_t, uint32, uint64*)': city.cc:535:31: error: '_mm_crc32_u64' was not declared in this scope z = _mm_crc32_u64(z, b + g); \ ^ city.cc:542:5: note: in expansion of macro 'CHUNK' CHUNK(0); PERMUTE3(a, h, c); ^ city.cc:535:31: error: '_mm_crc32_u64' was not declared in this scope z = _mm_crc32_u64(z, b + g); \ ^ city.cc:551:5: note: in expansion of macro 'CHUNK' CHUNK(29); ^ city.cc:535:31: error: '_mm_crc32_u64' was not declared in this scope z = _mm_crc32_u64(z, b + g); \ ^ city.cc:561:5: note: in expansion of macro 'CHUNK' CHUNK(33); ^ make[3]: *** [city.lo] 错误 1 make[3]:正在离开目录/home/selay/wormhole/cityhash-1.1.1/src'
make[2]: *** [all-recursive] 错误 1
make[2]:正在离开目录 /home/selay/wormhole/cityhash-1.1.1' make[1]: *** [all] 错误 2 make[1]:正在离开目录/home/selay/wormhole/cityhash-1.1.1'
make: *** [/home/selay/wormhole/deps/include/city.h] 错误 2

Error in building wormhole

When I try to build wormhole, run make -j4,

g++ -std=c++0x -Wall -O3 -msse2  -Wno-unknown-pragmas -funroll-loops -Iinclude   -Idmlc-core/include -Irabit/include -fPIC -DDISABLE_OPENMP -o xgboost  build/cli_main.o build/learner.o build/logging.o build/c_api/c_api.o build/c_api/c_api_error.o build/common/common.o build/data/data.o build/data/simple_csr_source.o build/data/simple_dmatrix.o build/data/sparse_page_dmatrix.o build/data/sparse_page_raw_format.o build/data/sparse_page_source.o build/data/sparse_page_writer.o build/gbm/gblinear.o build/gbm/gbm.o build/gbm/gbtree.o build/metric/elementwise_metric.o build/metric/metric.o build/metric/multiclass_metric.o build/metric/rank_metric.o build/objective/multiclass_obj.o build/objective/objective.o build/objective/rank_obj.o build/objective/regression_obj.o build/tree/tree_model.o build/tree/tree_updater.o build/tree/updater_colmaker.o build/tree/updater_histmaker.o build/tree/updater_prune.o build/tree/updater_refresh.o build/tree/updater_skmaker.o build/tree/updater_sync.o dmlc-core/libdmlc.a  -pthread -lm  -fopenmp -lrt -lglog  -lrt
kmeans.cc:8:19: fatal error: rabit.h: No such file or directory
 #include <rabit.h>
                   ^
compilation terminated.

how to fix this ?

run dmlc yarn error, "failure to login"

hi all,
i tried to run agaricus example on yarn. The following exception is thrown:

~/platform/java-1.8.0//bin/java -cp /home/hadoop/bin/hadoop classpath:tracker/../yarn//dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client -file tracker/../yarn//dmlc-yarn.jar -file tracker/../yarn//run_hdfs_prog.py -file bin/xgboost.dmlc -jobname DMLC[nworker=4]:xgboost.dmlc -tempdir /tmp -queue default ./run_hdfs_prog.py ./xgboost.dmlc
15/11/25 20:01:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.io.IOException: failure to login
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:782)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:734)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:607)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2748)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2740)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at org.apache.hadoop.yarn.dmlc.Client.(Client.java:73)
at org.apache.hadoop.yarn.dmlc.Client.main(Client.java:322)
Caused by: javax.security.auth.login.LoginException: java.lang.IllegalArgumentException: Illegal principal name [email protected]
at org.apache.hadoop.security.User.(User.java:50)
at org.apache.hadoop.security.User.(User.java:43)
at org.apache.hadoop.security.UserGroupInformation$HadoopLoginModule.commit(UserGroupInformation.java:179)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:588)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:757)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:734)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:607)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2748)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2740)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at org.apache.hadoop.yarn.dmlc.Client.(Client.java:73)
at org.apache.hadoop.yarn.dmlc.Client.main(Client.java:322)
Caused by: org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to [email protected]
at org.apache.hadoop.security.authentication.util.KerberosName.getShortName(KerberosName.java:389)
at org.apache.hadoop.security.User.(User.java:48)
... 23 more

at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:588)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:757)
... 9 more

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(_self.__args, *_self.__kwargs)
File "tracker/dmlc_yarn.py", line 191, in run
subprocess.check_call(cmd, shell = True, env = env)
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)

RAM not scaling

Seems xgboost job needs more then double amount of RAM on each yarn NM then it needs (on single) host when running natively (no wormhole). That's not helping with scaling. Is there anything wrong with my config?

Example:

SVM dataset of ~35M rows, 5.5GB in size
using following xgboost config:

colsample_bytree = 0.836855129273
max_delta_step = 0.0
min_child_weight = 8
subsample = 0.813015161805
eta = 0.177476177765
model_out = "/tmp/xgbtest.dat"
num_round = 94
data = "/tmp/xgbtest.svm"
max_depth = 13
gamma = 0.0

result:
- on 6 node YARN - 27.5GB on each node (so 6 * 27.5GB = 165GB taken!)
- local mode - just 11GB

[WIP] Documenting Linear SGD models

@mli I have write some todos for you on better documenting linear sgd models, please feel free to modify it

Use https://github.com/dmlc/wormhole/tree/master/learn/data for toy example
- Normally the user need an very quick example to test out the process
Give a complete example that runs on YARN
- See the L-BFGS Linear example that can be related
Add build script to Makefile and add a guide on how to build

Online Prediction

FEATURE REQUEST:

Is it possible to link wormhole as a library and call the predict function in-process for realtime predictions off a set of (pred_model*) prediction files?

linear-dmlc: segmentation fault for large train file

Hi,

Thank you very much for such a great tools!
Recently I'm trying to use linear-dmlc base on the provided demo. However, when I change the train file to a real-world file as large as 5GB, I got segmentation fault:

Core was generated by `wormhole/bin/linear.dmlc news.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 dmlc::Localizer::CountUniqIndex (

idx_frq=0x0, uniq_idx=0x7ff50c000c40, blk=..., this=0x7ff51f1de090)
at ../base/localizer.h:141

141 I curr = pair_[0].k;
[Current thread is 1 (Thread 0x7ff51f1df700 (LWP 22147))]

I checked the size of "pair_", and it returned "0".
If I split the 5GB file to 5 1GB small files, linear-dmlc works as expected.

Any idea on the problem? Thanks!

Check failed at the end of linear trainer in YARN mode

Hi all,

I encounter an issue with yarn mode (the same data works with local and MPI mode). The trainer seemed to progress normally until the end and died before it tried to save the model.

206 8.5e+07 1e+04 1.6328e+07 0.032549 0.991400 0.952465
207 8.5e+07 2.21e+04 1.63314e+07 0.031823 0.991887 0.938879
208 8.5e+07 2.21e+04 1.63347e+07 0.031823 0.991887 0.938879
209 8.5e+07 2e+04 1.63347e+07 0.046296 0.989800 0.873966
210 8.51e+07 4.23e+04 1.63386e+07 0.035798 0.991061 0.942518
211 8.51e+07 3.5e+04 1.63419e+07 0.033168 0.991114 0.953599
212 8.51e+07 1.73e+04 1.63419e+07 0.035980 0.989987 0.953084
Validating: iter = 0
sec ttl #ex inc #ex |w|_0 logloss accuracy AUC
263 9.4e+07 8.88e+06 1.63434e+07 0.045364 0.990129 0.882638
Hit max number of data passes 1
Saving the final model
F0223 01:41:42.488441 29536 range.h:107] Check failed: i < n (20 vs. 20)
*** Check failure stack trace: ***
@ 0x7ff283c3ce6d (unknown)
@ 0x7ff283c3eced (unknown)
@ 0x7ff283c3ca5c (unknown)
@ 0x7ff283c3f63e (unknown)
@ 0x47c35c ps::Range<>::EvenDivide()
@ 0x47b9b6 ps::Manager::Process()
@ 0x4704f7 ps::Postoffice::Recv()
@ 0x7ff2839df1e0 (unknown)
@ 0x7ff282fe6df5 start_thread
@ 0x7ff282d141ad __clone

Has anyone encounter similar problem before? Any hint is much appreciated.

My trainer config:

train_data = "hdfs://hdcluster/user/hduser/spark-daily/20160222/libsvm-wh.txt/part-000[0-9][0-9]"
val_data = "hdfs://hdcluster/user/hduser/spark-daily/20160222/libsvm-wh.txt/part-0030[0-9]"
data_format = "libsvm"
model_out = "hdfs://hdcluster/user/hduser/spark-daily/20160222/click-model-yarn"
lambda_l1 = 0.1
lr_eta = .1
minibatch = 5000
max_data_pass = 1

errors when running kmeans algorithm

I run this code in windows, rabit.lib is complied with VS2010.
Program output 'Socket Connect Error:No error', could you suggest a possible way to solve this?
Does this problem cause by socket?

python ../tracker/dmlc_local.py -n 3 kmeans.exe ../data/data.svm 10 10 .output
2015-05-21 21:29:52,278 INFO start listen on 137.189.56.119:9091
2015-05-21 21:29:52,509 INFO @tracker All of 3 nodes getting started
[21:29:52] d:\data\wt\tools\dmlc-core\src\data/basic_row_[i2t1e:r2.9h::572[9]2: 1 d:f:2i\9nd:ia5st2ha] \ rwdet:at\didiaanntg\a t\aowtot lt5si\4ad.
nm5\l4t5co5-o clMosBr\/eds\meslcrc
c-\cdoartea\/sbracs\idca_trao/wb_aistiecr_.rho:w7_9i:t efri.nhi:s7h9 :r efaidniinsgh  arte a4d6i.n1g5 3a8t  MB5/4s.e5c4
55 MB/sec
Socket Connect Error:No error
Socket Connect Error:No error
Socket Connect Error:No error
Exception in thread Thread-2:
Traceback (most recent call last):
  File "C:\Users\wt\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.3.0.1715.win-x86_64\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Users\wt\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.3.0.1715.win-x86_64\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "../tracker/dmlc_local.py", line 69, in exec_cmd
    os.exit(-1)
AttributeError: 'module' object has no attribute 'exit'

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\wt\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.3.0.1715.win-x86_64\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Users\wt\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.3.0.1715.win-x86_64\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "../tracker/dmlc_local.py", line 69, in exec_cmd
    os.exit(-1)
AttributeError: 'module' object has no attribute 'exit'

Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Users\wt\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.3.0.1715.win-x86_64\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Users\wt\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.3.0.1715.win-x86_64\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "../tracker/dmlc_local.py", line 69, in exec_cmd
    os.exit(-1)
AttributeError: 'module' object has no attribute 'exit'

Does wormhole support Torque/PBS?

hi, does wormhole have a plan to support Torque/PBS? Thanks

CTR dataset on AWS -- mount timeout and mpirun connection timeout

I ran the Criteo CTR on AWS EC2, but I trapped in the mount step.

When I ran command sudo mount <master_ip>:/home/ubuntu /home/ubuntu, I got a error: mount.nfs: Connection timed out.

And I also got a timeout error when I ran command mpirun -hostfile hosts pwd saying that unable to connect from "ip-172-31-22-52" to "ip-172-31-22-51" (Connection timed out).

Would you please give some hints how to address this.

Thanks a lot.