dmlc / xgboost Goto Github PK

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Home Page: https://xgboost.readthedocs.io/en/stable/

License: Apache License 2.0

R 6.88% C 0.52% Python 20.54% Shell 0.70% C++ 44.99% Java 3.61% Scala 6.49% CMake 0.83% Cuda 15.26% M4 0.04% Groovy 0.02% PowerShell 0.05% CSS 0.06% TeX 0.01%

gbdt gbrt gbm distributed-systems xgboost machine-learning

xgboost's Introduction

eXtreme Gradient Boosting

Community | Documentation | Resources | Contributors | Release Notes

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, Dask, Spark, PySpark) and can solve problems beyond billions of examples.

License

Contribute to XGBoost

XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone. Checkout the Community Page.

Reference

Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016
XGBoost originates from research project at University of Washington.

Open Source Collective sponsors

Backers

[Become a backer]

xgboost's People

Contributors

Stargazers

Watchers

Forkers

luheng wnzhang wangby joelee2013 yindlib dongh11 yanping lqshixinlei ericchanbd dongyu1990 wangdongfrank yanchen036 wachaong kalenhaha sinzero popfido zhuxiaoqiang chagge ljwsummer mackwang chenglongchen jianminsun wycg1984 dailiang vlsi1217 timliutong jrings lagvier aurora1625 peterwilliams97 nancysam amunategui aromazyl agathawrj xuyuandong wavelets jmliu88 fleogefyr jeongyoonlee xjzhou houkun tianluyuan alienfeel smartphp timwee chaserelock huangguanmayday poseidon1214 ramarlina qinghanmeng yiweishe smly manglam ekoziol daxiongshu shady007 abhik1368 miguelpicallo dr-dos-ok rouseguy bzshang nhammes1 msdw alexargus wjzdoremon mindis xiaozhouwang quantshah christang tharunniranjan arakitin kgfeman baggioss thomasdic2000 lesaffrea fulmicoton andrewjohnlowe syhrz fbaluch dheerajvc ekabiri simengy shustasch nianxue dreadlord1984 qingsong99 weijunyan adognamedtesla pl8787 rbroberg ahsancse emanuelaboros putaozhuose victorchi2009 wush978 ivanliu1989 bolddata rennell-garrett blindape snowytk

xgboost's Issues

What are the ways of treatng missing values in XGboost?

Generally does the model performance get better with that ?

Regression Demo

Hi,

Could you possibly add one more regression demo, with greater training/test data size?

It may look a dump question but i am a little bit confused, i see that there are only two regression types we can perform by xgboost, linear and logistic regressions, Cant we perform nonlinear regression with xgboost? can we only find the relations like y = a0+ a1_X1+a2_X2+....+anXn with reg:linear? or does it perform like scikit-learn GBM regression Tree ?

Regards,
Davut

kaggle higgs demo: rescaling weights

Hi!
The weight statistics in the kaggle's demo are misleading.
The sum of weights is conserved, so it is misleading to show it as doubled.

a possible fix could be to change the following lines from:

weight = dtrain[:,31] * float(test_size) / len(label)
sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0 )
sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0 )
print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos ))

to:

weight = dtrain[:,31]
sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0 )
sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0 )
print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos ))
weight = dtrain[:,31] * float(test_size) / len(label)

Please note also that I'm not using R but I imagine that the two demo files should be aligned.

How can this package work on large amount of data?

How can i use the xgboost package to train my dataset when training data cannot be fitted into memory?

different results across different runs with no change in parameters

I was able to get results that would replicate with xgboost up to 750 trees or so. With these parameters on the Kaggle Higgs boson problem:

param['bst:eta'] = 0.025
param['bst:max_depth'] = 6
param['eval_metric'] = 'auc'
param['silent'] = 1
param['nthread'] = 32
num_round = 1100
threshold_ratio = 0.148

I get slightly different answers each time I run. Code for higgs-numpy.py here:

http://pastebin.com/EEK4FAQK

Code for higgs-pred.py here

http://pastebin.com/hRCL15X2

how to code a customized objective function

I want to try a new objective function: -AMS (with the minus sign because it must be a "cost") defined as below

...
if (!strcmp("binary:ams", name)) return new RegLossObj(LossType::kAMS);
...
static const int kAMS = 4;
...
if (loss_type == kAMS) return "ams@0";
...

Now I would like to change the gradient into:

// I'm assuming an approximate AMS function s/sqrt(b) where s=x and b=1-x
// and I'm doing first and second derivatives
inline float PredTransform(float x) const {
switch (loss_type) {
...
case kAMS: return - x / std::sqrt(1.0f - x);
...
}
}
...
inline float FirstOrderGradient(float predt, float label) const {
switch (loss_type) {
...
case kAMS: return - std::sqrt(1.0f - x) * (x - 2.0f) / ( 2 * (x - 1.0f) * (x - 1.0f));
...
}
}
...
inline float SecondOrderGradient(float predt, float label) const {
switch (loss_type) {
...
case kAMS: return std::sqrt(1.0f - x) * (x - 4.0f) / ( 4 * (x - 1.0f) * (x - 1.0f) * (x - 1.0f) );
...
}
}

But I don't know how to face a couple of problems:

I see that first and second order functions are defined only as a signle point=prediction functions,
that means that I have a function of "(float predt, float label)", so one signle float preds and its corresponding float label, instead of a vector like its parent
virtual void GetGradient(const std::vector &preds, ...
The issue is that I have an objective function that is not a simple sum like rmse.
The solution of this question 1) could be to call the train function like python demo does, but I don't like very much that approach (*)
for logistic loss, I would expect to see a predt in (0,1) like the base score is required to be, but predt is well out of that range already before entering in PredTransform... why?
besides points 1) and 2) above, it seems that there is something else in the code that prevents the new gradient from working properly (what am I missing?)
Maybe are the predictions constant just because my gradient is completely wrong from a mathematical point of view? LOL

(*) I know that in the demo.py there is an example of customized objective function and I've tried also in that way, but usually I prefer to use a compiled programming language and not a scripting language :-) ... anyway, my questions 2) and 3) still apply also to the python version...

evalutating only the first n classifiers in an ensemble

is it possible to use only the first n classifiers (e.g. trees) in the ensemble ? Is it possible to do this from python ?

(this would be interesting in order to see at which point the 'loss' on the test set increases again or becomes larger than the loss on the training set)

Addition of a build system generator

I suggest to reuse a higher level build system than your current make file so that powerful checks for software features will become easier.

multi:softmax : output probabilities

Hi,

I am trying to use your library to tackle a multi-class classification problem.
I may have miss something in the documentation but is it possible to have access to the probabilities of belonging to each class?
If it is not, do you plan on adding an option similar to "binary:logistic" for multi-class problems?

Thanks you.
Jean-Baptiste Regli

AMS weight rescaling

Using xgb.cv with 8 folds, test_AMS are in 1.10 range which is the mean of AMS of each fold.

I think the problem is the rescaling in AMS function

With Higgs train data the sum of positive weights is wS = 691.9886
and for negative wB = 410999.8

So in order all AMS be comparable we need in each fold the sum of positive and negative be wS and wB

in R:
weight_fold[labels == 0] <- weight_fold[labels == 0] * wB / sum(weight_fold[labels == 0])
weight_fold[labels == 1] <- weight_fold[labels == 1] * wS / sum(weight_fold[labels == 1])

matrix row ptr bound

I think that you are not checking that
row_ptr_[row_ptr_.size()-1] is < than row_data_.size()
when you do
inline RowIter GetRow(long ridx) const{
utils::Assert( ridx < this->NumRow(), "row id exceed bound");
return RowIter(&row_data_[row_ptr_[ridx]] - 1, &row_data_[row_ptr_[ridx + 1]] - 1);
}
in xgboost_data.h at line 250

Can you add a check like this one?
utils::Assert(row_ptr_[ridx + 1] < row_data_.size(), "row ptr exceed bound");

Sorry if I'm confused and thank you for your clarification

How can i visually see a few trees made during Binary Classification in XGboost?

I'm trying to study a few trees to get a better sense of how the split was made and if is making intuitive sense. Can someone tell me how can i visually print a few trees?

CV impementation

Hi...
First of all, many thanks for providing XGboost for classification problem. Its application in my work has given better results than what I get from Random Forest. XGboost is awesome.

I have a query related to higgs competition. Do you have any plan to implement CV in xgboost?

Thanks
Mradul

Python Module in Mac

Seems some one report python module crash with segfault on mac, maybe some issues with ctypes, need to find a mac machine to fix

Exception AttributeError

Hi, should I care about the following message?

Exception AttributeError: "'NoneType' object has no attribute 'XGBoosterFree'" in <bound method Booster.del of <xgboost.Booster instance at 0x7fbb4845f488>> ignored

Reproducible random seeds

Need to check if random seed is set properly, whether subsample results is reproducible

Custom obj with weights

I’ve checked custom obj for the logregobj in docs and results are poorest than binary:logitraw with Higgs data.(in 8 folds CV AUC 0.93 vs 0.935, consistent over a grid of parameters)
I think binary:logitraw should be the same that logregobj.
Do you think the difference could be related with the bug in obj / eval you detected? Other explain?

xgb.DMatrix cbind

Is there a way to concatenate 2 DMatrix datasets and consequently 2 or more binary buffer files?

Missing -lgomp when installing

I get the following output when I try to run make in the python directory. Do you know what may be wrong?

Williams-MacBook-Air:python william$ make
g++ -Wall -O3 -msse2  -Wno-unknown-pragmas -fopenmp -fPIC -pthread -lm  -shared -o libxgboostpy.so xgboost_python.cpp
In file included from xgboost_python.cpp:3:
In file included from ./../regrank/xgboost_regrank.h:12:
In file included from ./../regrank/xgboost_regrank_eval.h:13:
./../regrank/../utils/xgboost_omp.h:13:2: warning: "OpenMP is not available, compile to single thread code" [-W#warnings]
#warning "OpenMP is not available, compile to single thread code"
 ^
1 warning generated.
ld: library not found for -lgomp
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libxgboostpy.so] Error 1

Is there a R library available for XGboost?

specifying number of iterations during predictions

A few of the existing R functions for boosting (e.g. gbm, C5.0 and a few others) allow you to declare how many trees should be used when predicting. So if your boosted tree used used 1,000 trees, you theoretically could use the same object to generate predictions for any model with <= 1,000 tree (all other parameters being equal).

This could be a big deal when it comes to tuning the model. If you are trying to optimize the number of iterations, packages like caret can exploit this and get big time savings above and beyond what your package already offers.

Thanks,

Max

[build error on windows 8] 2>LINK : fatal error LNK1561: entry point must be defined

hi,

on windows 8, when i try to build xgboost solution (comprising xgboost and xgboost_wrapper projects) on microsoft visual studio 2010 professional (with omp), the following error is generated (see below for full log):

2>LINK : fatal error LNK1561: entry point must be defined

when i then try to debug using f5, the following errors are generated:

'xgboost.exe': Loaded 'C:\Users\Chong\Downloads\xgboost-master\xgboost-master\windows\Release\xgboost.exe', Symbols loaded.
'xgboost.exe': Loaded 'C:\Windows\SysWOW64\ntdll.dll', Cannot find or open the PDB file
'xgboost.exe': Loaded 'C:\Windows\SysWOW64\kernel32.dll', Cannot find or open the PDB file
'xgboost.exe': Loaded 'C:\Windows\SysWOW64\KernelBase.dll', Cannot find or open the PDB file
<... A
The program '[5512] xgboost.exe: Native' has exited with code 0 (0x0).

any idea what went wrong?

------ Build started: Project: xgboost, Configuration: Release Win32 ------
2>------ Build started: Project: xgboost_wrapper, Configuration: Release Win32 ------
1>Build started 5/9/2014 4:19:40 PM.
2>Build started 5/9/2014 4:19:40 PM.
1>InitializeBuildStatus:
1> Creating "Release\xgboost.unsuccessfulbuild" because "AlwaysCreate" was specified.
2>InitializeBuildStatus:
2> Touching "Release\xgboost_wrapper.unsuccessfulbuild".
1>ClCompile:
1> gbm.cpp
2>ClCompile:
2> gbm.cpp
2> io.cpp
1> io.cpp
2> updater.cpp
1> updater.cpp
1> xgboost_main.cpp
2> xgboost_wrapper.cpp
1>Link:
1> Creating library C:\Users\Chong\Downloads\xgboost-master\xgboost-master\windows\Release\xgboost.lib and object C:\Users\Chong\Downloads\xgboost-master\xgboost-master\windows\Release\xgboost.exp
1> Generating code
2>Link:
2> Creating library C:\Users\Chong\Downloads\xgboost-master\xgboost-master\windows\Release\xgboost_wrapper.lib and object C:\Users\Chong\Downloads\xgboost-master\xgboost-master\windows\Release\xgboost_wrapper.exp
2>LINK : fatal error LNK1561: entry point must be defined
2>
2>Build FAILED.
2>
2>Time Elapsed 00:00:02.81
1> Finished generating code
1> xgboost.vcxproj -> C:\Users\Chong\Downloads\xgboost-master\xgboost-master\windows\Release\xgboost.exe
1>FinalizeBuildStatus:
1> Deleting file "Release\xgboost.unsuccessfulbuild".
1> Touching "Release\xgboost.lastbuildstate".
1>
1>Build succeeded.
1>
1>Time Elapsed 00:00:06.06
========== Build: 1 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

Feature request for supporting continue training

It would be great to load a previously saved model and to continue training from the last boosting round on!
Is it possibile to do that with xgboost?
Thank you

concatenating 2 .buffer files

There maybe a file that does not fit in memory. In such cases, can we use cat file1.buffer file2.buffer > file3.buffer to conactenate the .buffer files of xgboost?

Thanks
Kiran

Determine feature importances

Hi; could you please indicate me the way to determine the importance of each feature? I'm using the Python interface and I could not find anything in the doc. Thanks!

Split MetaInfo During CV

After this issue:
https://github.com/tqchen/xgboost/issues/83

I'm trying to find a workaround to this problem:
My model fits desegregated cases but the metric error needs merging all cases of the same item code and computing a weighted metric at item code level.

If I don't use 'base_margin' slot, could I use it for this purpose? Or if I set any info in this slot xgboost use it as base margin automatically?

In case the answer be NO. Any tip to workaround?
Could be possible to introduce an 'id' and 'merging_group' slots for future versions?

Thank you!

Testing with different number of rounds

Hi,

i trained a model with 2500 rounds, it took hours to train it, i also wonder result of model which is trained with 2000 or less round, but it will take hours again, i want to use this model (with 2500 rounds) but want test step to consider its first 2000 rounds. does xgboost support to test a model with different number of rounds?

Regards,
Davut

variable importance in R

hello, i know you had a thread about variable importance in python. what's the command for R or if it is available? variable importance is really important when i build my models.

[d74afec7d7] dont actually learn

It's produce constant preditiction.
Demo (higgs):

./run.sh

finish loading from csv
weight statistics: wpos=1522.37, wneg=904200, ratio=593.94
loading data end, start to boost trees
[0] train-auc:0.500000 [email protected]:0.620265
[1] train-auc:0.500000 [email protected]:0.620265
[2] train-auc:0.500000 [email protected]:0.620265
[3] train-auc:0.500000 [email protected]:0.620265
[4] train-auc:0.500000 [email protected]:0.620265
[5] train-auc:0.500000 [email protected]:0.620265
[6] train-auc:0.500000 [email protected]:0.620265
[7] train-auc:0.500000 [email protected]:0.620265
[8] train-auc:0.500000 [email protected]:0.620265
[9] train-auc:0.500000 [email protected]:0.620265
[10] train-auc:0.500000 [email protected]:0.620265
[11] train-auc:0.500000 [email protected]:0.620265
[12] train-auc:0.500000 [email protected]:0.620265
....

Excluding certain features

Once we create a libsvm format file, all the variables will be used in model building. It will be great if we can have a concept of a namespace whereby we can ignore some names that are not required or ignore some namespaces altogether

sigmoid range constrain

I 'm getting this error: "sigmoid range constrain" when I run higgs numpy script at the point when the "train" is called: what I'm doing wrong? Thanks

Is there an implementation of LambdaMART model?

Posteriors from binary:logistic

When I use param['objective'] = 'binary:logistic' the range of output values is not in [0,1] as mentioned in the wiki.

Is there something else I must do to get the posterior probability?

thanks

Error:buffer_indexexceed num_pbuffer

I sometimes got this error when I ran xgboost on the data that is used in Elements of Statistical Learning Example 10.2

Its available in sklearn via sklearn.datasets.make_hastie_10_2

Can we initialize xgboost with the outputs of other classifiers?

Hi, how can I initialize xgboost with the outputs of other classifiers, for example, initialize xgboost with randomForest ?

Liu

[old version] output_margin

Assume I need to execute the old version (before unity) for a while (transition phase).
There wasn't the output_margin yet available (is it correct?) ...
... so could I try to simulate it recompiling with the old line 69 of xgboost_regrank_obj.h
from

        inline float PredTransform(float x){
            switch (loss_type){
            case kLogisticRaw:
            case kLinearSquare: return x;
            case kLogisticClassify:
            case kLogisticNeglik: return 1.0f / (1.0f + expf(-x));
            default: utils::Error("unknown loss_type"); return 0.0f;
            }
        }

        inline float PredTransform(float x){
            switch (loss_type){
            case kLogisticRaw:
            case kLinearSquare: return x;
            case kLogisticClassify:
            case kLogisticNeglik: return x; // / return 1.0f / (1.0f + expf(-x));
            default: utils::Error("unknown loss_type"); return 0.0f;
            }
        }

or even

        inline float PredTransform(float x){
            return x;
            }
        }

Thanks!

after "bash build.sh" How to install xgboost

I'm new to python and new to ubuntu.
After "bash build.sh" I saw a program "xgboost". Then how to install the model ?
I have "import xgboost as xgb" in my py program but looks like they didn't find it.
I got "ImportError: No module named xgboost"
Thanks for you help.

problem of make on MAC

Hi Guys,

I cannot make on my Mac. Could you please help me?

Below is the information on my terminal when making.

Hiris-Guteki-MacBook-Pro-5:xgboost hirisgu$ make
g++ -Wall -O3 -msse2 -Wno-unknown-pragmas -fopenmp -pthread -lm -o xgboost regrank/xgboost_regrank_main.cpp
In file included from regrank/xgboost_regrank_main.cpp:7:
In file included from regrank/xgboost_regrank.h:12:
In file included from regrank/xgboost_regrank_eval.h:13:
regrank/../utils/xgboost_omp.h:13:2: warning: "OpenMP is not available, compile
to single thread code" [-W#warnings]

warning "OpenMP is not available, compile to single thread code"

^
1 warning generated.
ld: library not found for -lgomp
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [xgboost] Error 1

Some software versions `i have
make:
GNU Make 3.81

gcc:
Hiris-Guteki-MacBook-Pro-5:xgboost hirisgu$ gcc -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix

User defined objective function - question

I have a question about the definition of the objective function. It is given in the demo.py example file as

# user define objective function, given prediction, return gradient and second order gradient
def logregobj( preds, dtrain ):
    labels = dtrain.get_label()
    grad = preds - labels
    hess = preds * (1.0-preds)
    return grad, hess

What is the objective function L(x_i,y_i) in this example as a function of current predictions x_i and the true labels y_i?

The grad is a vector which is the derivative dL/dx_i ? But I am confused about the meaning of Hessian here - here it is a vector while the Hessian is typically a matrix of second derivatives d^2L/dx_i dx_j. If it would be d^2L/dx_i^2 then I do not see what function L(x_i,y_i) would give

dL/dx_i =x_i-y_i
d^2L/dx_i^2= x_i (1-x_i)

as in the example above.
I would be grateful for a hint. I am not from the Machine Learning community so I may not know some obvious conventions. It would be helpful to update the wiki with 1-2 sentences more on this.

Seems getting info of 'group' slot doesn't works (in R)

require(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
group = rep(seq(1:3), length.out = length(train[[1]]))
dtrain <- xgb.DMatrix(data = train$data, label = train$label, group = group)

works

labels <- getinfo(dtrain, 'label')
head(labels)
[1] 1 0 0 1 0 0

doesn't works

groups <- getinfo(dtrain, 'group')
Error en .local(object, ...) : xgb.getinfo: unknown info name group

low level works

labels <- .Call("XGDMatrixGetInfo_R", dtrain, 'label', PACKAGE = "xgboost")
head(labels)
[1] 1 0 0 1 0 0

low level doesn't works

groups <- .Call("XGDMatrixGetInfo_R", dtrain, 'group', PACKAGE = "xgboost")
Error: unknown field group

explicit setinfo

setinfo(dtrain, 'group', group)
[1] TRUE
groups <- .Call("XGDMatrixGetInfo_R", dtrain, 'group', PACKAGE = "xgboost")
Error: unknown field group

xgboost generates root-only trees for python example

when I run the example in python/example/demo.py I get the following output:

~/workspace/xgboost/python/example:master$ python demo.py 
[0] eval-error:0.481688 train-error:0.482113
[1] eval-error:0.481688 train-error:0.482113
error=0.481688
start running example of build DMatrix in python
[0] eval-error:0.481688 train-error:0.482113
[1] eval-error:0.481688 train-error:0.482113
start running example of build DMatrix from scipy.sparse
[0] eval-error:0.481688 train-error:0.482113
[1] eval-error:0.481688 train-error:0.482113
start running example of build DMatrix from numpy array
[0] eval-error:0.481688 train-error:0.482113
[1] eval-error:0.481688 train-error:0.482113
start running example to used cutomized objective function
[0] eval-error:0.042831 train-error:0.046522
[1] eval-error:0.021726 train-error:0.022263

it seems that it only builds one tree per run with only one node:

tree train end, 1 roots, 0 extra nodes, 0 pruned nodes ,max_depth=0

maybe my build was incorrect: I run ubuntu 12.04, g++ 4.6.3, numpy 1.8.0, python 2.7

Remove unnecessary null pointer checks

An extra null pointer check is not needed in functions like the following.

Averaging multiple models

Hi,
First of all, thank you very much for developing such a great project, My question is about using multiple models in Higgs competition, How can i average models' output ? , i averaged 5 models' output, it gave me 0.80 AMS on LB , i did it as follow:

ypred1 = bst1.predict( xgmat )
ypred2 = bst2.predict( xgmat )
ypred3 = bst3.predict( xgmat )
ypred4 = bst4.predict( xgmat )
ypred5 = bst5.predict( xgmat )

ypred = (ypred1+ypred2+ypred3+ypred4+ypred5)/5.0

what am i doing wrong here ?

Thanks in advance,
Regards

Incremental Loads

XGboost is a great package. Thanks for writing this.

When the dataset is very big and does not fit in memory, Vowpal wabbit has a nice way of building the model incrementally loading chunks into memory, making a model and updating the model with new data that is loaded. It will be great to have this feature in xgboost

Parameters not being passed via the python interface

It appears the parameters that are passed viat the python interface to not get set in xgboost.

For example, changing the silent flag in
demo/guide-python/generalized_linear_model.py

xgboost/demo/guide-python$ git diff generalized_linear_model.py
diff --git a/demo/guide-python/generalized_linear_model.py b/demo/guide-python/generalized_linear_model.py
index b6b60be..3e720a6 100755
--- a/demo/guide-python/generalized_linear_model.py
+++ b/demo/guide-python/generalized_linear_model.py
@@ -12,7 +12,7 @@ dtest = xgb.DMatrix('../data/agaricus.txt.test')

alpha is the L1 regularizer

lambda is the L2 regularizer

you can also set lambda_bias which is L2 regularizer on the bias term

-param = {'silent':1, 'objective':'binary:logistic', 'booster':'gblinear',
+param = {'silent':0, 'objective':'binary:logistic', 'booster':'gblinear',
'alpha': 0.0001, 'lambda': 1 }

normally, you do not need to set eta (step_size)

Then when I run this example I get
$ ./generalized_linear_model.py

build GBRT with 6513 instances
tree train end, 1 roots, 20 extra nodes, 0 pruned nodes ,max_depth=5
[0] eval-error:0.000000 train-error:0.000614

build GBRT with 6513 instances
tree train end, 1 roots, 18 extra nodes, 0 pruned nodes ,max_depth=5
[1] eval-error:0.000000 train-error:0.001228

build GBRT with 6513 instances
tree train end, 1 roots, 22 extra nodes, 0 pruned nodes ,max_depth=5
[2] eval-error:0.000000 train-error:0.000614

build GBRT with 6513 instances
tree train end, 1 roots, 22 extra nodes, 0 pruned nodes ,max_depth=5
[3] eval-error:0.000000 train-error:0.000614
error=0.000000

It appears to be still being using the tree booster.

Can we get all the parameters printed out from the xgboost model so we know exactly what is being used?

LESS_OR_EQUAL operator in dump format

Hello,

I implement prediction for constructed model via other programming language. But results of xgboost and my programm are not equal when I use LESS operator for compare value with bound in each node.
For example
...
0:[trait=18.375<15.5] yes=1,no=2,missing=1
1:[trait=239.498<100.5] yes=3,no=4,missing=4
3:[sSz=187537-1461<10.5] yes=7,no=8,missing=8
7:[trait=18.505<30] yes=15,no=16,missing=16
15:leaf=0.0550458
16:leaf=0.364126
...
When I change var "18.505" from 30 to 29 xgboost probability doesn't change, but it should.

Note:
Results of my programm and xgboost are equal when I use LESS_OR_EQUAL operator in my programm.

Wiki on suitable data preprocessing

Some detailed tutorial on what are the suitable data preprocessing techniques that should be used with xgboost will be very useful for users.

Question about predictions

Hi Tianqi,
Thank you for this great utility! It works incredible fastly.
I have next question. I need to imlement prediction for constructed model via other programming language. I already did it for gbm model from R package 'gbm' and sum of all trees (all selected tree leafs) I should add to initial value.
In xgboost model dump I don't see the initial value. Does it mean that initial value equals 0? Thanks.

Loading data with DMatrix

I have a large file with 300+ features in each record.
While trying to load the data with DMatrix in python, I got the following message:

dtest = xgb.DMatrix(tsDir+'xgbTest.csv', missing=-999.0)
86x397 matrix with 328730778 entries is loaded from ../data/xgbTest.csv

I know I have 1834123 lines of record.

I looked into the file at 86 line, which is no different than any other line.

What could be a possible reason for this?

Thanks very much!

Rui

Method for access to the current prediction for both the train and test set

From the wiki:
'The buffers are used to save the prediction results of last boosting step'

Would be very interesting a method for access to the current prediction for both the train and test set.
This could speed up the xgb.iter.update function when using customized obj (the xgb.predict call each iteration) and would let to estimate micro CV error (merging cv predictions for each fold and compute the error metric based in the 'honest' join prediction).