qiyiping / gbdt Goto Github PK

License: GNU General Public License v2.0

Python 0.39% C++ 94.09% Shell 2.23% Makefile 3.28%

machine-learning boosting-algorithms gbm gradient-boosting gradient-boosting-machine gbdt gbrt gradient-boosting-decision-trees

gbdt's Introduction

Gradient Boosting Regression Tree

Quick Start

Download the code: git clone https://github.com/qiyiping/gbdt.git
Run make to compile
Run the demo script in `test`: ./demo.sh

Data Format

[InitalGuess] Label Weight Index0:Value0 Index1:Value1 ..

Each line contains an instance and is ended by a ‘\n’ character. Inital guess is optional. For two-class classification, Label is -1 or 1. For regression, Label is the target value, which can be any real number. Feature Index starts from 0. Feature Value can be any real number.

Training Configuration

class Configure {
 public:
  size_t number_of_feature;      // number of features
  size_t max_depth;              // max depth for each tree
  size_t iterations;             // number of trees in gbdt
  double shrinkage;               // shrinkage parameter
  double feature_sample_ratio;    // portion of features to be splited
  double data_sample_ratio;       // portion of data to be fitted in each iteration
  size_t min_leaf_size;          // min number of nodes in leaf

  Loss loss;                     // loss type

  bool debug;                    // show debug info?

  double *feature_costs;         // mannually set feature costs in order to tune the model
  bool enable_feature_tunning;   // when set true, `feature_costs' is used to tune the model

  bool enable_initial_guess;
...
};

Reference

Friedman, J. H. “Greedy Function Approximation: A Gradient Boosting Machine.” (February 1999)
Friedman, J. H. “Stochastic Gradient Boosting.” (March 1999)
Jerry Ye, et al. (2009). Stochastic gradient boosted distributed decision trees. (Distributed implementation)

gbdt's People

Contributors

Stargazers

Watchers

Forkers

mafeichao knowledgehacker dongyu1990 shengy983 fanfannothing meituan litaoshao kingzqwang happyyao nipengmath trangle xxzhou lizhangzhan mtxrym zxsted dengcy028 luffyhwl headupinclouds irwenqiang alexchao2012 zbxzc35 bound2020 albert1988 kelvict tulangui ytjia gucasdongzi enchantedtomeetyou ianchen28 sxfmol ustchcl wangxggc stevenlee-belief fulquan jz3707 jialrs colinsongf ranchonono fengxhao lming08 doubleia yinweichong feina1 melody-xiaomi zq2501330612 githubutilities pyx123 chenmoshushi hanahimi rongyousu github-luffy wuhaowen56 baotong qwzhong1988 cathy29 lyk-ohlyk linuxfl shuxjweb w0lker agdyangkang jiapei100 shawnshanksgui strategist922 astronomicalphenomena tjliupeng yzbm kowizards chen-zhiyu

gbdt's Issues

11 请教几个问题？

1、enable_initial_guess这个大概是啥意思？
2、gbdt为啥要加一个bias呢？

Why use 'Pr(y=1) = 1/(1+exp(-2F(x)))' instead of 'Pr(y=1) = 1/(1+exp(-F(x)))'?

Functions for gradient boosting classifier make me confused.

Why use 'Pr(y=1) = 1/(1+exp(-2F(x)))' instead of 'Pr(y=1) = 1/(1+exp(-F(x)))'?
Why use 'L(y, F(x)) = log(1+2exp(-2yF(x)))' instead of L(y, F(x)) = log(1+exp(-yF(x)))'
Thank you.

the method to precess missing value

模型在处理missing value有点问题，麻烦看一下，谢谢！
这是我构造的训练集A，20条样本，训练集上的auc=0.55
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1

训练集合B，对A补全了特征，20条样本，训练集上auc=1
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1

get impurity question

计算基尼不纯度，代码里的逻辑是这样的，见fitness.cpp代码：

162 for (size_t j = unknown; j < len-1; ++j) {
163 s = data_copy[j]->target * data_copy[j]->weight;
164 ss = Squared(data_copy[j]->target) * data_copy[j]->weight;
165 c = data_copy[j]->weight;
166
167 ls += s;
168 lss += ss;
169 lc += c;
170
171 rs -= s;
172 rss -= ss;
173 rc -= c;
174
175 ValueType f1 = data_copy[j]->feature[index];
176 ValueType f2 = data_copy[j+1]->feature[index];
177 if (AlmostEqual(f1, f2))
178 continue;
179
180 fitness1 = lc > 1? (lss - lsls/lc) : 0;
181 if (fitness1 < 0) {
182 // std::cerr << "fitness1 < 0: " << fitness1 << std::endl;
183 fitness1 = 0;
184 }
185
186 fitness2 = rc > 1? (rss - rsrs/rc) : 0;
187 if (fitness2 < 0) {
188 // std::cerr << "fitness2 < 0: " << fitness2 << std::endl;
189 fitness2 = 0;
190 }
191
192 double fitness = fitness0 + fitness1 + fitness2;
193
194 if (g_conf.feature_costs && g_conf.enable_feature_tunning) {
195 fitness *= g_conf.feature_costs[index];
196 }
197
198 if (*impurity > fitness) {
199 *impurity = fitness;
200 *value = (f1+f2)/2;
201 *gain = fitness00 - fitness1 - fitness2;
202 }
203 }
204
205 return *impurity != std::numeric_limits::max();
206 }

计算impurity我的理解是应该计算全量样本的，代码的逻辑貌似是循环每一条非空特征的样本，然后取最小值。所以我的疑问是198-202行代码是不是应该在for循环之外呢，还是我的理解有问题？求解答

qiyiping / gbdt Goto Github PK

gbdt's Introduction

Gradient Boosting Regression Tree

Quick Start

Data Format

Training Configuration

Reference

gbdt's People

Contributors

Stargazers

Watchers

Forkers

gbdt's Issues

11

请教几个问题？

Why use 'Pr(y=1) = 1/(1+exp(-2F(x)))' instead of 'Pr(y=1) = 1/(1+exp(-F(x)))'?

the method to precess missing value

get impurity question

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent