Code Monkey home page Code Monkey logo

gbdt's Introduction

Gradient Boosting Regression Tree

Quick Start

  • Download the code: git clone https://github.com/qiyiping/gbdt.git
  • Run make to compile
  • Run the demo script in `test`: ./demo.sh

Data Format

[InitalGuess] Label Weight Index0:Value0 Index1:Value1 ..

Each line contains an instance and is ended by a ‘\n’ character. Inital guess is optional. For two-class classification, Label is -1 or 1. For regression, Label is the target value, which can be any real number. Feature Index starts from 0. Feature Value can be any real number.

Training Configuration

class Configure {
 public:
  size_t number_of_feature;      // number of features
  size_t max_depth;              // max depth for each tree
  size_t iterations;             // number of trees in gbdt
  double shrinkage;               // shrinkage parameter
  double feature_sample_ratio;    // portion of features to be splited
  double data_sample_ratio;       // portion of data to be fitted in each iteration
  size_t min_leaf_size;          // min number of nodes in leaf

  Loss loss;                     // loss type

  bool debug;                    // show debug info?

  double *feature_costs;         // mannually set feature costs in order to tune the model
  bool enable_feature_tunning;   // when set true, `feature_costs' is used to tune the model

  bool enable_initial_guess;
...
};

Reference

  • Friedman, J. H. “Greedy Function Approximation: A Gradient Boosting Machine.” (February 1999)
  • Friedman, J. H. “Stochastic Gradient Boosting.” (March 1999)
  • Jerry Ye, et al. (2009). Stochastic gradient boosted distributed decision trees. (Distributed implementation)

gbdt's People

Contributors

qiyiping avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gbdt's Issues

请教几个问题?

1、enable_initial_guess这个大概是啥意思?
2、gbdt为啥要加一个bias呢?

the method to precess missing value

模型在处理missing value有点问题,麻烦看一下,谢谢!
这是我构造的训练集A,20条样本,训练集上的auc=0.55
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
0 1 0:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1
1 1 1:1

训练集合B,对A补全了特征,20条样本,训练集上auc=1
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
0 1 0:1 1:0
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1
1 1 0:0 1:1

get impurity question

计算基尼不纯度,代码里的逻辑是这样的,见fitness.cpp代码:

162 for (size_t j = unknown; j < len-1; ++j) {
163 s = data_copy[j]->target * data_copy[j]->weight;
164 ss = Squared(data_copy[j]->target) * data_copy[j]->weight;
165 c = data_copy[j]->weight;
166
167 ls += s;
168 lss += ss;
169 lc += c;
170
171 rs -= s;
172 rss -= ss;
173 rc -= c;
174
175 ValueType f1 = data_copy[j]->feature[index];
176 ValueType f2 = data_copy[j+1]->feature[index];
177 if (AlmostEqual(f1, f2))
178 continue;
179
180 fitness1 = lc > 1? (lss - lsls/lc) : 0;
181 if (fitness1 < 0) {
182 // std::cerr << "fitness1 < 0: " << fitness1 << std::endl;
183 fitness1 = 0;
184 }
185
186 fitness2 = rc > 1? (rss - rs
rs/rc) : 0;
187 if (fitness2 < 0) {
188 // std::cerr << "fitness2 < 0: " << fitness2 << std::endl;
189 fitness2 = 0;
190 }
191
192 double fitness = fitness0 + fitness1 + fitness2;
193
194 if (g_conf.feature_costs && g_conf.enable_feature_tunning) {
195 fitness *= g_conf.feature_costs[index];
196 }
197
198 if (*impurity > fitness) {
199 *impurity = fitness;
200 *value = (f1+f2)/2;
201 *gain = fitness00 - fitness1 - fitness2;
202 }
203 }
204
205 return *impurity != std::numeric_limits::max();
206 }

计算impurity我的理解是应该计算全量样本的,代码的逻辑貌似是循环每一条非空特征的样本,然后取最小值。所以我的疑问是198-202行代码是不是应该在for循环之外呢,还是我的理解有问题?求解答

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.