gbm-developers / gbm3 Goto Github PK

Gradient boosted models

R 71.91% C++ 27.78% C 0.31%

gbm3's Introduction

gbm3: generalized boosted models ----0------------------------

Originally written by Greg Ridgeway between 1999-2003, added to by various authors, extensively updated and polished by James Hickey in 2016, survival models greatly improved by Terry Therneau in 2016, and currently maintained by Greg Ridgeway. Development is discussed --- somewhat --- at https://groups.google.com/forum/#!forum/gbm-dev .

This is the shiny new gbm3 package that is not backwards compatible with R code calling the original gbm package, but is fast and parallel and developed.

Non-production releases (bug fixes, mostly) will be released via the GitHub release workflow. To install from GitHub, first install remotes from CRAN:

install.packages("remotes")

Then install gbm3 from GitHub:

remotes::install_github("gbm-developers/gbm3")

# or to ensure your got everything
remotes::install_github("gbm-developers/gbm3", build_vignettes = TRUE, force = TRUE)

gbm3's People

Contributors

Stargazers

Watchers

Forkers

harrysouthworth jiho timwee bkchoi80 codegordi mafeichao mariatimofeeva scttl searchivarius neil-schneider randxyz az0 chenglongchen ebweinstein ahmed26 ageek wush978 raaz-ml blindape kowizards smartinsightsfromdata narayana1208 mmadsen khotilov feiyeye qingqingqing seadave pjpan lucentcosmos madrury grantbrown twistedmove mashiqi paulrobshannon pdmetcalfe vspinu dinggx besimsolak pkostya zhmz90 yuhai-china l2ru cicadas ccowingzitron maymaomao mudit2013 ryanmichaluk liuy12 stefan-schroedl ajkl jordanpeltier therneau kferris10 ambarket jackstat andrewdolman sandy4321 gabrielaolinto caomw scientific-computing-solutions shengp arnocandel 8bit-pixies colinsongf anpark patr1ckm rmatam andremikulec bgreenwell gravesee zgsxwsdxg jmosser gfelix2 2hua aurelielemmens metabolomicshk riviera2015 drblueberry yinpuli swotherspoon somerandomjohnny mmmika rackdos zeta1999 bobthecat seyi

gbm3's Issues

results not reproducible

set.seed(20150312)
wh <- gbm(Species ~ ., data=iris, cv.folds=3, n.cores=1)
ri1 <- relative.influence(wh)

set.seed(20150312)
wh <- gbm(Species ~ ., data=iris, cv.folds=3, n.cores=1)
ri2 <- relative.influence(wh)

ri1
ri2

Presumably this is due to

std::random_shuffle

in tree.cpp

MRR in LambdaMART

Hi,

Is there an example/demo of using MRR in pairwise distribution for LambdaMART implementation ?
I am not able to understand how the ordering is maintained to calculate MRR. Also MRR needs binary response variable, does that mean every query in the "group" can have a single click instance in the training data ? How would you model multi session clicks for same query but different clicked document ?

Thank you for LambdaMART implementation, much appreciated!

make gbm predict to handle poisson

According to current package user guide "The predictions from gbm do not include the offset term. The user may add the value of the offset to the predicted value if desired". Is it possible to accomodate this automatically?

Cannot search through "Code" in forked repo

Hi all,

Since this is a forked repository, github code search does not work. What can be a good solution to get around that ?

Seg fault when run on Travis CI using Docker

When run on Travis CI using docker, test_that("gaussian works") seg faults.
Identified : 02/03/2016.

False "... terminal node predictions were excessively large..." warning

see #6

Encounter NaN depending on number of training observations and setting for shrinkage parameter in gbm.fit()

This was initially posted @harrysouthworth/gbm. I understand this repo has taken over development of gbm. I'm reposting here just for the record.

Depending on the number of observations used for training and setting of shrinkage parameter, I encounter the following problem in gbm.fit()

> mdl <- gbm.fit(x=train.df[,1:(ncol(train.df)-1)],
+                 y=train.df[,ncol(train.df)],
+                 distribution = "multinomial",
+   .... [TRUNCATED] 
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        2.1972             nan     0.1000    0.6183
     2        1.8424             nan     0.1000    0.3274
     3        1.6543             nan     0.1000    0.2367
     4        1.5193             nan     0.1000    0.1802
     5        1.4185             nan     0.1000    0.1462
     6        1.3352             nan     0.1000    0.1213
     7        1.2657             nan     0.1000    0.0945
     8        1.2104             nan     0.1000    0.0811
     9        1.1629             nan     0.1000    0.0742
    10        1.1200             nan     0.1000    0.0609
    20        0.8972             nan     0.1000    0.0206
    40        0.7478             nan     0.1000    0.0040
    60        0.6810             nan     0.1000    0.0014
    80        0.6490             nan     0.1000       nan
   100           nan             nan     0.1000       nan
   120           nan             nan     0.1000       nan
   140           nan             nan     0.1000       nan
   150           nan             nan     0.1000       nan

This zip file contains source code and data to reproduce the problem:
https://drive.google.com/open?id=0B95myaZR5glcfm5XaXM3aGYyRkZQbG9hN1BBdElaNUVwcXVsZDdyN2tpMzYxeV94LWxyY28&authuser=0

Core parallelization of split point search algorithm

With the latest GBM version, the option was introduced to parallelize cross validation by distribution to multiple cores. However, not every one uses cross validation all the time; it would be nice to parallelize the general algorithm as well. I am using GBM with large data files (millions of rows and hundreds of columns), and one run requires many hours.

90% of the GBM time is spent in the loop calculating the possible loss over all possible features/split points. The most effective way to parallelize this would be to distribute different feature subsets to different cores (the Yahoo proprietary GBT code solved it like this).

Laplace memory leak

From dereli fatih [email protected]:

I am working for a clothing retail company in a position regarding forecasting. As a part of my job, I am responsible for maintaining and developing forecasting models in R. One of those models is GBM and we are having memory related issues while running GBM train codes which use GBM package. We are working on a virtual machine with the configuration of 32 GB RAM, 64-bit operating system and Intel Xeon 2.20 GHz CPU with 8 processors. The problem occurs while fitting models in GBM with laplace distribution. It does not happen when we try Gaussian distribution, code runs with a memory usage of 30% in a stable way. Laplace increases the memory gradually and it reaches to 100% which causes the software to stop. We tried to remove big objects and garbage cleaning, however it did not help. Current inputs for the gbm.fit function are bag fraction of 0.4, n.trees of 400, interaction.depth of 5, shrinkage of 0.025 and distribution of Laplace. Object size of the fitted object is 67 MB with 157k rows and 50 columns. We have seen bug fixes on Internet for multinomial distribution but did not have much result for the Laplace. Did you have any similar feedback from other people? Is there any bug fix for the Laplace, if not is it being planned to release such a bug fix?

More coxph woes, plus solution

From Terry Therneau:

I was looking at the gbm code with the idea of adding the counting process style of Cox models. Since I'm not a C++ programmer some details are of course a bit hazy.

First, the loglik and derivative used for Cox models are not correct if there are tied death times. This is worrying, since real data often has lots of ties. It deficit should at least be made clear in the documentation.

Second, one way to write the first derivative for a Cox model is X'm, where m is the vector of martingale residuals. I have fast code to compute m for both the ordinary and counting process type of Cox model (as part of the work for the coxme package). One alternative for computation would be to plug this in 'midstream' to the code for continuous y (for which X'r is the derivative). Does this seem like a workable approach?

The calling parameters for the routine are (time, status, score, index) for the coxph case, where index = order(-time, -status) and data is in the original order. For the start/stop case one needs (start, stop, status, score, index1, index2) where index1 = order(-stop, -status) and index2 = order(-strart). The algorithm is O(n) for the first case and O(2n) for the second. They return the residuals and the loglik.

Terry Therneau

Rounding error in quantile() - variable N: foo has no variation.

From @w3iBStime on April 8, 2014 20:1

Due to what appears to be a bug in the quantile() function from the stats package, it doesn't seem to be a reliable way of determining if a numeric vector is non-varying.

Here's the bug I filed against the quantile() function:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15746

...and here's where it gets used in gbm:
https://github.com/harrysouthworth/gbm/blob/master/R/gbm.fit.R#L90-L100

Copied from original issue: harrysouthworth#16

INTEGER() can only be applied to a 'integer', not a 'double'

I moved to this development version of gbm package as suggested by @harrysouthworth in his gbm repo (due to gbm.more bug with multinomial distribution). But after having download this one, I ran into the issue below.
Issue occured on :

window 7 64 bits
R. 3.1.2 (2014.10.31)
dependant variable is multinomial
setting established as default as much as possible

Notice that same error occurred even with the demo of the package

segfault

From @harrysouthworth on March 20, 2014 17:41

Sent to me by email.

I am testing ‘gbm' on some new data. Using gbm v2.1-05, R 3.0.3 (using the GUI), Max OS 10.9.2. I’ve attached a .rds file with the test data. My response variable is a factor consisting of >40 levels. The predictors are a mix of categorical and numeric/integer. I get an error (actually, R crashes) using ‘gbm.more’. I can replicate with the following code:

require(gbm)
Loading required package: gbm
Loading required package: survival
Loading required package: splines
Loading required package: lattice
Loading required package: parallel
Loaded gbm 2.1-05

dset = readRDS(“test.rds”) # Attached .rds file

This will complete in ~90 seconds (no errors)

q = gbm(puma00~year+loc+hincp+age+race+educ+hstat+rent+mortgage,
distribution="multinomial",
data=dset,
interaction.depth=4,
shrinkage = 0.01,
n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)),
n.cores=1,
n.trees=100)

Not sure what this error is when printing ‘q’ -- not fatal, just FYI

q
gbm(formula = puma00 ~ year + loc + hincp + age + race + educ +
hstat + rent + mortgage, distribution = "multinomial", data = dset,
n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30,
ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1)
A gradient boosted model with multinomial loss function.
100 iterations were performed.
There were 9 predictors of which 9 had non-zero influence.
Error in apply(x$cv.fitted, 1, function(x, labels) { :
dim(X) must have a positive length

HERE is the real problem: attempt to add 100 trees

q2 = gbm.more(q, n.new.trees=100)

*** caught segfault ***
address 0x1c74a8000, cause 'memory not mapped'

Traceback:
1: .Call("gbm", Y = as.double(y), Offset = as.double(offset), X = as.double(x), X.order = as.integer(x.order), weights = as.double(w), Misc = as.double(Misc), cRows = as.integer(cRows), cCols = as.integer(cCols), var.type = as.integer(object$var.type), var.monotone = as.integer(object$var.monotone), distribution = as.character(distribution.call.name), n.trees = as.integer(n.new.trees), interaction.depth = as.integer(object$interaction.depth), n.minobsinnode = as.integer(object$n.minobsinnode), n.classes = as.integer(object$num.classes), shrinkage = as.double(object$shrinkage), bag.fraction = as.double(object$bag.fraction), train.fraction = as.integer(nTrain), fit.old = as.double(object$fit), n.cat.splits.old = as.integer(length(object$c.splits)), n.trees.old = as.integer(object$n.trees), verbose = as.integer(verbose), PACKAGE = "gbm")
2: gbm.more(q, n.new.trees = 100)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

INTERESTINGLY, if I switch to a numeric response variable, I get different issues:

q = gbm(hincp~year+loc+age+race+educ+hstat+rent+mortgage,
distribution="gaussian",
data=dset,
interaction.depth=4,
shrinkage = 0.01,
n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)),
n.cores=1,
n.trees=100)

No predictors are included...

q
gbm(formula = hincp ~ year + loc + age + race + educ + hstat +
rent + mortgage, distribution = "gaussian", data = dset,
n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30,
ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1)
A gradient boosted model with gaussian loss function.
100 iterations were performed.
There were 8 predictors of which 0 had non-zero influence.

Summary of cross-validation residuals:
0% 25% 50% 75% 100%
NA NA NA NA NA

Cross-validation pseudo R-squared: 1

summary(q)
Error in plot.window(xlim, ylim, log = log, ...) :
need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf

BUT ‘gbm.more’ does not cause R to crash...

q2 = gbm.more(q, n.new.trees=100) # No error

Any ideas? Maybe a pointer issue when distribution=“multinomial”?

Many thanks,
Kevin

Copied from original issue: harrysouthworth#13

bug in gbm.more

From @harrysouthworth on June 16, 2014 15:48

http://stackoverflow.com/questions/24233865/r-gbm-more-function-doesnt-work-for-all-distributions

Copied from original issue: harrysouthworth#23

GBM crashing consistently on Mac OS X (10.9.4)

From @kgoldfeld on September 9, 2014 2:16

Every call to gbm (from package version 2.1, R version 3.1.1) on my Mac causes R to abort. Is this a known issue?

Copied from original issue: harrysouthworth#27

Memory leak with "laplace" still present?

From @johnrolfeellis on March 11, 2014 4:2

I'm experiencing a significant memory leak in 2.1-0.3 with the "laplace" distribution, similar to what is described here:

https://code.google.com/p/gradientboostedmodels/issues/detail?id=32

The file laplace.cpp was fixed at line 97 in 2.1-0.3 to address this. But though I'm running 2.1-0.3 (at least that's what "library (gbm)" says), I'm stilling getting a memory leak.

The training set has 280K rows with 11 columns, and the parameters are:

gbm.fit (trainingSet, outcomes, nTrain = 279870, distribution = "laplace", interaction.depth = 4, n.trees = 1000)

which uses 1.1 GB of working set and 4.7 GB of committed memory. And very roughly, adding 2000 more trees uses an additional 2 GB of working set and 4-5 GB of committed memory. But when the distribution is changed "gaussian", the working set and committed memory stay around 550M, no matter how many trees.

I tried building the master branch, but gbm.fit() doesn't appear to work:

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           nan             nan     0.0010       nan
     2           nan             nan     0.0010       nan
     3           nan             nan     0.0010       nan

(R version 3.0.2, gbm 2.1-0.3, Windows 7 64-bit.)

Copied from original issue: harrysouthworth#11

Question on Vignette : Gradient for Bernoulli

Is the gradient on page 10 (z_i) showing the derivative for the log-likelihood and not the deviance shown above it (i.e. the neg log-lik)? It seems like the gradient shown is not taking into account the (-) sign (or the 2 for that matter but I have seen that ignored or canceled)?

roxygen2 documentation is incomplete

The new roxygen2 documentation is not yet complete: R CMD check gives warnings.

Deprecate gbm.fit?

Currently, the main training function is split between functions gbm and gbm.fit. This leads to some code bloat and duplication (passing and (re)checking the arguments, extracting dimensions etc). In order to simplify the code, should we get rid of gbm.fit? It is not really necessary for users; the documentation states saving the time to generate the model frame, however for non-trivial applications I would think this would be a marginal fraction of the overall run time ... Thoughts?

Improvement for coxph

From @jeffwong on June 3, 2014 4:30

Would really appreciate the ability to use intervals in the Surv class, so instead of passing Surv(time, censor indicator) it would be great to pass Surv(start time, end time, censor indicator). This would require rewriting some of the C++ code because adding this flexibility implies that at a particular time t not all records are part of the risk set. In the current version, as long as the record is not censored it is in the risk set, but with a start time and end time we must have t >= start time and t < end time in order to be in the risk set

Copied from original issue: harrysouthworth#21

obscure error with high shrinkage and gbm.perf(OOB,...)

From @az0 on September 9, 2014 16:17

When building a gbm with high shrinkage, gbm.perf fails with an obscure error. Could the error be more clear? In the example below the shrinkage is ridiculously high, but it happens on my real data with shrinkage=0.1

R code
https://gist.github.com/az0/b032f6dcfb279fd0cd9e

R output
https://gist.github.com/az0/3fc58bda6186be682fb5

Copied from original issue: harrysouthworth#28

subscript out of bounds

When training some GBM using caret it is impossible to use the gbm for prediction since the following error occurs:

Error in object$var.levels[[i]] : subscript out of bounds

It is possible to fix this issue?

Bug in gbm.more for model trained with keep.data = FALSE

In gbm.more, if the model has been trained with keep.data=FALSE, alternative data can be passed in using the data argument. However, the parameter for number of training examples (nTrain) is copied from the model - this leads to a crash if the new data has less than nTrain rows. More generally, I think we should not implicitly assume the passed in data is identical to the original one, otherwise the option would not be necessary (we could just require the model always needs to be trained with keep.data=TRUE in order to be able to run gbm.more). Or maybe we should, as this would greatly simplify the code?

Add survreg support. distribution = 'weibull'

I think it is awesome that gbm supports survival analysis and I was wondering if it would be easy enough to add support for survival regression.

Memory problems

It is likely that gbm contains a 'memory bug'. In general or at least when running a multinomial classification model in caret. Developers please review issue #263 in topepo/caret.
A single core task consumes all memory on a 16 Gb machine and doesn't return memory after the task is completed.

is distribution= laplace uses weighted median or median?

hello
I was wondering if gbm uses weighed median for laplace distribution or not.
Thanks inadvance!

Refactoring proposal for distribution class / alternative options for using Hessian

The function API in the current distribution class has instance vectors for the target (adY), the current ensemble value (adF), the tree value (adFadj), the gradient (adZ), the weights (adW), and the offset (adOffset). First, ComputeWorkingResponse is called to calculate the gradient. Then, it is passed to FitBestConstant.

Although FitBestConstant is implemented in every separate distribution, it is quite similar each time: A numerator array keeps track of the sum of gradients per terminal node, and a denominator array sums the diagonals of the Hessian (computed here). The final predicted value is the ratio of the two.

Proposal: If we changed the interfaces of ComputeWorkingResponse and FitBestConstant to include the Hessian as well, it might be possible to reuse the same single implementation of FitBestConstant. Moreover, this would make it easier to allow options to use the Hessian differently, or not at all.

While the Newton algorithm helps find a good solution fast, sometimes the final model might be actually better using gradients alone (or, as a compromise, limit/cap the gradients). Low Hessians can easily lead to overfitting. I realize such a cap is implemented for the Bernoulli distribution, but we could make the procedure generally applicable for all distributions - or give the user an option to use only gradients (for the initial trees) ... Thoughts?

fit and predict binomial gbm with two offset terms

From @dmarch on March 17, 2014 18:34

Dear Harry,

I am using the ‘dismo’ package to conduct boosted regression trees (BRT) for both binary and count data. The dismo package uses ‘gbm’ package for the implementation of BRT. I would like to incorporate two offset terms in the model, as well as being able to make predictions.
For the count data I am using a Poisson model. Based on a previous post (https://stat.ethz.ch/pipermail/r-help/2010-September/253647.html), I implemented the following code:

library(gbm)
library(dismo)

define offset

offset=(log(data$off1)+ log(data$off2)) #equivalent to log(data$off1*data$off2)

fit poisson

m.pois<-gbm.step(data=data, gbm.x=7:8, gbm.y=4, offset=offset, family="poisson", tree.complexity=1, learning.rate=0.001, bag.fraction=0.7, n.folds=10)

predict poisson

link <-predict.gbm(m.pois, data, n.trees=n.trees, type="link")
link.offset<- link + offset
pred <- exp(link.offset)

My questions is how to implement the same for a binomial model? I have tried to look in different forums and documentation without success. The only clue that I have is the following document: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/inst/doc/gbm.pdf?revision=18&root=gbm&pathrev=22

Any advice and/or additional references on this issue would be more than welcome.

Thank you in advance,

Copied from original issue: harrysouthworth#12

CRAN gbm page still points to old google code repository

Should be changed to point to this one.

Quantile regression bug?

Hi Harry,

I was looking at the GitHub gbm change log (https://github.com/harrysouthworth/gbm/blob/master/CHANGES), and I see the following was apparently introduced in v2.0:

the "quantile" distribution now handles weighted data

However, using v2.1.1, I get the following message if I actually try to do that:

test <- gbm(AGE~SEX+SEAT_POS+AIR_BAG, data=traindf, distribution=list(name="quantile", alpha=0.5), weights=WEIGHT)
Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w, :
This version of gbm for the quantile regression lacks a weighted quantile. For now the weights must be constant.

Is this an oversight in the github documentation, or can the quantile method actually handle weights now?

Many thanks,
Kevin

Kevin Ummel
Research Scholar, Energy Program
International Institute for Applied Systems Analysis

Release tag for GBM 2.1.1

CRAN is on GBM v2.1.1, but it is not clear what commit/hash that is related too. Can we get a release tag set at this commit?

Levels with zero weight are always assigned to the right child

From @PatrickOReilly on March 16, 2015 18:48

g <- gbm(y ~ x, 
         distribution="gaussian", 
         train.fraction=1,
         bag.fraction=0.1,
         data=data.frame(x=as.factor(1:100), y=rnorm(100)), 
         n.trees=1,
         n.minobsinnode=1)

g$c.splits[[1]]
 [1]  1  1 -1  1  1  1  1 -1  1 -1  1  1 -1  1  1  1  1  1  1  1  1  1
[23]  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[45]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[67]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[89]  1  1  1  1  1  1  1  1  1  1  1  1

In this example 5 levels are assigned to the left child and 95 are assigned to the right child. Of those at the right child, 90 will have had zero weight while training (since bag.fraction=0.1). It's an artificial example but a node having zero weight for a particular level is very possible when training a real model due to

correlation among predictors
bag.fraction < 1
predictors with high cardinality.

I'm wondering if this could be an issue since the tree is (in expectation) over-predicting for zero-weight levels. Granted, later trees will attempt to correct for any over-prediction but

the later trees may also have zero-weight levels at their nodes
convergence may be improved if the need for correction is avoided.

Perhaps in these circumstances it is more reasonable to use the prediction at the parent node since there is no data to suggest whether the zero-weight levels should be assigned to either the left or right child?

Copied from original issue: harrysouthworth#45

problem with local variables and gbm formula interface

I ran into the following problem where gbm does not find a predictor in the local environment:

do.gbm <- function(y,x){
  require(gbm)
  l <- x
  o <- gbm(y~l)
}

x <- rnorm(100)
y <- x^2
do.gbm(y,x)

Produces

 Error in eval(expr, envir, enclos) : object 'l' not found

It occurs at the following line (456), which looks for the variables within data rather than parent.frame():

   x <- model.frame(terms(reformulate(var.names)),
                    data,
                    na.action=na.pass,
                    subset=subset)

A proposed solution could be to just select the predictors from mf using var.names computed the line before. mf should have all variables either in data or parent.frame. I've not tested this extensively.

x <- mf[,var.names,drop=F]

Error in checkForRemoteErrors(val)

Hello,
I have been getting an "Error in checkForRemoteErrors(val) : one node produced an error: incorrect number of dimensions"
in some gbm cv.folds > 1 runs. cv.folds = 0 runs OK. ( in some larger data set runs the rgui has shutdown)

Here is a reproducible example.
library(gbm)

test = data.frame(
Y = c( 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1 ) ,
R= as.factor( c( 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,
2,2,2,2,2,3,3,3,3,3,4,4,1,1,1,1,1,1,1,2,2,3,3,4,4 ) )
)
str(test)

set.x =
gbm( Y ~ R ,
data= test ,
var.monotone=NULL ,
distribution="bernoulli",
n.trees= 10 ,
shrinkage= .01 ,
interaction.depth= 1 ,
bag.fraction = 1,
train.fraction = 1,
n.minobsinnode = 2,
cv.folds = 2,
keep.data=TRUE,
verbose=F,
n.cores=1)

sessionInfo()

output:

Error in checkForRemoteErrors(val) :
one node produced an error: incorrect number of dimensions

sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel splines stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] gbm_2.1.1 lattice_0.20-33 survival_2.38-3

loaded via a namespace (and not attached):
[1] grid_3.2.2

Thanks,

Van L. Parsons
National Center for Health Statistics
E-mail: [email protected]

Scoring factor variables is a lot slower than CRAN gbm

I think it may be connected to this: gbm-developers/gbm#18

gbm-developer gbm is calling LENGTH(VECTOR_ELT(x)) a heck of a lot in the c++ since that fix got implemented.

Some reproduction code: https://gist.github.com/DexGroves/9d821dde795108a09f51

In development gbm_2.1-06 multinomial distribution no longer supported?

Harry and Greg,

About gbm_2.1-06 ( in development ),
I noticed, the 'drop in support' for the "multinomial" distribution.
I tried "Surv". That did not work
I tried nothing. It assumed "gaussian."

I was folling the new gbm_2.1-06 ( in development ) docs ? gbm::gbm

if the response has
only two unique values, bernoulli is assumed
(two-factor responses are converted to 0,1);
otherwise, if the response has class "Surv",
coxph is assumed; otherwise, gaussian is assumed.

Any recommendations for a 'replacement'?

I was trying out

# package gbm NOT ON search() . . . 
# R 3.2.2 (3.2.3 = R current)# gbm beta  gbm_2.1-06 ( beta )
# Feburary 15, 2016

# I tried

#   Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  : 
#   Distribution multinomial is not supported
#   
library(gbm)
data(iris)
scales <- seq(1,NROW(iris))
iris.mod2 <- gbm::gbm(Species ~ ., distribution="multinomial", data=iris,
                     n.trees=2000, shrinkage=0.01, cv.folds=5,
                     weights = scales,  # NO ERROR
                     verbose=FALSE, n.cores=1)

# I tried.

# Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  :
# Distribution Surv is not supported
# 
library(gbm)
data(iris)
scales <- seq(1,NROW(iris))
iris.mod2 <- gbm::gbm(Species ~ ., distribution="Surv", data=iris,
                     n.trees=2000, shrinkage=0.01, cv.folds=5,
                     weights = scales,  # NO ERROR
                     verbose=FALSE, n.cores=1)

# I did nothing.

# Distribution not specified, assuming gaussian ...
#
library(gbm)
data(iris)
scales <- seq(1,NROW(iris))
iris.mod2 <- gbm::gbm(Species ~ .,                    , data=iris,
                     n.trees=2000, shrinkage=0.01, cv.folds=5,
                     weights = scales,  # NO ERROR
                     verbose=FALSE, n.cores=1)

Thanks,
[email protected]
Andre Mikulec

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gbm_2.1-06  ## IN DEVELOPMENT ##

loaded via a namespace (and not attached):
[1] parallel_3.2.2  tools_3.2.2     survival_2.38-3 Rcpp_0.12.1     splines_3.2.2  
[6] grid_3.2.2      lattice_0.20-33

Add coveralls or some other code coverage service

I think we should add a code coverage service to help us identify areas of the code base that isn't backed up with good testing.
codecov.io
coveralls.io
Plus we get a new badge to put in the readme :)

Bernoulli requires the response to be in {0,1}...

From @Mullefa on November 23, 2014 17:12

... and similarly for other distributions. Whilst this doesn't affect the functionality of the package, it is a bit inconvenient.

Could the requirement be lowered to factors with two levels, and implicit conversion done within gbm.fit() (or whatever function from within the package is being used)?

Likewise, in the correspondingpredict() method, could a factor be returned instead of 0's and 1's?

Copied from original issue: harrysouthworth#29

Setting seed yields different results on mac for multicore gbm

This test fails on my mac

test_that("Setting the seed causes result to be reproducible (multicore)", {
  skip_on_cran()
  set.seed(18900217)
  mod <- gbm(Species == 'setosa' ~ ., data=iris, distribution="bernoulli",
             cv.folds=3, n.trees=100, shrinkage=.1)
  nt1 <- gbm.perf(mod, method="cv", plot.it=FALSE)
  ri1 <- relative.influence(mod, n.trees=nt1)

  set.seed(18900217)
  mod <- gbm(Species == 'setosa' ~ ., data=iris, distribution="bernoulli",
             cv.folds=3, n.trees=100, shrinkage=.1)
  nt2 <- gbm.perf(mod, method="cv", plot.it=FALSE)
  ri2 <- relative.influence(mod, n.trees=nt2)

  expect_equal(nt1, nt2,
               label="Number of trees match when same seed is used")
  expect_equal(ri1, ri2,
               label="Relative influences match when same seed is used")
})

I am wondering if we should add mac osx to our travis builds
http://docs.travis-ci.com/user/osx-ci-environment/

Error printing certain gbm objects

From @Mullefa on November 23, 2014 17:15

For example:

df <- data.frame(
  x = runif(100),
  y = runif(100),
  z = sample(0:1, 100, replace = TRUE)
)

trained_gbm <- gbm.fit(df[, c("x", "y")], df$z)

trained_gbm

Copied from original issue: harrysouthworth#30

plot.gbm is broken?

CRAN's gbm successfully draws plots for me, but the gbm-developers version doesn't. CRAN's gbm seems to compile a lot faster.

Some code to reproduce: https://gist.github.com/DexGroves/c707129c7a3b63748204

Remove code duplication between gbm.more and gbm.fit

A significant portion of the code in gbm.more and gbm.fit is very similar, doing pre-training checks and data preparation. It would be better to factor this functionality into a common function.

training after the refactoring(b1628fc) is much slower

before the refactoring, the training on a 100k toys data set with depth 3 and 100 trees finishes in 5 minutes..
after the b1628fc, it takes hours....

from my superficial understanding, looks like:
0. nodes' best split are no longer cached: previously, only new nodes' best split will be calculated from a portion of data, but now we re-calculate all nodes in each iteration;

an extra loop[2] on node has been introduced. because of the cache's absence, it makes thing even slower.

[1] https://github.com/gbm-developers/gbm/blob/22a019d25e1527f5bcf4ae1825fdf79eeec2f5e6/src/node_search.cpp#L56
[2] https://github.com/gbm-developers/gbm/blob/master/src/node_search.cpp#L47

plot.gbm: use split points from trees as grid

The parameter 'continuous.resolution' allows to set equidistant spacing for 1-3 way interaction plots. However, especially for 1-D plots, it is often useful to see all steps (not necessarily equidistant) induced by any node split in any of the decision trees (more precisely, a grid of pairs of points at x-epsilon

and x+epsilon for each threshold x in a continuous split; respectively, all x for each value x of a categorical feature). This gives an idea of the split distribution, and more importantly, gives a ready visual hint to overfitting (these will show up as discontinuities or 'spikes' - see included picture for an example).

Error: object 'gbm.fit' not found

I followed the instruction and installed gbm package using install_github("gbm-developers/gbm") on R 3.1.1. After installation, I run the following in R:
library(gbm)
gbm.fit
then I see the error message:
Error: object 'gbm.fit' not found
Any idea? Thanks.

Bug in CoxPH cpp code???

From @jeffwong on June 15, 2014 5:38

The CoxPH cpp code to calculate deviance does not agree with the equation listed in the vignette. The function

double CCoxPH::Deviance
(
double *adT,
double *adDelta,
double *adOffset,
double *adWeight,
double *adF,
unsigned long cLength,
int cIdxOff
)
{
unsigned long i=0;
double dL = 0.0;
double dF = 0.0;
double dW = 0.0;
double dTotalAtRisk = 0.0;

dTotalAtRisk = 0.0;
for(i=cIdxOff; i<cLength+cIdxOff; i++)
{
    dF = adF[i] + ((adOffset==NULL) ? 0.0 : adOffset[i]);
    dTotalAtRisk += adWeight[i]*exp(dF);
    if(adDelta[i]==1.0)
    {
        dL += adWeight[i]*(dF - log(dTotalAtRisk));
        dW += adWeight[i];
    }
}

return -2*dL/dW;

}

has two key values, dL and dW. dL is calculating

\sum (w_i \delta_i (f(x_i) - \log R_i)

and dW is calculating

\sum w_i \delta_i

If we expand the formula for deviance in the vignette we would get

-2 \sum (w_i \delta_i [ f(x_i) - \log R_i + \log w_i ] )

the calculation for dL is valid but dW seems to be wrong. We would want dw to be calculating

\sum (w_i \log w_i)

and the final answer should combine the values dL and dW via

-2 * (dL + dW)

Copied from original issue: harrysouthworth#22

problem with local variables and gbm formula interface

I ran into the following problem today: The data argument is not optional, gbm does not find a predictor in the local environment:

do.gbm <- function(y,x){
  require(gbm)
  l <- x
  o <- gbm(y~l)
}

x <- rnorm(100)
y <- x^2
do.gbm(y,x)

Produces

 Error in eval(expr, envir, enclos) : object 'l' not found

It occurs at the following line (456), which looks for the variables within data rather than parent.frame():

   x <- model.frame(terms(reformulate(var.names)),
                    data,
                    na.action=na.pass,
                    subset=subset)

x <- mf[,var.names,drop=F]

example code fails

From @harrysouthworth on March 24, 2014 17:5

Sent to my inbox. The first example in the help file for gbm fails on some runs, sometimes runs ok.

The issue is caused by one of the models in cv.models having all NaNs or Infs in the valid.error element. Those values come out of the .Call("gbm" call looking like that in gbm.fit.R

N <- 1000
X1 <- runif(N)
X2 <- 2_runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3_runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X11.5 + 2 * (X2.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

introduce some missing values

X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
#0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=FALSE, # don't print out progress
n.cores=1) # use only a single core (detecting #cores is

error-prone, so avoided here)

Copied from original issue: harrysouthworth#14

More notes on coxph

Our group is using gbm survival models more and so I am motivated to contribute a solid fix to this. I have C code that is rock solid, it may require a bit of input from you all to convert it to C++. Last time I counted I think there were 11 different computer languages in which I've done substantive work at some time in my career, but C++ isn't on the list. Three issues that need to be sorted out before I dive in:

Stratified Cox models. These have a lot of uses. The loglik and derivative for each subject are computed within strata. Do you want to allow it? Easy for the user to specify, but one more variable to pass down the calling chain.
Breslow vs Efron approximation. The survival::coxph and coxm::coxme functions use the Efron approx by default, most of the rest of the world defaults to Breslow. The latter is a tad easier to program and ends up as the default because it was done first and Efron added later. I use Efron because it is the more accurate one and I like to dot the i, but in most data sets the difference between the two is so small as to not be relevant. There is one counterexample: stratified logistic regression. This is used for matched case-control studies, common in epidemiology and now showing up a fair bit in genetic association studies. It can be fit using a stratified Cox model, and for this you want Efron. I don't know if you are getting requests for stratified logistic.
You can decide to do only Efron, only Breslow, or allow it as an option. If an option that is another argument to pass down the chain.
Counting process data. This is hugely useful. Users have Surv(time1, time2, status) ~ x1 + x2 +... as the formula, so it makes no impact on them. Again, more variables to pass down the calling chain.

Looking at the code, you have chosen to make all of the fitting routines have the same argument list. One thing that I did in rpart might be an approach, which was to let y be a matrix and pass nrow/ncol.

Terry T.

Incorrect prediction with new levels due to invalid memory addresses

d <- data.frame(x=as.factor(1:20), y=1:20)

train <- d[1:10,]
test  <- d[11:20,]

p <- rep(0, 10)

while(sum(abs(p)) == 0)
{
    g <- gbm(y ~ x,
             distribution="gaussian",
             bag.fraction=1,
             data=train, 
             n.trees=1,
             shrinkage=1,
             n.minobsinnode=1)

    p <- predict(g, newdata=test, n.trees=1) - g$initF
}

pretty.gbm.tree(g, 1)
#  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
#0        0           0.0        1         2           3           62.5     10        0.0
#1       -1          -2.5       -1        -1          -1            0.0      5       -2.5
#2       -1           2.5       -1        -1          -1            0.0      5        2.5
#3       -1           0.0       -1        -1          -1            0.0     10        0.0

g$c.splits[[1]]
#[1] -1 -1 -1 -1 -1  1  1  1  1  1

print(p)
#[1] 0.0 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

In this example the test data has 10 levels unseen during training. The predictions should all be 0.0 since (according to gbmentry.cpp) the intent is that new levels are treated as missing:

iCatSplitIndicator = INTEGER(
            VECTOR_ELT(rCSplits,
                       (int)adSplitCode[iCurrentNode]))[(int)dX];
if(iCatSplitIndicator==-1)
{
   iCurrentNode = aiLeftNode[iCurrentNode];
}
else if(iCatSplitIndicator==1)
{
   iCurrentNode = aiRightNode[iCurrentNode];
}
else // categorical level not present in training
{
   iCurrentNode = aiMissingNode[iCurrentNode];
}

The problem is that INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) is of length equal to the number of levels in the train data, and yet the program retrieves values at positions
11, 12, ..., 20 as these are the values of (int)dX in the test data. This can be easily verified by adding a printf. Surprisingly there is no segfault. The values at the addresses immediately following INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) appear in general not to be equal to -1 or 1 and so in general the program correctly uses the missing node. However, by random chance if there is a -1 or 1 the record will be scored at the left or right child respectively.

This is more illustrative:

                                    |-> shouldn't be accessing here
                                    |
[-1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 3245, 1, 64, 2342, 93348, -34857, 82, -8634, 9, 239]
                                          ^by chance this is a 1, so level 12 goes to 
                                           the right child instead of the missing node