Comments (6)
the forest is an ensemble of multiple trees and each tree is constructed by
sampling with replacement (bagging) the training examples that means that about
63.6% of training data is used for constructing each tree (but different trees
have a different example training set due to bagging)
RF uses this property that each tree has about 100-63.6% of data not used for
training used as validation set and that is used to predict the out of bag
examples or OOB (search that in the tutorial file)
usually 5x2CV or 10CV is the standard when giving an algorithms performance and
the OOB idea is limited to classifiers that uses an ensemble + bagging. so
people usually use (like in SVM) training into training + validation and then
choose the best model on validation and then use those parameters to train a
single model on training and predict on test
whereas if you are using RF, you dont have to create a validation set, use all
the training data to create models and find the best model with lowest oob
error and then use that model to predict on test. usually i use all the
training and set a fixed ntree=1000 and search over multiple mtry values
mtry=D/10:D/10:D (where D = number of features) and choose the model that had
the lowest ooberr and use that model to predict on test.
comparing it to svm, i would ideally create 10 different folds, randomly pick
one fold for validation, 8 for training and one for test, then parameterically
search over varoius kernels etc by creating models on training and predicting
on validation. once i find the best model parameter i create a single model
using training + validation and then predict on the test fold. and do it lots
of time
Original comment by abhirana
on 11 May 2012 at 4:51
from randomforest-matlab.
i meant that for 10CV and SVM i would do the following
i would ideally create 10 different folds, randomly pick one fold for
validation, 8 for training and one for test, then parameterically search over
varoius kernels etc by creating models on training and predicting on
validation. once i find the best model parameter i create a single model using
training + validation and then predict on the test fold. and do it lots of time
Original comment by abhirana
on 11 May 2012 at 5:00
from randomforest-matlab.
Thank you for your explanations. I just want to make sure that I understand
your meaning:
I still needs to have a separate test set to test the best model on that but
since the validation part is doing internally in RF, I dont need to have
training + validation set and can use all the training for training, correct?
If so, I still am confused about what Breiman website says regarding no need of
a separate test set.
I want to compare the result of RF with fKNN classifier on my data set. For
fKNN I leave on subject(101*101 pixels) out to validate the accuracy and use 69
subject (69*101*101 pixels) as training set. In order to do a fair comparison,
is that correct if I do the same method to create the best model using training
set and test the model on the one that is left out and do the same things for
every other subjects?
Sorry but I still cannot understand how can I evaluate the best model without
using any separate test set as its said in Breiman website.
Appreciate your help and time.
Original comment by [email protected]
on 11 May 2012 at 7:22
from randomforest-matlab.
I still needs to have a separate test set to test the best model on that but
since the validation part is doing internally in RF, I dont need to have
training + validation set and can use all the training for training, correct?
If so, I still am confused about what Breiman website says regarding no need of
a separate test set.
- yup this is correct. breiman showed that ooberr gives an upper bound on the
validation set. the reason why a test set is not required is because the
results on validation, on ooberr are similar to that on the test set and
usually RF behaves nicely to like 1000 trees and the default mtry parameter and
maybe that is why they say of not needing a separate test set. but for
publishable results and to be equivalent to reporting on other classifiers it
is important to do a training + (validation) + test split
I want to compare the result of RF with fKNN classifier on my data set. For
fKNN I leave on subject(101*101 pixels) out to validate the accuracy and use 69
subject (69*101*101 pixels) as training set. In order to do a fair comparison,
is that correct if I do the same method to create the best model using training
set and test the model on the one that is left out and do the same things for
every other subjects?
- yeh or instead of leave one out go with 5x2CV or 10fold CV, that might be
faster, and also make sure that the splits using to train/test knn are the same
split used for training/testing RF, so that you can do some paired testing on
the results.
Sorry but I still cannot understand how can I evaluate the best model without
using any separate test set as its said in Breiman website.
- well, lets say you do not fix any kind of parameter of RF except set it to
1000 trees and the default mtry value and then create a bunch of trees, for
each tree part of the dataset is not used for training (due to bagging). now
use the individual trees to predict on all examples that were not used for
training those trees and then take the ensemble votes on those examples (out of
bag for trees) and report those results. now consider what you do with typical
classifiers, you will ideally create a training/test split, now divide that
training set into training+validation to pick the best parameter and then
create a single model with the training set and the best parameter and then use
that model to predict on test. you will do this tons of time and report the
final test error. this is no different than an individual tree in the forest
which trains on a unique dataset and predicts on a held out set and does it for
a ton of different trees. the only difference is because its an ensemble it
takes the final votes over held out examples at the very end. and some research
has shown that an held out validation and ooberr tend to be similar and if you
take all the data in your dataset then the ooberr=tsterr
Original comment by abhirana
on 11 May 2012 at 8:01
from randomforest-matlab.
Thank you so much for the clarification. I guess now I have better
understanding of what Breiman said. If I use the whole dataset for training and
compute the obberr that would be similar to test error but in order to publish
the result of classification as the classifier accuracy and compare with other
classifiers I'd better do a CV.
Thanks again!
Original comment by [email protected]
on 11 May 2012 at 9:27
from randomforest-matlab.
Original comment by abhirana
on 19 Dec 2012 at 9:07
- Changed state: Done
from randomforest-matlab.
Related Issues (20)
- weak learner HOT 1
- Compiling on Mac Lion HOT 6
- Compiled mexmaci64 for OSX 10.8.2 (Mountain Lion) HOT 2
- about the unbalanced data HOT 32
- Segmentation violation problem HOT 2
- Hierarchical sampling of data? HOT 3
- memory leak in HOT 1
- probability of classes for highly skewed dataset HOT 2
- Feature Normalization HOT 1
- sampsize problem
- score values from random forest HOT 1
- MATLAB crashes after tens of thousands runs !! HOT 3
- Compilation Problems with Matlab 2014a on Mac HOT 7
- How to get individual tree predictions for regression HOT 2
- use library (gcc) in matlab and error with compile of mex HOT 1
- NaN data HOT 4
- multivariate label output in regression analysis
- Matlab (randomly) crash after a number of runs HOT 5
- Directions for Bagging Regression HOT 2
- Quantifying Fractal Dimension HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from randomforest-matlab.