guitargeek / xgboost-fastforest Goto Github PK
View Code? Open in Web Editor NEWMinimal library code to deploy XGBoost models in C++.
License: MIT License
Minimal library code to deploy XGBoost models in C++.
License: MIT License
I'm interested to compare my own custom reader to this one, but I'm having some trouble getting benchmark results consistent with what the readme reports. Details:
My results:
FastForest: 1.61 s
treelite: -*
m2cgen: 1.71 s
xgboost: 0.12 s
TMVA: 8.95 s
The relative differences in these numbers are rather different from the readme (and xgboost is blazing fast somehow).
It could be useful to distribute a Dockerfile that sets up and runs all the benchmarks in a more controlled environment (OS, versions, etc.).
* The treelite benchmark (using version 2.1.0) doesn't work at all. I get the following error:
ModuleNotFoundError: No module named 'treelite.runtime'
If I swap it with treelite_runtime
(a separate pip package), then there's another error:
AttributeError: module 'treelite_runtime' has no attribute 'Batch'
Hello,
I'm currently working on a project that involves using FastForest, and as part of my validation process, I've been comparing inference results between XGBoost and FastForest using a single vector. However, I've come across an unexpected issue that I'm seeking assistance with.
During my experiment, I noticed that when I use the 'binary:logistic' objective in XGBoost, the predicted values differ from those obtained using FastForest. Strangely, when I switch to the 'binary:logitraw' objective in XGBoost, the predicted scores align with those from FastForest.
I suspect that the difference might be due to distinct logistic transformations applied in XGBoost and FastForest. To address this, I've tried exploring XGBoost's documentation for details about the logistic transformation applied with the 'binary:logistic' objective. Unfortunately, I couldn't find the specific information I was looking for.
In the example provided in the Readme for FastForest, a sigmoid transformation is explicitly applied to the score obtained from the model. Based on this, I assumed that XGBoost should also apply a sigmoid transformation for the 'binary:logistic' objective, doesn't it?
The code for reproduction is the following:
XGBoost
model:from xgboost import XGBClassifier
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2, weights=[0.5])
model = XGBClassifier(objective='binary:logistic').fit(X, y) # switch objective to "logitrow" to match results
booster = model._Booster
print(model.predict_proba(np.array([[0.0, 0.2, 0.4, 0.6, 0.8]]))) # [[0.37146312 0.6285369 ]]
booster.dump_model("model.txt")
Load the model into FastForest and perform inference:
#include "fastforest.h"
#include <iostream>
#include <cmath>
int main() {
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4"};
const auto fastForest = fastforest::load_txt("model.txt", features);
std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};
float score = fastForest(input.data()); // 1.02595
float sigmoid = 1./(1. + std::exp(-score));
std::cout << "sigmoid: " << sigmoid << std::endl; // 0.736129
}
I'm interested in understanding the logistic regression mismatching between XGBoost and FastForest when using the 'binary:logistic' objective. I'm eager to get to the bottom of this issue and would greatly appreciate any help.
Hi!
We've integrated your library into our project aiming to achieve machine learning results that closely align with an alternative XGBoost implementation.
The basic pipeline is that we dump the XGBoost
JSON
model to txt
and use it to perform inference. We've encountered discrepancies when comparing the results of our FastForest model with those of XGBoost
.
Throughout our experiments, we've observed instances (with certain input vectors) where the FastForest model produces unexpected results, differing from XGBoost.
We have reduced the problem to a single input vector. This is a self-contained minimal repo to reproduce the problem: https://github.com/andriiknu/fastforest_issue/tree/master
Thank you in advance! Looking forward to any assistance.
when I use gcc 4.8 to install, I got issue like this:
/home/fanni/za/c_plus/XGBoost-FastForest-master/src/fastforest.cpp:36:35: fatal error: experimental/filesystem: No that file or directory #include <experimental/filesystem> ^ compilation terminated. CMakeFiles/fastforest.dir/build.make:62: recipe for target 'CMakeFiles/fastforest.dir/src/fastforest.cpp.o' failed make[2]: *** [CMakeFiles/fastforest.dir/src/fastforest.cpp.o] Error 1 CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/fastforest.dir/all' failed make[1]: *** [CMakeFiles/fastforest.dir/all] Error 2 Makefile:140: recipe for target 'all' failed make: *** [all] Error 2
so I tryed to use gcc 7.5.0, install success, but when i use it in #include at main.cpp, a new issue happened, like this:
/tmp/ccWRi8pr.o:in ‘main’: main.cpp:(.text+0x232):Undefined reference to ‘FastForest::FastForest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)’ main.cpp:(.text+0x2ca):Undefined reference to ‘FastForest::operator()(float const*) const’ collect2: error: ld returned 1 exit status
So how I use it, looking forward to your reply!
I'm following your recipe for compiling your library but am facing the following issue that I'm unsure how to resolve. Do you have any suggestions?
[[email protected] build]$ cmake --version
cmake3 version 3.6.1
CMake suite maintained and supported by Kitware (kitware.com/cmake).
[[email protected] build]$ cmake ..
-- The C compiler identification is GNU 4.4.7
-- The CXX compiler identification is GNU 4.4.7
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at CMakeLists.txt:7 (project):
project with VERSION must use LANGUAGES before language names.
-- Configuring incomplete, errors occurred!
See also "/disk1/mpadilla/projects/conf_eng/XGBoost-FastForest/build/CMakeFiles/CMakeOutput.log".
[[email protected] build]$
Unfortunately, looking at the log file wasn't immediately useful. Any help/suggestions would be appreciated. Thank you!
Hello,
When I run regression model with gcc version 4.9.2 (Debian 4.9.2-10+deb8u2), the results are inconsistent.
But compare with Python gcc version 8.3.0 (Debian 8.3.0-6) is correct com
How do I run fastforest on gcc4.9.2 ? thank you!
I got the following linking error when I compile my test code:
undefined reference to `fast forest::loadtxt(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >&)'
Here is my test code:
#include <stdlib.h>
#include <stdio.h>
#include
#include
#include
#include
#include
#include
#include "fastforest.h"
#include
int main() {
std::vectorstd::string features{"f0","f1","f2","f3","f4","f5","f6","f7","f8","f9",
"f10","f11","f12","f13","f14","f15","f16","f17","f18","f19",
"f20","f21","f22","f23","f24","f25","f26","f27","f28","f29"};
const auto fastForest = fastforest::load_txt("model.txt", features);
std::vector<float> input{1, 0, 6, 4, 0, 0, 0, 1, 6, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 6, 0, 0, 0, 0, 0, 0, 0};
float bond_length = fastForest(input.data());
std::cout << "Bond Length = " << bond_length << "\n";
}
Any help will be highly appreciated! Thanks.
Hello! Thank you for that nice library!
Does it support regression tasks? I try some examples, but it fall behind 0.5 from true value, compare to python code ( Python code: input: -4 predict 1, C++ code: input: -4, predict 0.5. ) Is it always behave like that?
Very simple to use so far, I was able to use the library for binary classification and it is speedier than m2cgen. I noticed that the return type of operator() is just float. Is it possible to return a vector of probabilities (for multiclass classification) such as model.predict_proba()? My task involves sampling from these probabilities.
Hi, I'm following the normal build procedure to build the library (under WSL) but encounter some link issues during make, do you have some suggestions? Thanks. Here is the log:
cloudray@LEGION7000:/mnt/d/XGBoost-FastForest/build$ cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/local/lib/cmake/Boost-1.74.0/BoostConfig.cmake (found version "1.74.0") found components: system filesystem unit_test_framework
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/d/XGBoost-FastForest/build
cloudray@LEGION7000:/mnt/d/XGBoost-FastForest/build$ make -j8
Scanning dependencies of target fastforest
[ 20%] Building CXX object CMakeFiles/fastforest.dir/src/common_details.cpp.o
[ 40%] Building CXX object CMakeFiles/fastforest.dir/src/fastforest.cpp.o
[ 60%] Linking CXX shared library libfastforest.so
[ 60%] Built target fastforest
Scanning dependencies of target Test
[ 80%] Building CXX object test/CMakeFiles/Test.dir/test.cpp.o
[100%] Linking CXX executable Test
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `SerializationTest::test_method()':
test.cpp:(.text+0x2f7): undefined reference to `fastforest::FastForest::write_bin(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const'
/usr/bin/ld: test.cpp:(.text+0x441): undefined reference to `fastforest::load_bin(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
/usr/bin/ld: test.cpp:(.text+0x6ed): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `ManyfeaturesTest::test_method()':
test.cpp:(.text+0x274c): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `ExampleTest::test_method()':
test.cpp:(.text+0x4499): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `SoftmaxArrayTest::test_method()':
test.cpp:(.text+0x61af): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: test.cpp:(.text+0x61bc): undefined reference to `fastforest::details::softmaxTransformInplace(float*, int)'/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `BasicTest::test_method()':
test.cpp:(.text+0x873c): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `DiscreteTest::test_method()':
test.cpp:(.text+0xa704): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `SoftmaxTest::test_method()':
test.cpp:(.text+0xc6c7): undefined reference to `fastforest::FastForest::softmax(float const*, int) const'
collect2: error: ld returned 1 exit status
make[2]: *** [test/CMakeFiles/Test.dir/build.make:88: test/Test] Error 1
make[1]: *** [CMakeFiles/Makefile2:144: test/CMakeFiles/Test.dir/all] Error 2
make: *** [Makefile:141: all] Error 2
So, I tried to clone this repo and set up the tests in order to see the recently closed issue of C++98 compatibility, and I ran into a lot of issues trying to get the script create_test_data.py to execute on my machine, mainly because of dependencies. I've made a virtualenv and attached the list of dependencies I had to retrieve and/or pip retrieved in order for me to perform the following sequence of commands
python3 create_test_data.py
python3 test_cppyy.py
-- this was all done to satisfy the test.cpp requirement for the model.txt
g++ -std=c++98 -pedantic test.cpp -lfastforest
./a.out
One last issue I had to resolve was the test scripts inability to find the module xgboos2tmva.py, my solution was to just copy it from the benchmark folder to the test folder. I also tried to do the export PYTHONPATH route but it didn't seem to work. Might want to update the documentation to let people know how to solve that issue, or some other solution.
I was met with a Tests PASSED which is a relief but this felt painful to initiate these tests, I'm going to attach a requirements.txt so that people can use the feature from pip pip install -r requirements.txt
after cloning the repo so that the dependencies are resolved.
If you'd like me to submit a pull request with an update to the README.md I can also go that route, but I'd rather you have the opportunity/choice to solve this your own way if necessary. I'd just thought I would bring up this pain point I experienced. Either way, I really like this library, great work!
Raw Requirements.txt
cppyy==2.3.0
cppyy-backend==1.14.8
cppyy-cling==6.25.3
CPyCppyy==1.12.9
cycler==0.11.0
fonttools==4.31.2
joblib==1.1.0
kiwisolver==1.4.1
matplotlib==3.5.1
numpy==1.22.3
packaging==21.3
pandas==1.4.1
Pillow==9.0.1
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.1
scikit-learn==1.0.2
scipy==1.8.0
six==1.16.0
sklearn==0.0
threadpoolctl==3.1.0
xgboost==1.5.2
The only reason I had to execute the tests this way is because the README.md install process it creates a test target but the resource required for that executable do not exist (continous/model.txt, discrete/model.txt, etc.) so when you invoke the ./Test executable generated by cmake it just crashes near instantly.
Olá!
I'm trying to make predictions with FastForest, but the outputs don't match the XGBoost ones, and I can't figure out why.
Here are 10 samples and - respectively - their XGBoost raw prediction, XGBoost prediction with logistic transformation, FastForest raw prediction and FastForest prediction with logistic transformation.
XGBoost | LT | FastForest | LT | |
---|---|---|---|---|
sample 1 | 4.6465325 | 0.990 | 1.39314 | 0.801 |
sample 2 | 4.5409245 | 0.989 | 1.39692 | 0.801 |
sample 3 | 4.5436025 | 0.989 | 1.7282 | 0.849 |
sample 4 | 4.6465325 | 0.990 | 1.70365 | 0.846 |
sample 5 | 3.681776 | 0.975 | 0.09692 | 0.524 |
sample 6 | 4.644615 | 0.990 | 1.44151 | 0.808 |
sample 7 | 4.6465325 | 0.990 | 1.30975 | 0.787 |
sample 8 | 4.6402144 | 0.990 | 1.26588 | 0.780 |
sample 9 | 4.6465325 | 0.990 | 1.59832 | 0.831 |
sample 10 | 4.2298365 | 0.985 | 0.644576 | 0.655 |
The figures have been computed as follows:
# 'binary:logitraw' model
raw_output = model.predict_proba(sample)[0][1]
probability = 1 / (1 + math.exp(-raw_output))
float raw_output = fastForest(input.data(sample));
float probability = 1. / (1. + std::exp(-(raw_output)));
Do you have any clue? Thanks!
Hi,
I got std::length_error when I was trying to build and run example codes on windows visual studio 2017. I have no idea what is wrong here. Could you help me with it?
Thanks,
Building information:
$ cmake -DCMAKE_GENERATOR_PLATFORM=x64 -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON ..
-- Building for: Visual Studio 15 2017
-- Selecting Windows SDK version 10.0.17763.0 to target Windows 10.0.19042.
-- The C compiler identification is MSVC 19.16.27045.0
-- The CXX compiler identification is MSVC 19.16.27045.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: C:/local/boost_1_71_0 (found version "1.71.0") found components: system filesystem unit_test_framework
-- Configuring done
-- Generating done
-- Build files have been written to: E:/workspace/c/XGBoost-FastForest/build2
$ cmake --build . --config Release
Microsoft (R) Build Engine version 15.9.21+g9802d43bc3 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.
fastforest_functions.cpp
C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.16.27023\include\xlocale(319): warning C4530: C++ exception handler used, but unwind semantics are not enabled. Specify /EHsc [E:\workspace\c\XGBoost-FastForest\build2\fastforest.vcxproj]
Auto build dll exports
LINK : warning LNK4075: ignoring '/INCREMENTAL' due to '/OPT:ICF' specification [E:\workspace\c\XGBoost-FastForest\build2\fastforest.vcxproj]
Creating library E:/workspace/c/XGBoost-FastForest/build2/Release/fastforest.lib and object E:/workspace/c/XGBoost-FastForest/build2/Release/fastforest.exp
fastforest.vcxproj -> E:\workspace\c\XGBoost-FastForest\build2\Release\fastforest.dll
Test.vcxproj -> E:\workspace\c\XGBoost-FastForest\build2\test\Release\Test.exe
Hi,
Following the example code, I ran the following code samples.
Python:
import xgboost as xgb
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2, weights=[0.5])
model = xgb.XGBClassifier().fit(X, y)
predictions = model.predict(X)
prob_predictions = model.predict_proba(X)
n = 0
print(X[n,:])
print(predictions[n])
print(prob_predictions[n])
np.save('model_predictions.npy', predictions)
booster = model._Booster
booster.dump_model("model.txt")
booster.save_model("model.bin")`
With output of:
[-2.24456934 -1.36232827 1.55433334 -2.0869092 -1.27760482]
0
[9.994567e-01 5.432876e-04]
But when I try to run this code in C++:
int main() {
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4"};
const auto fastForest = fastforest::load_txt("model.txt", features);
// std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};
std::vector<float> input{-2.24456934, -1.36232827, 1.55433334, -2.0869092, -1.27760482};
float orig = fastForest(input.data());
float score = 1./(1. + std::exp(-orig));
std::vector<float> probas = fastForest.softmax(input.data());
cout << orig << endl;
cout << score << endl;
cout << probas[0] << " , " << probas[1] <<endl;
}
I'm getting other results (see below).
What can be wrong here?
p.s 'fastForest.softmax' was changes so it won't raise an error.
-7.01733
0.000895414
0.420152 , 0.579848
Hi @guitargeek,
Thanks for sharing such a great tool. Overall, it works quite well. But I still identified a small "bug" when I tried to convert a multi-class model trained with python to C++. I feel like the package cannot handle a multi-class model containing a tree with a single leaf node. To quickly replicate this issue, we just need to train a "large" model with less training data:
training_X, training_Y = make_classification(n_samples=100, n_features=100, n_informative=3, random_state=42, n_classes=3, weights=[0.33, 0.33])
model = XGBClassifier(n_estimators=100, max_depth=7, objective='multi:softmax', eval_metric='mlogloss', use_label_encoder=False).fit(training_X, training_Y)
After converting this model using FastForest, there were discrepancies between C++ and python probability output. Of course, this is just an extremely rare example [e.g., we only have 100 data samples for training]. However, I did notice that as long as the trained model [even trained with a large amount of data] contains a tree with only one leaf node, the C++ output and python output won't be exactly the same.
More than happy to provide more details if I am not clear. Looking forward to your solution.
Thanks.
With XGBClassifier
in python, I got [[0.05979478 0.9402052 ]]
for binary classification by predict_proba
. However, 0.9898 gotten in FastForest after logistic transformation while loading the xgb_model.txt
dumped from python with absolute same input with python inference.
Why is the difference?
Hello Jonas,
We are attempting to back-port your Fast forest code to run on C++ 98. I am strictly a Python developer but am looking to hand this task off to a C++ developer. Do you believe this is a feasible task and if so, how long do you think such a back-port would take. Thank you for your help and input.
I am trying to add this project to my existing project.
As per cmake I have the following:
include(FetchContent)
FetchContent_Declare(
XGBoost-FastForest
GIT_REPOSITORY https://github.com/guitargeek/XGBoost-FastForest
)
FetchContent_MakeAvailable(XGBoost-FastForest)
However, when I try to compile I am having issues with including the main header file fastforest.h
.
The support for C++98 was a great step in the right direction, but the vector method data() is not supported for C++98 unless you are using one of the most recent gcc compilers >= gcc10 (as far as I've checked). I would consider simply returning the pointer to the first element, since the operations should theoretically be the same, the vector method data() just ensures you can also use it on an empty vector.
Current Code:
fastforest ff = load_txt("path", features);
vector <float> i = {1.0, 2.0, 3.0};
ff(i.data());
C++98 Compiler Independent Code:
fastforest ff = load_txt("path", features);
vector <float> i = {1.0, 2.0, 3.0};
const float* arr = &i[0]; //This doesn't work on empty vectors unfortunately
ff(arr);
You might have to adjust the pointer checks for the array access via the new method, but I thought I'd let you know about this issue existing for C++98 support. Any compiler older than GCC 10.0 will throw these issues out (at least on my ubuntu box). I compiled with GCC 9, GCC 6, GCC 4.9.4, GCC 4.9.2 and all of them threw this data() method as an error.
https://en.cppreference.com/w/cpp/container/vector/data <- States that this has been a feature since C++11, no mention of C++98 support.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.