guitargeek / xgboost-fastforest Goto Github PK

View Code? Open in Web Editor NEW

85.0 85.0 29.0 263 KB

Minimal library code to deploy XGBoost models in C++.

License: MIT License

CMake 4.55% C++ 78.10% Python 17.34%

xgboost-fastforest's People

Contributors

Stargazers

Watchers

xgboost-fastforest's Issues

Difficulty reproducing benchmarks

I'm interested to compare my own custom reader to this one, but I'm having some trouble getting benchmark results consistent with what the readme reports. Details:

CPU: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz
Compiler: GCC 9.3.0
Python: 3.9.6 (newer)
xgboost: 1.3.3 (newer)
m2cgen: 0.9.0
ROOT: 6.22.08 (newer)

My results:
FastForest: 1.61 s
treelite: -*
m2cgen: 1.71 s
xgboost: 0.12 s
TMVA: 8.95 s

The relative differences in these numbers are rather different from the readme (and xgboost is blazing fast somehow).

It could be useful to distribute a Dockerfile that sets up and runs all the benchmarks in a more controlled environment (OS, versions, etc.).

* The treelite benchmark (using version 2.1.0) doesn't work at all. I get the following error:

ModuleNotFoundError: No module named 'treelite.runtime'

If I swap it with treelite_runtime (a separate pip package), then there's another error:

AttributeError: module 'treelite_runtime' has no attribute 'Batch'

Results of binary logistic regression mismatched with xgboost

Hello,

I'm currently working on a project that involves using FastForest, and as part of my validation process, I've been comparing inference results between XGBoost and FastForest using a single vector. However, I've come across an unexpected issue that I'm seeking assistance with.

During my experiment, I noticed that when I use the 'binary:logistic' objective in XGBoost, the predicted values differ from those obtained using FastForest. Strangely, when I switch to the 'binary:logitraw' objective in XGBoost, the predicted scores align with those from FastForest.

I suspect that the difference might be due to distinct logistic transformations applied in XGBoost and FastForest. To address this, I've tried exploring XGBoost's documentation for details about the logistic transformation applied with the 'binary:logistic' objective. Unfortunately, I couldn't find the specific information I was looking for.

In the example provided in the Readme for FastForest, a sigmoid transformation is explicitly applied to the score obtained from the model. Based on this, I assumed that XGBoost should also apply a sigmoid transformation for the 'binary:logistic' objective, doesn't it?

The code for reproduction is the following:

Train, infer, and save the XGBoost model:

from xgboost import XGBClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2, weights=[0.5])

model = XGBClassifier(objective='binary:logistic').fit(X, y) # switch objective to "logitrow" to match results
booster = model._Booster

print(model.predict_proba(np.array([[0.0, 0.2, 0.4, 0.6, 0.8]]))) # [[0.37146312 0.6285369 ]]

booster.dump_model("model.txt")

Load the model into FastForest and perform inference:

#include "fastforest.h"
#include <iostream>
#include <cmath>

int main() {
    std::vector<std::string> features{"f0",  "f1",  "f2",  "f3",  "f4"};

    const auto fastForest = fastforest::load_txt("model.txt", features);

    std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};

    float score = fastForest(input.data()); // 1.02595
    float sigmoid = 1./(1. + std::exp(-score));

    std::cout << "sigmoid: " << sigmoid << std::endl; // 0.736129
}

I'm interested in understanding the logistic regression mismatching between XGBoost and FastForest when using the 'binary:logistic' objective. I'm eager to get to the bottom of this issue and would greatly appreciate any help.

FastForest inference decripincies with XGBoost

Hi!

We've integrated your library into our project aiming to achieve machine learning results that closely align with an alternative XGBoost implementation.

The basic pipeline is that we dump the XGBoost JSON model to txt and use it to perform inference. We've encountered discrepancies when comparing the results of our FastForest model with those of XGBoost.

Throughout our experiments, we've observed instances (with certain input vectors) where the FastForest model produces unexpected results, differing from XGBoost.

We have reduced the problem to a single input vector. This is a self-contained minimal repo to reproduce the problem: https://github.com/andriiknu/fastforest_issue/tree/master

Thank you in advance! Looking forward to any assistance.

gcc version

when I use gcc 4.8 to install, I got issue like this:
/home/fanni/za/c_plus/XGBoost-FastForest-master/src/fastforest.cpp:36:35: fatal error: experimental/filesystem: No that file or directory #include <experimental/filesystem> ^ compilation terminated. CMakeFiles/fastforest.dir/build.make:62: recipe for target 'CMakeFiles/fastforest.dir/src/fastforest.cpp.o' failed make[2]: *** [CMakeFiles/fastforest.dir/src/fastforest.cpp.o] Error 1 CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/fastforest.dir/all' failed make[1]: *** [CMakeFiles/fastforest.dir/all] Error 2 Makefile:140: recipe for target 'all' failed make: *** [all] Error 2
so I tryed to use gcc 7.5.0, install success, but when i use it in #include at main.cpp, a new issue happened, like this:
/tmp/ccWRi8pr.o：in ‘main’： main.cpp:(.text+0x232)：Undefined reference to ‘FastForest::FastForest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)’ main.cpp:(.text+0x2ca)：Undefined reference to ‘FastForest::operator()(float const*) const’ collect2: error: ld returned 1 exit status
So how I use it, looking forward to your reply!

cmake issue

I'm following your recipe for compiling your library but am facing the following issue that I'm unsure how to resolve. Do you have any suggestions?

[[email protected] build]$ cmake --version
cmake3 version 3.6.1

CMake suite maintained and supported by Kitware (kitware.com/cmake).
[[email protected] build]$ cmake ..
-- The C compiler identification is GNU 4.4.7
-- The CXX compiler identification is GNU 4.4.7
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at CMakeLists.txt:7 (project):
  project with VERSION must use LANGUAGES before language names.


-- Configuring incomplete, errors occurred!
See also "/disk1/mpadilla/projects/conf_eng/XGBoost-FastForest/build/CMakeFiles/CMakeOutput.log".
[[email protected] build]$

Unfortunately, looking at the log file wasn't immediately useful. Any help/suggestions would be appreciated. Thank you!

GCC version requirements？

Hello，
When I run regression model with gcc version 4.9.2 (Debian 4.9.2-10+deb8u2), the results are inconsistent.
But compare with Python gcc version 8.3.0 (Debian 8.3.0-6) is correct com

How do I run fastforest on gcc4.9.2 ? thank you!

Linking error for build C++ test

I got the following linking error when I compile my test code:
undefined reference to `fast forest::loadtxt(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >&)'

Here is my test code:

#include <stdlib.h>
#include <stdio.h>
#include
#include
#include
#include
#include
#include
#include "fastforest.h"
#include

int main() {
std::vectorstd::string features{"f0","f1","f2","f3","f4","f5","f6","f7","f8","f9",
"f10","f11","f12","f13","f14","f15","f16","f17","f18","f19",
"f20","f21","f22","f23","f24","f25","f26","f27","f28","f29"};

const auto fastForest = fastforest::load_txt("model.txt", features);

std::vector<float> input{1, 0, 6, 4, 0, 0, 0, 1, 6, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 6, 0, 0, 0, 0, 0, 0, 0};

float bond_length = fastForest(input.data());
std::cout << "Bond Length = " << bond_length << "\n";

}

Any help will be highly appreciated! Thanks.

XGBoost regression

Hello! Thank you for that nice library!

Does it support regression tasks? I try some examples, but it fall behind 0.5 from true value, compare to python code ( Python code: input: -4 predict 1, C++ code: input: -4, predict 0.5. ) Is it always behave like that?

Does it support multi-softmax as the objective?

multiclass support

Very simple to use so far, I was able to use the library for binary classification and it is speedier than m2cgen. I noticed that the return type of operator() is just float. Is it possible to return a vector of probabilities (for multiclass classification) such as model.predict_proba()? My task involves sampling from these probabilities.

make issue when build test.cpp.o

Hi, I'm following the normal build procedure to build the library (under WSL) but encounter some link issues during make, do you have some suggestions? Thanks. Here is the log:

cloudray@LEGION7000:/mnt/d/XGBoost-FastForest/build$ cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/local/lib/cmake/Boost-1.74.0/BoostConfig.cmake (found version "1.74.0") found components: system filesystem unit_test_framework
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/d/XGBoost-FastForest/build

cloudray@LEGION7000:/mnt/d/XGBoost-FastForest/build$ make -j8
Scanning dependencies of target fastforest
[ 20%] Building CXX object CMakeFiles/fastforest.dir/src/common_details.cpp.o
[ 40%] Building CXX object CMakeFiles/fastforest.dir/src/fastforest.cpp.o
[ 60%] Linking CXX shared library libfastforest.so
[ 60%] Built target fastforest
Scanning dependencies of target Test
[ 80%] Building CXX object test/CMakeFiles/Test.dir/test.cpp.o
[100%] Linking CXX executable Test
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `SerializationTest::test_method()':
test.cpp:(.text+0x2f7): undefined reference to `fastforest::FastForest::write_bin(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const'
/usr/bin/ld: test.cpp:(.text+0x441): undefined reference to `fastforest::load_bin(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
/usr/bin/ld: test.cpp:(.text+0x6ed): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `ManyfeaturesTest::test_method()':
test.cpp:(.text+0x274c): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `ExampleTest::test_method()':
test.cpp:(.text+0x4499): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `SoftmaxArrayTest::test_method()':
test.cpp:(.text+0x61af): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: test.cpp:(.text+0x61bc): undefined reference to `fastforest::details::softmaxTransformInplace(float*, int)'/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `BasicTest::test_method()':
test.cpp:(.text+0x873c): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `DiscreteTest::test_method()':
test.cpp:(.text+0xa704): undefined reference to `fastforest::FastForest::evaluate(float const*, float*, int) const'
/usr/bin/ld: CMakeFiles/Test.dir/test.cpp.o: in function `SoftmaxTest::test_method()':
test.cpp:(.text+0xc6c7): undefined reference to `fastforest::FastForest::softmax(float const*, int) const'
collect2: error: ld returned 1 exit status
make[2]: *** [test/CMakeFiles/Test.dir/build.make:88: test/Test] Error 1
make[1]: *** [CMakeFiles/Makefile2:144: test/CMakeFiles/Test.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

Executing Test(s) Issue

So, I tried to clone this repo and set up the tests in order to see the recently closed issue of C++98 compatibility, and I ran into a lot of issues trying to get the script create_test_data.py to execute on my machine, mainly because of dependencies. I've made a virtualenv and attached the list of dependencies I had to retrieve and/or pip retrieved in order for me to perform the following sequence of commands

python3 create_test_data.py
python3 test_cppyy.py
-- this was all done to satisfy the test.cpp requirement for the model.txt
g++ -std=c++98 -pedantic test.cpp -lfastforest
./a.out

One last issue I had to resolve was the test scripts inability to find the module xgboos2tmva.py, my solution was to just copy it from the benchmark folder to the test folder. I also tried to do the export PYTHONPATH route but it didn't seem to work. Might want to update the documentation to let people know how to solve that issue, or some other solution.

I was met with a Tests PASSED which is a relief but this felt painful to initiate these tests, I'm going to attach a requirements.txt so that people can use the feature from pip pip install -r requirements.txt after cloning the repo so that the dependencies are resolved.

If you'd like me to submit a pull request with an update to the README.md I can also go that route, but I'd rather you have the opportunity/choice to solve this your own way if necessary. I'd just thought I would bring up this pain point I experienced. Either way, I really like this library, great work!

Raw Requirements.txt

cppyy==2.3.0
cppyy-backend==1.14.8
cppyy-cling==6.25.3
CPyCppyy==1.12.9
cycler==0.11.0
fonttools==4.31.2
joblib==1.1.0
kiwisolver==1.4.1
matplotlib==3.5.1
numpy==1.22.3
packaging==21.3
pandas==1.4.1
Pillow==9.0.1
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.1
scikit-learn==1.0.2
scipy==1.8.0
six==1.16.0
sklearn==0.0
threadpoolctl==3.1.0
xgboost==1.5.2

The only reason I had to execute the tests this way is because the README.md install process it creates a test target but the resource required for that executable do not exist (continous/model.txt, discrete/model.txt, etc.) so when you invoke the ./Test executable generated by cmake it just crashes near instantly.

XGBoost and FastForest predictions don't match

Olá!
I'm trying to make predictions with FastForest, but the outputs don't match the XGBoost ones, and I can't figure out why.
Here are 10 samples and - respectively - their XGBoost raw prediction, XGBoost prediction with logistic transformation, FastForest raw prediction and FastForest prediction with logistic transformation.

	XGBoost	LT	FastForest	LT
sample 1	4.6465325	0.990	1.39314	0.801
sample 2	4.5409245	0.989	1.39692	0.801
sample 3	4.5436025	0.989	1.7282	0.849
sample 4	4.6465325	0.990	1.70365	0.846
sample 5	3.681776	0.975	0.09692	0.524
sample 6	4.644615	0.990	1.44151	0.808
sample 7	4.6465325	0.990	1.30975	0.787
sample 8	4.6402144	0.990	1.26588	0.780
sample 9	4.6465325	0.990	1.59832	0.831
sample 10	4.2298365	0.985	0.644576	0.655

The figures have been computed as follows:

XGBoost raw output:

# 'binary:logitraw' model
raw_output = model.predict_proba(sample)[0][1]

XGBoost probability output:

probability = 1 / (1 + math.exp(-raw_output))

FastForest raw output:

float raw_output = fastForest(input.data(sample));

FastForest probability output:

float probability = 1. / (1. + std::exp(-(raw_output)));

Do you have any clue? Thanks!

std::length_error issue

Hi,

I got std::length_error when I was trying to build and run example codes on windows visual studio 2017. I have no idea what is wrong here. Could you help me with it?

Thanks,

Building information:

$ cmake -DCMAKE_GENERATOR_PLATFORM=x64  -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON ..
-- Building for: Visual Studio 15 2017
-- Selecting Windows SDK version 10.0.17763.0 to target Windows 10.0.19042.
-- The C compiler identification is MSVC 19.16.27045.0
-- The CXX compiler identification is MSVC 19.16.27045.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: C:/local/boost_1_71_0 (found version "1.71.0") found components: system filesystem unit_test_framework
-- Configuring done
-- Generating done
-- Build files have been written to: E:/workspace/c/XGBoost-FastForest/build2

$ cmake --build . --config Release
Microsoft (R) Build Engine version 15.9.21+g9802d43bc3 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

  fastforest_functions.cpp
C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.16.27023\include\xlocale(319): warning C4530: C++ exception handler used, but unwind semantics are not enabled. Specify /EHsc [E:\workspace\c\XGBoost-FastForest\build2\fastforest.vcxproj]
  Auto build dll exports
LINK : warning LNK4075: ignoring '/INCREMENTAL' due to '/OPT:ICF' specification [E:\workspace\c\XGBoost-FastForest\build2\fastforest.vcxproj]
     Creating library E:/workspace/c/XGBoost-FastForest/build2/Release/fastforest.lib and object E:/workspace/c/XGBoost-FastForest/build2/Release/fastforest.exp
  fastforest.vcxproj -> E:\workspace\c\XGBoost-FastForest\build2\Release\fastforest.dll
  Test.vcxproj -> E:\workspace\c\XGBoost-FastForest\build2\test\Release\Test.exe

Prediction and Scores are not the same in Python and C++ (cannot reproduce results)

Hi,

Following the example code, I ran the following code samples.

Python:

import xgboost as xgb
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2, weights=[0.5])

model = xgb.XGBClassifier().fit(X, y)
predictions = model.predict(X)
prob_predictions = model.predict_proba(X)

n = 0
print(X[n,:])
print(predictions[n])
print(prob_predictions[n])

np.save('model_predictions.npy', predictions)
booster = model._Booster
booster.dump_model("model.txt")
booster.save_model("model.bin")`

With output of:

[-2.24456934 -1.36232827  1.55433334 -2.0869092  -1.27760482]  
0
[9.994567e-01 5.432876e-04]

But when I try to run this code in C++:

int main() {
    std::vector<std::string> features{"f0",  "f1",  "f2",  "f3",  "f4"};

    const auto fastForest = fastforest::load_txt("model.txt", features);

    // std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};
    std::vector<float> input{-2.24456934, -1.36232827,  1.55433334, -2.0869092,  -1.27760482};
    
    float orig = fastForest(input.data());
    float score = 1./(1. + std::exp(-orig));
    std::vector<float> probas = fastForest.softmax(input.data());

    cout << orig << endl;
    cout << score << endl;
    cout << probas[0] << " , " << probas[1] <<endl;
}

I'm getting other results (see below).
What can be wrong here?
p.s 'fastForest.softmax' was changes so it won't raise an error.

-7.01733
0.000895414
0.420152 , 0.579848

Undefined reference to fastforest

Hello.
I tried to test this library using the code sample provided in the readme but I keep getting the undefined reference to fastforest error.
Do you have any idea what could be the origin of the issue?

Cannot handle a multi-class model containing a tree with a single leaf node?

Hi @guitargeek,

Thanks for sharing such a great tool. Overall, it works quite well. But I still identified a small "bug" when I tried to convert a multi-class model trained with python to C++. I feel like the package cannot handle a multi-class model containing a tree with a single leaf node. To quickly replicate this issue, we just need to train a "large" model with less training data:

training_X, training_Y = make_classification(n_samples=100, n_features=100, n_informative=3, random_state=42, n_classes=3, weights=[0.33, 0.33])

model = XGBClassifier(n_estimators=100, max_depth=7, objective='multi:softmax',  eval_metric='mlogloss', use_label_encoder=False).fit(training_X, training_Y)

After converting this model using FastForest, there were discrepancies between C++ and python probability output. Of course, this is just an extremely rare example [e.g., we only have 100 data samples for training]. However, I did notice that as long as the trained model [even trained with a large amount of data] contains a tree with only one leaf node, the C++ output and python output won't be exactly the same.

More than happy to provide more details if I am not clear. Looking forward to your solution.

Thanks.

Different confidence score between FastForest and predict_proba of python?

With XGBClassifier in python, I got [[0.05979478 0.9402052 ]] for binary classification by predict_proba. However, 0.9898 gotten in FastForest after logistic transformation while loading the xgb_model.txt dumped from python with absolute same input with python inference.

Why is the difference?

C++ 98 Compatibility/Feasibility

Hello Jonas,

We are attempting to back-port your Fast forest code to run on C++ 98. I am strictly a Python developer but am looking to hand this task off to a C++ developer. Do you believe this is a feasible task and if so, how long do you think such a back-port would take. Thank you for your help and input.

Trying to add project in cmake

I am trying to add this project to my existing project.
As per cmake I have the following:

include(FetchContent)
FetchContent_Declare(
  XGBoost-FastForest
  GIT_REPOSITORY https://github.com/guitargeek/XGBoost-FastForest
)
FetchContent_MakeAvailable(XGBoost-FastForest)

However, when I try to compile I am having issues with including the main header file fastforest.h.

C++98 Compatibility

The support for C++98 was a great step in the right direction, but the vector method data() is not supported for C++98 unless you are using one of the most recent gcc compilers >= gcc10 (as far as I've checked). I would consider simply returning the pointer to the first element, since the operations should theoretically be the same, the vector method data() just ensures you can also use it on an empty vector.

Current Code:
    fastforest ff = load_txt("path", features);
    vector <float> i = {1.0, 2.0, 3.0};
    ff(i.data());

C++98 Compiler Independent Code:
    fastforest ff = load_txt("path", features);
    vector <float> i = {1.0, 2.0, 3.0};
    const float* arr = &i[0]; //This doesn't work on empty vectors unfortunately
    ff(arr);

You might have to adjust the pointer checks for the array access via the new method, but I thought I'd let you know about this issue existing for C++98 support. Any compiler older than GCC 10.0 will throw these issues out (at least on my ubuntu box). I compiled with GCC 9, GCC 6, GCC 4.9.4, GCC 4.9.2 and all of them threw this data() method as an error.

https://en.cppreference.com/w/cpp/container/vector/data <- States that this has been a feature since C++11, no mention of C++98 support.

guitargeek / xgboost-fastforest Goto Github PK

xgboost-fastforest's People

Contributors

Stargazers

Watchers

Forkers

xgboost-fastforest's Issues

Recommend Projects

Recommend Topics

Recommend Org