jimmy-lin / generalizedoptimalsparsedecisiontrees Goto Github PK

View Code? Open in Web Editor NEW

59.0 5.0 40.0 3.59 MB

Generalized Optimal Sparse Decision Trees

Makefile 2.42% M4 2.49% Shell 0.83% C++ 89.45% CMake 0.05% C 1.24% Python 3.52%

generalizedoptimalsparsedecisiontrees's Introduction

This is the code used in the ICML 2020 paper. Here is a newer version used in the AAAI 2022 paper

GOSDT Documentation

Implementation of Generalized Optimal Sparse Decision Tree.

Table of Content

Usage
Development
Configuration
Dependencies
License
FAQs

Usage

Guide for end-users who want to use the library without modification.

Describes how to install and use the library as a stand-alone command-line program or as an embedded extension in a larger project. Currently supported as a Python extension.

Installing Dependencies

Refer to Dependency Installation

As a Stand-Alone Command Line Program

Installation

./autobuild --install

Executing the Program

gosdt dataset.csv config.json
# or 
cat dataset.csv | gosdt config.json >> output.json

For examples of dataset files, refer to experiments/datasets/compas/binned.csv. For an example configuration file, refer to experiments/configurations/compas.json. For documentation on the configuration file, refer to Dependency Installation

As a Python Library with C++ Extensions

Build and Installation

./autobuild --install-python

If you have multiple Python installations, please make sure to build and install using the same Python installation as the one intended for interacting with this library.

Importing the C++ Extension

import gosdt

with open ("data.csv", "r") as data_file:
    data = data_file.read()

with open ("config.json", "r") as config_file:
    config = config_file.read()


print("Config:", config)
print("Data:", data)

gosdt.configure(config)
result = gosdt.fit(data)

print("Result: ", result)
print("Time (seconds): ", gosdt.time())
print("Iterations: ", gosdt.iterations())
print("Graph Size: ", gosdt.size())

Importing Extension with local Python Wrapper

import pandas as pd
import numpy as np
from model.gosdt import GOSDT

dataframe = pd.DataFrame(pd.read_csv("experiments/datasets/monk_2/data.csv"))

X = dataframe[dataframe.columns[:-1]]
y = dataframe[dataframe.columns[-1:]]

hyperparameters = {
    "regularization": 0.1,
    "time_limit": 3600,
    "verbose": True,
}

model = GOSDT(hyperparameters)
model.fit(X, y)
print("Execution Time: {}".format(model.time))

prediction = model.predict(X)
training_accuracy = model.score(X, y)
print("Training Accuracy: {}".format(training_accuracy))
print(model.tree)

Development

Guide for developers who want to use, modify and test the library.

Describes how to install and use the library with details on project structure.

Repository Structure

notebooks - interactive notebooks for examples and visualizations
experiments - configurations, datasets, and models to run experiments
doc - documentation
python - code relating to the Python implementation and wrappers around C++ implementation
auto - automations for checking and installing project dependencies
dist - compiled binaries for distribution
build - compiled binary objects and other build artifacts
lib - headers for external libraries
log - log files
src - source files
test - test files

Installing Dependencies

Refer to Dependency Installation

Build Process

Check Updates to the Dependency Tests or Makefile
```
./autobuild --regenerate
```
Check for Missing Dependencies
```
./autobuild --configure --enable-tests
```
Build and Run Test Suite
```
./autobuild --test
```
Build and Install Program
```
./autobuild --install --enable-tests
```
Run the Program
```
gosdt dataset.csv config.json
```
Build and Install the Python Extension
```
./autobuild --install-python
```

For a full list of build options, run ./autobuild --help

Configuration

Details on the configuration options.

gosdt dataset.csv config.json
# or
cat dataset.csv | gosdt config.json

Here the file config.json is optional. There is a default configuration which will be used if no such file is specified.

Configuration Description

The configuration file is a JSON object and has the following structure and default values:

{
  "balance": false,
  "cancellation": true,
  "look_ahead": true,
  "similar_support": true,
  "feature_exchange": true,
  "continuous_feature_exchange": true,
  "rule_list": false,

  "diagnostics": false,
  "verbose": false,

  "regularization": 0.05,
  "uncertainty_tolerance": 0.0,
  "upperbound": 0.0,

  "model_limit": 1,
  "precision_limit": 0,
  "stack_limit": 0,
  "tile_limit": 0,
  "time_limit": 0,
  "worker_limit": 1,

  "costs": "",
  "model": "",
  "profile": "",
  "timing": "",
  "trace": "",
  "tree": ""
}

Key parameters

regularization

Values: Decimal within range [0,1]
Description: Used to penalize complexity. A complexity penalty is added to the risk in the following way.
```
ComplexityPenalty = # Leaves x regularization
```
Default: 0.05
Note: We highly recommend setting the regularization to a value larger than 1/num_samples. A small regularization could lead to a longer training time.

time_limit

Values: Decimal greater than or equal to 0
Description: A time limit upon which the algorithm will terminate. If the time limit is reached, the algorithm will terminate with an error.
Special Cases: When set to 0, no time limit is imposed.

Flags

balance

Values: true or false
Description: Enables overriding the sample importance by equalizing the importance of each present class

cancellation

Values: true or false
Description: Enables propagate up the dependency graph of task cancellations

look_ahead

Values: true or false
Description: Enables the one-step look-ahead bound implemented via scopes

similar_support

Values: true or false
Description: Enables the similar support bound imeplemented via the distance index

feature_exchange

Values: true or false
Description: Enables pruning of pairs of features using subset comparison

continuous_feature_exchange

Values: true or false
Description: Enables pruning of pairs continuous of feature thresholds using subset comparison

diagnostics

Values: true or false
Description: Enables printing of diagnostic trace when an error is encountered to standard output

verbose

Values: true or false
Description: Enables printing of configuration, progress, and results to standard output

Tuners

uncertainty_tolerance

Values: Decimal within range [0,1]
Description: Used to allow early termination of the algorithm. Any models produced as a result are guaranteed to score within the lowerbound and upperbound at the time of termination. However, the algorithm does not guarantee that the optimal model is within the produced model unless the uncertainty value has reached 0.
Values: Decimal within range [0,1]
Description: Used to limit the risk of model search space. This can be used to ensure that no models are produced if even the optimal model exceeds a desired maximum risk. This also accelerates learning if the upperbound is taken from the risk of a nearly optimal model.

Limits

model_limit

Values: Decimal greater than or equal to 0
Description: The maximum number of models that will be extracted into the output.
Special Cases: When set to 0, no output is produced.

precision_limit

Values: Decimal greater than or equal to 0
Description: The maximum number of significant figures considered when converting ordinal features into binary features.
Special Cases: When set to 0, no limit is imposed.

stack_limit

Values: Decimal greater than or equal to 0
Description: The maximum number of bytes considered for use when allocating local buffers for worker threads.
Special Cases: When set to 0, all local buffers will be allocated from the heap.

tile_limit

Values: Decimal greater than or equal to 0
Description: The maximum number of bits used for the finding tile-equivalence
Special Cases: When set to 0, no tiling is performed.

worker_limit

Values: Decimal greater than or equal to 1
Description: The maximum number of threads allocated to executing th algorithm.
Special Cases: When set to 0, a single thread is created for each core detected on the machine.

Files

costs

Values: string representing a path to a file.
Description: This file must contain a CSV representing the cost matrix for calculating loss.
- The first row is a header listing every class that is present in the training data
- Each subsequent row contains the cost incurred of predicitng class i when the true class is j, where i is the row index and j is the column index
- Example where each false negative costs 0.1 and each false positive costs 0.2 (and correct predictions costs 0.0):
```
negative,positive
0.0,0.1
0.2,0.0
```
- Example for multi-class objectives:
```
class-A,class-B,class-C
0.0,0.1,0.3
0.2,0.0,0.1
0.8,0.3,0.0
```
- Note: costs values are not normalized, so high cost values lower the relative weight of regularization
Special Case: When set to empty string, a default cost matrix is used which represents unweighted training misclassification.

model

Values: string representing a path to a file.
Description: The output models will be written to this file.
Special Case: When set to empty string, no model will be stored.

profile

Values: string representing a path to a file.
Description: Various analytics will be logged to this file.
Special Case: When set to empty string, no analytics will be stored.

timing

Values: string representing a path to a file.
Description: The training time will be appended to this file.
Special Case: When set to empty string, no training time will be stored.

trace

Values: string representing a path to a directory.
Description: snapshots used for trace visualization will be stored in this directory
Special Case: When set to empty string, no snapshots are stored.

tree

Values: string representing a path to a directory.
Description: snapshots used for trace-tree visualization will be stored in this directory
Special Case: When set to empty string, no snapshots are stored.

Optimizing Different Loss Functions

When using the Python interface python/model/gosdt.py additional loss functions are available. Here is the list of loss functions implemented along with descriptions of their hyperparameters.

Accuracy

{ "objective": "acc" }

This optimizes the loss defined as the uniformly weighted number of misclassifications.

Balanced Accuracy

{ "objective": "bacc" }

This optimizes the loss defined as the number of misclassifications, adjusted for imbalanced representation of positive or negative samples.

Weighted Accuracy

{ "objective": "wacc", "w": 0.5 }

This optimizes the loss defined as the number of misclassifications, adjusted so that negative samples have a weight of w while positive samples have a weight of 1.0

F - 1 Score

{ "objective": "f1", "w": 0.9 }

This optimizes the loss defined as the F-1 score of the model's predictions.

Area under the Receiver Operanting Characteristics Curve

{ "objective": "auc" }

This maximizes the area under the ROC curve formed by varying the prediction of the leaves.

Partial Area under the Receiver Operanting Characteristics Curve

{ "objective": "pauc", "theta": 0.1 }

This maximizes the partial area under the ROC curve formed by varying the prediction of the leaves. The area is constrained so that false-positive-rate is in the closed interval [0,theta]

Dependencies

List of external dependencies

The following dependencies need to be installed to build the program.

Boost - Collection of portable C++ source libraries
GMP - Collection of functions for high-precision artihmetics
Intel TBB - Rich and complete approach to parallelism in C++
WiredTiger - WiredTiger is an high performance, scalable, production quality, NoSQL, Open Source extensible platform for data management
OpenCL(Optional) - A framework for execution across heterogeneous hardware accelerators.

Bundled Dependencies

The following dependencies are included as part of the repository, thus requiring no additional installation.

nlohmann/json - JSON Parser
ben-strasser/fast-cpp-csv-parser - CSV Parser
OpenCL C++ Bindings 1.2 - OpenCL bindings for GPU computing

Installation

Install these using your system package manager. There are also installation scripts provided for your convenience: trainer/auto

These currently support interface with brew and apt

Boost - auto/boost.sh --install
GMP - auto/gmp.sh --install
Intel TBB - auto/tbb.sh --install
WiredTiger - auto/wiredtiger.sh --install
OpenCL - auto/opencl.sh --install

FAQs

If you run into any issues, consult the FAQs first.

License

Licensing information

Inquiries

For general inquiries, send an email to [email protected]

generalizedoptimalsparsedecisiontrees's People

Contributors

Stargazers

Watchers

Forkers

anninnisi michal512 gilblankenship zeta1999 marlovss fernandolpz andro536 yanlirock leannejdong mcomsa vibratingpig keyan3 sofiaek ubc-systopia adarsa rohitpandey13

generalizedoptimalsparsedecisiontrees's Issues

Python installation does not work when calling `fit`

Hi,

After the installation of the dependencies and of the python library, when calling the fit method (executing the code from the example.py) it gives this error:

Traceback (most recent call last):
  File "example.py", line 17, in <module>
    model.fit(X, y)
  File "/home/ettore.mariotti/Documents/gosdt_repo/GeneralizedOptimalSparseDecisionTrees/python/model/gosdt.py", line 99, in fit
    self.__train__(X, y)
  File "/home/ettore.mariotti/Documents/gosdt_repo/GeneralizedOptimalSparseDecisionTrees/python/model/gosdt.py", line 48, in __train__
    dataset.insert(m, "class", y) # It is expected that the last column is the label column
  File "/home/ettore.mariotti/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 4418, in insert
    value = self._sanitize_column(value)
  File "/home/ettore.mariotti/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 4510, in _sanitize_column
    return sanitize_array(value, self.index, copy=True, allow_2d=True)
  File "/home/ettore.mariotti/.local/lib/python3.8/site-packages/pandas/core/construction.py", line 571, in sanitize_array
    subarr = maybe_convert_platform(data)
  File "/home/ettore.mariotti/.local/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 126, in maybe_convert_platform
    arr = lib.maybe_convert_objects(arr)
  File "pandas/_libs/lib.pyx", line 2385, in pandas._libs.lib.maybe_convert_objects
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

No source distribution or M1 Mac distribution available on PyPI

On https://pypi.org/project/gosdt/#files, it appears there's no source distribution or M1 Mac distribution available, which means the package can't be installed from PyPI on M1 Macs

Fail to build

Hi Jimmy, i tried to rebuild your very original project under your developer instruction.
In step 3, ./autobuild --test
I got error towards the end of the process, i.e.
config.status: creating Makefile
config.status: executing depfiles commands
g++ -std=gnu++11 -O3 -o gosdt_test -lgmp
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/Scrt1.o: in function _start': (.text+0x24): undefined reference to main'
collect2: error: ld returned 1 exit status
make: *** [Makefile:681: gosdt_test] Error 1

In step 4, ./autobuild --install --enable-tests
g++ -std=gnu++11 -O3 -Wall -I include -pthread -o gosdt src/gosdt-bitmask.o src/gosdt-configuration.o src/gosdt-task.o src/gosdt-message.o src/gosdt-tile.o src/gosdt-graph.o src/gosdt-queue.o src/gosdt-model.o src/gosdt-encoder.o src/gosdt-index.o src/gosdt-dataset.o src/gosdt-optimizer.o src/gosdt-state.o src/gosdt-local_state.o src/gosdt-gosdt.o src/gosdt-main.o -ltbb -ltbbmalloc -lgmp -lgmp
make[1]: Entering directory '/home/leanne/Dev/GeneralizedOptimalSparseDecisionTrees'
/usr/bin/mkdir -p '/usr/local/bin'
/usr/bin/install -c gosdt '/usr/local/bin'
/usr/bin/install: cannot remove '/usr/local/bin/gosdt': Permission denied
make[1]: *** [Makefile:558: install-binPROGRAMS] Error 1
make[1]: Leaving directory '/home/leanne/Dev/GeneralizedOptimalSparseDecisionTrees'
make: *** [Makefile:1432: install-am] Error 2

Would you look into them? I'm on ubuntu20.04 with a modern compiler installed. Thanks so much, leanne

Installing as a Python Library with C++ Extensions

Hi all!

I'm not well versed using terminal commands and I'm having troubles installing this tool. Whenever I try to run ./autobuild --install-python I get the error

checking for __gmpz_init in -lgmp... no
and then I get the final error:
configure: error: GNU MP is missing on this system. Please install GNU MP and try again.
Although I believe that I installed GNU MP correctly and even ran a make check afterwards. I installed the last version available at https://gmplib.org/. Does anyone experienced this error and/or knows how to fix it?

Install on docker container failing due to missing make file

I'm trying to install gosdt on a docker container (just to have a clean linux environment, I'm pretty sure the issue is unrelated to docker)

This is after all listed dependencies installed correctly. This is the bash output:

root@95578a67e637:/src# ./autobuild --install
./autobuild: 61: ./autobuild: make: not found
root@95578a67e637:/src# ./autobuild --install-python
./autobuild: 63: ./autobuild: make: not found
root@95578a67e637:/src# ls
Makefile  Makefile.am  Makefile.in  README.md  auto  autobuild  build  config.status  configure  configure.ac  doc  experiments  include  log  python  setup.py  src  test

any suggestions?

The docker image is the slim python base image (3.7)

BTW i think auto/boost.sh has (i believe) a typo:

  sudo apt-get install liboost-dev
  sudo apt-get install liboost-all-dev

should read

sudo apt-get install libboost-dev
sudo apt-get install libboost-all-dev

[Minor] Misspelling in `auto/boost.sh`

libboost-dev is spelt as liboost-dev and libboost-all-dev is spelt as liboost-all-dev throughout the script. Caused us a bit of a headache :)

unable to build due to error in concurrent hash map: "value_type of the container must be the same as its allocator's"

First, thanks for contributing this awesome package. I just cloned the repository and was following the installation instructions for the python extension (./autobuild --install-python) when I ran into the following error. It seems that the instantiation of TBB's concurrent_hash_map that takes place in src/queue is faulty:

g++ -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot 
/Library/Developer/CommandLineTools/SDKs/MacOSX11.sdk -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.9/include/python3.9 
-c src/dataset.cpp -o build/temp.macosx-11-x86_64-3.9/src/dataset.o -O3 -std=c++11 -I include -stdlib=libc++

In file included from src/dataset.cpp:1:
In file included from src/dataset.hpp:16:
In file included from /usr/local/include/tbb/concurrent_hash_map.h:17:
/usr/local/include/tbb/../oneapi/tbb/concurrent_hash_map.h:625:5: error: static_assert failed due to requirement 
'std::is_same<std::__1::pair<Message *const, bool>, std::__1::pair<Message *, bool>>::value'
"value_type of the container must be the same as its allocator's"

    static_assert(std::is_same<value_type, typename Allocator::value_type>::value,
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/queue.hpp:76:27: note: in instantiation of template class
'tbb::detail::d2::concurrent_hash_map<Message *, bool, MembershipKeyHashCompare, tbb::detail::d1::scalable_allocator<std::__1::pair<Message *, bool>>>' requested here
    membership_table_type membership;
                          ^

This is using clang 12.0.5 on a MacBook Pro. Here is the output of g++ --version for reference:

(base) keyan@MacBook-Pro-2 gosdt % g++ --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 12.0.5 (clang-1205.0.22.9)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

I can't find path

I can't find this path: from python.model.tree_encoder import TreeEncoder

Cannot use F1 objective, because w is set to None

When running the fit(X,y) method of the GOSDT python class, I consistently get the following error message (trace shortened):

Traceback (most recent call last):
...
  File ".../gosdt.py", line 95, in fit
    self.__python_train__(X, y)
  File ".../gosdt.py", line 273, in __python_train__
    leaves_c, pred_c, dic, nleaves, m, n, totaltime, time_c, R_c, COUNT, C_c, accu, best_is_cart, clf = bbound(
  File ".../imbalance/osdt_imb_v9.py", line 1875, in bbound
    root_leaf = CacheLeaf(name, n, P, N, (), x, y, y_mpz, z_mpz, make_all_ones(n + 1),
  File ".../imbalance/osdt_imb_v9.py", line 1676, in __init__
    self.pred = bound.leaf_predict(self.p, self.n, w)
  File ".../imbalance/osdt_imb_v9.py", line 1344, in leaf_predict
    if w * p <= n:
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

When printing the variables w, p and n prior to the last call (line 1344 in osdt_imb_v9.py) I can see that w is None, the other two are not.

It seems to me that using the F1-score is impossible, because in gosdt.py the configuration entry for w is set to None whenever F1 should be used. Unless w is altered somewhere the operation in osdt_imb_v9.py consequently has to fail. File gosdt.py, lines 85-87:

elif self.configuration["objective"] == "f1":
    self.configuration["theta"] = None
    self.configuration["w"] = None

Running the algorithm without specifying the objective and with other objectives like acc, bacc or wacc works perfectly fine. How can I use F1?
Is the C++ implementation able to optimize for an F1 objective function?

will it work on windows?

Provide Example Demonstrating Early Stopping, Optimality Gap

The paper suggested that it would be possible to stop training prior to finding the optimum and reporting the optimality gap but it's unclear from gosdt.py how the train method supports this.

Time limit ignored on Linux

Hi Jimmy,

I'm trying to illustrate GOSDT with the diabetes dataset located here, and it seems that the time limit is being ignored. I've tried with continuous features, as well as discretizing on my own, but I can't seem to get anything to return in the amount of time I would expect. I'm running the following code in a Jupyter notebook on a Debian Linux instance with 8 cores and 30GB RAM, so I wouldn't suspect a hardware issue (RAM particularly is hovering around ~1GB used). This example took ~23 minutes on my machine.

## --- env setup --- ##
import pandas as pd
from sklearn.model_selection import train_test_split
import sys
sys.path.append('../GeneralizedOptimalSparseDecisionTrees/python/') # location of cloned GOSDT repo
from model.gosdt import GOSDT 

## --- load data (directly from Kaggle location) --- ##
diabetes = pd.read_csv('diabetes.csv')

## --- same training/test split I'm using --- ##
train, _ = train_test_split(diabetes, random_state=0, test_size=0.2)
X = train.drop(columns="Outcome")
y = train['Outcome']

## --- specify and fit model --- ##
hyperparams = {
    "regularization": 0.1,
    "time_limit":10,
    "precision_limit":0.1,
    "worker_limit":8,
    "verbose": True
}

model = GOSDT(hyperparams)
model.fit(X, pd.DataFrame(y))
print(model.time / 60) # 23.861

A potentially separate issue is that I have only been able to get trees with a single split and two terminal nodes - this is perhaps due in part to the time limit issue not allowing any regularization lower than ~0.1, but I wanted to see if you could offer any advice. Here is the tree I'm getting, pretty much regardless of which combination of hyperparameters I use:

if 144 <= Glucose then:
    predicted class: 1
    misclassification penalty: 0.065
    complexity penalty: 0.1

else if Glucose < 144 then:
    predicted class: 0
    misclassification penalty: 0.191
    complexity penalty: 0.1

Thank you!
-Mitch

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.