Efficient Inference of Optimal Decision Trees

Paper link: http://florent.avellaneda.free.fr/dl/AAAI20.pdf

The source code is designed to be compiled and executed on GNU/Linux.

Dependencies

g++
cmake
minisat : http://minisat.se/
xdot : https://github.com/jrfonseca/xdot.py
CLI11 : https://github.com/CLIUtils/CLI11

Example of installing dependencies for Ubuntu

The following instructions have been tested on Ubuntu 19.04

sudo apt install g++ cmake minisat2 xdot libzip-dev libboost-dev

CentOS 7

search yum install minisat2 https://centos.pkgs.org/7/lux/minisat2-libs-2.2.1-1.el7.lux.x86_64.rpm.html

git clone https://github.com/zhzdeng/minisat sudo cp -r minisat/minisat /usr/local/include/

yum install epel-release yum install gcc-c++ cmake xdotool libzip-devel boost-devel

sudo echo '[group_kdesig-cmake3_EPEL] name=Copr repo for cmake3_EPEL owned by @kdesig baseurl=https://copr-be.cloud.fedoraproject.org/results/@kdesig/cmake3_EPEL/epel-7-$basearch/ type=rpm-md skip_if_unavailable=True gpgcheck=1 gpgkey=https://copr-be.cloud.fedoraproject.org/results/@kdesig/cmake3_EPEL/pubkey.gpg repo_gpgcheck=0 enabled=1 enabled_metadata=1' >> /etc/yum.repos.d/cmake3.repo

yum install cmake3

sudo yum install centos-release-scl sudo yum install devtoolset-9-gcc* scl enable devtoolset-9 bash gcc -v

Build

cmake .
make

Running

Infer a decision tree with the algorithm DT_depth for the dataset "mouse":

./InferDT -d data/mouse.csv -v infer

Infer a decision tree with the algorithm DT_size for the dataset "car":

./InferDT data/car.csv -v infer

Run a 10-cross-validation on the dataset "mouse" with the algorithm DT_size:

./InferDT data/mouse.csv bench

Infer a decision tree with the algorithm DT_size for the training set balance-scale.csv.train1 and evaluate the model with the testing set balance-scale.csv.test1:

./InferDT data/balance-scale.csv.train1.csv infer -t data/balance-scale.csv.test1.csv

Print a help message:

./InferDT --help

Benchmarks

We ran experiments on Ubuntu with Intel Core CPU i7-2600K @ 3.40 GHz.

Verwer and Zhang Datasets

The datasets we used are extracted from the paper of Verwer and Zhang and are available at (https://github.com/SiccoVerwer/binoct).

Dataset	S	B	Time DT_depth	Accuracy DT_depth	k	n for DT_size	Time DT_size	Accuracy DT_size	Accuracy BinOCT*
iris	150	114	18 ms	92.9 %	3	10.6	30 ms	93.2 %	98.4 %
monks1	124	17	24 ms	90.3 %	4.4	17	80 ms	95.5 %	87.1 %
monks2	169	17	190 ms	70.2 %	5.8	47.8	9.1 sec	74.0 %	63.3 %
monks3	122	17	30 ms	78.1 %	4.8	23.4	210 ms	82.6 %	93.5 %
wine	178	1276	600 ms	89.3 %	3	7.8	1.2 sec	92.0 %	88.9 %
balance-scale	625	20	50 sec	93.0 %	8	268	183 sec	92.6 %	77.5 %

S: Number of examples in the dataset

B: Number of Boolean features in the dataset

Time DT_depth: Time used by our algorithm with parameter "-d"

Accuracy DT_depth: Accuracy of our algorithm with parameter "-d"

k: Depth of inferred decision trees

n: Number of nodes in inferred decision trees

Time DT_size: Time used by our algorithm without parameter "-d"

Accuracy DT_size: Accuracy of our algorithm without parameter "-d"

Accuracy BinOCT: Accuracy of algorithm from https://github.com/SiccoVerwer/binoct

Mouse

We used dataset Mouse that the authors Bessiere, Hebrard and O'Sullivan shared with us. Each entry in rows DT_size and DT_depth corresponds to the average over 100 runs. The first columns correspond to the name of each algorithm used. The next three columns correspond to inferring a decision tree from the whole dataset. The last column corresponds the 10-fold cross-validations.

Algorithm	Time	k	n	Accuracy
DT2 from paper Minimising Decision Tree Size as Combinatorial Optimisation	577 sec	4	15	83.8 %
DT1 from paper Learning Optimal Decision Trees with SAT	12.9 sec	4	15	83.8 %
DT_size: our algorithm without parameter "-d"	70 ms	4	15	83.5 %
DT_depth: our algorithm with parameter "-d"	20 ms	4	31	85.8 %

Other

In this section we perform x 10-cross-validations made randomly and record the average.

Dataset	S	B	Time DT_depth	Accuracy DT_depth	k	n for DT_size	Time DT_size	Accuracy DT_size	x
zoo	101	136	47 ms	91.7 %	4	20	200 ms	91 %	200
BodyMassIndex	500	172	53 sec	85 %	6.6	109	6.4 h	85.4 %	1
lungCancerDataset	59	72	3.8 ms	89.6 %	2.6	7	7.5 ms	90.6 %	200
iris	113	114	30 ms	93.6 %	3.6	14	120 ms	94.0 %	200
monks1	124	17	30 ms	97.0 %	4.9	15	140 ms	99.5 %	200
monks2	169	17	330 ms	88.0 %	6	64	9.3 sec	88.7 %	100
monks3	92	17	125 ms	79.6 %	5.8	35.5	2.5 sec	81.5 %	800
balance-scale	625	20	200 sec	71.8 %	9	276	11 h	73.6 %	1
wine	178	1276	4.5 sec	91.5 %	3	12.2	14.5 sec	91.2 %	200

S: Number of examples in the dataset

B: Number of Boolean features in the dataset

k: Average depth of inferred decision trees

n: Average number of nodes in inferred decision trees

x: Number of 10-cross-validation performed

showkeyjar / inferdt Goto Github PK

inferdt's Introduction

Efficient Inference of Optimal Decision Trees

Dependencies

Example of installing dependencies for Ubuntu

CentOS 7

Build

Running

Benchmarks

Verwer and Zhang Datasets

Mouse

Other

inferdt's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent