Code Monkey home page Code Monkey logo

tabular-benchmark's Introduction

Tabular data learning benchmark

Accompanying repository for the paper Why do tree-based models still outperform deep learning on tabular data?

alt text

Replicating the paper's results

Downloading the datasets

To download these datasets, simply run python data/download_data.py.

Training the models

You can re-run the training using WandB sweeps.

  1. Copy / clone this repo on the different machines / clusters you want to use.
  2. Login to WandB and create new projects.
  3. Enter your projects name in launch_config/launch_benchmarks.py (or launch_config/launch_xps.py)
  4. run python launch_config/launch_benchmarks.py
  5. Run the generated sweeps using wandb agent <USERNAME/PROJECTNAME/SWEEPID> on the machine of your choice. More infos in the WandB doc
  6. After you've stopped the runs, download the results: python launch_config/download_data.py, after entering your wandb login in launch_config/download_data.py.

We're planning to release a version allowing to use Benchopt instead of WandB to make it easier to run.

Replicating the analyses / figures

All the R code used to generate the analyses and figures in available in the analyses folder.

Benchmarking your own algorithm

Downloading the datasets

The datasets used in the benchmark have been uploaded as OpenML benchmarks, with the same transformations that are used in the paper.

import openml
#openml.config.apikey = 'FILL_IN_OPENML_API_KEY'  # set the OpenML Api Key
SUITE_ID = 297 # Regression on numerical features
#SUITE_ID = 298 # Classification on numerical features
#SUITE_ID = 299 # Regression on numerical and categorical features
#SUITE_ID = 304 # Classification on numerical and categorical features
benchmark_suite = openml.study.get_suite(SUITE_ID)  # obtain the benchmark suite
for task_id in benchmark_suite.tasks:  # iterate over all tasks
    task = openml.tasks.get_task(task_id)  # download the OpenML task
    dataset = task.get_dataset()
    X, y, categorical_indicator, attribute_names = dataset.get_data(
        dataset_format="dataframe", target=dataset.default_target_attribute
    )

Using our results

If you want to compare you own algorithms with the models used in this benchmark for a given number of random search iteration, you can use the results from our random searches, which we share as two csv files located in the analyses/results folder.

Using our code

To benchmark your own algorithm using our code, you'll need:

  • a model which uses the sklearn's API, i.e having fit and predict methods. We recommend using Skorch use sklearn's API with a Pytorch model.
  • to add your model hyperparameters search space to launch_config/model_config.
  • to add your model name in launch_config/launch_benchmarks and utils/keyword_to_function_conversion.py
  • to run the benchmarks as explained in Training the models.

We're planning to release a version allowing to use Benchopt instead of WandB to make it easier to run.

tabular-benchmark's People

Contributors

leogrin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.