Code Monkey home page Code Monkey logo

uci-madelon-dataset's Introduction

UCI Madelon Dataset: Feature Selection + Classification

Data

Demonstrate a capacity to identify relevant features using machine learning. Madelon. "MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized."

The Madelon Dataset does not have attribute information to avoid biasing feature selection.

MADELON -- Positive ex. -- Negative ex. -- Total

  • Training set -- 1000 -- 1000 -- 2000
  • Validation set -- 300 -- 300 -- 600
  • Test set -- 900 -- 900 -- 1800
  • All -- 2200 -- 2200 -- 4400

Number of variables/features/attributes: Real: 20 Probes: 480 Total: 500

Problem Statement

Your challenge here is to develop a series of models for two purposes:

  1. for the purposes of identifying relevant features.
  2. for the purposes of generating predictions from the model.

Content

Data Sampling

Do substantive work on at least six subsets of the data.

  • 3 sets of 10% of the data from the UCI Madelon set
  • 3 sets of 10% of the data from the Madelon set made available by your instructors
EDA
  • perform EDA on each set as you see necessary
Benchmarking
  • Perform a naive fit for each of the base model classes:
    • logistic regression
    • decision tree
    • k nearest neighbors
    • support vector classifier
Identify Features & Feature Importance
  • Considering these results, build a final predictive model
  • Approaches:
    • Use feature selection to reduce the dataset to a manageable size then use conventional methods
    • Use an iterative model training method to find relevant features (ANOVA)
Build Model
  • Implement final model
Additional Items to Add (forthcoming):
  • ROC visualizations
  • comparative score visualizations for different classification pipelines
  • tune hyperparameters to improve accuracy/precision/recall and reduce logloss

uci-madelon-dataset's People

Contributors

godsylla avatar joshuacook avatar

Watchers

James Cloos avatar Arnab Kar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.