Code Monkey home page Code Monkey logo

ml-automated-feature-selection-'s Introduction

A Model for Automated Feature Selection

we investigate multiple techniques and purpose a model for dealing with the feature selection which have been the focus in areas of application for datasets with hundreds and thousands of variables. To this end we utilize a dataset which was prepared artificially for examining feature selection techniques. Our investigation clearly shows that feature selection in datasets where variables do not have significance can enhance the predictive ability of a machine learning algorithm.

MADELON is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 501 features. The dataset was obtained from the UCI Machine Learning Repository

Feature selection is the process of picking those attributes in our dataset (such as columns in a tabular data), which are of most relevance to a problem at hand. For e.g., if we are trying to differentiate a human from a cat, the number of feet would be a relevant feature in our prediction. Although it is possible to select the features using some prior knowledge and other manual techniques, it is not always feasible. Many times, when carrying out feature selection we might be biased with our assumptions while selecting a feature or there may be too many features in a dataset to be conveniently filtered out.

Motivation for choosing this problem through the "Occam's razor (or Ockham's razor)" principle. It is a problem-solving principle from philosophy which says that if there exist two explanations for an occurrence, the one that needs the smallest number of assumptions is usually correct. i.e. The more assumptions that you have to make, the higher the chances that it is an unlikely explanation. Our motivation also stems largely from modelling various machine learning problems in our assignments. Even though we use statistical methods like observing the co-linearity of the variables, it might not be sufficient for making our proposed set of features the most optimal. Also going back and forth to confirm the increase or decrease in accuracy upon selecting our features to decide if we should drop a feature after creating the whole model is not an efficient technique.

Data Distribution

Before applying any machine learning or statistical methods, it is crucial to understand the underlying distribution of the data. The distribution of the data helps us clearly state our assumptions and base our models upon them. We visualize the distribution among the attributes by using histograms. Since the number of features is very high, we will plot the histograms by sampling 20 random features out of the 500 as shown in the fig 1. The observations made from the visualization are as follows:

  1. Some of the features are skewed (i.e. the mean and the median are not the same)
  2. Most of the features have a Normal distribution.
  3. Some of the features have a Bimodal distribution.
  4. Outliers are present in the data.

ml-automated-feature-selection-'s People

Contributors

anup113 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.