Code Monkey home page Code Monkey logo

wrar's Introduction

Weighted Relevance and Redundancy Scoring

This repository contains a Python 3 implementation of the wRaR algorithm for feature selection on imbalanced dataset. The full paper (bachelor's thesis) on this algorithm can be found here. The algorithm is based on RaR (see references), itself based on this paper.   The implementation is based around a heavily modified and extended version of this HiCS implementation. It uses the Gurobi Optimizer for quadratic optimization estimating single-feature relevance, you must acquire a license to be able to use it (academic licenses should be free).

Install

Simply run

pip install -r requirements.txt
python setup.py install

How to use

Here is a quick example of running wRaR. It uses Pandas dataframes

import wrar
import pandas as pd

data = pd.read_csv('data.csv', header=None)
target = 'Class' # data[target] should be the target column

rar = wrar.rar.RaR(data)
rar.run(target, k=5, runs=200, split_iterations=20, compensate_imbalance=True)

After finishing, wRaR prints a summary of the feature ranking. It is also available as attribute, i.e. rar.feature_ranking. The parameters of rar.run are as follows:

  • target The column name of the target. This also means that the passed dataframe should contain all columns.
  • k (default: 5) This is the maximum size of subsets to be picked for relevance and redundancy estimation.
  • runs (default: number of features) This controls how many subsets are used to estimate relevance, i.e. the number of Monte Carlo estimations.
  • split_iterations (default: 3) This controls how many subsets are picked for the redundancy estimation phase, i.e. for each newly ranked feature.
  • compensate_imbalance (default: False) Boolean value that controls whether to use wRaR or RaR. If true, a cost matrix compensating class imbalance is generated and used in the feature selection process.
  • weight_mod (default: 1) Exponenent for compensation matrix. A higher exponent will result in a significantly higher "counterweight" to imbalance, while a lower one equalizes the weights. Might be useful for fine-tuning.
  • cost_matrix This is a custom dataframe that can be passed in order to be multiplied with the generated compensation matrix. It has only one row, and a column for each unique value in the target column.

Contributors (wRaR)

Contributors (HiCS)

References

  • A. K. Shekar, T. Bocklisch, C. N. Straehle, P. I. Sánchez, and E. Müller, “Including multi-feature interactions and redundancy for feature ranking in mixed datasets,” in Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017, Macedonia, Skopje, September 18-22, 2017, Proceedings, Lecture Notes in Computer Science, Springer, 2017.
  • F. Keller, E. Muller, and K. Bohm, “Hics: High contrast subspaces for density-based outlier ranking,” in Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 1037–1048, IEEE, 2012.

wrar's People

Contributors

danthe96 avatar marcuspappik avatar nikriek avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.