Code Monkey home page Code Monkey logo

car-acceptability-classification's Introduction

Table of contents

General info

Goal:

The goal of this project is to make a comparison between the Naive Bayes and Logistic Regression classifiers in a binary classification problem, where the goal is to predict if a car's evaluation is either "Positive" or "Negative". Additionally, we deal with class imbalance using a technique called Synthetic Minority Oversampling (SMOTE).

Libraries:

This project is created using R version: 4.0.1. The following libraries are used:

  • e1071
  • ggplot2
  • gridExtra
  • ROCR
  • DMwR
  • MASS

Scripts:

  • Comparative_Study.R contains a model comparison between three Naive Bayes models with different Laplace parameters and a full Logistic Regression model.
  • Classification.R addresses the problem of unbalanced classes and compares the performance of the models when using a balanced versus an unbalanced training set.

Data

the file car.data.txt contains the car evaluation dataset which can be found in UCI Machine Leaning repository. The data consists of 1728 complete observations and the 7 following variables:

  • buying : buying price
  • maint : price of maintainance
  • doors : number of doors
  • persons : capacity in terms of persons to carry
  • lug_boot : the size of luggage boot
  • safety : estimated safety of the car
  • eval : Two categories , "Positive" and "Negative"

Explore.jpeg

Results

Comparative study

Coparison.jpeg

Naive Bayes and Logistic Regression fall into two different categories of algorithms, these are Generative and Discriminative. Naive Bayes is a Generative algorithm, that means the joint probability P(x,y)=P(y)P(x|y) is estimated from the training set before using Baye's rule to estimate Pr(y|x) for the test set. On the other hand, logistic regression is a Discriminative algorithm and the difference from the previous category is that it estimates Pr(y|x) directly from the training data by minimizing an error function.

There are some advantages and disadvantages in both categories, therefore the best method should be chosen based on the available data. Some advantages of the Naive Bayes as a Generative classifier are that the assumption of independence of features makes computations very simple and fast and it can reach the assymptotic error faster. The latter is an important advantage over Logistic Regression when the training sample size is small because LR tends to overfit for small sample size, therefore Naive Bayes needs less training data to converge. A disadvantage of Naive Bayes is that it cannot learn interactions between features, whereas Logistic Regression can handle correlations between features and can be regularized by using Ridge or LASSO regularization. The main advantage of Logistic Regression over Naive Bayes is that even though it can overfit for small sample sizes, it usually outperformes Naive Bayes when the training size is large because it reaches its assymptotic error slower ("On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes", Andrew Y. Ng and Michael I. Jordan , University of California, Berkeley, https://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf).

Class Imbalance

The two models were fitted on the original unbalanced training set and a balanced training set which was generated using Synthetic Minority Oversampling (SMOTE). Then, they were evaluated at a similarly unbalanced test set. In both cases of the unbalanced training and test sets, "Negative" is the majoity and "Positive" is the minority class.

SMOTE is an oversampling technique that generates new training data of the minority class by looking at the K-nearest-neighbors of each case.

Table 1: Unbalanced training set
Model Metric Negative Positive
Naive Bayes Precision 0.950 0.929
Recall 0.969 0.887
F1 score 0.960 0.907
Logistic Regression Presicion 0.979 0.913
Recall 0.959 0.954
F1 score 0.969 0.933

Table 1 shows the performance of the models when they were fitted on the unbalanced training set. The issue here is that these results are highly biased, since the class distribution in the test set has the same kind of imbalance that is present in the training set. This obviously results in very good performance on the test set, but its also a misleading result since the model is unable to generalize.

Table 2: Balanced training set
Model Metric Negative Positive
Naive Bayes Precision 1.000 0.773
Recall 0.869 1.000
F1 score 0.930 0.872
Logistic Regression Presicion 1.000 0.826
Recall 0.906 1.000
F1 score 0.950 0.904

Table 2 provides the performance results when the models are fitted on the balanced training set and evaluated on the unbalanced test set.

car-acceptability-classification's People

Contributors

pantelis31 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.