Code Monkey home page Code Monkey logo

thebrownboy / biometrica Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 7.37 MB

The project focuses on tackling challenges such as imbalanced data and skewed features. Through exploratory data analysis,and model training.The use of innovative techniques like focal loss and controlled oversampling allows us to address the imbalanced nature of the data and achieve better model performance.

Jupyter Notebook 99.46% Python 0.54%
focal-loss imbalanced-classification imbalanced-data imbalanced-learning logistic-regression machine-learning neural-network random-forest svm

biometrica's Introduction

BioMetrica

Problem Definition

The problem at hand is to develop a machine learning solution for "Body Level Classification" based on a given dataset. The dataset comprises various attributes related to the physical, genetic, and habitual conditions of individuals. These attributes consist of both categorical and continuous variables. The goal is to accurately classify the body level of a person into one of four distinct classes.

With a total of 1477 data samples, it is important to address the class imbalance issue in the dataset. The distribution of classes is uneven, meaning that certain classes may have significantly more or fewer instances than others. Therefore, it is necessary to build models that can effectively adapt to this class imbalance while aiming to achieve the best possible classification results.

Data Visualization

  • Imbalanced target feature
  • Some features also exhibit imbalances, where the majority of their values tend to be skewed towards a single value.

  • Furthermore, the presence of skewness can also be observed in certain data instances.

Data preprocessing

To prepare the data for analysis, the following preprocessing techniques will be applied:

  • Standardization: The feature values will be standardized to have a mean of 0 and a standard deviation of 1, ensuring consistent scaling across different features.

  • Log Transformation (for skewed data): When data exhibits skewness, a logarithmic transformation will be applied to reduce the impact of extreme values and achieve a more normal distribution.

  • Oversampling (to tackle the imbalancing problem): To address the imbalance in certain features, oversampling techniques such as Synthetic Minority Over-sampling Technique (SMOTE) could be employed.But we found that the regular random oversampling was a good choice

By implementing these preprocessing steps, we aim to improve the quality and suitability of the data for the subsequent stages of the project.

Insights

The feature importance analysis reveals that weight, age, and height of the person are the primary features that any model would prioritize in learning their significance. These features can be combined to calculate the Body Mass Index (BMI). However, it is important to note that BMI represents the true function in this problem. Therefore, including it as a feature in the model would be redundant or unnecessary. As a result, we do not require a machine learning model to solve this particular problem.

Models

  • Logistic Regression
  • Random forest regression
  • SVM
  • NN (Neural Network)

Logistic Regression

Firstly, we begin with a basic implementation of Logistic Regression without any additional techniques. This initial step allows us to tune the hyperparameters and identify the optimal configuration.

  • Tuning the 'C' Hyperparameter
  • Exploring various approaches(class weights (CW) , over-sampling(OS))

Focal Loss (Modifying the learning methodology)

Instead of relying solely on preprocessing steps to address the issue of imbalanced data, we can explore the possibility of incorporating the knowledge of this problem directly into the learning process itself. By explicitly informing the loss function about the imbalanced nature of the data, we enable it to handle this situation more effectively. This approach can potentially alleviate the need for extensive preprocessing steps specifically aimed at dealing with the imbalance problem

Problems with imbalance dataset

  • No learning due to easy negatives
  • cumulative effect of many easy negatives
  • Cross entropy does not handle the two problems above let's see how can focal loss helps in solving them balance between easy and hard examples:

Handling easy examples problems

  • The idea is that if a sample is already well-classified, we can significantly decrease or down weigh its contribution to the loss.
  • gamma is the modulating factor

cumulative effect of many easy negatives

  • To do so, we add a weighting parameter (α), which is usually the inverse class frequency. α is the weighted term whose value is α for positive class and 1-α for negative

Results

Model train-accuracy val-accuracy test-accuracy
before-sampling 0.9878 0.983 0.9715
after-sampling 0.9965 0.9863 0.993

Interpreting the Results

As demonstrated earlier, the focal loss approach has yielded the best model performance in terms of accuracy and F1-score, even without any preprocessing steps applied to the data. However, when we further applied oversampling with a 0.5 ratio, we observed even better results in terms of accuracy and F1-score. This indicates that combining the benefits of focal loss with a controlled oversampling strategy can lead to further improvements in model performance. By balancing the class distribution while maintaining the benefits of focal loss, we can effectively address the challenges posed by imbalanced data and achieve enhanced accuracy and F1-score.

biometrica's People

Contributors

thebrownboy avatar yousefelmahdy avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.