Code Monkey home page Code Monkey logo

ft_logistic_regression's Introduction

DataScience | Logistic Regression | 42Paris

Implement one-vs-all logistic regression that will solve classification problem:

  • Implementation of pandas.DataFrame.describe from scratch
  • Implementation of data visulazionation tools from scratch to make insights and develop an intuition of what the data looks like
  • Recreated Poudlard's Sorting Hat by implementing logistic regression from scratch.

Requirements:

  • Python 3
  • NumPy
  • Pandas
  • Matplotlib
  • Sklearn
  • Tabulate
  • Scipy

How to Run:


  git clone https://github.com/shimazadeh/Ft_logistic_regression.git DSLR
  cd DSLR
  pip3 install -r requirements.txt
  python main.py config.yaml: config.yaml file must include necessary information for training and testing purposes

Implementation

The following sections indicates the method and results for each part of the program, note all the methods are developed from scratch:

Data Analysis

describe.py is implementation of pandas.DataFrame.describe. This program takes a dataset as a parameter and it displays all the statistical parameters of all numerical features. See the data analysis folder for the code implementation. Here is the output of the dataset used in this project:

Arithmancy Astronomy Herbology Defense Against the Dark Arts Divination Muggle Studies Ancient Runes History of Magi Transfiguration Potions Care of Magical Creatures Charms Flying
count 1251 1251 1251 1251 1251 1251 1251 1251 1251 1251 1251 1251 1251
mean 49453.1 46.4764 1.1895 -0.4648 3.2138 -222.904 496.252 2.9786 1029.86 5.9613 -0.0643 -243.326 23.109
std 16701.6 520.946 5.2231 5.2095 4.111 484.986 106.711 4.457 43.9829 3.1029 0.9726 8.7904 97.755
skew 2.78942e+08 271385 27.2812 27.1385 16.9003 235211 11387.2 19.8645 1934.49 9.6281 0.946 77.2712 9556.04
kurtosis -0.0525 -0.1174 -0.4316 0.1174 -1.4067 0.8039 0.0318 -1.0414 -1.2183 0.0033 -0.0202 0.3781 0.859
variance 0.2119 -1.693 -1.3692 -1.693 0.6879 -0.7592 -1.5902 -0.1 0.1994 -0.5513 0.0342 -1.088 -0.1605
min -24370 -966.74 -10.2957 -10.1621 -8.727 -1043.96 283.87 -8.4311 906.627 -3.6208 -3.3137 -261.049 -181.47
25% 38180 -485.323 -4.2523 -5.2835 3.1205 -573.969 396.41 2.2309 1025.64 3.6842 -0.6944 -250.586 -40.085
50% 48793 272.072 3.5264 -2.7207 4.621 -419.164 464.328 4.4026 1045.48 5.8685 -0.0651 -244.789 -1.92
75% 60794.5 528.346 5.4637 4.8532 5.727 264.144 597.517 5.8939 1058.33 8.2067 0.5756 -232.528 52.625
max 104956 1016.21 10.2968 9.6674 10.032 1092.39 745.396 11.8897 1094.46 13.5368 3.0565 -225.428 279.07

Data Visualization

Three programs that implementation of histogram, scatter plot and pair-plot library in python:

Histogram.py scatter_plot.py
Generates the histogram of the features to see the homogeneous score distribution between all four houses. Displays a scatter plot of similar features to identify those that can be eliminated.
Histogram Screenshot Scatter Plot Screenshot
pair_plot.py
Displays a pair plot matrix of the data to identify features for the logistic regression model.
Pair Plot Screenshot

Training and Evaluation

The program is modular and can be run with different settings. Adjust the config.yml file with your speicfic parameters and feeatures. The program can be run in two different mode: training and testing:

  • Training: you must provide models parameters, the dataset and features to do the trainings in the yml file
  • Testing: this mode of the program uses the model.joblib file generated from the training phase and outputs the result in a json file.

During training the loss of each category is printed in the terminal for each iteration. At the end of the training a confusion matrix with performance of each category is also generated in the terminal.

Alt text

Stochastic GD Mini-Batch GD GD
Alt text Alt text Alt text

ft_logistic_regression's People

Contributors

shimazadeh avatar

Watchers

 avatar  avatar

ft_logistic_regression's Issues

Options to the program

Modes:
option 1 for hyperparameter technique to find the best parameters
option 2 using the parameters for training
Data Visualization:
some visualization of the data before training

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.