Code Monkey home page Code Monkey logo

diabetes_prediction's Introduction

Diabetes Prediction

Prompt

Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. The disease is depended on other health factors like glucose level, blood pressure etc. The aim of this project is to predict the possibility of having diabetes (presently or in near future) by analysing the statistics of the other health factors.

Solution

I have used a Machine Learning Model called 'KNN' (k-Nearest Neighbours) for predicting if a person has diabetes or not. The steps involved in reaching the final results are:

  • Reading the dataset
  • Extracting the useful information
  • Cleaning the dataset
  • Understanding the interfence of each factor
  • Dividing the dataset into train and test sets
  • Creating the algorithm for prediction
  • Making test predictions
  • Calculate accuracy of our Model

I have also made prediction using the model provided by sklearn, to compare the end results of both the models.

Algorithm

Let's see what is KNN Algorithm

Overview

KNN is a supervised machine learning algorithm, which relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

Working

So if we have a dataset of cells which have categories as: Plant Cell and Animal Cell and we have a new unlabeled cell. Our task is to find out that our 'new cell' belongs to which category.

image

Then decide upon the value of 'K' for now lets take it to be 5, so we will calculate the distance of the 5 most nearest cells (the most common method is the Euclidean Distance). And simply pick the category with the most votes. Here the "new cell" will belong to the Animal Cell Category

image

Steps for Implementing Algorithm

  1. Load the data Initialize K to your chosen number of neighbors

  2. For each example in the data

    2.1 Calculate the distance between the query example and the current example from the data.

    2.2 Add the distance and the index of the example to an ordered collection

  3. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances

  4. Pick the first K entries from the sorted collection

  5. Get the labels of the selected K entries

  6. If regression, return the mean of the K labels

  7. If classification, return the mode of the K labels

Snipet

image

Choosing the right value for K

To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.

Advantages

  • The algorithm is simple and easy to implement.
  • There’s no need to build a model, tune several parameters, or make additional assumptions.
  • The algorithm is versatile. It can be used for classification, regression, and search (as we will see in the next section).

Disadvantages

  • The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.

Data Analysis

Dataframe

image

Binary Histogram for categorization

image

Dependency of each factor on outcome

  • with Nan values

image

image

image

  • without Nan values

image

image

image

Pair plots

Pair Plots are a really simple (one-line-of-code simple!) way to visualize relationships between each variable. It produces a matrix of relationships between each variable in your data for an instant examination of our data. It can also be a great jumping off point for determining types of regression analysis to use.

image

Heatmaps

A heatmap is a graphical representation of data in two-dimension, using colors to demonstrate different factors. Heatmaps are a helpful visual aid for a viewer, enabling the quick dissemination of statistical or data-driven information.

image

Visualising Results

image

Confusion Matrix

image

Directory Structure

Dateset: diabetes_dataset.csv

Source Code: diabetes_prediction.ipynb

Results: result.csv

Readme File: README.md

Contribution File CONTRIBUTION.md

Testing

To test this project on your local computer follow the given steps:

1. fork this repository

2. clone it

3. make sure you have all the Prerequisites mentioned below

4. run the diabetes_prediction.ipynb file

Prerequisites

Make sure you have the latest version of python3, if not you can easily download it from here.

Make sure to update pip to latest version using 'python -m pip install –upgrade pip .

The project uses a few python libraries, so make sure you have them too:

numpy: download it using this documentation.

pandas: download it using this documentation.

matplotlib: download it using this documentation.

scikit-learn: download it using this documentation.

seaborn: download it using this documentation.

Conclusion

The KNN algorithm which we used had an accuracy of 73.37% The KNN algoritm by sklearn had an accuracy of 75.32%

For making the KNN algorithm more accurate we can play-around with the value of 'K'.

If you do not wish to use kNN we can always go for more accurate Machine Learning Models such as Vector Quantization, Naive Bayes, Support Vactor Machines, etc. I will surely try to solve this problem using different algorithms to show the difference.

References

If you are curious about kNN algorithms, you can learn more from StatQuest

Want to contribute?

I would love to recieve your contributions towards this project. Refer to CONTRIBUTION.md for more details.

Thanks! ✨

diabetes_prediction's People

Contributors

sakshigupta265 avatar

Watchers

 avatar

diabetes_prediction's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.