Diabetes Prediction

Prompt

Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. The disease is depended on other health factors like glucose level, blood pressure etc. The aim of this project is to predict the possibility of having diabetes (presently or in near future) by analysing the statistics of the other health factors.

Solution

I have used a Machine Learning Model called 'KNN' (k-Nearest Neighbours) for predicting if a person has diabetes or not. The steps involved in reaching the final results are:

Reading the dataset
Extracting the useful information
Cleaning the dataset
Understanding the interfence of each factor
Dividing the dataset into train and test sets
Creating the algorithm for prediction
Making test predictions
Calculate accuracy of our Model

I have also made prediction using the model provided by sklearn, to compare the end results of both the models.

Algorithm

Let's see what is KNN Algorithm

Overview

KNN is a supervised machine learning algorithm, which relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

Working

So if we have a dataset of cells which have categories as: Plant Cell and Animal Cell and we have a new unlabeled cell. Our task is to find out that our 'new cell' belongs to which category.

Then decide upon the value of 'K' for now lets take it to be 5, so we will calculate the distance of the 5 most nearest cells (the most common method is the Euclidean Distance). And simply pick the category with the most votes. Here the "new cell" will belong to the Animal Cell Category

Steps for Implementing Algorithm

Load the data Initialize K to your chosen number of neighbors
For each example in the data

2.1 Calculate the distance between the query example and the current example from the data.

2.2 Add the distance and the index of the example to an ordered collection
Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
Pick the first K entries from the sorted collection
Get the labels of the selected K entries
If regression, return the mean of the K labels
If classification, return the mode of the K labels

Snipet

Choosing the right value for K

To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.

Advantages

The algorithm is simple and easy to implement.
There’s no need to build a model, tune several parameters, or make additional assumptions.
The algorithm is versatile. It can be used for classification, regression, and search (as we will see in the next section).

Disadvantages

The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.

Data Analysis

Dataframe

Binary Histogram for categorization

Dependency of each factor on outcome

with Nan values

without Nan values

Pair plots

Pair Plots are a really simple (one-line-of-code simple!) way to visualize relationships between each variable. It produces a matrix of relationships between each variable in your data for an instant examination of our data. It can also be a great jumping off point for determining types of regression analysis to use.

Heatmaps

A heatmap is a graphical representation of data in two-dimension, using colors to demonstrate different factors. Heatmaps are a helpful visual aid for a viewer, enabling the quick dissemination of statistical or data-driven information.

Visualising Results

Confusion Matrix

Directory Structure

Dateset: diabetes_dataset.csv

Source Code: diabetes_prediction.ipynb

Results: result.csv

Readme File: README.md

Contribution File CONTRIBUTION.md

Testing

To test this project on your local computer follow the given steps:

1. fork this repository

2. clone it

3. make sure you have all the Prerequisites mentioned below

4. run the diabetes_prediction.ipynb file

Prerequisites

Make sure you have the latest version of python3, if not you can easily download it from here.

Make sure to update pip to latest version using 'python -m pip install –upgrade pip .

The project uses a few python libraries, so make sure you have them too:

numpy: download it using this documentation.

pandas: download it using this documentation.

matplotlib: download it using this documentation.

scikit-learn: download it using this documentation.

seaborn: download it using this documentation.

Conclusion

The KNN algorithm which we used had an accuracy of 73.37% The KNN algoritm by sklearn had an accuracy of 75.32%

For making the KNN algorithm more accurate we can play-around with the value of 'K'.

If you do not wish to use kNN we can always go for more accurate Machine Learning Models such as Vector Quantization, Naive Bayes, Support Vactor Machines, etc. I will surely try to solve this problem using different algorithms to show the difference.

References

If you are curious about kNN algorithms, you can learn more from StatQuest

Want to contribute?

I would love to recieve your contributions towards this project. Refer to CONTRIBUTION.md for more details.

sakshigupta265 / diabetes_prediction Goto Github PK