- Installation
- Project Motivation
- File Descrription
- Results
- Licensing, Authors, and Acknowledgements
Use pytthon 3.
The idea behind the analysis is to identify the main predictors of Stroke.
To understand this i have looked at 3 main questions:
- Does smoking increase the probability of getting stroke?
- Does a patient with other underlying conditions more susceptible to stroke?
- What are the top three predictors of stroke?
The data contains historical information of 5000 patients with 12 attributes. The attributes have both demographic and patient healthy info: Below is the description of the variables
- Id: Unique id
- gender: Gender of the patient
- Age: Age in years of the patient
- Hypertension: Hypertension binary feature (1- has disease ,0 – No disease)
- Heart Disease: Heart disease binary feature (1- has disease ,0 – No disease)
- Ever Married: Has patient ever married
- Work Type: Work type of the patient
- Residence_type: Residence type of the patient
- avg_glucose_level: Average glucose level in blood
- BMI: Body Mass Index
- smoking_status: Smoking status of the patient
- stroke: Stroke event- indicates whether a customer has stroke or not (1- has stroke ,0 – No stroke)
More info about the data can be found in this link
The data was generally clean, with only one attribute with missing variables. The missing values were imputed using mean.
Categorical variables were changed into dummies, by creating k-1 variables and dropping the original variables
Since it is a classification problem, I have a classification model. After running several models and optimization Adaboost Classifier performed the best. It is an ensemble model that trains weak learners sequentially and then combines them to get a good performing model. More info about adptive boosting can be found in this link
We use recall, precision, and f1-score to evaluate the model performance. The reason for using the above and not accuracy or Area under the curve is because the data is highly imbalanced such that if our model was so naive to predict everyone as not having stroke the accuracy level will still be above 80%. F1-score helps in getting a good balance of the precision and recall Also, I optimized more on recall than the precision since it is a lesser harm to label a patient that they will get stroke and not get it than fail to identify a patient who will actually get stroke
Summary of the result published in a medium article
The repo has 3 notebooks. Each addressing the 3 questions as indicated above
The data was sourced from kaggle, and the code is free for use.