Code Monkey home page Code Monkey logo

practical-data-science's Introduction

Practical-Data-Science

My Data Science Experiments A project done by Amal Joy.

Instruction to operate the source code file:

  1. Data file is taken from the internet so no need of the data file.
  2. Line number 75 is plotting the decision tree in the python screen itself. It requires some additional packages to be installed. In case if you are having trouble installing the packages, just delete that entire chunk. It won’t effect the running of the program.

The project was aimed to build a model to predict if a person is diabetic or not based on his medical diagnostics. The experiment was conducted using the pima Indians diabetics data. Numerous issues were present in the data including impossible values and missing values. These issues were handles appropriately by imputing class average in most of the cases. Normalization of the variables were applied on the data. A thorough analysis of the features was conducted and the most important features were selected for building the models. Different models like KNN, decision trees and Random forest were built on the data. Due to the issue of data imbalance, oversampling using SMOTE was also tried on KNN model. But it was later found that the oversampling is actually overfitting the data and the model will not be useful. Important features were identified from the random forest model and the new data set was made using only the important features. Comparison of the models were also done and found that random forest model gave the best results. Parametric tuning was also performed on all the models. Random search grid was applied on the decision tree and the random forest model to tune the parameters. Even though the parametric tuning didn’t have much effect on the random forest model it had a 3.75% improvement on the decision tree model. Cross validation was applied on all the models and Random Forest model was found to be more stable than any other model in the experiment with a 10-fold cross validation accuracy of 87.11%.

Thanks for reading

practical-data-science's People

Contributors

dataastronomy avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.