Classification-with-Health-Records

Dataset Description This dataset includes information about over 100,000 medical appointments of different patients from different neighborhoods in Brazil, and this dataset discuss very important point that why a person makes a doctor appointment, receives all the instructions and no-show.

Part I: Data prep:

This includes:

Reading in data csv file

Cleanup data column names

Removed records with erroneous entries (e.g., negative ages, look at what people have done in Kaggle)

Created a test set of 20k records that you won’t touch again for the reminder of this project until Part III. Use stratified sampling on the No-show variable to ensure test set and training set class proportions are the same. Save the train and test sets as csv files in the processed_data directory.

Plotted the No-show variable against the other variables in the dataset as part of Exploratory Data Analysis

Created a preprocessing pipeline using scikit to prepare the data for the ML algorithms we will use. At a minimum, standardize numerical variables, transform categorical variables into one or more numerical values. You may apply other transformations that you think would be useful (e.g., logarithmic transformations).

Part II: Classification Methods

Here are the steps involved in this part

Using sklearn fit a DecisionTree, a RandomForest, a linear SVM and an SVM with a radial basis kernel to the transformed data. For now, use default parameters for each method.

Use 10 fold cross validation to estimate performance of each of the above methods using both accuracy and AUC as metrics.

Based on the above choose two of the ML methods and fit a model using 5 fold cross validation for model selection and 10 fold cross validation for model assessment.

Implemented gradient descent for a linear svm and test it on the training set.

Part III: Ensembles

Here are the steps involved in this part

Trained an AdaBoost classifier and compare its performance to the results obtained in Part II using 10 fold cross validation as before

Trained an xgBoost classifier and compare its performance to the results obtained in Part II

Chose a set of 5 or so classifiers, e.g., Decision Trees of diverse depths, linear SVMs over diverse subsets of features, RBF kernels with diverse bandwidths, Random Forests with diverse number of trees in their ensemble, be creative!. Write a function that given a training set does the following:

    Created a validation set using 20% of the training set
    Trained each of your chosen classifiers on the training set
    Using the validation set created a new dataset where features are predictions made by each of your chosen classifiers
    Trained a logistic regression classifier to blend the predictions

hassanshabbir1960 / classification-with-health-records Goto Github PK