Donors-for-Charity

In this project, I will employ several supervised algorithms of your choice to accurately model individuals' income using data collected from the 1994 U.S. Census. You will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. Goal with this implementation is to construct a model that accurately predicts whether an individual makes more than $50,000. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual's income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. While it can be difficult to determine an individual's general income bracket directly from public sources, we can (as we will see) infer this value from other publically available features.

The dataset for this project originates from the UCI Machine Learning Repository.The datset was donated by Ron Kohavi and Barry Becker. The data we investigate here consists of small changes to the original dataset, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.

Data Exploration:

A cursory investigation of the dataset will determine how many individuals fit into either group, and will tell us about the percentage of these individuals making more than $50,000.

Total number of records: 45222
Individuals making more than $50,000: 11208
Individuals making at most $50,000: 34014
Percentage of individuals making more than $50,000: 24.7844%

(a) Feature Observation:

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Evaluating Model Performance

Unsupervised Algorithm:

(a) Naive Predictor Performace:

Naive Predictor: [Accuracy score: 0.2478, F-score: 0.2917]

Supervised Algorithm:

(a) Decision Tree

(b) SVM

(c) Random Forest

Improving Results

The results generated from the Random Forest classifier would be the one opted over SVM and Decision Tree because of following reasons:

Test Accuracy
F-Score for test for 100% training data
Prediction and Training time When compared the F-Score for test at 100% training data is ~0.2% better over Random Forest whereas the test accuracy remains same at 83.77%. But taking into consideration the time complexity for model learning and prediction the results generated by Random Forest were much more faster when compared to likes of SVM.

Unoptimized model:

Accuracy score on testing data: 0.8392

F-score on testing data: 0.6755

Optimized Model:

Final accuracy score on the testing data: 0.8489 Final F-score on the testing data: 0.7129

Feature Importance

Feature Relevance Observation:

Based on my opinion for this project the following five features seems to be of higher importance:

occupation - Different jobs have different payscales. Some jobs pay higher than others.
education - People who have completed a higher level of education are better equipped to handle more technical/specialized jobs that pay well.
age - As people get older, they accumulate greater weatlh.
workclass - The working class they belong to can also be correlated with how much money they make.
hours-per-week - If you work more hours per week, you're likely to earn more.

Income can varies mostly on these 5 factors as various job occupation have different pay-scales. Level of education when combined with work experience creates a big impact in terms of pay for each role. Age is again a critical factor with workclass and hours-per-week.

Extracting Feature Importance:

Out of the five features : Age, hours per week, education-num (which is a numerical label for education) are included in the list of features considered most important by RandomForest, although with different rankings. The other two features which are capital-gain (the profit gained by sale of property/asset) and marital-status_Married-civ-spouse (relationship status of each individual explaining dependency of family) seems to be contributing to the predictions as well. After thinking again with the problem is explains why these features play an important role in income prediction or total income each individual might have.

Feature Selection:

Final Model trained on full data

Accuracy on testing data: 0.8489

F-score on testing data: 0.7129

Final Model trained on reduced data

Accuracy on testing data: 0.8444

F-score on testing data: 0.7014

The final model trained on full data definitely has a better test accuracy (84.89%) and F-score (0.7129) when compared to model trained on reduced data with test accuracy (84.44%) and F-score (0.7014). In my opinion the performance of reduced model is very good when we consider that only five features were used to build the model. If we've had a larger or enormous dataset then a reduced model would be very effective given the reduced training and prediction advantage it will have. For this problem however, we don't have enormous dataset thus we can use Final Model trained on full data.

geekquad / donors-for-charity Goto Github PK