Code Monkey home page Code Monkey logo

donors-for-charity's Introduction

Donors-for-Charity

In this project, I will employ several supervised algorithms of your choice to accurately model individuals' income using data collected from the 1994 U.S. Census. You will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. Goal with this implementation is to construct a model that accurately predicts whether an individual makes more than $50,000. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual's income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. While it can be difficult to determine an individual's general income bracket directly from public sources, we can (as we will see) infer this value from other publically available features.

The dataset for this project originates from the UCI Machine Learning Repository.The datset was donated by Ron Kohavi and Barry Becker. The data we investigate here consists of small changes to the original dataset, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.

Data Exploration:

A cursory investigation of the dataset will determine how many individuals fit into either group, and will tell us about the percentage of these individuals making more than $50,000.

  • Total number of records: 45222
  • Individuals making more than $50,000: 11208
  • Individuals making at most $50,000: 34014
  • Percentage of individuals making more than $50,000: 24.7844%

(a) Feature Observation:

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Evaluating Model Performance

Unsupervised Algorithm:

(a) Naive Predictor Performace:

Naive Predictor: [Accuracy score: 0.2478, F-score: 0.2917]

Supervised Algorithm:

(a) Decision Tree

(b) SVM

(c) Random Forest


Improving Results

The results generated from the Random Forest classifier would be the one opted over SVM and Decision Tree because of following reasons:

  1. Test Accuracy
  2. F-Score for test for 100% training data
  3. Prediction and Training time When compared the F-Score for test at 100% training data is ~0.2% better over Random Forest whereas the test accuracy remains same at 83.77%. But taking into consideration the time complexity for model learning and prediction the results generated by Random Forest were much more faster when compared to likes of SVM.

Unoptimized model:

Accuracy score on testing data: 0.8392

F-score on testing data: 0.6755

Optimized Model:

Final accuracy score on the testing data: 0.8489 Final F-score on the testing data: 0.7129


Feature Importance

Feature Relevance Observation:

Based on my opinion for this project the following five features seems to be of higher importance:

  • occupation - Different jobs have different payscales. Some jobs pay higher than others.
  • education - People who have completed a higher level of education are better equipped to handle more technical/specialized jobs that pay well.
  • age - As people get older, they accumulate greater weatlh.
  • workclass - The working class they belong to can also be correlated with how much money they make.
  • hours-per-week - If you work more hours per week, you're likely to earn more.

Income can varies mostly on these 5 factors as various job occupation have different pay-scales. Level of education when combined with work experience creates a big impact in terms of pay for each role. Age is again a critical factor with workclass and hours-per-week.

Extracting Feature Importance:

Out of the five features : Age, hours per week, education-num (which is a numerical label for education) are included in the list of features considered most important by RandomForest, although with different rankings. The other two features which are capital-gain (the profit gained by sale of property/asset) and marital-status_Married-civ-spouse (relationship status of each individual explaining dependency of family) seems to be contributing to the predictions as well. After thinking again with the problem is explains why these features play an important role in income prediction or total income each individual might have.

Feature Selection:

Final Model trained on full data


Accuracy on testing data: 0.8489

F-score on testing data: 0.7129

Final Model trained on reduced data


Accuracy on testing data: 0.8444

F-score on testing data: 0.7014

The final model trained on full data definitely has a better test accuracy (84.89%) and F-score (0.7129) when compared to model trained on reduced data with test accuracy (84.44%) and F-score (0.7014). In my opinion the performance of reduced model is very good when we consider that only five features were used to build the model. If we've had a larger or enormous dataset then a reduced model would be very effective given the reduced training and prediction advantage it will have. For this problem however, we don't have enormous dataset thus we can use Final Model trained on full data.

donors-for-charity's People

Contributors

geekquad avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.