Code Monkey home page Code Monkey logo

ds_salary_proj's Introduction

Data Science Salary Estimator: Project Overview

  • Created a tool that estimates data science salaries (MAE ~ $ 11K) to help data scientists negotiate their income when they get a job.
  • Scraped over 1000 job descriptions from glassdoor using python and selenium
  • Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
  • Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model.
  • Built a client facing API using flask

Code and Resources Used

Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle
For Web Framework Requirements: pip install -r requirements.txt
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2

YouTube Project Walk-Through

https://www.youtube.com/playlist?list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

Web Scraping

Tweaked the web scraper github repo (above) to scrape 1000 job postings from glassdoor.com. With each job, we got the following:

  • Job title
  • Salary Estimate
  • Job Description
  • Rating
  • Company
  • Location
  • Company Headquarters
  • Company Size
  • Company Founded Date
  • Type of Ownership
  • Industry
  • Sector
  • Revenue
  • Competitors

Data Cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

  • Parsed numeric data out of salary
  • Made columns for employer provided salary and hourly wages
  • Removed rows without salary
  • Parsed rating out of company text
  • Made a new column for company state
  • Added a column for if the job was at the company’s headquarters
  • Transformed founded date into age of company
  • Made columns for if different skills were listed in the job description:
    • Python
    • R
    • Excel
    • AWS
    • Spark
  • Column for simplified job title and Seniority
  • Column for description length

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.

alt text alt text alt text

Model Building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad in for this type of model.

I tried three different models:

  • Multiple Linear Regression – Baseline for the model
  • Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
  • Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Model performance

The Random Forest model far outperformed the other approaches on the test and validation sets.

  • Random Forest : MAE = 11.22
  • Linear Regression: MAE = 18.86
  • Ridge Regression: MAE = 19.67

Productionization

In this step, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial in the reference section above. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary.

ds_salary_proj's People

Contributors

playingnumbers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ds_salary_proj's Issues

Regarding Average Salary

Hello Ken,

I followed your youtube video on this project and I have doubt regarding the salary column. When we created the avg_salary column in the data we took the average of min and max salary for all records including the records where the salary estimate is given in per hour. But when we convert hourly to annual in the EDA notebook we don't update the avg_salary column. Would it make any difference or am I mistaken? Please let me know. @PlayingNumbers

Data Scraping

When I run data_collection.py the chrome window opens but I get an error:

NoSuchElementException: no such element: Unable to locate element: {"method":"css selector","selector":".selected"}
(Session info: chrome=91.0.4472.77)

StaleElementReferenceException: stale element reference: element is not attached to the page document

Hi,
when i run the glassdoor scraper code, i get below error in spyder console.
StaleElementReferenceException: stale element reference: element is not attached to the page document
(Session info: chrome=87.0.4280.88)
I tried with putting delay after every job list click, but i still get this error.
I feel recent updates on Glassdoor website might be causing this issue. But i am not able to solve this issue.
Please help me to resolve this issue.

Updated the project as per 2023 requirements

Hey ! My Name is Anurag and Iam a fresher and a DATA Science enthusiast. Still in my intermediate stages of learning this amazing field.
I needed to learn how to deploy models using endpoint API's and make an end to end project with it..
As, I was searching through cheap courses as I was on a budget, I came to KenJee's channel and man He is awesome. Providing a detailed end to end project with a very comprehensive approach that even beginners could understand. Gotta respect that.

The project was almost four years ago therefore needed many tweaks and changes in the code, therefore I updated it as per 2023 requirements. I hope this may help someone, all credit goes to KenJee, but, I would be glad If was of little help...

I had updated the glassdoor_scraper.

Hi, I'm just copying Ken Jee project couse I'm new at Data Science.

Couldn't run glassdoor_scraper so I tried to tweak it.
I was successful, I have updated it for the current glassdoor page.
https://github.com/echestare/001KenJeeFromScratch_DSSalary

Please take into account I have not used selenium before and I don't know anything of javascript, html, css and so. So the code could be a little mess....... but works.

Thanks, Ken Jee!!!!

I'm seeing the pull request tab. I think I sould put this there.
I still don't know how to do it. I'm still new here in github. Sorry.
I will do it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.