playingnumbers / ds_salary_proj Goto Github PK

Repo for the data science salary prediction of the Data Science Project From Scratch video on my youtube

Python 0.85% Jupyter Notebook 99.15%

ds_salary_proj's Introduction

Data Science Salary Estimator: Project Overview

Created a tool that estimates data science salaries (MAE ~ $ 11K) to help data scientists negotiate their income when they get a job.
Scraped over 1000 job descriptions from glassdoor using python and selenium
Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model.
Built a client facing API using flask

Code and Resources Used

Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle
For Web Framework Requirements: pip install -r requirements.txt
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2

YouTube Project Walk-Through

https://www.youtube.com/playlist?list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

Web Scraping

Tweaked the web scraper github repo (above) to scrape 1000 job postings from glassdoor.com. With each job, we got the following:

Job title
Salary Estimate
Job Description
Rating
Company
Location
Company Headquarters
Company Size
Company Founded Date
Type of Ownership
Industry
Sector
Revenue
Competitors

Data Cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

Parsed numeric data out of salary
Made columns for employer provided salary and hourly wages
Removed rows without salary
Parsed rating out of company text
Made a new column for company state
Added a column for if the job was at the company’s headquarters
Transformed founded date into age of company
Made columns for if different skills were listed in the job description:
- Python
- R
- Excel
- AWS
- Spark
Column for simplified job title and Seniority
Column for description length

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.

Model Building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad in for this type of model.

I tried three different models:

Multiple Linear Regression – Baseline for the model
Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Model performance

The Random Forest model far outperformed the other approaches on the test and validation sets.

Random Forest : MAE = 11.22
Linear Regression: MAE = 18.86
Ridge Regression: MAE = 19.67

Productionization

In this step, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial in the reference section above. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary.

ds_salary_proj's People

Contributors

Stargazers

Watchers

Forkers

rajaramkuberan dhoscovi87 zahrael97 schlon24 aiyazsarwar vinayk1985 mihir1493 jordankham ajmalbinnizam vsablok123 deepstatsanalysis renish-charaniya rekib0023 ammartouati mohamedelwaghf onkarkhedkar plantedbrain edebo netemmanuel akash8190 khashpavan rego7624 ngantran1611 yeegorski akshaymohanscripts shoaiburrehman anhiva andrewhslr419 apanalytics cancanpeng deanne430 fazil31 khairihilmi01 mohamadnovaldi zk1987zk ibrahim-abdalla odafe94 tkenganeer jasonyao3 ppandey0695 ankit-kothari asmabalamane srikanthlakkoju shaon2221 shaniellewill ddupreez2 jpark7167 datasciencebein r-mainali dhruvin-data josiassekhebesa abrambeyer viottihugo elguneminov momoardestani jimmywang-jw charmzshab aamaguay jomlearn2 vaibhavt14 arpit1012 rjoe430 aditjain125 johnnyhsieh1020 vrta madhu612 heuihyun leoniemwindari ajeetpandeyy tapsy0 pawangoon063 nadhemjbeli hadihassan96 revankars phanivelpuri harshit0512 vinayamsnl feifanlu milenabrankovic datascience-nirajan tosinbabatunde megaardisa anthony-moubarak ahmedismaildawoud edmontdants orebank huazhong-ttcy21 philippemoussalli anyingifa kcontee matiasmoreno gsivapavan hazukir abdurrahmanshidiq crisortiz92 dmuhabbatov pembelajardata aambardar esteban19967769 wizrox

ds_salary_proj's Issues

Regarding Average Salary

Hello Ken,

I followed your youtube video on this project and I have doubt regarding the salary column. When we created the avg_salary column in the data we took the average of min and max salary for all records including the records where the salary estimate is given in per hour. But when we convert hourly to annual in the EDA notebook we don't update the avg_salary column. Would it make any difference or am I mistaken? Please let me know. @PlayingNumbers

glassdoor scraping file error

after running >>> data collection.py file
i am getting error >

can you help me??

Data Scraping

When I run data_collection.py the chrome window opens but I get an error:

NoSuchElementException: no such element: Unable to locate element: {"method":"css selector","selector":".selected"}
(Session info: chrome=91.0.4472.77)

StaleElementReferenceException: stale element reference: element is not attached to the page document

Hi,
when i run the glassdoor scraper code, i get below error in spyder console.
StaleElementReferenceException: stale element reference: element is not attached to the page document
(Session info: chrome=87.0.4280.88)
I tried with putting delay after every job list click, but i still get this error.
I feel recent updates on Glassdoor website might be causing this issue. But i am not able to solve this issue.
Please help me to resolve this issue.

Updated the project as per 2023 requirements

Hey ! My Name is Anurag and Iam a fresher and a DATA Science enthusiast. Still in my intermediate stages of learning this amazing field.
I needed to learn how to deploy models using endpoint API's and make an end to end project with it..
As, I was searching through cheap courses as I was on a budget, I came to KenJee's channel and man He is awesome. Providing a detailed end to end project with a very comprehensive approach that even beginners could understand. Gotta respect that.

The project was almost four years ago therefore needed many tweaks and changes in the code, therefore I updated it as per 2023 requirements. I hope this may help someone, all credit goes to KenJee, but, I would be glad If was of little help...

I had updated the glassdoor_scraper.

Hi, I'm just copying Ken Jee project couse I'm new at Data Science.

Couldn't run glassdoor_scraper so I tried to tweak it.
I was successful, I have updated it for the current glassdoor page.
https://github.com/echestare/001KenJeeFromScratch_DSSalary

Please take into account I have not used selenium before and I don't know anything of javascript, html, css and so. So the code could be a little mess....... but works.

Thanks, Ken Jee!!!!

I'm seeing the pull request tab. I think I sould put this there.
I still don't know how to do it. I'm still new here in github. Sorry.
I will do it.