anouk2311 / indeed-job-listings Goto Github PK

This repository contains the entire workflow for our Online Data Collection & Management and Data Preparation & Workflow Management group projects (group 3).

Jupyter Notebook 17.67% Makefile 6.90% R 75.43%

data-analist data-scientists job-market marketeer marketing-analist netherlands

indeed-job-listings's Introduction

print("Hello, world!") 👋🌍

I'm Anouk!

🎓 Student MSc Marketing Analytics and MSc Data Science & Society at Tilburg University
🏫 Current courses:
- Interactive Data Transformation
- Social Media and Web Analytics
- Experimental Research
🌱 Currently teaching myself:
- Tableau
- Building Shiny apps
- Norwegian 🇳🇴

indeed-job-listings's People

Contributors

Stargazers

Watchers

Forkers

reneen1998 georgianahutanu alantjee

indeed-job-listings's Issues

Error marketeer data

Hi! I was checking for that error in line 5 of the marketeer data, and I think there is something wrong with the separator. Not sure why, because we did the same with this data set as we did with the others.

Fix last small issues/data cleaning

Finish makefile and do a last check if everything runs

Data cleaning dirty location goes wrong

Hi! I was checking the data_clean.R code and our remove_dirty_location also removes 'Noord' from Noord-Holland and Noordwijk so they end up as -holland and wijk. So maybe we should not delete the words first, but directly replace the whole string?

E.g. so replace Amsterdam-noord by Amsterdam, instead of removing noord everywhere.

clean data functions

Hi! I tried to change the clean_data.R file into a file with functions for cleaning the data. However, apparently there are some problems with using dplyr within a function in R. Can one of you take a look at this as well? I don't see how to fix it. Right now, it doesn't create a new column named location_trimmed.

I also included a prototype function to run the cleaning functions on all datasets. But we should first fix that dplyr problem, before testing if this works:

Switching of the review/location

Hey guys, at the #getjoblocation part and #getcompanyreview part in the scraper the locations and the reviews get mixed up with eachother. I think i found the problem, but i don't know how to fix it. Could one of you have a look at it? See the pictures. The text.splitlines()[1] gives the reviews in this case and text.splitlines()[2] gives the location.

This is why these 2 get mixed up, because not every vacancy has a review component.

Salary cleaning in cleaning file

Hi Guys, most of the code with functions seem to work so far, only in the cleaning file where I added the salary cleaning step does it go wrong at the salary cleaning function(4th one). If you could have a look as well would be great. Trying to fix it right now but not seeming to get any closer.

How to name ambiguous location strings

We know give the name Unknown to all locations that are not in a specific city. Should we maybe change this to Remote or Online or something similar?

Clean up repository for files not necessary

Format changes analysis

Hey guys,

I made some changes in the formatting of the Rmarkdown so that the pdf becomes more readable (Like blank lines between header and text, new chapters on new pages and plots to stop floating to the right). Please take a look at it and let me know if you still see some things that need changing. Thanks!

last parts frequency and location

The keyword analysis and location frequency largely are functionised right now. If you could try and see if it runs on your own computer as well would be nice. Also some of the last parts I did not really make a function so could still be even more efficient. But already reduced half of the code so we are on the right way.

download_data issue

Hi guys, I just tested download_data.R and it seems that there is still an issue for the marketing-analist data, I only get the listings and the first 3 descriptions. Can you have a look at it?

Scraper loops over the first 80 pages only

Do we only want to scrape 80 pages? Or all available job listings?

Combined plot salary analysis

Our combined plot for the salary analysis for top locations salary wise only shows 3 cities because they are the only 3 cities that show up in all 4 plots. If we relax the filter of 3 job postings per location we will get a plot with more cities in it but some of these cities will have only one job ad per search term and thus an average is not very useful in this case.

What do you guys want, keep the plots like this with only a few cities to be compared, or remove the minimum number of job ads needed per location to get plots with more cities in the plot but single job ads having higher influence on average salaries.

Last things Readme

Update the repository overview <-- I'll pick that up
Integrate analysis salary part in the results overview <-- I'll pick it up when salary analyses is fixed
Should we give a description on how to run the makefile? Or is that 'common sense'
Overall read through and last checks

Almost done! Great job everyone :)

Salary cleaning needs to be performed in a subsequent file.

Hi I am currently updating the analysis scripts to incorporate for all 4 datasets and the keyword analysis for each job search term. However the datasets I used should not be cleaned from salary data because this reduces the number of observations massively. I think it is a better idea to seperate the cleaning steps of location strings and removing duplicates into one file. ANd the salary cleaning in the salary analysis file.

Download data improved file

Hi! Could you please check if the new download_data.R file works for you? I included functions and it now downloads the data directly from Google Drive instead of Github.

If it works, we can delete all datasets from Github so they are not public anymore.

Documentation for ODCM is completely done, Readme needs additions

Hey guys,

The documentation (datasheet) for ODCM is imo completely done, please do have a look before the deadline to make any enhancements or changes.

Readme file has already been filled in for the most, still needs the part where we explain how to run exactly. I will start working on that. Maybe a good idea to shorten the Method & Results part a little bit? Makes it easier to read.

Create driver object in selenium scraper gives error

The code we now have is "driver = webdriver.Chrome()". I don't know how it runs on your computers, but I have to put in my path between the brackets, e.g. "driver = webdriver.Chrome('/usr/local/bin/chromedriver')".

Code has to run on all computers without adjustments right?

Error in makefile

Not sure how to add the analysis and analysis/output folders in gen. Tried it with directory.R now (see workflow), but still gives an error.