Code Monkey home page Code Monkey logo

data-scientist-skills's Introduction

Skills of a Data Scientist Team Project

This is a project for CUNY SPS 607 - Data Acquisition and Management. This project was completed by

  • Elizabeth Drikman
  • Michael Yampol
  • Michael Silva
  • Corey Arnouts

Motivation

Our motivation for this study is to gain an understanding of which skills are the most useful for a data scientist to have so that we can plan what courses to take in our Master's program.

Approach

To answer this question we will scrape data scientist job listings on dice.com and extract the skills listed on the postings.

Findings

The following figure shows the top 20 skills of a data scientist:

This wordcloud summarizing all the top skills listed on the job postings:

Full findings are found in https://rpubs.com/mikesilva/skills-of-a-data-scientist.

Replication

System Requirements

This study uses both R and Python 3. In order to replicate it you will need the following installed:

  • R
    • DBI
    • RMySQL
    • dplyr
    • tidyr
    • stringr
    • splitstackshape
    • Hmisc
    • wordcloud
    • ggplot2
    • rmdformats
  • Python 3
    • requests
    • sqlalchemy
    • pymysql
    • beautifulsoup4
    • selenium
    • pandas

You will need the Chrome selenium driver installed on your local machine.

Configuration

You will need to edit Credentials.R with your MySQL credentials.

Replicating the Study

Step 1: Setup.R

You will first need to run the Setup.R script. This will set up your local MySQL database. It will then launch a selenium controlled Chrome browser which will search for dice.com using "Data Scientist" as the keyword (in quotes).

Step 2: Dice_Scraper.py

The next step is to run the Dice Scraper script which will scrape all the URLs previously stored in the MySQL database. Note: You may need to run this more than once.

Step 3: Skills Miner.py

Next run the Skills Miner script which will create locations.csv, skills_counts.csv and skills_list.csv.

Step 4: Curate the list of data science skills

The next step is to clean up and standardize the skills list. Project_3_V2.R was an attempt to clean up the skills programatically but human interpretation was needed. Running Project_3_V2.R which will clean up the skills_counts.csv and create data_science.csv.

Step 5: presentation.Rmd

The final step is to run the presentation.Rmd which will create the visualizations.

data-scientist-skills's People

Contributors

mikeasilva avatar crarnouts avatar esabovic avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.