Code Monkey home page Code Monkey logo

nltweets's Introduction

nltweets

Building an end-to-end NLP pipeline for small teams to do user research with Twitter data

Project Intro/Objective

See the Wiki! This project is a part of the Data Science Working Group at Code for San Francisco. Other DSWG projects can be found at the main GitHub repo.

Organization

Please refer to this article for how these folders should work together.

The "/main" folder is for production code and has 4 sub folders:

  • /data
  • /code
  • /pipeline
  • /output

Use "/sandbox" folder for storing experiments and playing around. "/outreach" is for organizing materials for producing presentations.

-- Project Status: [In Discovery]

Methods Used

Technologies

  • Python
  • Spacy
  • scikit-learn
  • gensim

Overview

Contributing NLTweets Members

Name Slack Handle
Daniel Zou @daniel.zou
Josh Freivogel @Josh Freivogel
Nathan Chau @Nathan Chau

Contact

  • If you haven't joined the SF Brigade Slack, you can do that here.
  • Our slack channel is #nltweets
  • Feel free to contact team leads with any questions or if you are interested in contributing!

nltweets's People

Contributors

bleeeeee avatar frhino avatar josh-cqg avatar nathanhc avatar nhilton92 avatar pahdo avatar rileypredum avatar russhp avatar vincentla avatar vincentla14 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nltweets's Issues

Standardize Development Environment

In other Data Science Working Group projects, I've used Anaconda to manage Python Environments. I'm open to continue doing this or other solutions (e.g. Docker image), but we should have some way to standardize Development Environment to facilitate onboarding.

Very basic MVP could even just be a list of steps to do manually. We should then document this in our Onboarding Docs. My proposal would be to create a folder called "onboarding" within the root directory with different Markdown pages within that directory.

For example, the Data Science Working Group Small Business Administration Project had this: https://github.com/sfbrigade/datasci-sba/tree/master/onboarding

Add twitter data cleaning step prior to labeling

Occasionally people will spam @sfmta_muni with more than 2 tweets in a short amount of time (e.g. < 30 minutes). We should purge all but the first, or better, merge them all via concatenation so the two possible topics must be pulled from the entirety of the diatribe rather than scoring multiple hits for the same issue.

Clean twitter output of UTF-8 code

Twitter output contains character encodings rather than punctuation, possibly due to errors of people using the wrong punctuation character when tweeting such as

b'@sfmta_muni how\xe2\x80\x99s this looking for the NX??? a lot of people waiting'

Create Twitter Developer Account with Code for San Francisco Account

Goal is to create a Twitter Developer Account with Code for San Francisco credentials that can be shared amongst all group members so that we're not tied to a single account. It will also make onboarding potentially easier as new members don't need to go off and create their own account.

Note: It might be possible for us to get an Enterprise Account. To help kick off potential conversations, let's document:

  1. What are our use cases?
  2. What do we currently have access to via a free account, and what are the shortcomings?
  3. What would we want out of an "enterprise account" and how would that help us accomplish our goals?

Add dummy example of creds.txt

I'm getting

'{"errors":[{"code":89,"message":"Invalid or expired token."}]}'

in the CodeLabs example CodeLab0TwitterAPI.ipynb. Can we include an example of the structure like the following dummy values, "Your creds.txt file will look like this:

ACCESS_KEY.123456789876543210
ACCESS_SECRET.abCD123E4f6ghi7jkLmnOPqR89sTUv8WXYZy7Xw6vUtsr
CONSUMER_KEY.1ABC2defG3HI4JkLmNOpQrs5T
CONSUMER_SECRET.abCD123E4f6ghi7jkLmnOPqR89sTUv8WXYZy7Xw6vUtsr5qp4o

But not that, because that throws an error.

Sanitize "twitter" folder

  • Remove less functional notebooks when we have established our official master that includes rate-limiter logic (#1) and retrieves all features of interest (#3, #4)
  • move existing data files to a "data" folder and delete deprecated data structure files

Make some CodeLabs to improve onboarding

Noticed that new members without prior data science experience don't have great resources to ramp up and start doing interesting experiments. Since a lot of our work is exploratory, we benefit from as many members as possible doing experiments in parallel. Thinking about building a set of CodeLabs that cover-

  1. Using the Twitter API #62
  2. NLTK/Spacy for preprocessing text data #41
  3. Classification with scikit-learn #42

Ideas:
3. Using Tweepy to scrape tweets
4. Scikit-learn for transforming text data into document embeddings
5. Interpreting the results of an experiment

Seems like a good investment to prioritize this now for greater velocity later.

Add @user to twitter puller

Should be part of 3 and 4. We need to filter out messages from our target twitter user (e.g. @sfmta_muni). We also need to handle multiple tweets by the same user about the same topic in a relatively short period of time, e.g. maybe >2 tweets in one hour, >3 tweets in one day, or more than 4 tweets in 2 days.

Move or copy twitter/tweet_data.ipynb to main/code

Once tweet_data.ipynb is in main/code, also make sure output data file is added to .gitignore, so we aren't constantly overwriting the output with new PR. The output of this code for analysis should be renamed and PR to main/data/manual. When we want to refactor the data structure for any reason, we'll create a new PR to maintain reproducibility.

Delimit tweet output

When running an updated twitter query app with the image urls added, the output is not clearly delimited, so when opening in Excel, we do not get individual tweets' text and image urls delimited into columns. Right now, output looks like:

A "protected" bike lane in San Francisco.  6th &amp; Howard.

@sfmta_muni @SF311 @sfbike https://t.co/viVGGLkjKK  http://pbs.twimg.com/ext_tw_video_thumb/1091160885630910464/pu/img/PbNsVNblDqMzCmeK.jpg

Create Pipeline to Classify Tweets by Topic

Key topics

  • Machine learning
  • Human-in-the-loop
  • Predictive modeling

Objective
We want to develop a pipeline that can tag tweets by topic. In this project, a topic is something that Muni riders are talking about. For example, bus bunching and bike safety are topics. If we can tag tweets by topic in an automated fashion, we will have a data-backed understanding of what riders care about at a given moment.

First steps
We have an initial set of manually labeled tweets located in this Google Sheet. Also, it may be valuable to look at the ClassificationExperiment CodeLab, located here.

Useful tools
Classification - scikit-learn
Text preprocessing spacy

writeup notes from user research interviews

conducted interviews with projects regarding how they conduct user research, do stakeholders surrounding their issue interact on twitter, would statistics of this activity be interesting, in what format, etc.

Research labeling platform

Bob is going to sketch out the effort required for creating a mobile app labeling product for projects to quickly enlist a number of volunteers to label data.

clean up environments.yml file

The environments.yml file in the root directory causes conflict when I try to implement it:

(base) C:\Users\Joshf\Documents\nltweets>conda env create -f environment.yml
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
  - python==3.6.0
  - sqlalchemy==1.1 -> python=2.7
Use "conda info <package>" to see the dependencies for each package.

My base environment is python 3.6, so I think that's what's causing the conflict.

Develop System to Tag Tweets with Entities

Key topics

  • Information extraction
  • Named entity recognition

Objective
We want build a system that can tag our tweets with entities that are useful to us. For example, we may want to tag a tweet with the entity "N Line" or "4th and King." This helps us know exactly which transit-related entities are being talked about.

First steps
There are two approaches for this task. One is a supervised approach, where we determine a set of entities that are interesting to us and tag tweets either by keyword matching or using a classifier. The other is an unsupervised approach, which is called named entity recognition. Here, we try to automatically extract named entities from a dataset.

Useful tools
Classification - scikit-learn
Named entity recognition spacy

Add geo coords to output data

My beta version just pulls the extended twitter text with no parsing such as @steven4354 link in issue 1. We want to add geo coords to the output so we can aggregate and visualize data according to where tweets were created.

Add links to Welcome wiki page

In the "concepts to be familiar with" section of the Welcome page, we should point people in the direction of where they can start learning about these technologies.

Add uploaded image urls to output data

Also available in twitter API response is the url of any image added to tweets. Later iterations of the ML pipeline might include facial or object recognition in images, so we might as well engineer our output data files now to include this info since our database needs are small and this info only adds a few bytes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.