sfbrigade / nltweets Goto Github PK

View Code? Open in Web Editor NEW

12.0 14.0 12.0 807 KB

"Our corpus is tweets."

Home Page: http://www.codeforsanfrancisco.org/

Jupyter Notebook 99.71% Python 0.29%

social-media python machine-learning code-for-sf code-for-america

nltweets's Introduction

nltweets

Building an end-to-end NLP pipeline for small teams to do user research with Twitter data

Project Intro/Objective

See the Wiki! This project is a part of the Data Science Working Group at Code for San Francisco. Other DSWG projects can be found at the main GitHub repo.

Organization

Please refer to this article for how these folders should work together.

The "/main" folder is for production code and has 4 sub folders:

/data
/code
/pipeline
/output

Use "/sandbox" folder for storing experiments and playing around. "/outreach" is for organizing materials for producing presentations.

-- Project Status: [In Discovery]

Methods Used

Technologies

Python
Spacy
scikit-learn
gensim

Overview

Contributing NLTweets Members

Name	Slack Handle
Daniel Zou	@daniel.zou
Josh Freivogel	@Josh Freivogel
Nathan Chau	@Nathan Chau

Contact

If you haven't joined the SF Brigade Slack, you can do that here.
Our slack channel is #nltweets
Feel free to contact team leads with any questions or if you are interested in contributing!

nltweets's People

Contributors

Stargazers

Watchers

Forkers

hackerjeff705 bleeeeee mhfisch arieltak rileypredum eayoungs lbjennings irinakolobova tariq1890 itismouad sshyran

nltweets's Issues

Figure out how to progamatically extract Twitter Data "At Scale"

The issue is that calling the Twitter API for many rows, there's "Rate Limiting". So we need to figure out a way around that.

Label Tweets 701-800

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Set up server to log tweets

Probably use AWS lightsail, linux machine to run
https://github.com/mlegy/Slack-logger-bot

Re-run Twitter API Caller to Extract Data and Dump into a CSV for People to Play Around with

Standardize Development Environment

In other Data Science Working Group projects, I've used Anaconda to manage Python Environments. I'm open to continue doing this or other solutions (e.g. Docker image), but we should have some way to standardize Development Environment to facilitate onboarding.

Very basic MVP could even just be a list of steps to do manually. We should then document this in our Onboarding Docs. My proposal would be to create a folder called "onboarding" within the root directory with different Markdown pages within that directory.

For example, the Data Science Working Group Small Business Administration Project had this: https://github.com/sfbrigade/datasci-sba/tree/master/onboarding

Add twitter data cleaning step prior to labeling

Occasionally people will spam @sfmta_muni with more than 2 tweets in a short amount of time (e.g. < 30 minutes). We should purge all but the first, or better, merge them all via concatenation so the two possible topics must be pulled from the entirety of the diatribe rather than scoring multiple hits for the same issue.

Label Tweets 1001-1100

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Label Tweets 501-600

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Clean twitter output of UTF-8 code

Twitter output contains character encodings rather than punctuation, possibly due to errors of people using the wrong punctuation character when tweeting such as

b'@sfmta_muni how\xe2\x80\x99s this looking for the NX??? a lot of people waiting'

Build CodeLab #1 Spacy

Covers text cleaning and named entity recognition

Label Tweets 1401-1455

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Create Twitter Developer Account with Code for San Francisco Account

Goal is to create a Twitter Developer Account with Code for San Francisco credentials that can be shared amongst all group members so that we're not tied to a single account. It will also make onboarding potentially easier as new members don't need to go off and create their own account.

Note: It might be possible for us to get an Enterprise Account. To help kick off potential conversations, let's document:

What are our use cases?
What do we currently have access to via a free account, and what are the shortcomings?
What would we want out of an "enterprise account" and how would that help us accomplish our goals?

Add dummy example of creds.txt

I'm getting

'{"errors":[{"code":89,"message":"Invalid or expired token."}]}'

in the CodeLabs example CodeLab0TwitterAPI.ipynb. Can we include an example of the structure like the following dummy values, "Your creds.txt file will look like this:

ACCESS_KEY.123456789876543210
ACCESS_SECRET.abCD123E4f6ghi7jkLmnOPqR89sTUv8WXYZy7Xw6vUtsr
CONSUMER_KEY.1ABC2defG3HI4JkLmNOpQrs5T
CONSUMER_SECRET.abCD123E4f6ghi7jkLmnOPqR89sTUv8WXYZy7Xw6vUtsr5qp4o

But not that, because that throws an error.

Build CodeLab 0 for Twitter API

Label Tweets 347 - 500

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Label Tweets 801-900

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Verify "credentials" folder readme is correct information

I'm not 100% positive this info is completely safe. If you know this area, please confirm, fix if necessary, and close this issue.

Label Tweets 1101-1200

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Document CodeLabs onboarding notebooks prominently in wiki onboarding guide

Add instructions to working with CoLab in wiki prominently as a way to onboard people to the technology we're using and the work that's being done. When done with each element in issue 25 (and break that issue into multiple for each use case), make a corresponding entry in wiki.

Sanitize "twitter" folder

Remove less functional notebooks when we have established our official master that includes rate-limiter logic (#1) and retrieves all features of interest (#3, #4)
move existing data files to a "data" folder and delete deprecated data structure files

Make some CodeLabs to improve onboarding

Noticed that new members without prior data science experience don't have great resources to ramp up and start doing interesting experiments. Since a lot of our work is exploratory, we benefit from as many members as possible doing experiments in parallel. Thinking about building a set of CodeLabs that cover-

Using the Twitter API #62
NLTK/Spacy for preprocessing text data #41
Classification with scikit-learn #42

Ideas:
3. Using Tweepy to scrape tweets
4. Scikit-learn for transforming text data into document embeddings
5. Interpreting the results of an experiment

Seems like a good investment to prioritize this now for greater velocity later.

Add @user to twitter puller

Should be part of 3 and 4. We need to filter out messages from our target twitter user (e.g. @sfmta_muni). We also need to handle multiple tweets by the same user about the same topic in a relatively short period of time, e.g. maybe >2 tweets in one hour, >3 tweets in one day, or more than 4 tweets in 2 days.

Update README and Onboarding Documentation

Update the root README file, especially the Overview section and the Needs of the Project. This will make it easier to onboard new members and know what we need to pitch for.

Add # vs. @ feature

# vs. @ feature addition

Add an input field for whether the desired data is a hashtag or a mention.

Move or copy twitter/tweet_data.ipynb to main/code

Once tweet_data.ipynb is in main/code, also make sure output data file is added to .gitignore, so we aren't constantly overwriting the output with new PR. The output of this code for analysis should be renamed and PR to main/data/manual. When we want to refactor the data structure for any reason, we'll create a new PR to maintain reproducibility.

Label tweets 0 - 100

Label tweets 100 - 200 in the google sheets file.

Create a "Developer Guidelines" section in wiki for code contributions

Build CodeLab #2 Classification

Classification with scikit-learn

Delimit tweet output

When running an updated twitter query app with the image urls added, the output is not clearly delimited, so when opening in Excel, we do not get individual tweets' text and image urls delimited into columns. Right now, output looks like:

A "protected" bike lane in San Francisco.  6th &amp; Howard.

@sfmta_muni @SF311 @sfbike https://t.co/viVGGLkjKK  http://pbs.twimg.com/ext_tw_video_thumb/1091160885630910464/pu/img/PbNsVNblDqMzCmeK.jpg

Label tweets 401 - 500

Label tweets 401 - 500 in the google sheets file.

Create Pipeline to Classify Tweets by Topic

Key topics

Machine learning
Human-in-the-loop
Predictive modeling

Objective
We want to develop a pipeline that can tag tweets by topic. In this project, a topic is something that Muni riders are talking about. For example, bus bunching and bike safety are topics. If we can tag tweets by topic in an automated fashion, we will have a data-backed understanding of what riders care about at a given moment.

First steps
We have an initial set of manually labeled tweets located in this Google Sheet. Also, it may be valuable to look at the ClassificationExperiment CodeLab, located here.

Useful tools
Classification - scikit-learn
Text preprocessing spacy

Label tweets 101 - 200

Label tweets 100 - 200 in google drive file.

writeup notes from user research interviews

conducted interviews with projects regarding how they conduct user research, do stakeholders surrounding their issue interact on twitter, would statistics of this activity be interesting, in what format, etc.

Dedupe and Combine Labelled Twitter Data

Dedupe posts for labelling from
https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0
and
https://docs.google.com/spreadsheets/d/12mTqmgyiCDanTiPFrNdjj96UDZuQNCcnJdzdhOXeZn0/edit#gid=0
Combine the tweets into one document
Make sure not to delete any labels already applied to tweets

Label tweets 301-400

Label tweets 301 - 400 in the google sheets file.

change greetbot to smaller welcome message with link to all current info

Label Tweets 601-700

https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0

Investigate "custom" named entity recognition

Research feasibility, possible approaches, etc.

Research labeling platform

Bob is going to sketch out the effort required for creating a mobile app labeling product for projects to quickly enlist a number of volunteers to label data.

Label tweets 201 -300

Label tweets 201 - 300 in the google sheets file.

clean up environments.yml file

The environments.yml file in the root directory causes conflict when I try to implement it:

(base) C:\Users\Joshf\Documents\nltweets>conda env create -f environment.yml
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:
  - python==3.6.0
  - sqlalchemy==1.1 -> python=2.7
Use "conda info <package>" to see the dependencies for each package.

My base environment is python 3.6, so I think that's what's causing the conflict.

Develop System to Tag Tweets with Entities

Key topics

Information extraction
Named entity recognition

Objective
We want build a system that can tag our tweets with entities that are useful to us. For example, we may want to tag a tweet with the entity "N Line" or "4th and King." This helps us know exactly which transit-related entities are being talked about.

First steps
There are two approaches for this task. One is a supervised approach, where we determine a set of entities that are interesting to us and tag tweets either by keyword matching or using a classifier. The other is an unsupervised approach, which is called named entity recognition. Here, we try to automatically extract named entities from a dataset.

Useful tools
Classification - scikit-learn
Named entity recognition spacy

Add geo coords to output data

My beta version just pulls the extended twitter text with no parsing such as @steven4354 link in issue 1. We want to add geo coords to the output so we can aggregate and visualize data according to where tweets were created.

Add links to Welcome wiki page

In the "concepts to be familiar with" section of the Welcome page, we should point people in the direction of where they can start learning about these technologies.

Lane Breach Ad-hoc data request

Add uploaded image urls to output data

Also available in twitter API response is the url of any image added to tweets. Later iterations of the ML pipeline might include facial or object recognition in images, so we might as well engineer our output data files now to include this info since our database needs are small and this info only adds a few bytes.

sfbrigade / nltweets Goto Github PK

nltweets's Introduction

nltweets

Project Intro/Objective

Organization

-- Project Status: [In Discovery]

Methods Used

Technologies

Overview

Contributing NLTweets Members

Contact

nltweets's People

Contributors

Stargazers

Watchers

Forkers

nltweets's Issues

Recommend Projects

Recommend Topics

Recommend Org