sfbrigade / nltweets Goto Github PK
View Code? Open in Web Editor NEW"Our corpus is tweets."
Home Page: http://www.codeforsanfrancisco.org/
"Our corpus is tweets."
Home Page: http://www.codeforsanfrancisco.org/
In other Data Science Working Group projects, I've used Anaconda to manage Python Environments. I'm open to continue doing this or other solutions (e.g. Docker image), but we should have some way to standardize Development Environment to facilitate onboarding.
Very basic MVP could even just be a list of steps to do manually. We should then document this in our Onboarding Docs. My proposal would be to create a folder called "onboarding" within the root directory with different Markdown pages within that directory.
For example, the Data Science Working Group Small Business Administration Project had this: https://github.com/sfbrigade/datasci-sba/tree/master/onboarding
When running an updated twitter query app with the image urls added, the output is not clearly delimited, so when opening in Excel, we do not get individual tweets' text and image urls delimited into columns. Right now, output looks like:
A "protected" bike lane in San Francisco. 6th & Howard.
@sfmta_muni @SF311 @sfbike https://t.co/viVGGLkjKK http://pbs.twimg.com/ext_tw_video_thumb/1091160885630910464/pu/img/PbNsVNblDqMzCmeK.jpg
Bob is going to sketch out the effort required for creating a mobile app labeling product for projects to quickly enlist a number of volunteers to label data.
Label tweets 201 - 300 in the google sheets file.
Add instructions to working with CoLab in wiki prominently as a way to onboard people to the technology we're using and the work that's being done. When done with each element in issue 25 (and break that issue into multiple for each use case), make a corresponding entry in wiki.
Goal is to create a Twitter Developer Account with Code for San Francisco credentials that can be shared amongst all group members so that we're not tied to a single account. It will also make onboarding potentially easier as new members don't need to go off and create their own account.
Note: It might be possible for us to get an Enterprise Account. To help kick off potential conversations, let's document:
Label tweets 100 - 200 in google drive file.
Noticed that new members without prior data science experience don't have great resources to ramp up and start doing interesting experiments. Since a lot of our work is exploratory, we benefit from as many members as possible doing experiments in parallel. Thinking about building a set of CodeLabs that cover-
Ideas:
3. Using Tweepy to scrape tweets
4. Scikit-learn for transforming text data into document embeddings
5. Interpreting the results of an experiment
Seems like a good investment to prioritize this now for greater velocity later.
Also available in twitter API response is the url of any image added to tweets. Later iterations of the ML pipeline might include facial or object recognition in images, so we might as well engineer our output data files now to include this info since our database needs are small and this info only adds a few bytes.
Twitter output contains character encodings rather than punctuation, possibly due to errors of people using the wrong punctuation character when tweeting such as
b'@sfmta_muni how\xe2\x80\x99s this looking for the NX??? a lot of people waiting'
Key topics
Objective
We want to develop a pipeline that can tag tweets by topic. In this project, a topic is something that Muni riders are talking about. For example, bus bunching and bike safety are topics. If we can tag tweets by topic in an automated fashion, we will have a data-backed understanding of what riders care about at a given moment.
First steps
We have an initial set of manually labeled tweets located in this Google Sheet. Also, it may be valuable to look at the ClassificationExperiment CodeLab, located here.
Useful tools
Classification - scikit-learn
Text preprocessing spacy
In the "concepts to be familiar with" section of the Welcome page, we should point people in the direction of where they can start learning about these technologies.
Label tweets 100 - 200 in the google sheets file.
Research feasibility, possible approaches, etc.
conducted interviews with projects regarding how they conduct user research, do stakeholders surrounding their issue interact on twitter, would statistics of this activity be interesting, in what format, etc.
The environments.yml file in the root directory causes conflict when I try to implement it:
(base) C:\Users\Joshf\Documents\nltweets>conda env create -f environment.yml
Solving environment: failed
UnsatisfiableError: The following specifications were found to be in conflict:
- python==3.6.0
- sqlalchemy==1.1 -> python=2.7
Use "conda info <package>" to see the dependencies for each package.
My base environment is python 3.6, so I think that's what's causing the conflict.
I'm getting
'{"errors":[{"code":89,"message":"Invalid or expired token."}]}'
in the CodeLabs example CodeLab0TwitterAPI.ipynb. Can we include an example of the structure like the following dummy values, "Your creds.txt file will look like this:
ACCESS_KEY.123456789876543210
ACCESS_SECRET.abCD123E4f6ghi7jkLmnOPqR89sTUv8WXYZy7Xw6vUtsr
CONSUMER_KEY.1ABC2defG3HI4JkLmNOpQrs5T
CONSUMER_SECRET.abCD123E4f6ghi7jkLmnOPqR89sTUv8WXYZy7Xw6vUtsr5qp4o
But not that, because that throws an error.
Occasionally people will spam @sfmta_muni with more than 2 tweets in a short amount of time (e.g. < 30 minutes). We should purge all but the first, or better, merge them all via concatenation so the two possible topics must be pulled from the entirety of the diatribe rather than scoring multiple hits for the same issue.
My beta version just pulls the extended twitter text with no parsing such as @steven4354 link in issue 1. We want to add geo coords to the output so we can aggregate and visualize data according to where tweets were created.
Once tweet_data.ipynb is in main/code, also make sure output data file is added to .gitignore, so we aren't constantly overwriting the output with new PR. The output of this code for analysis should be renamed and PR to main/data/manual. When we want to refactor the data structure for any reason, we'll create a new PR to maintain reproducibility.
Label tweets 301 - 400 in the google sheets file.
Covers text cleaning and named entity recognition
I'm not 100% positive this info is completely safe. If you know this area, please confirm, fix if necessary, and close this issue.
Dedupe posts for labelling from
https://docs.google.com/spreadsheets/d/1duD-RUOTLNu9zW4aXqfgmow62pOcKUuUitxqeVRkbPs/edit#gid=0
and
https://docs.google.com/spreadsheets/d/12mTqmgyiCDanTiPFrNdjj96UDZuQNCcnJdzdhOXeZn0/edit#gid=0
Combine the tweets into one document
Make sure not to delete any labels already applied to tweets
Update the root README file, especially the Overview section and the Needs of the Project. This will make it easier to onboard new members and know what we need to pitch for.
Add an input field for whether the desired data is a hashtag or a mention.
Probably use AWS lightsail, linux machine to run
https://github.com/mlegy/Slack-logger-bot
Classification with scikit-learn
Key topics
Objective
We want build a system that can tag our tweets with entities that are useful to us. For example, we may want to tag a tweet with the entity "N Line" or "4th and King." This helps us know exactly which transit-related entities are being talked about.
First steps
There are two approaches for this task. One is a supervised approach, where we determine a set of entities that are interesting to us and tag tweets either by keyword matching or using a classifier. The other is an unsupervised approach, which is called named entity recognition. Here, we try to automatically extract named entities from a dataset.
Useful tools
Classification - scikit-learn
Named entity recognition spacy
Should be part of 3 and 4. We need to filter out messages from our target twitter user (e.g. @sfmta_muni). We also need to handle multiple tweets by the same user about the same topic in a relatively short period of time, e.g. maybe >2 tweets in one hour, >3 tweets in one day, or more than 4 tweets in 2 days.
The issue is that calling the Twitter API for many rows, there's "Rate Limiting". So we need to figure out a way around that.
Label tweets 401 - 500 in the google sheets file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.