bitslabsyr / stack Goto Github PK

The BITS Lab STACK tool for social media collection and analysis.

License: MIT License

Python 86.50% JavaScript 0.21% CSS 0.10% HTML 13.16% Shell 0.03%

stack's Introduction

STACKS - Social Media Tracker, Analyzer, & Collector Toolkit at Syracuse

STACKS is an extensible social media research toolkit designed to collect, process, and store data from online social networks. The toolkit is an ongoing project via the Syracuse University iSchool, and currently supports the Twitter Streaming API. Collecting from the Twitter search API is under development. The toolkit architecture is modular and supports extending.

You can cite this repository:

Jeff Hemsley, Sam Jackson, Sikana Tanupabrungsun, & Billy Ceskavich. (2019). bitslabsyr/stack: STACKS 3.1 (Version 3.1). http://doi.org/10.5281/zenodo.2638848

This documentation assumes the following:

You know how to use ssh.
Your server has MongoDB already installed.
You understand how to edit files using vim (“vi”) or nano.
You have rights and know how to install Python libraries.

Installation

Please read through Install to go through the STACK installation process.

Prior to installing STACK, make sure you have MongoDB installed and running on your server. Learn how to install MongoDB here.

Wiki

To learn more about STACK semantics, logging, and processing parameters, refer to our wiki.

Ongoing Work + Next Action Items

This list will be updated soon with more detailed action items. Please note again that we are actively working on this toolkit!

Full move away from .ini file use
Extensible module format for future social network implementations
Exentesible back-end API

Credits

Lovingly maintained at Syracuse University by:

Distributed under the MIT License:

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

stack's People

Contributors

Stargazers

Watchers

stack's Issues

Tweet-QueryScript-Cond -- list tweet_ids in config file?

Right now, the config file makes it look like you can give a list of tweet ids in CONDITIONS['tweet_id'][. If that's meant to be the case, need to add an elif to the query script file for if tweet_id is not None and is not a file.

Add more CLI feedback

Add more CLI feedback for users running any of the processes. Specifically, add prompts when invalid inputs are entered for language, location, API, network, etc.

As admin, stopping a collection make all collections appear inactive

after stopping a FB collection i noticed that # Active Collectors for all collections are zero, but I verified some are running.

Follow collection with no handles

If a Twitter collector is made using follow, STACK and Twitter expect all the collection terms to be account handles. If collection terms aren't account handles, they are given a "null" value for the "id" field in the terms list in the config doc.
If ALL collection terms aren't account handles, the collector will continuously through an error, saying it's getting a 406 error from Twitter.

We should add something in documentation reminding people not to include any spaces in collection terms using follow. If possible, it'd be great to even have a dynamic warning if someone tries to do that.

Remove (or prevent) duplicate media files

Right now, getmedia.py downloads all media files that it can. But with retweets and resharing images and other forms of media, we might not need to keep all copies of media files.
Instead, we might want to create unique identifiers for all unique media files, then link those unique identifiers with each tweet in which that media appears.

If that's the case, we might hash media files to get a bit-level representation of unique files; store all unique hashes somewhere (perhaps alongside the main data collection, like we do with stream limit messages); compare new media file hashes against the list of existing media file hashes; and only retain unique media files.

Add url processing

Add a call to domain_parser to process urls in tweets.

Web Front End config example missing

The wiki refers to a stack.conf file to use with Apache, but the example it refers to is missing from the repo. The wiki referred to a Facebook branch that I can't see that maybe had it?

https://github.com/bitslabsyr/stack/wiki/Web-Front-End

Log stream rate limit JSON responses rather than pure counts.

Rate Limit central collection fields not populating properly

I haven't looked into this at all yet, but it seems that some of the fields don't always populate correctly in the central rate limit collection.

The following comes from the db called "Houston" on houston.

{
"_id" : ObjectId("5b29e0fc7315ac1c3da8ba15"),
"project_name" : "Gunsv7",
"lost_count" : 1,
"server_name" : "houston",
"notified" : false,
"timestamp_ms" : "1529469968212",
"collector_id" : "streamlimits",
"collection_type" : "UNDEFINED",
"time" : "2018-06-20T00:46:08",
"project_id" : "5ad7a9f47315ac32f5d8b9bf"
}
{
"_id" : ObjectId("5b2a275c7315ac1c3da8ff4c"),
"project_name" : "Gunsv7",
"lost_count" : 2,
"server_name" : "houston",
"notified" : false,
"timestamp_ms" : "1529488688607",
"collector_id" : "5ad7ab407315ac330929bc02",
"collection_type" : "track",
"time" : "2018-06-20T05:58:08",
"project_id" : "5ad7a9f47315ac32f5d8b9bf"
}

kombu error?

Running sudo python __main__.py db create_project leads to the following Traceback.

Traceback (most recent call last):
File "main.py", line 7, in
from app.controller import Controller
File "/home/bits/stackv1/app/init.py", line 3, in
from celery import Celery
File "/usr/local/lib/python2.7/dist-packages/celery/init.py", line 130, in
from celery import five
File "/usr/local/lib/python2.7/dist-packages/celery/five.py", line 149, in
from kombu.utils.compat import OrderedDict # noqa
File "/usr/local/lib/python2.7/dist-packages/kombu/utils/init.py", line 19, in
from uuid import UUID, uuid4 as _uuid4, _uuid_generate_random
ImportError: cannot import name _uuid_generate_random

Solve by updating kombu to 3.0.30 in requirements.txt

Better error handling for getmedia.py

getmedia.py sometimes times out or gets a socket error. The code should be able to handle these kinds of errors and continue, but right now, the script crashes.

Add count of tweets to config doc

We have a count of the number of tweets lost to stream limits in the collector config doc. Can we add a count of the number of tweets we collect?

Add image collection

Add image collection feature for Twitter collections.

Fix stream limit loss count in config doc

Have the stream limit counter check the config doc to see if there's a value for the number of stream limit losses when starting a process. Right now, if a collector is restarted, that number is reset.

Restart collector (and maybe processor) on update_collector_detail.

Apparently, if I use the update_collector_detail method to add new collection terms (and presumably to remove collection terms, too), those changes don't take effect until that collector has been restarted. I can't imagine a use case where someone would want to update collector parameters but not have those updates take effect, so I think it's logical to have the collector automatically restart when update_collector_detail is run.

Also, updates to collection terms won't take effect in how the processor builds track_kw, so it might be good to automatically restart the processor, too.

Deletes and Limits all to same project DB

Add more info to limits output

In the main project db, if there are rate limits, there's a collection for limits. Sample item here:
{
"_id" : ObjectId("5966b98f122fcc7e414fd2f2"),
"limit" : {
"track" : 1,
"time" : "2017-07-12T19:09:02",
"timestamp_ms" : "1499900942909"
}
}

It would be useful to add a field for which collector the limit comes from. I imagine that shouldn't be too difficult, since the project config db keeps track of tweets lost to stream limits for each collector separately.

Fix retry loops on 406 error codes

Move to log file for getmedia.py

Move from printing output to dumping to a logfile.

Track_kw broken

"track_kw" field should tell us why we collected a given tweet - what of our collection criteria led to a particular tweet being collected. Right now, that field is blank.
It's in app/twttier/tweetprocessing.py

Character conversion error

We are getting the following error when attempting to collect from twitter.

TOOKLKIT STREAM: Unkown stream exception caught.
'ascii' codec can't encode character u'\xad' in position 8: ordinal not in range(128)
Retrying in 320 seconds.

The error occurs every time the collector runs. No tweets are processed. I'm not sure exactly where this is being thrown, so it is hard to tell if it is tweet content or something else that the processor is choking on.

More in progress info from data pull queries

It would be nice if the data pull queries would print out some progress info, especially for large data pulls.

Repo weblink to university of syracuse is dead

I personally blame whoever gave Steve Sawyer root privileges on your server.

Sean

Add a 'delete' collector

Add a collector for capturing delete requests from Twitter.

Remove deprecated files

Remove deprecated setup and run files.

Convert stream limit doc time string to datetime object

At this line, make the timestamp a datetime object (like here, I think). Right now, the limit docs in Mongo are strings instead of ISODates, and they don't contain timezone info.

Stream limit loss count is misleading

According to Twitter docs, Twitter keeps track of the running count of tweets lost to stream limits since an API connection is opened. That means that our limit collection should probably update with each new stream limit message rather than adding a new doc for each stream limit message. If we want to keep all messages in the limit collection, we need to make the STACKS documentation really clear that the count in the track field is cumulative back to when the API connection was opened.