Code Monkey home page Code Monkey logo

stack's Introduction

STACKS - Social Media Tracker, Analyzer, & Collector Toolkit at Syracuse

STACKS is an extensible social media research toolkit designed to collect, process, and store data from online social networks. The toolkit is an ongoing project via the Syracuse University iSchool, and currently supports the Twitter Streaming API. Collecting from the Twitter search API is under development. The toolkit architecture is modular and supports extending.

You can cite this repository:

Jeff Hemsley, Sam Jackson, Sikana Tanupabrungsun, & Billy Ceskavich. (2019). bitslabsyr/stack: STACKS 3.1 (Version 3.1). http://doi.org/10.5281/zenodo.2638848

This documentation assumes the following:

  • You know how to use ssh.
  • Your server has MongoDB already installed.
  • You understand how to edit files using vim (โ€œviโ€) or nano.
  • You have rights and know how to install Python libraries.

Installation

Please read through Install to go through the STACK installation process.

Prior to installing STACK, make sure you have MongoDB installed and running on your server. Learn how to install MongoDB here.

Wiki

To learn more about STACK semantics, logging, and processing parameters, refer to our wiki.

Ongoing Work + Next Action Items

This list will be updated soon with more detailed action items. Please note again that we are actively working on this toolkit!

  1. Full move away from .ini file use
  2. Extensible module format for future social network implementations
  3. Exentesible back-end API

Credits

Lovingly maintained at Syracuse University by:

Distributed under the MIT License:

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

stack's People

Contributors

alexanderosmith avatar bceskavich avatar danyfdz92 avatar jeckert1 avatar jhemsley avatar sikana avatar sjacks26 avatar stanupab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stack's Issues

Add more CLI feedback

Add more CLI feedback for users running any of the processes. Specifically, add prompts when invalid inputs are entered for language, location, API, network, etc.

Follow collection with no handles

If a Twitter collector is made using follow, STACK and Twitter expect all the collection terms to be account handles. If collection terms aren't account handles, they are given a "null" value for the "id" field in the terms list in the config doc.
If ALL collection terms aren't account handles, the collector will continuously through an error, saying it's getting a 406 error from Twitter.

We should add something in documentation reminding people not to include any spaces in collection terms using follow. If possible, it'd be great to even have a dynamic warning if someone tries to do that.

Remove (or prevent) duplicate media files

Right now, getmedia.py downloads all media files that it can. But with retweets and resharing images and other forms of media, we might not need to keep all copies of media files.
Instead, we might want to create unique identifiers for all unique media files, then link those unique identifiers with each tweet in which that media appears.

If that's the case, we might hash media files to get a bit-level representation of unique files; store all unique hashes somewhere (perhaps alongside the main data collection, like we do with stream limit messages); compare new media file hashes against the list of existing media file hashes; and only retain unique media files.

Rate Limit central collection fields not populating properly

I haven't looked into this at all yet, but it seems that some of the fields don't always populate correctly in the central rate limit collection.

The following comes from the db called "Houston" on houston.

{
"_id" : ObjectId("5b29e0fc7315ac1c3da8ba15"),
"project_name" : "Gunsv7",
"lost_count" : 1,
"server_name" : "houston",
"notified" : false,
"timestamp_ms" : "1529469968212",
"collector_id" : "streamlimits",
"collection_type" : "UNDEFINED",
"time" : "2018-06-20T00:46:08",
"project_id" : "5ad7a9f47315ac32f5d8b9bf"
}
{
"_id" : ObjectId("5b2a275c7315ac1c3da8ff4c"),
"project_name" : "Gunsv7",
"lost_count" : 2,
"server_name" : "houston",
"notified" : false,
"timestamp_ms" : "1529488688607",
"collector_id" : "5ad7ab407315ac330929bc02",
"collection_type" : "track",
"time" : "2018-06-20T05:58:08",
"project_id" : "5ad7a9f47315ac32f5d8b9bf"
}

kombu error?

Running sudo python __main__.py db create_project leads to the following Traceback.

Traceback (most recent call last):
File "main.py", line 7, in
from app.controller import Controller
File "/home/bits/stackv1/app/init.py", line 3, in
from celery import Celery
File "/usr/local/lib/python2.7/dist-packages/celery/init.py", line 130, in
from celery import five
File "/usr/local/lib/python2.7/dist-packages/celery/five.py", line 149, in
from kombu.utils.compat import OrderedDict # noqa
File "/usr/local/lib/python2.7/dist-packages/kombu/utils/init.py", line 19, in
from uuid import UUID, uuid4 as _uuid4, _uuid_generate_random
ImportError: cannot import name _uuid_generate_random

Solve by updating kombu to 3.0.30 in requirements.txt

Better error handling for getmedia.py

getmedia.py sometimes times out or gets a socket error. The code should be able to handle these kinds of errors and continue, but right now, the script crashes.

Add count of tweets to config doc

We have a count of the number of tweets lost to stream limits in the collector config doc. Can we add a count of the number of tweets we collect?

Fix stream limit loss count in config doc

Have the stream limit counter check the config doc to see if there's a value for the number of stream limit losses when starting a process. Right now, if a collector is restarted, that number is reset.

Restart collector (and maybe processor) on update_collector_detail.

Apparently, if I use the update_collector_detail method to add new collection terms (and presumably to remove collection terms, too), those changes don't take effect until that collector has been restarted. I can't imagine a use case where someone would want to update collector parameters but not have those updates take effect, so I think it's logical to have the collector automatically restart when update_collector_detail is run.

Also, updates to collection terms won't take effect in how the processor builds track_kw, so it might be good to automatically restart the processor, too.

Add more info to limits output

In the main project db, if there are rate limits, there's a collection for limits. Sample item here:
{
"_id" : ObjectId("5966b98f122fcc7e414fd2f2"),
"limit" : {
"track" : 1,
"time" : "2017-07-12T19:09:02",
"timestamp_ms" : "1499900942909"
}
}

It would be useful to add a field for which collector the limit comes from. I imagine that shouldn't be too difficult, since the project config db keeps track of tweets lost to stream limits for each collector separately.

Track_kw broken

"track_kw" field should tell us why we collected a given tweet - what of our collection criteria led to a particular tweet being collected. Right now, that field is blank.
It's in app/twttier/tweetprocessing.py

Character conversion error

We are getting the following error when attempting to collect from twitter.

TOOKLKIT STREAM: Unkown stream exception caught.
'ascii' codec can't encode character u'\xad' in position 8: ordinal not in range(128)
Retrying in 320 seconds.

The error occurs every time the collector runs. No tweets are processed. I'm not sure exactly where this is being thrown, so it is hard to tell if it is tweet content or something else that the processor is choking on.

Stream limit loss count is misleading

According to Twitter docs, Twitter keeps track of the running count of tweets lost to stream limits since an API connection is opened. That means that our limit collection should probably update with each new stream limit message rather than adding a new doc for each stream limit message. If we want to keep all messages in the limit collection, we need to make the STACKS documentation really clear that the count in the track field is cumulative back to when the API connection was opened.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.