hayesdavis / flamingo Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 2.0 1.23 MB

Flamingo is a service for wading into the Twitter Streaming API.

License: MIT License

Ruby 100.00%

flamingo's People

Contributors

Stargazers

Watchers

Forkers

infochimps-forks gotwalt

flamingo's Issues

OAuth Support

What it says in the title.

Separate Redis Instances for Internal Use and Subscription Dispatch

Internally, flamingo uses Redis as a queue between the wader and dispatcher processes. It's also used for stats tracking. Currently it assumes it can dispatch Resque jobs for subscriptions to this Redis instance as well.

If we allow separate subscription Redis instance(s) that are unrelated to the "core" instance, we can make it possible to ensure that flamingo's performance is not impacted by a rogue Redis being used by other applications.

The design of this separation should make it possible to still use one Redis instance for both core functionality and for subscription dispatching if desired.

Make wader reconnects more robust

The current method of reconnecting to the stream after a predicate change is to simply kill the wader and start a new one. This should be made more robust as described in the Streaming API docs.

Remove Resque Dependency

Resque is likely too heavyweight of a solution for the dispatcher. Ideally the dispatcher would be stripped down to a simple process responsible for taking an event from a redis list (put there by the wader), dispatching it to subscribers and then repeating. No forking, etc.

This would need to be coupled with a new subscription mechanism that would make it easier to create types of subscriptions, one of which would place the event onto a resque queue so it could be consumed by a worker. Another type could append the event to a file, etc.

REST Resources Intermittently Unresponsive

On low volume streams in development branch at version 0.3.0, it seems the *.json resources in the REST API will sometimes become unresponsive. The / resource will return correctly.

This may be due to a Redis client connection issue after forking child processes. 0.3.0 now talks to redis prior to forking the flamingod children which initiates a socket connection in the flamingod process. This socket may need to be reset after the fork.

Wader goes into limbo after too many reconnect attempts

After exhausting the maximum number of reconnects, the wader just stops trying and basically goes into limbo. This can be a problem if there are connectivity issues with Twitter for an extended period.

Should Revert to Last Known Good Stream Config

There are times when a user might specify a bad track term or some other invalid data for a stream which can result in a 406 error from twitter that will make the wader die fatally. It would be nice if the system could rever to a last known good config so that tracking continues. This would likely need to be coupled with some notification system so that this situation can be detected in production.

Better Limit Handling

Limit events are very important and should be:

Logged as a WARN message to the flamingo log
The limit status of the current stream should be stored in the meta info for easy lookup

Limits are a single key value pair of the form {"limit":{STREAM:NNN}} where STREAM is the name of the stream endpoint (usually "filter") and NNN is some integer number of events that have not been delivered since the current connection began. Limit meta information should take into account restarts of the wader which will reset the limit values.

Web UI

Create a resque-web style UI for flamingo that makes it easy to see what's going on.

Incorrect Event Handling

The dispatcher only parses events of type delete and link separate from tweet. The dispatcher needs to parse and handle all non-tweet events with correct typing for sending to downstream subscribers, including unexpected types.

~/flamingo.yml Detection

Doesn't detect existence of ~/flamingo.yml. Detection of ./flamingo.yml works just fine.

To reproduce:

create $HOME_DIR/flamingod.yml
Run $ flamingo-web

Event Log

Currently if the log level is set to DEBUG, the dispatcher will log the JSON of each event received to the overall flamingo log. This is annoying for a few reasons:

The application log is now filled with json, making it hard to locate any actual issues
Replaying or reusing that JSON in some way (in case of a failure somewhere down stream) is difficult because it requires extracting out events from a noisy log.

I'm proposing to log all JSON events received to a rotating set of logs where they will be stored newline separated. Log rotation will be configurable based on a number of events per log. Log retention is outside the scope of this feature. It will be up to the user to occasionally purge logs that are no longer needed.

The event log should be optional based on configuration parameters. It should also permit a flamingo to be configured with no subscriptions and an event log if the user wishes to simply capture information and store it to a file.

Dispatcher Queue Can Grow Large Under Heavy Loads

The dispatcher throughput is too low to ensure that the queue doesn't grow and that tweets are dispatched in real time under heavy stream rates. On a rackspace cloud VM this is topping out at around 1.2k events per minute after which we see queue growth.

This is likely due to the forking overhead imposed by using resque for the dispatching infrastructure.