docnow / diffengine Goto Github PK

track changes to the news, where news is anything with an RSS feed

License: MIT License

Python 96.58% HTML 3.42%

diffengine's Introduction

docnow

The web is a big and rapidly changing place, so it can be challenging to discover what resources related to a particular event or topic are in need of archiving. Appraisal is an umbrella term for the many processes by which archivists identify records of enduring value for preservation in an archive. DocNow is an appraisal tool for the social web that uses Twitter.

DocNow allows archivists to tap into conversations in Twitter to help them discover what web resources for collection and preservation. It also connects archivists with content creators in order to make the process of archving web content more collaborative and consentful. The purpose of DocNow is to help ensure ethical practices in web archiving by building conversations between archivists and the communities they are documenting.

The DocNow application has been developed with generous support from the Mellon Foundation.

Architecture

This repository houses the complete DocNow application which is comprised of a few components:

a client side application (React)
a server side REST API (Node)
a database (PostgreSQL)
a messaging queue database (Redis)

Production

If you are running DocNow in production you will want to check out docnow-ansible which allows you to provision and configure DocNow in the cloud.

Development

The main branch of this repository represents the latest tested features of the DocNow application following the trunk based development model. Tagged version releases can be used for production deployments. Development usually happens on short lived branches which are merged into main once they have been reviewed and approved. If you'd like to contribute to the DocNow project please fork this repository, create a branch for your feature or bug fix, and then send a pull request to have it reviewed and merged.

To set up DocNow locally on your workstation you will need to install Git and Docker. Once you've got them installed open a terminal window and follow these instructions:

git clone https://github.com/docnow/docnow
cd docnow
cp .env.dev .env
docker-compose build --no-cache
docker-compose up
make some ☕️
open http://localhost:3000

If you run into an error above and want to clean out all your docker containers and images you can run this:

sh clean-up.sh

Testing

The test suite runs automatically via a GitHub Action. If you want to run the tests yourself you will need to:

cp .env.test-sample .env.test

Replace the CHANGE_ME values in .env.test to the respective Twitter API credentials. Then run the tests.

npm run test

Do not commit .env.test to git since it contains your Twitter API keys!

diffengine's People

Contributors

Stargazers

Watchers

diffengine's Issues

pip 19 breaks install

Because of pypa/pip#4187 --process-dependency-links is not supported anymore and installation fails. Use pip3 install --upgrade pip==18.0.0 to install it.

Link to Wayback Diff?

The Wayback Machine now has a diff view for comparing two versions of a page. For example:

https://web.archive.org/web/diff/20200204071550/20190921140148/https://www.anotheracronym.org/about/

When tweeting a URL it would be better to link to this diff rather than expecting users to be able to compare them in separate tabs.

Remove Entry.blogged

It looks like Diff.tweeted is no longer set. It should be set when a tweet for a diff has been sent. Also I think we can remove DIff.blogged now.

unable to tweet image

I think images may need to be resized, sometimes they fail like this:

2017-01-18 07:39:22,299 - root - ERROR - unable to tweet: [{'message': 'Image dimensions must be >= 4x4 and <= 8192x8192', 'code': 324}]

Photo changes

Franc suggests that it would be useful to possibly track changes in images. At the moment only textual changes are noted. But it could be possible to notice a substantial change in images used in the body of the article.

Can we use it for personal use only?

Like i have few websites where i want to keep and an eye like amazon or few real estate builer website.

Can i remove the twitter section?

How to enable logging?

I see there's logging in https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py, how is it turned on?

The utility is running without any evident errors, but with no updates yet so I'd like to debug to check everything's in order.

diffengine installation fails on Ubuntu (htmldiff version)

pip3 install --process-dependency-links diffengine fails on Ubuntu 17.04 with the following message:

Collecting htmldiff==0.2 (from diffengine) Could not find a version that satisfies the requirement htmldiff==0.2 (from diffengine) (from versions: 0.1) No matching distribution found for htmldiff==0.2 (from diffengine)

Track changes only in headlines

Hi,

at first thanks for your efforts and the project!

I got the script running and tweeting but would like to tweet only changes in headlines. Is there a setting to compare only changes in the headline?

Thanks and kind regards

Tibor

Min / max change feature

I think it would be good to track and tweet only articles when there was a minimum or maximum of changes:

Small changes are often edits because of typos / styleguide or
Very large changes are often made because the article has been fetched and auto published from a news agency and edited afterwards.

Also a “heatmap” feature would be interesting: not only looking for the amount of changes in the whole article but in the part where the changes have been made (paragraph, heading)

This way it would also be easier to generate smaller captures like mentioned in #34 because we would know which parts of the article have the most relevant changes.

Empty URLs/documents in feed seem to crash diffengine

Last line in log: 017-10-18 13:18:57,133 - root - INFO - checking https://gateway.itstgate.com/WebLink2/WebLink.aspx

Traceback:

Traceback (most recent call last):
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 483, in main
    version = entry.get_latest()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 156, in get_latest
    title = doc.title()
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 137, in title
    return get_title(self._html(True))
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 108, in _html
    self.html = self._parse(self.input)
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 117, in _parse
    doc, self.encoding = build_doc(input)
  File "/usr/local/lib/python3.6/site-packages/readability/htmls.py", line 21, in build_doc
    doc = lxml.html.document_fromstring(decoded_page.encode('utf-8', 'replace'), parser=utf8_parser)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 765, in document_fromstring
    "Document is empty")
lxml.etree.ParserError: Document is empty

strange diffs getting tweeted

@ruebot noticed a series of odd updates like this which led to the discovery that readability returns very little content sometimes. For example:

import requests
import readability

html = requests.get("https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html").content
doc = readability.Document(html)

print(doc.summary())

returns (at the moment):

<html><body><div><div class="article__subheadline" data-reactid="93"><p data-reactid="94">The 15-year-old was remanded into secure accommodation on Wednesday and was also charged with possession of an offensive weapon. </p></div></div></body></html>

Perhaps there should be a configurable threshold below which the content will be ignored or at least not tweeted? Could readability be tuned in this case to return content that is more appropriate like the text of the AP press release?

Python code formatter integration

Hi @edsu I'm working on the first PR related to the envyaml package integration as the first step of the entire features I've been adding in my own fork.

I know you created the thread branch but I'm sure this way would be much easier as we can discuss every feature separately and that way avoiding conflicts as it's a one big file.

Anyway, one of the first things I'm thinking is this: what do you think about using a common Python formatter as a condition for the colaborators (myself in this case)?

This way every code addition, is compliant with the same way to coding. This can be done automatically by installing the code formatter it in the collaborator's own computer. E.g.: I've installed Black for formatting the code when I save the file without having to worry about that kind of stuff.

The thing here is that modifies the entire file the first time. So this would be the very first PR to integrate to the master branch, if you agree with this.

Change summary lang key default value to "the page"

As stated by @edsu at #74

All that being said, maybe it's best to stick with summary since that is what Readability calls it, and its what is in the model. But I think the default string value should be "the page"?

UnicodeEncodeError being raised by calls to logging.info

UnicodeEncodeError: 'ascii' codec can't encode character '\u279c' in position 280: ordinal not in range(128)
Call stack:
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 464, in main
    tweet_diff(version.diff, f['twitter'])
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 420, in tweet_diff
    logging.info("tweeted %s", status)
Message: 'tweeted %s'
Arguments: ('Trump wants good relationship with Russia, May says sanctions should stay | Reuters https://wayback.archive.org/web/20170127111722/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews \u279c https://wayback.archive.org/web/20170127193013/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews',)

Should we explicitly call status.encode('utf-8') before logging? http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20
I've also now set LC_ALL='en_US.utf8' in my crontab as suggested by another answer there to see if that fixes it as well.

'EntryVersion' object has no attribute 'save_url'

I think is a typo error on init.py at line 259 it says self.save_url and it should be just save_url

Database migrations

The next version of diffengine will require some database modifications for existing installs. peewee supports migrations. I think we have an example of one in init.py but maybe we should pull these out into a separate module?

I can test the migrations on a v0.2.7 database that I have.

staleness heuristic / performance

The longer diffengine runs the more urls it needs to check, and the more time it takes to take a full pass through them. The assumption I've had so far based on watching news websites is that the older a page gets the less likely it is to change its content.

There is a method on the Entry object that calculates whether an entry is stale or not. It uses what I call a staleness ratio or s. If s is greater than a given value (currently .2) it is deemed stale. I've thought about making this magic number configurable per feed. Here's how it works:

hotness = current time - entry creation time
staleness = current time - time last checked
s = hotness / staleness

stale? = s >= .2

So if an entry is 3 hours (10800 seconds) old and it was last checked 20 minutes (1200 seconds) ago, the calculation is:

1200 / 10800 = .11 (not stale)

Or if the entry is 3 hours old and it was last checked 1 hour ago:

3600 / 10800 = .33 (stale)

The idea is that things get checked less often as they get older, but the problem that I haven't really verified yet is that I think it can still result in thresholds over which lots of checks need to happen. So periodically diffengine will spend a lot of time checking URLs as they cross over that threshold.

I was wondering if it might make sense to take a more probabilistic approach where URLs are checked more often when they are new and less often as they get older using some sort of probability sampling. For example when an entry is new it is checked 80% of the time, and as it gets to be old, say a month old, it is checked only 50% of the time. So a gradiant of some kind like that? Or maybe it should also factor in the total number of entries that need to be checked, and the desired time it should take for a complete run?

It takes about a second to check an entry, and after running against Washington Post the Guardian and Breitbart for a week I have 1531 URLs to check. If there were no backing off at all this be 25 minutes of runtime, and it would just get worse. This would mean that new entries would not be monitored closely enough. Also it would unduly burden the webserver being checked with tons of requests.

I suspect this problem may have been solved elsewhere before, so if you have ideas or pointers they would be appreciated!

Fair use screenshots

At the moment the project is capturing always the whole article even if only a few words in a specific section have changed.

I guess it would be great to capture only the paragraph that has changed. This way we would not copy the whole article and it would be easier to read on Twitter as well.

404s

I'm just now seeing that if something that was previously available becomes 404 Not Found diffengine logs it, but doesn't tweet it. Ideally I think it should tweet it right? Or at least it should be configurable to tweet it. I noticed this because I've been watching the White House website, and a large number of posts from 2017 went missing during the Drupal -> WordPress switch.

https://inkdroid.org/2017/12/20/whitehouse-redesign/

unexpected archive.org response: name 'url' is not defined

Looking at diffengine.log I noticed the following error:

2017-01-19 15:25:06,606 - root - ERROR - unexpected archive.org response for https://web.archive.org/save/http://www.presseportal.de/pm/58964/3539208: name 'url' is not defined

Opening the same URL in a browser just loads archive.org fine and it returns the saved URL.

Not sure this is just a temporary error due to connection speed or similar issues, or a bug with my PhantomJS install?

Document how to work with multiple feeds

I have accounts setup, and a couple of have many urls for RSS feeds. Not sure if I have everything setup right. So, we should probably document the best way to setup multiple accounts, and an account that has multiple RSS feeds in it.

Happy to do this work.

I have each account setup in it's own home directory:

Toronto Sun /home/nruest/.torontosun
Toronto Star /home/nruest/.diffengine
Globe & Mail /home/nruest/.globemail
Canadaland /home/nruest/.canadaland
CBC /home/nruest/.cbc

Toronto Sun has multiple RSS feeds, and I have config.yaml setup like so:

- name: Top Home stories
  twitter:
    access_token: SOMETHING
    access_token_secret: SOMETHING
  url: http://www.torontosun.com/photos/rss.xml
- name: Top Home stories
  twitter:
    access_token: SOMETHING
    access_token_secret: SOMETHING
  url: http://www.torontosun.com/videos/rss.xml
- name: Top Home stories
  twitter:
    access_token: SOMETHING
    access_token_secret: SOMETHING
  url: http://www.torontosun.com/sunshine-girl/rss.xml
...
phantomjs: phantomjs
twitter:
  consumer_key: SOMETHING
  consumer_secret: SOMETHING

smarter stale measure

The current test for staleness doesn't seem to be smart enough. After running diffengine for over a month it is taking it 8 hours to check Breitbart, The Guardian and The Washington Post. I think it needs to be smarter about what to do with the backlog of sites. Perhaps randomly sampling from them?

lockfile

It would be useful for diffengine to establish a lock before running in order to prevent a long running cron job from interfering with a newly started one.

Style the output

Hi,

could you tell me how I can modify / individualize the tweetet images?

I first thought I would be able to do so with the ./diffengine/diff.html, but it seems like the output hasn’t changed.

Thanks in advance!

unable to tweet: Read-only application cannot POST

Looking at the Twitter account's "Apps" settings this is actually correct: it reads "Permissions: read-only", however I just provided the tokens and secret as requested and opened the Twitter authorize URL that returned the pin.

I haven't played with Twitter OAuth for ages and I can't seem to find a way to change the permissions afterwards using the Twitter UI, so what would be the proper way to grant read and write to my app?

EDIT: I simply clicked "Regenerate My Access Token and Token Secret" and Twitter then magically makes it a Read and write token instead of a read-only

Occasional WebDriverExceptions raised from _generate_diff_images

Traceback (most recent call last):
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 460, in main
    version = entry.get_latest()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 175, in get_latest
    diff.generate()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 271, in generate
    self._generate_diff_images()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 297, in _generate_diff_images
    self.browser = webdriver.PhantomJS(phantomjs)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 52, in __init__
    self.service.start()
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 102, in start
    raise WebDriverException("Can not connect to the Service %s" % self.path)
selenium.common.exceptions.WebDriverException: Message: Can not connect to the Service phantomjs

https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L297

Maybe we should catch this and retry after waiting? Not sure exactly what's causing it (and if retrying would help or not).

Negative match / exclude URLs

I think it would be a nice feature to ignore articles with specific words in the URL or in the Title.

I have one feed where articles with a specific word in the url are always changed but contain only Copyright information (https://twitter.com/ueberschrieben/status/864926843224432641)

User-configurable deletions for content normalization

It might be nice for users to be able to put an array of strings or regexes in config.yaml that can be used to normalize content before diffing.

For example, I could put 'Scroll down for video' in for deletion for dailymail_diff, or with regexes globemail_diff might be able to remove stock price changes.

Related to #10, there might be a tradeoff for where to put such an array in the YAML hierarchy. Putting it as a top-level key would mean less repetition for people using one config per news source, putting it as a key under each feed would allow people using one config for multiple news sources to have different ones for each.

Why tweet_status_id_str on EntryVersion and not Diff?

Since diffs are tweeted and not particular versions I'm wondering why we are storing the tweet id for the diff on the EntryVersion instead of Diff? I think it would be cleaner to store it on the Diff right?

Unable to complete setup

I finished installing using AWS cloud9 and got this:

Fetching initial set of entries.
Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 473, in main
init(home)
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 463, in init
setup_browser()
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 421, in setup_browser
browser = webdriver.Firefox(options=opts)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/firefox/webdriver.py", line 174, in init
keep_alive=True)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

**I downgraded selenium (I saw this might help and I'm really new to this stuff)

Then I got this:**

Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 473, in main
init(home)
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 463, in init
setup_browser()
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 421, in setup_browser
browser = webdriver.Firefox(options=opts)
TypeError: init() got an unexpected keyword argument 'options'

Please could someone help? Thanks!!

Screenshot truncated

Since the move to Firefox for screenshotting it seems that the image can sometimes not include the diffed text. For example https://twitter.com/whitehouse_diff/status/1252696316381143041

Perhaps there is a timing difference, or the JavaScript that adjusts the page isn't working as it once was?

$HOME/.diffengine default profile location

Rather than having the user always select the location for their profile directory perhaps $HOME/.diffengine could be the default, and it could be overridden with a --profile command line option?

Dealing with url changes

I'm setting up a tracker for http://visir.is (a Icelandic news site).

I've noticed that when changes are done on headlines, their system makes new urls.

The urls are made up of these elements:

http://visir.is/g/<ARTICLE_ID>/< HEADLINE >

To view the article the < HEADLINE > part is reduntant.

To get around it I made some changes to allow for a regex to be applied to a url from the rss feed. See here:

pallih@0519f31

This makes the url checked: http://visir.is/g/<ARTICLE_ID> so subsequent changes to the headline are picked up, and not stored as a new article.

I'm not sure introducing a config variable is appropriate for the project, but at least my solution is there, if anyone needs it.

Non-canonical URLs pointing at the same URL from multiple feeds results in duplicates

For example: https://twitter.com/search?f=tweets&q=fox_diff%20Former%20President%20Bush%20intensive%20care&src=typd

All the Archive URLs are the same. However, in the logs I see both:

checking http://feedproxy.google.com/~r/foxnews/politics/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html
checking http://feedproxy.google.com/~r/foxnews/most-popular/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html

Which I'm guessing is how these duplicates are getting tweeted.

Maybe we need to do some URL de-referencing/canonicalization before storing/checking URLs from feeds? If I curl -I those feedproxy URLs I get a 301 response with a semi-canonical URL in the location (would need to have parameters stripped).

Make `time.sleep` configurable and default to zero

When asked for it use at #74 , @edsu said:

I put it there to prevent getting blocked by websites that saw repeated and rapid crawling as a threat. But honestly I had kind of forgotten about it. It might be nice to retain it as a configuration option, and have it default to zero?

Info

Hi.

I am sorry to contact you here but I am new and I did not find a better way to reach you.

I am looking for a tool that allow me to automatically monitor webpages. Examples of use: track the price of an item and/or check item availability.

Features that I would like to have:

Option to track just a portion of a page, instead of the whole page
Notifications directly on my phone
Option to customize the frequency of the checks

Does your tool allow this?

My intention would be to use this stuff in a raspberry, as it is cheap solution and it has a low power consumption. Is it in your opinion suitable for this?

Thanks

Fetching data with NewsAPI

I think the NewsAPI would be a nice addition to RSS feeds.

There are some pros:

reliable data and
proven option to fetch media articles
70 sources available with access to titles, authors, article images and excerpts

Cons:

Only excerpt possible, not the full article
dependency from a 3rd party developer
not usable for blogs or smaller news web sites

However, I think NewsAPI would make diffengine more reliable and gives us further options to style the ouput (with image layouts and author name).

The developer is also thinking about adding a RSS / Atom feature. Maybe a collaboration would be great for both projects?

diff.html

... is not getting installed by pip install diffengine. It ought to be possible to update setup.py to handle this.

X-Archive-Orig-Date

Internet Archive's Save Page Now functionality seems to have some logic to return a previous snapshot if it has one that is 5 minutes or so old. The time of the snapshot is made available in the X-Archive-Orig-Date header. It ought to be possible to parse this and compare it against the current time to see if the snapshot was current.

I'm not quite sure what to do if it isn't though...I guess it could at least be logged? Alternatively it could decide not to tweet it so that this doesn't happen. Notice how the new version doesn't have the new change?

Error message: WARNING - not tweeting without archive urls

I tried to set up diffengine yesterday and after a few teething issues, including installation failing due to the "--process-dependency-links" error, it seemed to run fine when I rolled back to Pip 18.

However, it does not tweet.

I do have 3 directories in my diffs folder but I noticed that differences were not tweeted. Looking through the diffengine log, there are three error messages that presumably relate to this: "WARNING - not tweeting without archive urls"

Not sure whether other errors in the diffengine.log are related:

"ERROR - unable to get archive id from None"
"ERROR - unexpected archive.org response for https://web.archive.org/save/https://www.blahblah.com"

Any ideas?

changes in whitespace count as diffs

It looks like changes in whitespace in the readability text are showing up as diffs. Here's an example

Rather than doing a simple equality check perhaps there whitespace should be stripped somehow? Or we could calculate a diff each time?

"Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead"

I have hundreds of mails on my server with this warning:

/usr/local/lib/python3.6/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for Phantom
JS has been deprecated, please use headless versions of Chrome or Firefox instead
  warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '

(The PhantomJS repo is being archived: ariya/phantomjs#15344 There may be a fork soon: ariya/phantomjs#15345.)

Ideally diffengine should switch from PhantomJS to headless Chrome (eg.) or Firefox (or the fork), but it'd be good to silence this specific warning in the meantime.

handle too many redirects

I caught this via an email from cron. It looks like some better handling of this type of error is needed?

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 233
, in archive
    resp = requests.get(save_url, headers={"User-Agent": UA})
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 56, in req
uest
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 488,
in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 630,
in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 630,
in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 111,
in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, respon
se=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/diffengine", line 9, in <module>
    load_entry_point('diffengine==0.0.27', 'console_scripts', 'diffengine')()
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 459
, in main
    version = entry.get_latest()
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 173
, in get_latest
    new.archive()
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 242
, in archive
    save_url, resp.headers, e
UnboundLocalError: local variable 'resp' referenced before assignment

diffengine sometimes hangs without explanation

Noticed on 2017-01-27 for cnn_diff:

2017-01-25 10:16:43,128 - root - INFO - shutting down: new=13 checked=495 skipped=1691 elapsed=0:16:41.543544
2017-01-25 10:30:02,301 - root - INFO - starting up with home=/Users/ryan/source/diffengine/cnn_diff
2017-01-25 10:30:02,317 - root - INFO - fetching feed: http://rss.cnn.com/rss/cnn_topstories.rss
2017-01-25 10:30:03,048 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/KiO2MctO3eI/index.html
2017-01-25 10:30:03,240 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/iyOB4KTWINU/index.html
2017-01-25 10:30:03,413 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/q2hFlt0ZpK0/index.html

and bbc_diff:

2017-01-26 05:52:10,719 - root - WARNING - Got 404 when fetching http://www.bbc.co.uk/news/world-us-canada-38702983
2017-01-26 05:53:29,932 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/rss.xml?edition=int
2017-01-26 05:53:38,509 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/system/latest_published_content/rss.xml
2017-01-26 05:53:38,673 - root - INFO - shutting down: new=6 checked=226 skipped=1851 elapsed=0:08:36.948907
2017-01-26 06:00:01,536 - root - INFO - starting up with home=/Users/ryan/source/diffengine/bbc_diff
2017-01-26 06:00:01,545 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/rss.xml
2017-01-26 06:00:01,674 - root - INFO - found new entry: http://www.bbc.co.uk/news/science-environment-38755229
2017-01-26 06:00:01,717 - root - INFO - found new entry: http://www.bbc.co.uk/news/business-38748296
2017-01-26 06:00:01,765 - root - INFO - found new entry: http://www.bbc.co.uk/news/magazine-38722929
2017-01-26 06:01:41,721 - root - WARNING - Got 404 when fetching http://www.bbc.co.uk/news/world-us-canada-38702983

Both had long-running processes but didn't appear to be logging or doing anything new.

I tried using dtruss to trace syscalls in the running processes before killing them, but no syscalls were being made (using dtruss on a successfully-running diffengine instance produces a lot of output).

404 Pages

It would be useful to record when a page disappears completely.

configure text element

It might be useful to be configure a feed with a CSS selector to specify what element to extract text from with readability. For example the Washington Post currently use

<article itemprop="articleBody">...</article>

To enclose the text of the article using https://schema.org/NewsArticle microdata. Perhaps the config could look like:

- name: Washington Post - Politics
  url: http://feeds.washingtonpost.com/rss/politics
  css_selector: article[itemprop="articleBody"]
  twitter:
   access_token: foo
   access_token_secret: bar

I guess the downside to this is that sites change, so unless you are watching it you may not notice when their markup changes, and your diffengine instance would quietly stop working.

SavePageNow 503 Service Unavailable

I've noticed that the SavePageNow service gives out occasional 503 Service Unavailable errors. diffengine should guard against that and retry then log the failure.

tweet throttling

It's probably important to put in some kind of tweet throttling so that global changes in content on a website don't trigger a rash of tweets. I think it's probably ok to generate diffs for this content, but excessive tweeting can get your account blocked.

use web.archive.org directly

It's probably a good idea to use web.archive.org directly instead of pragma as a middle man for adding a URL to Internet Archive? The relevant code can be found here.