Reuters article URLs not unique

Similar issue to #3 as with Reuters. In this case it's not because of a URL query string, the only unique part of duplicate URLs is the last portion after a dash.

Snopes is broken

Getting the following error from the snopes parser:
KeyError: 'data-lazy-srcset'

Snopes parser is failing

Looks like they changed their markup a bit. Need to make the parser more generic/fault tolerant.

Traceback (most recent call last):
  File "firehose\sources\__init__.py", line 14, in run
    await self.update(db, queue)
  File "firehose\sources\snopes.py", line 22, in update
    articles = div.find_all('article', {'class': 'media-wrapper'})
AttributeError: 'NoneType' object has no attribute 'find_all'

C-SPAN thumbnails

These are broken. Their server is returning a 403. The URLs contain base64-encoded JSON so maybe it's possible to parse those and retrieve a working image link?

URLs not unique

Relying on URLs as a unique ID for articles from a single source isn't enough. Often they'll publish the same article in different categories with different URLs which causes duplicates in the stream.

Add a new abstract method that allows RSS subclasses to do their own transformations/filters on the RSS items before reaching the cache/push stage. This will allow for other touch-ups too (like filtering certain content or changing weird unicode characters, etc).

This issue was merged from the following specific issues and the above solution would resolve both:

The Daily Beast

Looks like they add a ?source= query string to them that creates duplicates when the same article is published in two feeds (for instance the main feed and the politics feed).

The easiest way to fix it would be to just strip everything after the ? since it's not required for viewing the article.

Reuters

In this case it's not because of a URL query string, the only unique part of duplicate URLs is the last portion after a dash.

Replace Websocket with SSE

Updates are one-directional so it's a waste to open a full websocket connection for each client. Just use Server-Sent Events, which will also simplify the code by automatically reconnecting.

Add published date to DOM

The WebSocket JSON has the date but it's not being rendered anywhere in the DOM for the stream preview.

SSL Verification Errors

Pew Research, Route Fifty, and Slate are all giving SSL verification errors that look like this:
Cannot connect to host slate.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:997)')]
However the certificate info seems normal (this request was on January 29, 2022, well before the March 11, 2022 expiration):

	"Connection:": {
		"Protocol version:": "TLSv1.2",
		"Cipher suite:": "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
		"Key Exchange Group:": "x25519",
		"Signature Scheme:": "RSA-PSS-SHA256"
	},
	"Host slate.com:": {
		"HTTP Strict Transport Security:": "Disabled",
		"Public Key Pinning:": "Disabled"
	},
	"Certificate:": {
		"Issued To": {
			"Common Name (CN):": "slate.com",
			"Organization (O):": "<Not Available>",
			"Organizational Unit (OU):": "<Not Available>"
		},
		"Issued By": {
			"Common Name (CN):": "R3",
			"Organization (O):": "Let's Encrypt",
			"Organizational Unit (OU):": "<Not Available>"
		},
		"Period of Validity": {
			"Begins On:": "Sat, 11 Dec 2021 17:44:42 GMT",
			"Expires On:": "Fri, 11 Mar 2022 17:44:41 GMT"
		},
		"Fingerprints": {
			"SHA-256 Fingerprint:": "36:74:A8:DB:79:58:52:23:68:E7:06:87:8D:E9:51:97:DE:E6:7F:FC:69:90:5F:09:71:77:EE:3B:B4:7A:AF:8C",
			"SHA1 Fingerprint:": "1D:5A:91:D0:DE:1D:65:68:5E:D5:7B:9D:3B:DF:8F:07:5C:28:9F:7E"
		},
		"Transparency:": "<Not Available>"
	}
}```

Reuters Wire API

The Reuters RSS feeds are pretty bad (no thumbnails, too much delay, etc) but they have a better JSON API beneath their main content.

URLS:

Parameters:

count: the number of items to return
since: return stories since a wireitem_id
until: return stories up to a wireitem_id

This would be much more efficient than RSS if it uses the "since" parameter to only query for new stories every update.

CBS News feed keeps resurrecting old videos

For some reason they randomly throw in videos more than a year or two old that don't seem to have any modern context for reason for being brought back. Any URLs that start with https://www.cbsnews.com/video/ can probably just be filtered out anyway.

BeautifulSoup error

Every now and then I'm getting this error from bs4:

Python\Python310\lib\site-packages\bs4\__init__.py:337: MarkupResemblesLocatorWarning:
"..." looks like a directory name, not markup. You may want to open a file found in this directory
and pass the filehandle into Beautiful Soup.

It looks like one of the webpages is returning something that isn't HTML and it's confusing bs4. Not sure which site is doing it yet, but it may just be something that needs to be suppressed.

AP News is broken

It looks like they've changed their markup so the parser needs to be updated.

Add storage backend

Currently the firehose only stores keys used for deduplication. It would be more useful to store all of the article metadata to allow for consumers to "catch up" if they go offline for some time.

This doesn't need to be fancy, SQLite should do just fine for the volume of data and simple range queries we're working with.

If it's possible to do the deduplication in the SQLite DB and ditch the log files entirely... All the better.

Add categories

Most news sources contain multiple category feeds (politics, technology, science, world, etc) it would be useful to include those in the output stream.

Maybe in the RSSSource classes .feeds can allow tuples of (category, url). If it's not a tuple the category is None (backwards compatible).

wybiral / firehose Goto Github PK

firehose's Introduction

Davy Wybiral

firehose's People

Contributors

Stargazers

Watchers

firehose's Issues

URLs not unique

The Daily Beast

Reuters

URLS:

Parameters:

Recommend Projects

Recommend Topics

Recommend Org