- Email: [email protected]
wybiral / firehose Goto Github PK
View Code? Open in Web Editor NEWReal-time news aggregate system.
License: MIT License
Real-time news aggregate system.
License: MIT License
Similar issue to #3 as with Reuters. In this case it's not because of a URL query string, the only unique part of duplicate URLs is the last portion after a dash.
Getting the following error from the snopes parser:
KeyError: 'data-lazy-srcset'
Looks like they changed their markup a bit. Need to make the parser more generic/fault tolerant.
Traceback (most recent call last):
File "firehose\sources\__init__.py", line 14, in run
await self.update(db, queue)
File "firehose\sources\snopes.py", line 22, in update
articles = div.find_all('article', {'class': 'media-wrapper'})
AttributeError: 'NoneType' object has no attribute 'find_all'
These are broken. Their server is returning a 403. The URLs contain base64-encoded JSON so maybe it's possible to parse those and retrieve a working image link?
Relying on URLs as a unique ID for articles from a single source isn't enough. Often they'll publish the same article in different categories with different URLs which causes duplicates in the stream.
Add a new abstract method that allows RSS subclasses to do their own transformations/filters on the RSS items before reaching the cache/push stage. This will allow for other touch-ups too (like filtering certain content or changing weird unicode characters, etc).
This issue was merged from the following specific issues and the above solution would resolve both:
Looks like they add a ?source=
query string to them that creates duplicates when the same article is published in two feeds (for instance the main feed and the politics feed).
The easiest way to fix it would be to just strip everything after the ?
since it's not required for viewing the article.
In this case it's not because of a URL query string, the only unique part of duplicate URLs is the last portion after a dash.
Updates are one-directional so it's a waste to open a full websocket connection for each client. Just use Server-Sent Events, which will also simplify the code by automatically reconnecting.
The WebSocket JSON has the date but it's not being rendered anywhere in the DOM for the stream preview.
Pew Research, Route Fifty, and Slate are all giving SSL verification errors that look like this:
Cannot connect to host slate.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:997)')]
However the certificate info seems normal (this request was on January 29, 2022, well before the March 11, 2022 expiration):
"Connection:": {
"Protocol version:": "TLSv1.2",
"Cipher suite:": "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"Key Exchange Group:": "x25519",
"Signature Scheme:": "RSA-PSS-SHA256"
},
"Host slate.com:": {
"HTTP Strict Transport Security:": "Disabled",
"Public Key Pinning:": "Disabled"
},
"Certificate:": {
"Issued To": {
"Common Name (CN):": "slate.com",
"Organization (O):": "<Not Available>",
"Organizational Unit (OU):": "<Not Available>"
},
"Issued By": {
"Common Name (CN):": "R3",
"Organization (O):": "Let's Encrypt",
"Organizational Unit (OU):": "<Not Available>"
},
"Period of Validity": {
"Begins On:": "Sat, 11 Dec 2021 17:44:42 GMT",
"Expires On:": "Fri, 11 Mar 2022 17:44:41 GMT"
},
"Fingerprints": {
"SHA-256 Fingerprint:": "36:74:A8:DB:79:58:52:23:68:E7:06:87:8D:E9:51:97:DE:E6:7F:FC:69:90:5F:09:71:77:EE:3B:B4:7A:AF:8C",
"SHA1 Fingerprint:": "1D:5A:91:D0:DE:1D:65:68:5E:D5:7B:9D:3B:DF:8F:07:5C:28:9F:7E"
},
"Transparency:": "<Not Available>"
}
}```
The Reuters RSS feeds are pretty bad (no thumbnails, too much delay, etc) but they have a better JSON API beneath their main content.
This would be much more efficient than RSS if it uses the "since" parameter to only query for new stories every update.
For some reason they randomly throw in videos more than a year or two old that don't seem to have any modern context for reason for being brought back. Any URLs that start with https://www.cbsnews.com/video/
can probably just be filtered out anyway.
Every now and then I'm getting this error from bs4:
Python\Python310\lib\site-packages\bs4\__init__.py:337: MarkupResemblesLocatorWarning:
"..." looks like a directory name, not markup. You may want to open a file found in this directory
and pass the filehandle into Beautiful Soup.
It looks like one of the webpages is returning something that isn't HTML and it's confusing bs4. Not sure which site is doing it yet, but it may just be something that needs to be suppressed.
It looks like they've changed their markup so the parser needs to be updated.
Currently the firehose only stores keys used for deduplication. It would be more useful to store all of the article metadata to allow for consumers to "catch up" if they go offline for some time.
This doesn't need to be fancy, SQLite should do just fine for the volume of data and simple range queries we're working with.
If it's possible to do the deduplication in the SQLite DB and ditch the log files entirely... All the better.
Most news sources contain multiple category feeds (politics, technology, science, world, etc) it would be useful to include those in the output stream.
Maybe in the RSSSource classes .feeds
can allow tuples of (category, url)
. If it's not a tuple the category is None (backwards compatible).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.