flairnlp / fundus Goto Github PK
View Code? Open in Web Editor NEWA very simple news crawler with a funny name
License: MIT License
A very simple news crawler with a funny name
License: MIT License
I did some work on qse when I noticed that some parsers did not work. The reason is that the publisher updated the layout of their website. Some of these changes affected fundus(#162). These incidents raise the question how these changes will be found in the future. A automated approach would be best, but I have no solution at the moment.
The current code is written for Python 3.10, but that would make the library unattractive in projects that use older python versions. The oldest currently supported Python version is 3.7, so it would make sense to either refactor fundus
for 3.7 (or 3.8 if we think 3.7 might be retired somewhat soon).
The current implementation differs from the state of fundus the we have to rewrite it a bit to be compatible again.
I tried writing a parser for https://freebeacon.com/, but our current code base fails to parse their ld:
(The first line is the ld that breaks)
{'@context': 'https://schema.org', '@graph': [{'@type': 'Article', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#article', 'isPartOf': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/'}, 'author': [{'@id': 'https://freebeacon.com/#/schema/person/d77e0840977af7e659efd282d1ef5122'}], 'headline': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill', 'datePublished': '2023-03-29T21:50:55+00:00', 'dateModified': '2023-03-29T20:48:27+00:00', 'mainEntityOfPage': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/'}, 'wordCount': 342, 'publisher': {'@id': 'https://freebeacon.com/#organization'}, 'image': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'thumbnailUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'keywords': ['Biden Administration', 'Department of Commerce', 'Environmentalism', 'Gina Raimondo', 'Technology', 'Ted Cruz', 'woke'], 'articleSection': ['Biden Administration'], 'inLanguage': 'en-US'}, {'@type': 'WebPage', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/', 'url': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/', 'name': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill', 'isPartOf': {'@id': 'https://freebeacon.com/#website'}, 'primaryImageOfPage': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'image': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'thumbnailUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'datePublished': '2023-03-29T21:50:55+00:00', 'dateModified': '2023-03-29T20:48:27+00:00', 'description': 'Republican senators are accusing the Biden administration of pursuing liberal social policies through the implementation of the CHIPS Act, the bipartisan bill enacted last year to boost domestic semiconductor production.', 'breadcrumb': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#breadcrumb'}, 'inLanguage': 'en-US', 'potentialAction': [{'@type': 'ReadAction', 'target': ['https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/']}]}, {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage', 'url': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'contentUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'width': 736, 'height': 514, 'caption': 'Commerce Secretary Gina Raimondo / Getty Images'}, {'@type': 'BreadcrumbList', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#breadcrumb', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'name': 'Home', 'item': 'https://freebeacon.com/'}, {'@type': 'ListItem', 'position': 2, 'name': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill'}]}, {'@type': 'WebSite', '@id': 'https://freebeacon.com/#website', 'url': 'https://freebeacon.com/', 'name': 'Washington Free Beacon', 'description': '', 'publisher': {'@id': 'https://freebeacon.com/#organization'}, 'potentialAction': [{'@type': 'SearchAction', 'target': {'@type': 'EntryPoint', 'urlTemplate': 'https://freebeacon.com/?s={search_term_string}'}, 'query-input': 'required name=search_term_string'}], 'inLanguage': 'en-US'}, {'@type': 'Organization', '@id': 'https://freebeacon.com/#organization', 'name': 'Washington Free Beacon', 'url': 'https://freebeacon.com/', 'logo': {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/#/schema/logo/image/', 'url': 'https://freebeacon.com/wp-content/uploads/2021/09/open-graph-fallback.png', 'contentUrl': 'https://freebeacon.com/wp-content/uploads/2021/09/open-graph-fallback.png', 'width': 323, 'height': 198, 'caption': 'Washington Free Beacon'}, 'image': {'@id': 'https://freebeacon.com/#/schema/logo/image/'}, 'sameAs': ['https://www.instagram.com/washingtonfreebeacon/', 'https://www.linkedin.com/company-beta/6343616/', 'https://www.youtube.com/user/WashingtonFreeBeacon', 'https://www.facebook.com/FreeBeacon/', 'https://twitter.com/FreeBeacon']}, {'@type': 'Person', '@id': 'https://freebeacon.com/#/schema/person/d77e0840977af7e659efd282d1ef5122', 'name': 'Claire Sprang', 'image': {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/#/schema/person/image/8786285c5f340898704e49c40c3b5f51', 'url': 'https://secure.gravatar.com/avatar/89b9be77282462fe9431c3568792f1bb?s=96&d=https%3A%2F%2Ffreebeacon.com%2Fwp-content%2Fthemes%2Ffreebeacon%2Fimages%2Fsmoking-man-100.jpg&r=g', 'contentUrl': 'https://secure.gravatar.com/avatar/89b9be77282462fe9431c3568792f1bb?s=96&d=https%3A%2F%2Ffreebeacon.com%2Fwp-content%2Fthemes%2Ffreebeacon%2Fimages%2Fsmoking-man-100.jpg&r=g', 'caption': 'Claire Sprang'}, 'url': 'https://freebeacon.com/author/claire-sprang/'}]}
Traceback (most recent call last):
File "/home/aaron/Code/Python/Fundus/examples/example_crawler.py", line 30, in <module>
for article in crawler.crawl(max_articles=5, error_handling="raise"):
File "/home/aaron/Code/Python/Fundus/src/scraping/pipeline.py", line 28, in run
yield next(robin)
File "/home/aaron/.conda/envs/fundus/lib/python3.8/site-packages/more_itertools/more.py", line 1111, in <genexpr>
return (x for x in i if x is not _marker)
File "/home/aaron/Code/Python/Fundus/src/scraping/scraper.py", line 22, in scrape
raise err
File "/home/aaron/Code/Python/Fundus/src/scraping/scraper.py", line 18, in scrape
data = self.parser.parse(article_source.html, error_handling)
File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/base_parser.py", line 136, in parse
self._base_setup(html)
File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/base_parser.py", line 132, in _base_setup
self.precomputed = Precomputed(html, doc, get_meta_content(doc), LinkedData(collapsed_lds))
File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/data.py", line 26, in __init__
raise ValueError(f"Found no type for LD")
ValueError: Found no type for LD
Process finished with exit code 1
They are using https://schema.org/ as a baseline, therefore we should support this eventually. I will stop working at this particular parser until this is resolved.
Currently Autopipeline
is the main access to users for our project.
from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline
for article in pipeline.run(max_articles=5):
print(article)
As @alanakbik suggested this naming schema (especially AutoPipeline
) and import structure may have a deterrent effect on new users.
This issues goal is to:
AutoPipeline
classPublisherCollection
and former ArticlePipeline
https://requests.readthedocs.io/en/latest/user/quickstart/#binary-response-content suggests that it does, but we are not quite sure.
This may include:
-Link to the different types of contributions
-Best practices
-...
@register_attribute
def authors(self) -> List[str]:
raw_str = self.cache['doc'].xpath('normalize-space('
'//ul[@class="smallList"]'
'/li[strong[contains(text(), "Auto")]]'
'/text()[last()]'
')')
if raw_str:
return raw_str.split(', ')
return []
This is the current code for authors of the dw-parser. But what should happen if there are no authors? This issues appears in the other attributes as well. We should document our decision!
I am in favor of returning empty lists etc and changing the type hint accordingly.
Currently the Precomputed
class consists of very overloaded names while itself yielding not much information about it's nature through the class name:
class Precomputed:
html: str
doc: lxml.html.HtmlElement
meta: Dict[str, str]
ld: LinkedData
cache: Dict[str, Any] = field(default_factory=dict)
The goal of this issue is:
Precomputed
as well as the precomputed
attribute of BaseParser
ld
, doc
, meta
The above list also include renaming those instances and this issue should be closed with a PR
We have to formulate some contribution guidelines especially concerning the library
section. I.e:
Every parser added to the repo library is only allowed to use attributes (name, semantic,, return type) from the attribute guideline.
How to propose new attributes?
etc...
The current Focus Parser extracts authors like 'Von FOCUS-online-Redakteur Thomas Sabin'. Should we clean them?
All of these points are suggestions, discussion is encouraged.
Article
class for direct accessI would expect the source
field of Article
to contain information on the article source. I.e. if an article was crawled from welt.de, I would expect source
to contain the value DieWelt
or WELT
. Similarly, if an article was crawled from FAZ I would expect this field to contain the string FAZ
.
However, when I run this code:
from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline
pipeline = AutoPipeline(PublisherCollection.de_de.FAZ)
for article in pipeline.run(max_articles=5):
print(article.source)
It just prints:
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
i.e. a reference to the crawler object.
Is this desired behavior? Is there any way for me to get from an Article
the information which source it is from (aside from parsing the url
field)?
The current test cases for attribute annotations rely on a hard-coded mapping. This approach has two major issues.
To solve this, the mapping should be parsed from the guidelines.
This code currently throws an error:
from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline
pipeline = AutoPipeline(PublisherCollection.de_de.Merkur)
for article in pipeline.run(max_articles=5):
print(article)
print()
It throws:
Traceback (most recent call last):
File "/home/alan/PycharmProjects/fundus/local_crawl.py", line 6, in <module>
for article in pipeline.run(max_articles=5):
File "/home/alan/PycharmProjects/fundus/src/scraping/pipeline.py", line 54, in run
yield next(robin)
File "/home/alan/.environments/fundus/lib/python3.8/site-packages/more_itertools/more.py", line 1111, in <genexpr>
return (x for x in i if x is not _marker)
File "/home/alan/PycharmProjects/fundus/src/scraping/scraper.py", line 28, in scrape
raise err
File "/home/alan/PycharmProjects/fundus/src/scraping/scraper.py", line 23, in scrape
data = self.parser.parse(article_source.html, error_handling)
File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 164, in parse
raise err
File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 161, in parse
parsed_data[attribute_name] = func()
File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 42, in __call__
return self.__func__(self.__self__, *args, *kwargs)
File "/home/alan/PycharmProjects/fundus/src/library/de_de/merkur_parser.py", line 20, in authors
return generic_author_parsing(self.precomputed.ld.bf_search("author"))
File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py", line 114, in generic_author_parsing
return [name.strip() for name in authors]
File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py", line 114, in <listcomp>
return [name.strip() for name in authors]
AttributeError: 'list' object has no attribute 'strip'
One way to test Fundus would be to execute a few example projects ourselves and see if Fundus is up to the task :)
Here is an idea for an example project: Make a corpus of funny German text
Steps:
Since there is currently some progress on attribute guidelines to specify attribute names, their semantic meaning and python return types, once again the question came up if we should validate attribute return types at runtime. So for example if we have a specified attribute body
with some description and ArticleBody
as specified return type, should we throw a warning/error if a particular parser implementation specifies the wrong return type or none at all, like shown in this example
class ExampleParser(BaseParser):
@register_attribute
def body(self) -> str:
return 'body'
Calling any code throws an error currently since an import is incorrect.
For instance calling
from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline
pipeline = AutoPipeline(PublisherCollection.de_de.BerlinerZeitung)
for article in pipeline.run(max_articles=5):
print(article)
print()
will throw:
ImportError: cannot import name 'extract_article_body_with_css' from 'src.parser.html_parser.utility' (/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py)
Consisting of:
As #12 stated, it appears that the parser sometimes returns None for plaintext
. I guess that in this case the plaintext actually is None, mostly due to the processed article type (Video etc...) but this should be investigated and proper communicated.
I want to analyze the language used when discussing soccer.
Task: Create a corpus of soccer-related language using Fundus. At least 500 articles.
@register_attribute
def topics(self) -> List[str]:
if keyword_str := self.meta().get('keywords'):
return keyword_str.split(', ')
Quite a few pieces of code from the parsers work like this example.
I was wondering why meta needs brackets. The reason is that meta is exposed to the public as well:
@register_attribute
def meta(self) -> dict[str, Any]:
return self.cache.get('meta')
This is a bad dependency. You may alter the return value of meta without considering its uses downstream.
Unfortunately, I have no good option for this at the moment.
First of all we have to agree on a place where to put them:
And then formulate specific guidelines for every parser attribute added in the base version
This guideline should especially restrict names as well as define the semantic meaning of an attribute and the return type. I.e this projects understands plaintext as the ordered accumulation of all paragraphs included in the actual article body. This exclude everything else (headline, section headline etc).
There are currently test cases like this which include empty values for attributes. I would consider those as a bad test case since they cover less than they could. To avoid example HTML like this in the future I would suggest adding an assertion to test_parsing()
asserting bool(attr)
if attr is not of type bool
.
Update:
Points to a valid example JSON again. Thanks to @dobbersc for pointing this out.
Beside a sitemap crawler #58 also introduced automatic sitemap detection. While this seemed to be an enhancement in the first place, loading the sitemaps takes between 1 and 5 seconds which really impacts the user experience in a bad way.
I think this future should be exchanged with the former static one.
This is important, since most of the work in adding a parser consists of xpath/css. We should ease this part of the contribution.
It would be very nice to include a short printout of the additional available attributes (supported and unsupported). I'd prefer this in a separate PR but wanted to mention it to not get forgotten.
Originally posted by @dobbersc in #155 (comment)
As @dobbersc pointed out in #148 the current resolution comes with some problems. To keep an implementation using absolute paths rather than relatives I would rework the current one as follows.
root
referencing the absolute path to the project root directory.Update:
@dobbersc pointed out in #148 that there wouldn't be any difference in using pathlib
over os
beside the advantages of the former and thus proposed to use pathlib
instead of os
.
pathlib
instead of os
.Since we are using mypy 1.0, using the Self
type hint is supported and can be used in our repository. For our supported Python versions we can import it from typing_extension
with from typing_extensions import Self
. For more information see the corresponding PEP.
This will help users with choosing the best crawlers for their purpose.
Should be tests/resources
instead of tests/ressources
;)
see https://github.com/flairNLP/fundus/tree/master/tests/ressources
The current documentation and guidelines lacks crucial information and should be expanded with the following:
BaseParser.parse()
and it's correlation with Precomputed
(should be done after #94)@function
from documentation since it isn't is use yet.self.precomputed.ld.bf_search("datePublished")
src/parser/html_parser/utils.py
https://choosealicense.com/ has a solid overview of the options.
It would be helpful if a print on an Article object resulted in a nicely readable string representation of the article without full information on all attributes.
I.e.:
crawler = Crawler(PublisherCollection.de_de)
for article in crawler.crawl(max_articles=5):
print(article)
would print something like:
--------------------
"IT-Schule von VW und SAP: Der Staat scheitert am Fachkräftemangel"
- by Daniel Zwick (WELT.DE)
"Während die Bundesregierung ausländischen
Fachkräften den Zuzug erleichtern will, helfen
sich Unternehmen wie SAP und Volkswagen selbst.
Im WELT-Interview erklären die Personalchefs der
beiden Dax-Konzerne, warum sie ihren IT- [...]"
from: https://www.welt.de/wirtschaft/article242481623/IT-Schule-von-VW-und-SAP-Der-Staat-scheitert-am-Fachkraeftemangel.html (crawled 2022-12-06 08:09:44.059753)
--------------------
"Zweiter Weltkrieg: Polen will im Streit um Reparationen weiter eskalieren"
- by WELT (WELT.DE)
"Zu seinem Antrittsbesuch in Berlin reist Polens
neuer Vize-Außenminister Arkadiusz Mularczyk mit
einem Bericht im Gepäck. Darin fordert Warschau
von Deutschland Reparationen in Höhe von 1,3
Billionen Euro – und droht mit weiterer [...]"
from: https://www.welt.de/politik/ausland/article242511375/Zweiter-Weltkrieg-Polen-will-im-Streit-um-Reparationen-weiter-eskalieren.html (crawled 2022-12-06 08:09:43.338912)
--------------------
Here is my attempt at a script for this:
de_de = PublisherCollection.de_de
crawler = Crawler(de_de)
for article in crawler.crawl(max_articles=2, error_handling='raise'):
title = article.extracted['title']
author = article.extracted['authors']
# Wrap this text.
wrapper = textwrap.TextWrapper(width=50, max_lines=5, initial_indent= '"', subsequent_indent=' ')
word_list = wrapper.wrap(text=article.extracted['plaintext'])
text_sample = '\n'.join(word_list) + '"'
print('-' * 20)
print(f'"{title}"\n - by {author[0]} (WELT.DE)')
print("\n" + text_sample + "\n")
print(f'from: {article.url} (crawled {article.crawl_date})')
print('-' * 20)
However, building this script I noticed some small issues:
article.title
instead of article.extracted['title']
article.extracted['plaintext']
is sometimes None.Article
should know the proper name of its "newspaper". I am thinking of a field like article.source
that just prints "Die WELT" if crawled from there.article.extracted['authors']
is sometimes should the newspaper nameThis issue is about tracking progress with porting the crawlers from qse. A crawler is considered done after a PR that introduces it has been merged to main.
I just remembered that there are some more:
Some ones I worked on locally:
@dobbersc pointed out that xpath and CSS expressions can be pre-compiled in lxml via the XPath
respectively CSSSelector
class.
Imo pre-compiling the selectors could have two major benefits:
extract_article_body_with_selector()
depend on a rather ugly mode
parameter. This makes mixing xpath and CSS selectors impossible in the first place and ugly to use in the second. With pre-compiled selectors, functions like the one mentioned above no longer depend on an ambiguous string representation of the selector rather than an actual object, allowing the function to differentiate automatically.The goal of this issue is to enforce a PR that implements the following:
It would be great to add some crawlers for English-language news to the library (once the guidelines are finalized).
For another research project on detecting political bias and reliability, we require crawlers for the following 12 sources:
This table on the front page: https://github.com/flairNLP/fundus#currently-supported-news-sources is helpful to understand is incomplete since not all currently supported news sources are listed.
Task: Complete this table
Do we want to setup flake8, mypy and tox right from the start? Same question with ensuring that the package is pip-installable (for now only from GitHub).
If desired, I could add the same setup as in the quotemine repository.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.