flairnlp / fundus Goto Github PK

View Code? Open in Web Editor NEW

119.0 6.0 62.0 19.17 MB

A very simple news crawler with a funny name

License: MIT License

Python 100.00%

cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper

fundus's People

Contributors

Stargazers

Watchers

Forkers

scarletocean dubeno songkq zhengpeng-guo lg8897203 maxdall felixvonberlin lethalsnake1337 mk2112 jannispoltier jabbawukis tamemo99 screw-44 boriskalika benjamin2107 brandjakhu 3l1nk buschd-nlp tingc99 lsch0lz justeif oreo7985 henrikkirchmann dopamean zhengxueying fabianhenning areinicke sunitasi desicochrane andy-nlp martinknz sugartzu leon130996 banomarvey opticus10 dlrow18 dkm1006 ozelalisen mornningstar jannichorst thobenmehu dilara1919 alwanyah feyrbrand frank-prysner sebchmie lindhork frank10969 falcusanradu umutyesildal myoncee merchants-11 syafrahman junq1

fundus's Issues

How do we catch publisher changes that break extraction?

I did some work on qse when I noticed that some parsers did not work. The reason is that the publisher updated the layout of their website. Some of these changes affected fundus(#162). These incidents raise the question how these changes will be found in the future. A automated approach would be best, but I have no solution at the moment.

Refactor code for Python 3.7 or 3.8

The current code is written for Python 3.10, but that would make the library unattractive in projects that use older python versions. The oldest currently supported Python version is 3.7, so it would make sense to either refactor fundus for 3.7 (or 3.8 if we think 3.7 might be retired somewhat soon).

Adjust cc-news crawler to current project state

The current implementation differs from the state of fundus the we have to rewrite it a bit to be compatible again.

Refactor CC-NEWS crawler
provide CC-NEWS example

[Feature Request]: Enable SitemapCrawler to handle archive formats like gzip

The current implementation of SitemapCrawler expects sitemaps to be defined as an xml file, but publisher like merkur.de present there sitemap as a collection of compressed files.
The SitemapCrawler should be able to detect those formats, open them as a stream ad yielding url locations.

Fix the interaction between the new ld and author extraction

The new ld-parsing (#28) and the current utility functions (#23) clash. Since I am unsure about the best solution, this issue documents that I am aware of the issue.

Freebeacon uses another ld scheme with we can not parse, how to proceed?

I tried writing a parser for https://freebeacon.com/, but our current code base fails to parse their ld:
(The first line is the ld that breaks)

{'@context': 'https://schema.org', '@graph': [{'@type': 'Article', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#article', 'isPartOf': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/'}, 'author': [{'@id': 'https://freebeacon.com/#/schema/person/d77e0840977af7e659efd282d1ef5122'}], 'headline': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill', 'datePublished': '2023-03-29T21:50:55+00:00', 'dateModified': '2023-03-29T20:48:27+00:00', 'mainEntityOfPage': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/'}, 'wordCount': 342, 'publisher': {'@id': 'https://freebeacon.com/#organization'}, 'image': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'thumbnailUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'keywords': ['Biden Administration', 'Department of Commerce', 'Environmentalism', 'Gina Raimondo', 'Technology', 'Ted Cruz', 'woke'], 'articleSection': ['Biden Administration'], 'inLanguage': 'en-US'}, {'@type': 'WebPage', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/', 'url': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/', 'name': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill', 'isPartOf': {'@id': 'https://freebeacon.com/#website'}, 'primaryImageOfPage': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'image': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'thumbnailUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'datePublished': '2023-03-29T21:50:55+00:00', 'dateModified': '2023-03-29T20:48:27+00:00', 'description': 'Republican senators are accusing the Biden administration of pursuing liberal social policies through the implementation of the CHIPS Act, the bipartisan bill enacted last year to boost domestic semiconductor production.', 'breadcrumb': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#breadcrumb'}, 'inLanguage': 'en-US', 'potentialAction': [{'@type': 'ReadAction', 'target': ['https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/']}]}, {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage', 'url': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'contentUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'width': 736, 'height': 514, 'caption': 'Commerce Secretary Gina Raimondo / Getty Images'}, {'@type': 'BreadcrumbList', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#breadcrumb', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'name': 'Home', 'item': 'https://freebeacon.com/'}, {'@type': 'ListItem', 'position': 2, 'name': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill'}]}, {'@type': 'WebSite', '@id': 'https://freebeacon.com/#website', 'url': 'https://freebeacon.com/', 'name': 'Washington Free Beacon', 'description': '', 'publisher': {'@id': 'https://freebeacon.com/#organization'}, 'potentialAction': [{'@type': 'SearchAction', 'target': {'@type': 'EntryPoint', 'urlTemplate': 'https://freebeacon.com/?s={search_term_string}'}, 'query-input': 'required name=search_term_string'}], 'inLanguage': 'en-US'}, {'@type': 'Organization', '@id': 'https://freebeacon.com/#organization', 'name': 'Washington Free Beacon', 'url': 'https://freebeacon.com/', 'logo': {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/#/schema/logo/image/', 'url': 'https://freebeacon.com/wp-content/uploads/2021/09/open-graph-fallback.png', 'contentUrl': 'https://freebeacon.com/wp-content/uploads/2021/09/open-graph-fallback.png', 'width': 323, 'height': 198, 'caption': 'Washington Free Beacon'}, 'image': {'@id': 'https://freebeacon.com/#/schema/logo/image/'}, 'sameAs': ['https://www.instagram.com/washingtonfreebeacon/', 'https://www.linkedin.com/company-beta/6343616/', 'https://www.youtube.com/user/WashingtonFreeBeacon', 'https://www.facebook.com/FreeBeacon/', 'https://twitter.com/FreeBeacon']}, {'@type': 'Person', '@id': 'https://freebeacon.com/#/schema/person/d77e0840977af7e659efd282d1ef5122', 'name': 'Claire Sprang', 'image': {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/#/schema/person/image/8786285c5f340898704e49c40c3b5f51', 'url': 'https://secure.gravatar.com/avatar/89b9be77282462fe9431c3568792f1bb?s=96&d=https%3A%2F%2Ffreebeacon.com%2Fwp-content%2Fthemes%2Ffreebeacon%2Fimages%2Fsmoking-man-100.jpg&r=g', 'contentUrl': 'https://secure.gravatar.com/avatar/89b9be77282462fe9431c3568792f1bb?s=96&d=https%3A%2F%2Ffreebeacon.com%2Fwp-content%2Fthemes%2Ffreebeacon%2Fimages%2Fsmoking-man-100.jpg&r=g', 'caption': 'Claire Sprang'}, 'url': 'https://freebeacon.com/author/claire-sprang/'}]}
Traceback (most recent call last):
  File "/home/aaron/Code/Python/Fundus/examples/example_crawler.py", line 30, in <module>
    for article in crawler.crawl(max_articles=5, error_handling="raise"):
  File "/home/aaron/Code/Python/Fundus/src/scraping/pipeline.py", line 28, in run
    yield next(robin)
  File "/home/aaron/.conda/envs/fundus/lib/python3.8/site-packages/more_itertools/more.py", line 1111, in <genexpr>
    return (x for x in i if x is not _marker)
  File "/home/aaron/Code/Python/Fundus/src/scraping/scraper.py", line 22, in scrape
    raise err
  File "/home/aaron/Code/Python/Fundus/src/scraping/scraper.py", line 18, in scrape
    data = self.parser.parse(article_source.html, error_handling)
  File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/base_parser.py", line 136, in parse
    self._base_setup(html)
  File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/base_parser.py", line 132, in _base_setup
    self.precomputed = Precomputed(html, doc, get_meta_content(doc), LinkedData(collapsed_lds))
  File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/data.py", line 26, in __init__
    raise ValueError(f"Found no type for LD")
ValueError: Found no type for LD

Process finished with exit code 1

They are using https://schema.org/ as a baseline, therefore we should support this eventually. I will stop working at this particular parser until this is resolved.

[Discussion]: User friendly access points to fundus

Currently Autopipeline is the main access to users for our project.

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

for article in pipeline.run(max_articles=5):
    print(article)

As @alanakbik suggested this naming schema (especially AutoPipeline) and import structure may have a deterrent effect on new users.

This issues goal is to:

Find a more inviting naming schema for the currently called AutoPipeline class
Encourage PRs reducing the import length to PublisherCollection and former ArticlePipeline

Explore wether requests supports automatic decrompession for gzip

https://requests.readthedocs.io/en/latest/user/quickstart/#binary-response-content suggests that it does, but we are not quite sure.

Discuss/Implement a contribution strategy documentation

This may include:
-Link to the different types of contributions
-Best practices
-...

Ensure Consistency with the Type hints and return Types of the Parser-Attributes

 @register_attribute
    def authors(self) -> List[str]:
        raw_str = self.cache['doc'].xpath('normalize-space('
                                          '//ul[@class="smallList"]'
                                          '/li[strong[contains(text(), "Auto")]]'
                                          '/text()[last()]'
                                          ')')
        if raw_str:
            return raw_str.split(', ')
        return []

This is the current code for authors of the dw-parser. But what should happen if there are no authors? This issues appears in the other attributes as well. We should document our decision!

I am in favor of returning empty lists etc and changing the type hint accordingly.

More descriptive naming schema for Precomputed

Currently the Precomputed class consists of very overloaded names while itself yielding not much information about it's nature through the class name:

class Precomputed:
    html: str
    doc: lxml.html.HtmlElement
    meta: Dict[str, str]
    ld: LinkedData
    cache: Dict[str, Any] = field(default_factory=dict)

The goal of this issue is:

Find a new name for class Precomputed as well as the precomputed attribute of BaseParser
Find more precise names for ld, doc, meta

The above list also include renaming those instances and this issue should be closed with a PR

Contribution guideline

We have to formulate some contribution guidelines especially concerning the library section. I.e:

Every parser added to the repo library is only allowed to use attributes (name, semantic,, return type) from the attribute guideline.
How to propose new attributes?
etc...

Clean Authors?

The current Focus Parser extracts authors like 'Von FOCUS-online-Redakteur Thomas Sabin'. Should we clean them?

[Refactor] Improve file structure

Move the parser out of html_parser
Restructure the example crawler to remove the split of the example across the comment
Enforce a file name convention for the parser files
Choose a better name for the folder 'library' and move the files accordingly
Move the function 'listify' and remove utils
Fix inconsistency in the naming of the folders 'scraping' vs 'parser'

All of these points are suggestions, discussion is encouraged.

Feature request: Add commonly used fields as attributes to the article class

Agree on a set of commonly used fields (plaintext, author, etc...)
Add them to the Article class for direct access

No meaningful value in Article source field

I would expect the source field of Article to contain information on the article source. I.e. if an article was crawled from welt.de, I would expect source to contain the value DieWelt or WELT. Similarly, if an article was crawled from FAZ I would expect this field to contain the string FAZ.

However, when I run this code:

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

pipeline = AutoPipeline(PublisherCollection.de_de.FAZ)

for article in pipeline.run(max_articles=5):
    print(article.source)

It just prints:

<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>

i.e. a reference to the crawler object.

Is this desired behavior? Is there any way for me to get from an Article the information which source it is from (aside from parsing the url field)?

[Feature Request]: Parse annotation mapping from the guidelines so it's read-only.

The current test cases for attribute annotations rely on a hard-coded mapping. This approach has two major issues.

Code duplication: The annotations are defined in the guidelines as well as in the mapping. This usually is very error prone.
Accessibility: This setup makes it possible to bypass attribute annotation tests by adding an annotation manually to the mapping without defining it in the guidelines. This undermines the purpose of the annotation tests.

To solve this, the mapping should be parsed from the guidelines.

Merkur parser not working

This code currently throws an error:

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

pipeline = AutoPipeline(PublisherCollection.de_de.Merkur)

for article in pipeline.run(max_articles=5):
    print(article)
    print()

It throws:

Traceback (most recent call last):
  File "/home/alan/PycharmProjects/fundus/local_crawl.py", line 6, in <module>
    for article in pipeline.run(max_articles=5):
  File "/home/alan/PycharmProjects/fundus/src/scraping/pipeline.py", line 54, in run
    yield next(robin)
  File "/home/alan/.environments/fundus/lib/python3.8/site-packages/more_itertools/more.py", line 1111, in <genexpr>
    return (x for x in i if x is not _marker)
  File "/home/alan/PycharmProjects/fundus/src/scraping/scraper.py", line 28, in scrape
    raise err
  File "/home/alan/PycharmProjects/fundus/src/scraping/scraper.py", line 23, in scrape
    data = self.parser.parse(article_source.html, error_handling)
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 164, in parse
    raise err
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 161, in parse
    parsed_data[attribute_name] = func()
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 42, in __call__
    return self.__func__(self.__self__, *args, *kwargs)
  File "/home/alan/PycharmProjects/fundus/src/library/de_de/merkur_parser.py", line 20, in authors
    return generic_author_parsing(self.precomputed.ld.bf_search("author"))
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py", line 114, in generic_author_parsing
    return [name.strip() for name in authors]
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py", line 114, in <listcomp>
    return [name.strip() for name in authors]
AttributeError: 'list' object has no attribute 'strip'

Example project: A corpus of funny German

One way to test Fundus would be to execute a few example projects ourselves and see if Fundus is up to the task :)

Here is an idea for an example project: Make a corpus of funny German text

Steps:

select at least two sources (Titanic, Eulenspiegel etc.)
create parsers for these sources
create a corpus of at least 200 articles
save in json format for easy distribution

[Discussion] Should we validate parser attribute return types at run time?

Since there is currently some progress on attribute guidelines to specify attribute names, their semantic meaning and python return types, once again the question came up if we should validate attribute return types at runtime. So for example if we have a specified attribute body with some description and ArticleBody as specified return type, should we throw a warning/error if a particular parser implementation specifies the wrong return type or none at all, like shown in this example

class ExampleParser(BaseParser):

    @register_attribute
    def body(self) -> str:
        return 'body'

SPON parser incorrect import

Calling any code throws an error currently since an import is incorrect.

For instance calling

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

pipeline = AutoPipeline(PublisherCollection.de_de.BerlinerZeitung)

for article in pipeline.run(max_articles=5):
    print(article)
    print()

will throw:

ImportError: cannot import name 'extract_article_body_with_css' from 'src.parser.html_parser.utility' (/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py)

Reuters

Setup CI

Consisting of:

pytest
mypy
black
~~flake8~~

Investigate: `plaintext` is sometimes None

As #12 stated, it appears that the parser sometimes returns None for plaintext. I guess that in this case the plaintext actually is None, mostly due to the processed article type (Video etc...) but this should be investigated and proper communicated.

Example project: Soccer news

I want to analyze the language used when discussing soccer.

Task: Create a corpus of soccer-related language using Fundus. At least 500 articles.

Discourage the 'Schachtelung" of attributes in the parser class


    @register_attribute
    def topics(self) -> List[str]:
        if keyword_str := self.meta().get('keywords'):
            return keyword_str.split(', ')

Quite a few pieces of code from the parsers work like this example.
I was wondering why meta needs brackets. The reason is that meta is exposed to the public as well:

    @register_attribute
    def meta(self) -> dict[str, Any]:
        return self.cache.get('meta')

This is a bad dependency. You may alter the return value of meta without considering its uses downstream.

Unfortunately, I have no good option for this at the moment.

Attribute guideline

First of all we have to agree on a place where to put them:

wiki
repo

And then formulate specific guidelines for every parser attribute added in the base version

This guideline should especially restrict names as well as define the semantic meaning of an attribute and the return type. I.e this projects understands plaintext as the ordered accumulation of all paragraphs included in the actual article body. This exclude everything else (headline, section headline etc).

[Discussion]: Enforce values in parser test cases.

There are currently test cases like this which include empty values for attributes. I would consider those as a bad test case since they cover less than they could. To avoid example HTML like this in the future I would suggest adding an assertion to test_parsing() asserting bool(attr) if attr is not of type bool.

Update:
Points to a valid example JSON again. Thanks to @dobbersc for pointing this out.

Move current repo

Get current repo in a movable state (commit all changes, etc...) -
Move repo

Remove automatic sitemap detection and go back to static values

Beside a sitemap crawler #58 also introduced automatic sitemap detection. While this seemed to be an enhancement in the first place, loading the sitemaps takes between 1 and 5 seconds which really impacts the user experience in a bad way.

I think this future should be exchanged with the former static one.

Add mypy

Configure and setup mypy for the project
Run mypy without errors

Generate/Link xpath/csss documentation

This is important, since most of the work in adding a parser consists of xpath/css. We should ease this part of the contribution.

Add Supported/Unsupported Article Attributes to the Printout

          It would be very nice to include a short printout of the additional available attributes (supported and unsupported). I'd prefer this in a separate PR but wanted to mention it to not get forgotten.

Originally posted by @dobbersc in #155 (comment)

[ToDo] Rework path resolution

As @dobbersc pointed out in #148 the current resolution comes with some problems. To keep an implementation using absolute paths rather than relatives I would rework the current one as follows.

Create a fixed root anchor as a variable called root referencing the absolute path to the project root directory.
Change all existing paths to be relative to this anchor.

Update:
@dobbersc pointed out in #148 that there wouldn't be any difference in using pathlib over os beside the advantages of the former and thus proposed to use pathlib instead of os.

Use pathlib instead of os.

Use `Self` as Type Hint for Alternative Constructurs and Other Usages

Since we are using mypy 1.0, using the Self type hint is supported and can be used in our repository. For our supported Python versions we can import it from typing_extension with from typing_extensions import Self. For more information see the corresponding PEP.

Add "performance-benchmarks"

This will help users with choosing the best crawlers for their purpose.

Test Resources folder name is misspelled

Should be tests/resources instead of tests/ressources ;)

see https://github.com/flairNLP/fundus/tree/master/tests/ressources

Feature request: Add a field displaying the proper name of the underlying publisher to article

Gather the requirements needed for the inclusion of fundus in qse

Finish #22
Add support for the article ids

Further enhance documentation.

The current documentation and guidelines lacks crucial information and should be expanded with the following:

how_to_contribute.md

A best practice for contributing a parser (First add to publisher collection -> validation -> ...)
A simple rule on how to choose where to use the information from (meta vs ld vs layout)
A explain of the functionality of BaseParser.parse() and it's correlation with Precomputed (should be done after #94)
Remove @function from documentation since it isn't is use yet.

attribute_guidelines.md

Add commonly used extraction patterns to attributes. i.e:

self.precomputed.ld.bf_search("datePublished")

code documentation

Add precise documentation to src/parser/html_parser/utils.py
Change parameter naming schema

Add tests using pytest

source (use monkey patching, pytest)
library - without specific parser implementation
parser
pipeline

Decide on a licence

https://choosealicense.com/ has a solid overview of the options.

Discussion: Convenience attribute access and human-readable Article printouts

It would be helpful if a print on an Article object resulted in a nicely readable string representation of the article without full information on all attributes.

I.e.:

crawler = Crawler(PublisherCollection.de_de)

for article in crawler.crawl(max_articles=5):
    print(article)

would print something like:

--------------------
"IT-Schule von VW und SAP: Der Staat scheitert am Fachkräftemangel"
 - by Daniel Zwick (WELT.DE)

"Während die Bundesregierung ausländischen
 Fachkräften den Zuzug erleichtern will, helfen
 sich Unternehmen wie SAP und Volkswagen selbst.
 Im WELT-Interview erklären die Personalchefs der
 beiden Dax-Konzerne, warum sie ihren IT- [...]"

from: https://www.welt.de/wirtschaft/article242481623/IT-Schule-von-VW-und-SAP-Der-Staat-scheitert-am-Fachkraeftemangel.html (crawled 2022-12-06 08:09:44.059753)

--------------------
"Zweiter Weltkrieg: Polen will im Streit um Reparationen weiter eskalieren"
 - by WELT (WELT.DE)

"Zu seinem Antrittsbesuch in Berlin reist Polens
 neuer Vize-Außenminister Arkadiusz Mularczyk mit
 einem Bericht im Gepäck. Darin fordert Warschau
 von Deutschland Reparationen in Höhe von 1,3
 Billionen Euro – und droht mit weiterer [...]"

from: https://www.welt.de/politik/ausland/article242511375/Zweiter-Weltkrieg-Polen-will-im-Streit-um-Reparationen-weiter-eskalieren.html (crawled 2022-12-06 08:09:43.338912)
--------------------

Here is my attempt at a script for this:

de_de = PublisherCollection.de_de

crawler = Crawler(de_de)

for article in crawler.crawl(max_articles=2, error_handling='raise'):

    title = article.extracted['title']
    author = article.extracted['authors']

    # Wrap this text.
    wrapper = textwrap.TextWrapper(width=50, max_lines=5, initial_indent= '"', subsequent_indent=' ')

    word_list = wrapper.wrap(text=article.extracted['plaintext'])
    text_sample = '\n'.join(word_list) + '"'

    print('-' * 20)
    print(f'"{title}"\n - by {author[0]} (WELT.DE)')
    print("\n" + text_sample + "\n")
    print(f'from: {article.url} (crawled {article.crawl_date})')
    print('-' * 20)

However, building this script I noticed some small issues:

it would be nice to get a direct access to some feature we expect most articles to have. E.g. article.title instead of article.extracted['title']
the article.extracted['plaintext'] is sometimes None.
the Article should know the proper name of its "newspaper". I am thinking of a field like article.source that just prints "Die WELT" if crawled from there.
article.extracted['authors'] is sometimes should the newspaper name

Redesign stream support for CC-NEWS crawler

Get rid of monitoring
Implement easy to use stream module for CC-NEWS

Port the Crawlers from QSE

This issue is about tracking progress with porting the crawlers from qse. A crawler is considered done after a PR that introduces it has been merged to main.

I just remembered that there are some more:

Bild
Tagesspiegel

Some ones I worked on locally:

[Proposal]: Precompile xpath and CSS expressions

@dobbersc pointed out that xpath and CSS expressions can be pre-compiled in lxml via the XPath respectively CSSSelector class.

Imo pre-compiling the selectors could have two major benefits:

Speed: Till now I didn't evaluate this, which we should do for sure before going live with these changes, but since most of the time during parsing is spent on the selectors, this could be a huge time saver.
Flexibility: Currently functions like extract_article_body_with_selector() depend on a rather ugly mode parameter. This makes mixing xpath and CSS selectors impossible in the first place and ugly to use in the second. With pre-compiled selectors, functions like the one mentioned above no longer depend on an ambiguous string representation of the selector rather than an actual object, allowing the function to differentiate automatically.

The goal of this issue is to enforce a PR that implements the following:

Adjust existing functions to use selector objects instead of strings.
Pre-compile existing selectors in the individual parser implementations.

Add pytest

pytest setup + conf
add to ci