Code Monkey home page Code Monkey logo

fundus's People

Contributors

addie9800 avatar alanakbik avatar annikathiele avatar boriskalika avatar brandjakhu avatar dkm1006 avatar dobbersc avatar fabianhenning avatar jabbawukis avatar jannispoltier avatar lethalsnake1337 avatar lsch0lz avatar lukasgarbas avatar martinknz avatar maxdall avatar mk2112 avatar myoncee avatar screw-44 avatar susannaruecker avatar weyaaron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fundus's Issues

How do we catch publisher changes that break extraction?

I did some work on qse when I noticed that some parsers did not work. The reason is that the publisher updated the layout of their website. Some of these changes affected fundus(#162). These incidents raise the question how these changes will be found in the future. A automated approach would be best, but I have no solution at the moment.

Refactor code for Python 3.7 or 3.8

The current code is written for Python 3.10, but that would make the library unattractive in projects that use older python versions. The oldest currently supported Python version is 3.7, so it would make sense to either refactor fundus for 3.7 (or 3.8 if we think 3.7 might be retired somewhat soon).

Freebeacon uses another ld scheme with we can not parse, how to proceed?

I tried writing a parser for https://freebeacon.com/, but our current code base fails to parse their ld:
(The first line is the ld that breaks)

{'@context': 'https://schema.org', '@graph': [{'@type': 'Article', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#article', 'isPartOf': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/'}, 'author': [{'@id': 'https://freebeacon.com/#/schema/person/d77e0840977af7e659efd282d1ef5122'}], 'headline': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill', 'datePublished': '2023-03-29T21:50:55+00:00', 'dateModified': '2023-03-29T20:48:27+00:00', 'mainEntityOfPage': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/'}, 'wordCount': 342, 'publisher': {'@id': 'https://freebeacon.com/#organization'}, 'image': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'thumbnailUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'keywords': ['Biden Administration', 'Department of Commerce', 'Environmentalism', 'Gina Raimondo', 'Technology', 'Ted Cruz', 'woke'], 'articleSection': ['Biden Administration'], 'inLanguage': 'en-US'}, {'@type': 'WebPage', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/', 'url': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/', 'name': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill', 'isPartOf': {'@id': 'https://freebeacon.com/#website'}, 'primaryImageOfPage': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'image': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage'}, 'thumbnailUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'datePublished': '2023-03-29T21:50:55+00:00', 'dateModified': '2023-03-29T20:48:27+00:00', 'description': 'Republican senators are accusing the Biden administration of pursuing liberal social policies through the implementation of the CHIPS Act, the bipartisan bill enacted last year to boost domestic semiconductor production.', 'breadcrumb': {'@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#breadcrumb'}, 'inLanguage': 'en-US', 'potentialAction': [{'@type': 'ReadAction', 'target': ['https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/']}]}, {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#primaryimage', 'url': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'contentUrl': 'https://freebeacon.com/wp-content/uploads/2021/01/rsz_gettyimages-492468882.jpg', 'width': 736, 'height': 514, 'caption': 'Commerce Secretary Gina Raimondo / Getty Images'}, {'@type': 'BreadcrumbList', '@id': 'https://freebeacon.com/biden-administration/senators-accuse-biden-admin-of-sneaking-left-wing-policies-into-semiconductor-bill/#breadcrumb', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'name': 'Home', 'item': 'https://freebeacon.com/'}, {'@type': 'ListItem', 'position': 2, 'name': 'Senators Accuse Biden Admin of Sneaking Left-Wing Policies Into Semiconductor Bill'}]}, {'@type': 'WebSite', '@id': 'https://freebeacon.com/#website', 'url': 'https://freebeacon.com/', 'name': 'Washington Free Beacon', 'description': '', 'publisher': {'@id': 'https://freebeacon.com/#organization'}, 'potentialAction': [{'@type': 'SearchAction', 'target': {'@type': 'EntryPoint', 'urlTemplate': 'https://freebeacon.com/?s={search_term_string}'}, 'query-input': 'required name=search_term_string'}], 'inLanguage': 'en-US'}, {'@type': 'Organization', '@id': 'https://freebeacon.com/#organization', 'name': 'Washington Free Beacon', 'url': 'https://freebeacon.com/', 'logo': {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/#/schema/logo/image/', 'url': 'https://freebeacon.com/wp-content/uploads/2021/09/open-graph-fallback.png', 'contentUrl': 'https://freebeacon.com/wp-content/uploads/2021/09/open-graph-fallback.png', 'width': 323, 'height': 198, 'caption': 'Washington Free Beacon'}, 'image': {'@id': 'https://freebeacon.com/#/schema/logo/image/'}, 'sameAs': ['https://www.instagram.com/washingtonfreebeacon/', 'https://www.linkedin.com/company-beta/6343616/', 'https://www.youtube.com/user/WashingtonFreeBeacon', 'https://www.facebook.com/FreeBeacon/', 'https://twitter.com/FreeBeacon']}, {'@type': 'Person', '@id': 'https://freebeacon.com/#/schema/person/d77e0840977af7e659efd282d1ef5122', 'name': 'Claire Sprang', 'image': {'@type': 'ImageObject', 'inLanguage': 'en-US', '@id': 'https://freebeacon.com/#/schema/person/image/8786285c5f340898704e49c40c3b5f51', 'url': 'https://secure.gravatar.com/avatar/89b9be77282462fe9431c3568792f1bb?s=96&d=https%3A%2F%2Ffreebeacon.com%2Fwp-content%2Fthemes%2Ffreebeacon%2Fimages%2Fsmoking-man-100.jpg&r=g', 'contentUrl': 'https://secure.gravatar.com/avatar/89b9be77282462fe9431c3568792f1bb?s=96&d=https%3A%2F%2Ffreebeacon.com%2Fwp-content%2Fthemes%2Ffreebeacon%2Fimages%2Fsmoking-man-100.jpg&r=g', 'caption': 'Claire Sprang'}, 'url': 'https://freebeacon.com/author/claire-sprang/'}]}
Traceback (most recent call last):
  File "/home/aaron/Code/Python/Fundus/examples/example_crawler.py", line 30, in <module>
    for article in crawler.crawl(max_articles=5, error_handling="raise"):
  File "/home/aaron/Code/Python/Fundus/src/scraping/pipeline.py", line 28, in run
    yield next(robin)
  File "/home/aaron/.conda/envs/fundus/lib/python3.8/site-packages/more_itertools/more.py", line 1111, in <genexpr>
    return (x for x in i if x is not _marker)
  File "/home/aaron/Code/Python/Fundus/src/scraping/scraper.py", line 22, in scrape
    raise err
  File "/home/aaron/Code/Python/Fundus/src/scraping/scraper.py", line 18, in scrape
    data = self.parser.parse(article_source.html, error_handling)
  File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/base_parser.py", line 136, in parse
    self._base_setup(html)
  File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/base_parser.py", line 132, in _base_setup
    self.precomputed = Precomputed(html, doc, get_meta_content(doc), LinkedData(collapsed_lds))
  File "/home/aaron/Code/Python/Fundus/src/parser/html_parser/data.py", line 26, in __init__
    raise ValueError(f"Found no type for LD")
ValueError: Found no type for LD

Process finished with exit code 1

They are using https://schema.org/ as a baseline, therefore we should support this eventually. I will stop working at this particular parser until this is resolved.

[Discussion]: User friendly access points to fundus

Currently Autopipeline is the main access to users for our project.

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

for article in pipeline.run(max_articles=5):
    print(article)

As @alanakbik suggested this naming schema (especially AutoPipeline) and import structure may have a deterrent effect on new users.

This issues goal is to:

  • Find a more inviting naming schema for the currently called AutoPipeline class
  • Encourage PRs reducing the import length to PublisherCollection and former ArticlePipeline

Ensure Consistency with the Type hints and return Types of the Parser-Attributes

 @register_attribute
    def authors(self) -> List[str]:
        raw_str = self.cache['doc'].xpath('normalize-space('
                                          '//ul[@class="smallList"]'
                                          '/li[strong[contains(text(), "Auto")]]'
                                          '/text()[last()]'
                                          ')')
        if raw_str:
            return raw_str.split(', ')
        return []

This is the current code for authors of the dw-parser. But what should happen if there are no authors? This issues appears in the other attributes as well. We should document our decision!

I am in favor of returning empty lists etc and changing the type hint accordingly.

More descriptive naming schema for Precomputed

Currently the Precomputed class consists of very overloaded names while itself yielding not much information about it's nature through the class name:

class Precomputed:
    html: str
    doc: lxml.html.HtmlElement
    meta: Dict[str, str]
    ld: LinkedData
    cache: Dict[str, Any] = field(default_factory=dict)

The goal of this issue is:

  • Find a new name for class Precomputed as well as the precomputed attribute of BaseParser
  • Find more precise names for ld, doc, meta

The above list also include renaming those instances and this issue should be closed with a PR

Contribution guideline

We have to formulate some contribution guidelines especially concerning the library section. I.e:

Every parser added to the repo library is only allowed to use attributes (name, semantic,, return type) from the attribute guideline.
How to propose new attributes?
etc...

Clean Authors?

The current Focus Parser extracts authors like 'Von FOCUS-online-Redakteur Thomas Sabin'. Should we clean them?

[Refactor] Improve file structure

  • Move the parser out of html_parser
  • Restructure the example crawler to remove the split of the example across the comment
  • Enforce a file name convention for the parser files
  • Choose a better name for the folder 'library' and move the files accordingly
  • Move the function 'listify' and remove utils
  • Fix inconsistency in the naming of the folders 'scraping' vs 'parser'

All of these points are suggestions, discussion is encouraged.

No meaningful value in Article source field

I would expect the source field of Article to contain information on the article source. I.e. if an article was crawled from welt.de, I would expect source to contain the value DieWelt or WELT. Similarly, if an article was crawled from FAZ I would expect this field to contain the string FAZ.

However, when I run this code:

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

pipeline = AutoPipeline(PublisherCollection.de_de.FAZ)

for article in pipeline.run(max_articles=5):
    print(article.source)

It just prints:

<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>
<src.scraping.crawler.crawler.RSSCrawler object at 0x7f9f2699af10>

i.e. a reference to the crawler object.

Is this desired behavior? Is there any way for me to get from an Article the information which source it is from (aside from parsing the url field)?

[Feature Request]: Parse annotation mapping from the guidelines so it's read-only.

The current test cases for attribute annotations rely on a hard-coded mapping. This approach has two major issues.

  • Code duplication: The annotations are defined in the guidelines as well as in the mapping. This usually is very error prone.
  • Accessibility: This setup makes it possible to bypass attribute annotation tests by adding an annotation manually to the mapping without defining it in the guidelines. This undermines the purpose of the annotation tests.

To solve this, the mapping should be parsed from the guidelines.

Merkur parser not working

This code currently throws an error:

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

pipeline = AutoPipeline(PublisherCollection.de_de.Merkur)

for article in pipeline.run(max_articles=5):
    print(article)
    print()

It throws:

Traceback (most recent call last):
  File "/home/alan/PycharmProjects/fundus/local_crawl.py", line 6, in <module>
    for article in pipeline.run(max_articles=5):
  File "/home/alan/PycharmProjects/fundus/src/scraping/pipeline.py", line 54, in run
    yield next(robin)
  File "/home/alan/.environments/fundus/lib/python3.8/site-packages/more_itertools/more.py", line 1111, in <genexpr>
    return (x for x in i if x is not _marker)
  File "/home/alan/PycharmProjects/fundus/src/scraping/scraper.py", line 28, in scrape
    raise err
  File "/home/alan/PycharmProjects/fundus/src/scraping/scraper.py", line 23, in scrape
    data = self.parser.parse(article_source.html, error_handling)
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 164, in parse
    raise err
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 161, in parse
    parsed_data[attribute_name] = func()
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/base_parser.py", line 42, in __call__
    return self.__func__(self.__self__, *args, *kwargs)
  File "/home/alan/PycharmProjects/fundus/src/library/de_de/merkur_parser.py", line 20, in authors
    return generic_author_parsing(self.precomputed.ld.bf_search("author"))
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py", line 114, in generic_author_parsing
    return [name.strip() for name in authors]
  File "/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py", line 114, in <listcomp>
    return [name.strip() for name in authors]
AttributeError: 'list' object has no attribute 'strip'

Example project: A corpus of funny German

One way to test Fundus would be to execute a few example projects ourselves and see if Fundus is up to the task :)

Here is an idea for an example project: Make a corpus of funny German text

Steps:

  • select at least two sources (Titanic, Eulenspiegel etc.)
  • create parsers for these sources
  • create a corpus of at least 200 articles
  • save in json format for easy distribution

[Discussion] Should we validate parser attribute return types at run time?

Since there is currently some progress on attribute guidelines to specify attribute names, their semantic meaning and python return types, once again the question came up if we should validate attribute return types at runtime. So for example if we have a specified attribute body with some description and ArticleBody as specified return type, should we throw a warning/error if a particular parser implementation specifies the wrong return type or none at all, like shown in this example

class ExampleParser(BaseParser):

    @register_attribute
    def body(self) -> str:
        return 'body'

SPON parser incorrect import

Calling any code throws an error currently since an import is incorrect.

For instance calling

from src.library.collection import PublisherCollection
from src.scraping.pipeline import AutoPipeline

pipeline = AutoPipeline(PublisherCollection.de_de.BerlinerZeitung)

for article in pipeline.run(max_articles=5):
    print(article)
    print()

will throw:

ImportError: cannot import name 'extract_article_body_with_css' from 'src.parser.html_parser.utility' (/home/alan/PycharmProjects/fundus/src/parser/html_parser/utility.py)

Setup CI

Consisting of:

  • pytest
  • mypy
  • black
  • flake8

Investigate: `plaintext` is sometimes None

As #12 stated, it appears that the parser sometimes returns None for plaintext. I guess that in this case the plaintext actually is None, mostly due to the processed article type (Video etc...) but this should be investigated and proper communicated.

Example project: Soccer news

I want to analyze the language used when discussing soccer.

Task: Create a corpus of soccer-related language using Fundus. At least 500 articles.

Discourage the 'Schachtelung" of attributes in the parser class


    @register_attribute
    def topics(self) -> List[str]:
        if keyword_str := self.meta().get('keywords'):
            return keyword_str.split(', ')

Quite a few pieces of code from the parsers work like this example.
I was wondering why meta needs brackets. The reason is that meta is exposed to the public as well:

    @register_attribute
    def meta(self) -> dict[str, Any]:
        return self.cache.get('meta')

This is a bad dependency. You may alter the return value of meta without considering its uses downstream.

Unfortunately, I have no good option for this at the moment.

Attribute guideline

First of all we have to agree on a place where to put them:

  • wiki
  • repo

And then formulate specific guidelines for every parser attribute added in the base version

  • title
  • plaintext
  • author (also see #38 )
  • publishing_time
  • topics

This guideline should especially restrict names as well as define the semantic meaning of an attribute and the return type. I.e this projects understands plaintext as the ordered accumulation of all paragraphs included in the actual article body. This exclude everything else (headline, section headline etc).

[Discussion]: Enforce values in parser test cases.

There are currently test cases like this which include empty values for attributes. I would consider those as a bad test case since they cover less than they could. To avoid example HTML like this in the future I would suggest adding an assertion to test_parsing() asserting bool(attr) if attr is not of type bool.

Update:
Points to a valid example JSON again. Thanks to @dobbersc for pointing this out.

Remove automatic sitemap detection and go back to static values

Beside a sitemap crawler #58 also introduced automatic sitemap detection. While this seemed to be an enhancement in the first place, loading the sitemaps takes between 1 and 5 seconds which really impacts the user experience in a bad way.

I think this future should be exchanged with the former static one.

Add mypy

  • Configure and setup mypy for the project
  • Run mypy without errors

[ToDo] Rework path resolution

As @dobbersc pointed out in #148 the current resolution comes with some problems. To keep an implementation using absolute paths rather than relatives I would rework the current one as follows.

  • Create a fixed root anchor as a variable called root referencing the absolute path to the project root directory.
  • Change all existing paths to be relative to this anchor.

Update:
@dobbersc pointed out in #148 that there wouldn't be any difference in using pathlib over os beside the advantages of the former and thus proposed to use pathlib instead of os.

  • Use pathlib instead of os.

Further enhance documentation.

The current documentation and guidelines lacks crucial information and should be expanded with the following:

how_to_contribute.md

  • A best practice for contributing a parser (First add to publisher collection -> validation -> ...)
  • A simple rule on how to choose where to use the information from (meta vs ld vs layout)
  • A explain of the functionality of BaseParser.parse() and it's correlation with Precomputed (should be done after #94)
  • Remove @function from documentation since it isn't is use yet.

attribute_guidelines.md

  • Add commonly used extraction patterns to attributes. i.e:
self.precomputed.ld.bf_search("datePublished")

code documentation

  • Add precise documentation to src/parser/html_parser/utils.py
  • Change parameter naming schema

Discussion: Convenience attribute access and human-readable Article printouts

It would be helpful if a print on an Article object resulted in a nicely readable string representation of the article without full information on all attributes.

I.e.:

crawler = Crawler(PublisherCollection.de_de)

for article in crawler.crawl(max_articles=5):
    print(article)

would print something like:

--------------------
"IT-Schule von VW und SAP: Der Staat scheitert am Fachkräftemangel"
 - by Daniel Zwick (WELT.DE)

"Während die Bundesregierung ausländischen
 Fachkräften den Zuzug erleichtern will, helfen
 sich Unternehmen wie SAP und Volkswagen selbst.
 Im WELT-Interview erklären die Personalchefs der
 beiden Dax-Konzerne, warum sie ihren IT- [...]"

from: https://www.welt.de/wirtschaft/article242481623/IT-Schule-von-VW-und-SAP-Der-Staat-scheitert-am-Fachkraeftemangel.html (crawled 2022-12-06 08:09:44.059753)

--------------------
"Zweiter Weltkrieg: Polen will im Streit um Reparationen weiter eskalieren"
 - by WELT (WELT.DE)

"Zu seinem Antrittsbesuch in Berlin reist Polens
 neuer Vize-Außenminister Arkadiusz Mularczyk mit
 einem Bericht im Gepäck. Darin fordert Warschau
 von Deutschland Reparationen in Höhe von 1,3
 Billionen Euro – und droht mit weiterer [...]"

from: https://www.welt.de/politik/ausland/article242511375/Zweiter-Weltkrieg-Polen-will-im-Streit-um-Reparationen-weiter-eskalieren.html (crawled 2022-12-06 08:09:43.338912)
--------------------

Here is my attempt at a script for this:

de_de = PublisherCollection.de_de

crawler = Crawler(de_de)

for article in crawler.crawl(max_articles=2, error_handling='raise'):

    title = article.extracted['title']
    author = article.extracted['authors']

    # Wrap this text.
    wrapper = textwrap.TextWrapper(width=50, max_lines=5, initial_indent= '"', subsequent_indent=' ')

    word_list = wrapper.wrap(text=article.extracted['plaintext'])
    text_sample = '\n'.join(word_list) + '"'

    print('-' * 20)
    print(f'"{title}"\n - by {author[0]} (WELT.DE)')
    print("\n" + text_sample + "\n")
    print(f'from: {article.url} (crawled {article.crawl_date})')
    print('-' * 20)

However, building this script I noticed some small issues:

  • it would be nice to get a direct access to some feature we expect most articles to have. E.g. article.title instead of article.extracted['title']
  • the article.extracted['plaintext'] is sometimes None.
  • the Article should know the proper name of its "newspaper". I am thinking of a field like article.source that just prints "Die WELT" if crawled from there.
  • article.extracted['authors'] is sometimes should the newspaper name

Port the Crawlers from QSE

This issue is about tracking progress with porting the crawlers from qse. A crawler is considered done after a PR that introduces it has been merged to main.

  • SZ
  • SPON
  • Welt
  • Zeit
  • BZ
  • DW
  • Focus
  • Merkur
  • Orf
  • Tagesschau
  • NDR
  • Taz
  • NTV(#132 )
  • Stern(#127)
  • FAZ
  • WAZ(#128)

I just remembered that there are some more:

  • Bild
  • Tagesspiegel

Some ones I worked on locally:

  • Frankfurter Rundschau
  • Freitag
  • MDR
  • T Online
  • Heise
  • Zdf

[Proposal]: Precompile xpath and CSS expressions

@dobbersc pointed out that xpath and CSS expressions can be pre-compiled in lxml via the XPath respectively CSSSelector class.

Imo pre-compiling the selectors could have two major benefits:

  1. Speed: Till now I didn't evaluate this, which we should do for sure before going live with these changes, but since most of the time during parsing is spent on the selectors, this could be a huge time saver.
  2. Flexibility: Currently functions like extract_article_body_with_selector() depend on a rather ugly mode parameter. This makes mixing xpath and CSS selectors impossible in the first place and ugly to use in the second. With pre-compiled selectors, functions like the one mentioned above no longer depend on an ambiguous string representation of the selector rather than an actual object, allowing the function to differentiate automatically.

The goal of this issue is to enforce a PR that implements the following:

  • Adjust existing functions to use selector objects instead of strings.
  • Pre-compile existing selectors in the individual parser implementations.

Example project: Corpus of political bias

It would be great to add some crawlers for English-language news to the library (once the guidelines are finalized).

For another research project on detecting political bias and reliability, we require crawlers for the following 12 sources:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.