ruedigervoigt / exoskeleton Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 1.0 723 KB

A Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend

License: Apache License 2.0

Python 100.00%

crawler crawling-framework database machine-learning mariadb network python python-3 scraping

exoskeleton's People

Contributors

Stargazers

Watchers

Forkers

clijsters

exoskeleton's Issues

Add ability to block domains

Allow the user to add domains to a list. If an URL points to a domain on that list, it should not be possible to add this URL to the queue.

Checking the list should not require a database query. Such a list should be short, so it can be kept in RAM. It should be read once at startup, and refreshed every time a domain is added or removed.

If a file already has been processed, it is possible to add the same URL and task combination to the queue without enforcing a new version. This is because exoskeleton accidentally compares the storageTypeID with the actionID. Those were supposed to match in an early phase of the project, but do not.

This does not occur if a crawler first collects all URLs and then works through them. It does occur if a crawler scans repeatedly for new ones.

This requires adding a database field and will be fixed with 0.9.2.

Avoid duplicate entries in queue / duplicate downloads

While scraping a large site a specific URL might be refereed to from many places. The framework should avoid processing it again.

Using an unique index on the URL is not suitable as those are too long for an index with MariaDB default settings. It would require a setup change by the admin.

The solution might be a short hash value combined with a collision check.

Add Integration Test with MariaDB

As the code is a mixture of Python and SQL, the database part has to be tested also. At the moment there is a script that simulates a crawler, but the resulting database structure is checked manually.

It seems possible to spin up a MariaDB instance with GitHub Actions, so this can be automated.

Replace the requests with aiohttp or httpx

Parallelize tasks

The queue has to be modified in order to distribute the load evenly on multiple target hosts.

PostgreSQL support

add statistics on host basis

add crawl delay in case of timeout

If the crawler runs into a timeout, it should wait for a certain time until it bothers the server again for the same item

Cloud services as a storage option

colored log output

Maybe via https://github.com/borntyping/python-colorlog

Add to CI pipeline

MariaDB Service container broken by system update

Connecting to the MariaDB service container from within the Linux Test Container is impossible if apt update && apt upgrade has run.

Check for the existence of database functions

The package does actually check if all expected stored procedures exist.
However, there are some SQL functions that are important for the package.
So check for the existence of those also.

Create a Docker image

A docker image would it make easier to start projects with exoskeleton.

A branch docker-image was created.

Python 3.8 Test

parse / correct page content before saving

Improve estimate of remaining time

The method estimate_remaining_time in TimeManager does not take rate limits and temporary problems into account.

Auto-Increment in InnoDB notpersistent before MariaDB 10.2.4

The auto-increment value is not persistent over server restarts. See:
https://jira.mariadb.org/browse/MDEV-6076

Any queue entry is deleted as soon it is not needed anymore. Assume the application worked through the queue. Then the queue is empty and after a server restart the first auto-increment value will be set to 1 (max-value + 1) by the server.

The problem is, that the queue-id is used as filename. So the application will start to overwrite all existing files and there are multiple entries for the same filename in the system.

This will need changes to the database structure, as MariaDB versions without persistent auto-increment value are used by current Linux LTS versions.

Python 3.9 support

Lxml is a dependency, but building lxml version 4.5.2 fails with Python 3.9 on Linux/Windows/MacOS.

There is already an issue opened with the lxml project. So waiting for lxml's next release

Add support for a remote Mailserver

Currently exoskeleton assumes there is a Mail Transfer Agent (like Sendmail, Exim, or Postfix) installed on localhost and will try to use that for sending mails.

Change that so it allows to use a different machine.

File Name Prefix

Default is to store file with it's database id as name. Allow to define a filename prefix.

Unit Tests

Add base URL to links

Many documents use relative links like overview.html instead of https://www.example.com/overview.html. It would be useful to convert those into absolute links before the page content is saved.

The first step would be to find each link and to determine if it is absolute or relative. This must cover other protocols besides http and https.
It is possible that a base-URL was set in the document code, which is not the URL of the page just crawled. This has to be found and regarded.
To capture cases like ../../foobar.html the urllib.parse.urljoin function should be used.

The best place for this functionality seems to be an optional feature of the prettify_html function.

Switch from pymysql to mariadb and a connection pool

Currently exoskeleton uses pymysql to connect with MariaDB. Recently the MariaDB Connector/Python was published. See:

It would bring real connection pooling.

However, it relies on the C-Connector. A test with a system based on Ubuntu 18 showed that the installed version of this is not compatible with the new Python connector. So a user would have to upgrade the mariadb-server to a version higher than supplied by the OS, and / or integrate some repositories. Both can be quite a hurdle

As pymysql is "good enough" the switch is postponed to exoskeleton 2.