ruedigervoigt / exoskeleton Goto Github PK
View Code? Open in Web Editor NEWA Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend
License: Apache License 2.0
A Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend
License: Apache License 2.0
Allow the user to add domains to a list. If an URL points to a domain on that list, it should not be possible to add this URL to the queue.
Checking the list should not require a database query. Such a list should be short, so it can be kept in RAM. It should be read once at startup, and refreshed every time a domain is added or removed.
If a file already has been processed, it is possible to add the same URL and task combination to the queue without enforcing a new version. This is because exoskeleton accidentally compares the storageTypeID with the actionID. Those were supposed to match in an early phase of the project, but do not.
This does not occur if a crawler first collects all URLs and then works through them. It does occur if a crawler scans repeatedly for new ones.
This requires adding a database field and will be fixed with 0.9.2.
While scraping a large site a specific URL might be refereed to from many places. The framework should avoid processing it again.
Using an unique index on the URL is not suitable as those are too long for an index with MariaDB default settings. It would require a setup change by the admin.
The solution might be a short hash value combined with a collision check.
As the code is a mixture of Python and SQL, the database part has to be tested also. At the moment there is a script that simulates a crawler, but the resulting database structure is checked manually.
It seems possible to spin up a MariaDB instance with GitHub Actions, so this can be automated.
Parallelize tasks
The queue has to be modified in order to distribute the load evenly on multiple target hosts.
If the crawler runs into a timeout, it should wait for a certain time until it bothers the server again for the same item
Connecting to the MariaDB service container from within the Linux Test Container is impossible if apt update && apt upgrade
has run.
The package does actually check if all expected stored procedures exist.
However, there are some SQL functions that are important for the package.
So check for the existence of those also.
A docker image would it make easier to start projects with exoskeleton.
A branch docker-image
was created.
The method estimate_remaining_time
in TimeManager
does not take rate limits and temporary problems into account.
The auto-increment value is not persistent over server restarts. See:
https://jira.mariadb.org/browse/MDEV-6076
Any queue entry is deleted as soon it is not needed anymore. Assume the application worked through the queue. Then the queue is empty and after a server restart the first auto-increment value will be set to 1 (max-value + 1) by the server.
The problem is, that the queue-id is used as filename. So the application will start to overwrite all existing files and there are multiple entries for the same filename in the system.
This will need changes to the database structure, as MariaDB versions without persistent auto-increment value are used by current Linux LTS versions.
Lxml is a dependency, but building lxml version 4.5.2 fails with Python 3.9 on Linux/Windows/MacOS.
There is already an issue opened with the lxml project. So waiting for lxml's next release
Currently exoskeleton assumes there is a Mail Transfer Agent (like Sendmail, Exim, or Postfix) installed on localhost and will try to use that for sending mails.
Change that so it allows to use a different machine.
Default is to store file with it's database id as name. Allow to define a filename prefix.
Many documents use relative links like overview.html
instead of https://www.example.com/overview.html
. It would be useful to convert those into absolute links before the page content is saved.
../../foobar.html
the urllib.parse.urljoin
function should be used.The best place for this functionality seems to be an optional feature of the prettify_html function.
Currently exoskeleton uses pymysql to connect with MariaDB. Recently the MariaDB Connector/Python was published. See:
It would bring real connection pooling.
However, it relies on the C-Connector. A test with a system based on Ubuntu 18 showed that the installed version of this is not compatible with the new Python connector. So a user would have to upgrade the mariadb-server to a version higher than supplied by the OS, and / or integrate some repositories. Both can be quite a hurdle
As pymysql is "good enough" the switch is postponed to exoskeleton 2.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.