Code Monkey home page Code Monkey logo

exoskeleton's People

Contributors

clijsters avatar ruedigervoigt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

clijsters

exoskeleton's Issues

Add ability to block domains

Allow the user to add domains to a list. If an URL points to a domain on that list, it should not be possible to add this URL to the queue.

Checking the list should not require a database query. Such a list should be short, so it can be kept in RAM. It should be read once at startup, and refreshed every time a domain is added or removed.

Duplicate tasks in the queue

If a file already has been processed, it is possible to add the same URL and task combination to the queue without enforcing a new version. This is because exoskeleton accidentally compares the storageTypeID with the actionID. Those were supposed to match in an early phase of the project, but do not.

This does not occur if a crawler first collects all URLs and then works through them. It does occur if a crawler scans repeatedly for new ones.

This requires adding a database field and will be fixed with 0.9.2.

Avoid duplicate entries in queue / duplicate downloads

While scraping a large site a specific URL might be refereed to from many places. The framework should avoid processing it again.

Using an unique index on the URL is not suitable as those are too long for an index with MariaDB default settings. It would require a setup change by the admin.

The solution might be a short hash value combined with a collision check.

Add Integration Test with MariaDB

As the code is a mixture of Python and SQL, the database part has to be tested also. At the moment there is a script that simulates a crawler, but the resulting database structure is checked manually.

It seems possible to spin up a MariaDB instance with GitHub Actions, so this can be automated.

Check for the existence of database functions

The package does actually check if all expected stored procedures exist.
However, there are some SQL functions that are important for the package.
So check for the existence of those also.

Create a Docker image

A docker image would it make easier to start projects with exoskeleton.

A branch docker-image was created.

Auto-Increment in InnoDB *not*persistent before MariaDB 10.2.4

The auto-increment value is not persistent over server restarts. See:
https://jira.mariadb.org/browse/MDEV-6076

Any queue entry is deleted as soon it is not needed anymore. Assume the application worked through the queue. Then the queue is empty and after a server restart the first auto-increment value will be set to 1 (max-value + 1) by the server.

The problem is, that the queue-id is used as filename. So the application will start to overwrite all existing files and there are multiple entries for the same filename in the system.

This will need changes to the database structure, as MariaDB versions without persistent auto-increment value are used by current Linux LTS versions.

Python 3.9 support

Lxml is a dependency, but building lxml version 4.5.2 fails with Python 3.9 on Linux/Windows/MacOS.

There is already an issue opened with the lxml project. So waiting for lxml's next release

Add support for a remote Mailserver

Currently exoskeleton assumes there is a Mail Transfer Agent (like Sendmail, Exim, or Postfix) installed on localhost and will try to use that for sending mails.

Change that so it allows to use a different machine.

File Name Prefix

Default is to store file with it's database id as name. Allow to define a filename prefix.

Add base URL to links

Many documents use relative links like overview.html instead of https://www.example.com/overview.html. It would be useful to convert those into absolute links before the page content is saved.

  • The first step would be to find each link and to determine if it is absolute or relative. This must cover other protocols besides http and https.
  • It is possible that a base-URL was set in the document code, which is not the URL of the page just crawled. This has to be found and regarded.
  • To capture cases like ../../foobar.html the urllib.parse.urljoin function should be used.

The best place for this functionality seems to be an optional feature of the prettify_html function.

Switch from pymysql to mariadb and a connection pool

Currently exoskeleton uses pymysql to connect with MariaDB. Recently the MariaDB Connector/Python was published. See:

It would bring real connection pooling.

However, it relies on the C-Connector. A test with a system based on Ubuntu 18 showed that the installed version of this is not compatible with the new Python connector. So a user would have to upgrade the mariadb-server to a version higher than supplied by the OS, and / or integrate some repositories. Both can be quite a hurdle

As pymysql is "good enough" the switch is postponed to exoskeleton 2.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.