nietaki / crawlie Goto Github PK

View Code? Open in Web Editor NEW

88.0 5.0 11.0 201 KB

A simple Elixir library for writing decently-performing crawlers with minimum effort.

License: MIT License

Elixir 100.00%

genstage elixir elixir-library crawler

crawlie's Introduction

Hi there 👋

Software engineering generalist, specializing in backend and Elixir
- Until recently worked on the tech behind the Ampersand Table
Check out rexbug, for all your Elixir interactive tracing needs!
Need to create a good looking resume easily? Give markdown-resume a shot!
Trying to write more on https://nietaki.com/

I'm available for contract work! You can check out my CV HERE.

crawlie's People

Contributors

Stargazers

Watchers

Forkers

tazsingh loongmxbt thomasbrus arashsc yurgon praxis-of-nines richmorin axelson accountyi kianmeng

crawlie's Issues

Pass more response data to the parser logic

... for example response headers. Wrap it in a struct.

Merge the option with defaults inside Crawlie.crawl

otherwise things that don't use Options.get, like Hackney / HTTPoison, don't get the defaults

remove duplicate "crawling finished" debug messages

Replace the heap with a priority queue

For example this one: https://hex.pm/packages/pqueue

The heap compares the urls as well as the crawling depth. It's doing unnecessary string comparison and it looks weird when the pages go roughly reverse-alphabetically.

Have crawlie operate in the library's supervision tree instead of the caller's

Fix the `:url_manager_timeout` logic

Even for longer (~5s) values of url manager timeout, the UrlManager gracefully shuts down even though it still processes new urls.

Either fix the logic or find an approach that doesn't depend on the timeout.

Don't send links that are too deep back to the `UrlManager`

Currently the UrlManager checks if the url satisfies the max_depth requirement. That can and should be done during url extraction.

Move to using URI.t instead of strings for urls

Remove `initial` from the UrlManager State

stick it in the discovered from the get-go

Tune the Flow parameters

parameters:

:min_demand
:max_demand
:stages

It should be possible to choose these parameters individually for the fetching phase and the processing phase. The defaults for those should be sensible (more stages for fetching phase, stage count proportional to the cpu count for processing phase) and user options should be deep-merged with the defaults using a straightforward tested function.

Do not add duplicate uris to the UrlManager State

Currently checking for duplicates is performed on pulling the Pages out, which might lead to unnecessary memory consumption.

Do it on adding them in instead, but still have correct logic on retries.

Rename `extract_links` to `extract_uris`

will make it more technically correct, the best kind of correct

Not compatible with the latest GenStage

The current mix file doesn't allow usage with the current version of GenStage (0.14.1), although reading through the changelog everything should work fine.

Provide the option of tracking crawling statistics

Implementation detail ideas:

off by default
separate API function for it, like crawl_with_stats(source, parser_logic, options \\ []) :: {flow, ref}
create a :simple_one_for_one StatsSupervisor as part of Crawlie app
use the ref (a {:stats, PID} tuple?) to send the stats updates from the Flow and for retrieving the stats from the Supervisor children.

Add a simple usage example to the README

(Inspired by @fala)

Allow for ParserLogic.parse to skip a page instead of just parsing

Since deliberate skipping is not an error

Elliminating duplicate urls

Make UrlManager reject urls it already processed.

Bear in mind, if a url leaves the UrlManager, it doesn't mean it was processed - it can still be retried.

Rate Limiting

Just starting to wrap my ahead around this library as it could do what I'm looking for, but I'm struggling to find any form of rate limiting? Some way to limit the amount of requests done per second/how quickly items get dealt with from the queue.

I could be entirely missing where that can be done, but thought I'd ask as there doesn't seem to be an issue about it.

Create some sort of CONTRIBUTING.md file

Signal completion of the fetches to the UrlManager instead of relying on timeouts to wrap it up.

Add content_type and content_type_simple to the Response struct

The cost is low and would make it easier to pattern match

Update GenStage to 0.11

Make it possible to pass some information from the parent page

Currently each of the pages are considered on their own. This isn't a flexible enough model, when some page's data is important in the context of the parent page - for example you're crawling some website to download all the images it links to, but want to know the anchor texts for each of them.

The solution would be to have ParserLogic.extract_uris() return {tag :: term, uri :: URI.t} | URI.t, where the latter would be converted to {nil, uri} and have it passed as an aditional argument to ParserLogic.parse(). The user could then choose to (or not to) include the parent tag in the parse_result for their convenience.

This will require some manual labour, but shouldn't be too bad ;)