Code Monkey home page Code Monkey logo

crawlie's Introduction

Hi there ๐Ÿ‘‹

  • Software engineering generalist, specializing in backend and Elixir
  • Check out rexbug, for all your Elixir interactive tracing needs!
  • Need to create a good looking resume easily? Give markdown-resume a shot!
  • Trying to write more on https://nietaki.com/

I'm available for contract work! You can check out my CV HERE.

crawlie's People

Contributors

nietaki avatar tazsingh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

crawlie's Issues

Fix the `:url_manager_timeout` logic

Even for longer (~5s) values of url manager timeout, the UrlManager gracefully shuts down even though it still processes new urls.

Either fix the logic or find an approach that doesn't depend on the timeout.

Tune the Flow parameters

parameters:

  • :min_demand
  • :max_demand
  • :stages

It should be possible to choose these parameters individually for the fetching phase and the processing phase. The defaults for those should be sensible (more stages for fetching phase, stage count proportional to the cpu count for processing phase) and user options should be deep-merged with the defaults using a straightforward tested function.

Do not add duplicate uris to the UrlManager State

Currently checking for duplicates is performed on pulling the Pages out, which might lead to unnecessary memory consumption.

Do it on adding them in instead, but still have correct logic on retries.

Not compatible with the latest GenStage

The current mix file doesn't allow usage with the current version of GenStage (0.14.1), although reading through the changelog everything should work fine.

Provide the option of tracking crawling statistics

Implementation detail ideas:

  • off by default
  • separate API function for it, like crawl_with_stats(source, parser_logic, options \\ []) :: {flow, ref}
  • create a :simple_one_for_one StatsSupervisor as part of Crawlie app
  • use the ref (a {:stats, PID} tuple?) to send the stats updates from the Flow and for retrieving the stats from the Supervisor children.

Elliminating duplicate urls

Make UrlManager reject urls it already processed.

Bear in mind, if a url leaves the UrlManager, it doesn't mean it was processed - it can still be retried.

Rate Limiting

Just starting to wrap my ahead around this library as it could do what I'm looking for, but I'm struggling to find any form of rate limiting? Some way to limit the amount of requests done per second/how quickly items get dealt with from the queue.

I could be entirely missing where that can be done, but thought I'd ask as there doesn't seem to be an issue about it.

Make it possible to pass some information from the parent page

Currently each of the pages are considered on their own. This isn't a flexible enough model, when some page's data is important in the context of the parent page - for example you're crawling some website to download all the images it links to, but want to know the anchor texts for each of them.

The solution would be to have ParserLogic.extract_uris() return {tag :: term, uri :: URI.t} | URI.t, where the latter would be converted to {nil, uri} and have it passed as an aditional argument to ParserLogic.parse(). The user could then choose to (or not to) include the parent tag in the parse_result for their convenience.

This will require some manual labour, but shouldn't be too bad ;)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.