Code Monkey home page Code Monkey logo

Comments (3)

petermeissner avatar petermeissner commented on June 4, 2024 1

The package now got a big overhaul tackling the problems reported here in a - I think - adequate, robust and backward compatible fashion - @dmi3kno if you find the time to have a look - if things work for you this would be most appreciated.

from robotstxt.

petermeissner avatar petermeissner commented on June 4, 2024

version 0.7.x

While previous releases were concerned with implementing parsing and permission checking and improving performance the 0.7.x release will be about robots.txt retrieval foremost. While retrieval was implemented there are corner cases in the retrieval stage that very well influence the interpretation of permissions granted.

Features and Problems handled:

  • now handles corner cases of retrieving robots.txt files
  • e.g. if no robots.txt file is available this basically means "you can scrape it all"
  • but there are further corner cases (what if there is a server error, what if redirection takes place, what is redirection takes place to different domains, what if a file is returned but it is not parsable, or is of format HTML or JSON, ...)

Design Decisions

  1. the whole HTTP request-response-chain is checked for certain event/state types
    a. server error
    b. client error
    c. file not found (404)
    d. redirection
    e. redirection to another domain
  2. the content returned by the HTTP is checked against
    a. mime type / file type specification mismatch
    b. suspicious content (file content does seem to be JSON, HTML, or XML instead of robots.txt)
  3. state/event handler define how these states and events are handled
  4. a handler handler executes the rules defined in individual handlers
  5. handler can be overwritten
  6. handler defaults are defined that they should always do the right thing
  7. handler can ...
    a. overwrite the content of a robots.txt file (e.g. allow/disallow all)
    b. modify how problems should be signaled: error, warning, message, none
    c. if robots.txt file retrieval should be cached or not
  8. problems (no matter how they were handled) are attached to the robots.txt's as attributes, allowing for ...
    a. transparency
    b. reacting post-mortem to the problems that occured
  9. all handler (even the actual execution of the HTTP-request) can be overwritten at runtime to inject user defined behaviour beforehand

from robotstxt.

petermeissner avatar petermeissner commented on June 4, 2024

done

from robotstxt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.