Comments (3)
The package now got a big overhaul tackling the problems reported here in a - I think - adequate, robust and backward compatible fashion - @dmi3kno if you find the time to have a look - if things work for you this would be most appreciated.
from robotstxt.
version 0.7.x
While previous releases were concerned with implementing parsing and permission checking and improving performance the 0.7.x release will be about robots.txt retrieval foremost. While retrieval was implemented there are corner cases in the retrieval stage that very well influence the interpretation of permissions granted.
Features and Problems handled:
- now handles corner cases of retrieving robots.txt files
- e.g. if no robots.txt file is available this basically means "you can scrape it all"
- but there are further corner cases (what if there is a server error, what if redirection takes place, what is redirection takes place to different domains, what if a file is returned but it is not parsable, or is of format HTML or JSON, ...)
Design Decisions
- the whole HTTP request-response-chain is checked for certain event/state types
a. server error
b. client error
c. file not found (404)
d. redirection
e. redirection to another domain - the content returned by the HTTP is checked against
a. mime type / file type specification mismatch
b. suspicious content (file content does seem to be JSON, HTML, or XML instead of robots.txt) - state/event handler define how these states and events are handled
- a handler handler executes the rules defined in individual handlers
- handler can be overwritten
- handler defaults are defined that they should always do the right thing
- handler can ...
a. overwrite the content of a robots.txt file (e.g. allow/disallow all)
b. modify how problems should be signaled: error, warning, message, none
c. if robots.txt file retrieval should be cached or not - problems (no matter how they were handled) are attached to the robots.txt's as attributes, allowing for ...
a. transparency
b. reacting post-mortem to the problems that occured - all handler (even the actual execution of the HTTP-request) can be overwritten at runtime to inject user defined behaviour beforehand
from robotstxt.
done
from robotstxt.
Related Issues (20)
- problem info for on_domain_change is not informative enough
- improve test coverage HOT 4
- test that caching works
- test that file overwrite works HOT 1
- fine tune messaging and warnings
- Event on_redirect resulting in bad behaviour HOT 1
- GOV.UK Crawl-delay HOT 9
- paths_allowed gives error if www is included in URL HOT 10
- Partial matching warnings HOT 2
- Save cached/normal as attribute HOT 3
- Case-sensitive robots.txt results in incorrect crawl delay HOT 2
- Guess domain name with hyphen(s) correctly HOT 1
- Improve validity check: treat error messages as invalid HOT 9
- Parsing would fail for comment in last line HOT 1
- New Maintainer Wanted :-) HOT 3
- add r-cmd-check action
- remove unnecessary and legacy files
- CRAN package spiderbar and its reverse dependencies HOT 3
- CRAN: Error(s) in re-building vignettes HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from robotstxt.