Code Monkey home page Code Monkey logo

scrapy / protego Goto Github PK

View Code? Open in Web Editor NEW
53.0 9.0 26.0 3.48 MB

A pure-Python robots.txt parser with support for modern conventions.

License: BSD 3-Clause "New" or "Revised" License

Python 0.64% DIGITAL Command Language 98.81% Io 0.01% C++ 0.01% Ruby 0.27% Shell 0.01% Prolog 0.01% Roff 0.01% LilyPond 0.02% Perl 0.03% xBase 0.03% mupad 0.06% JavaScript 0.01% NewLisp 0.03% Raku 0.01% Berry 0.10%
robots-txt robots-parser python hacktoberfest

protego's Introduction

Protego

Supported Python Versions CI

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

  Protego RobotFileParser Reppy Robotexclusionrulesparser
Implementation language Python Python C++ Python
Reference specification Google Martijn Koster’s 1996 draft
Wildcard support  
Length-based precedence    
Performance   +40% +1300% -25%

API Reference

Class protego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified in robots.txt.
  • preferred_host {string} Preferred host specified in robots.txt.

Methods

  • parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.
  • can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.
  • crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
  • request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.
  • visit_time(user_agent) Return the visit time specified for the user agent as a named tuple VisitTime(start_time, end_time). If nothing is specified, return None.

protego's People

Contributors

akx avatar anubhavp28 avatar baotlake avatar felixonmars avatar gallaecio avatar jeroenseegers avatar kmike avatar laerte avatar maramsumanth avatar noviluni avatar sseveran avatar tjlaboss avatar vmruiz avatar whalebot-helmsman avatar wrar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

protego's Issues

Non-encoding of URL queries causing Regex robots.txt rule matches to be missed

Hi, we have recently switched from Reppy to Protego for our robots.txt parser. All seems fine, except we noticed a few differences between Reppy and Protego in the URLs we are crawling - essentially, Protego appeared to be allowing access to URLs that should be blocked. It seems that Protego follows the Google specification and Reppy does not, so differences should be expected. However, we noticed that the official Google Robots Tester blocked access to these URLs also - so there seems to be an error here.

The string in the robots.txt file that appeared to be being ignored is /en-uk/*q=*relevance* and an example of a URL that was not being filtered by this string is /en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Here is the output from Google Robots Tester showing that this URL should be blocked by the aforementioned string:
image

Having looked at the Protego code, we believe that we have found where this apparent error comes from. We also think we have a fix for it, and will happily submit the fix for your scrutiny as we'd like to know if there are unforeseen consequences from this change.

The problem involves the ASCII hex-encoding of the URL string. Protego splits the URL into parts, e.g.:

scheme="http://", netloc='www.website.com', path='/en-uk/new-wildlife-range/c/4018', params='', query='q=%3Arelevance%3Atype%3AHedgehog%2BFood', fragment=''

It then encodes symbols in the "path" part of this, removes the "scheme" and "netloc" parts, and reassembles the URL to compare with all the rules in robots.txt. The issue we're seeing is that it only encodes the symbols in the "path" part. The "query" part is left alone.
We end up with this as the URL to be checked:
/en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Which when a regex search is applied to it using this string: /en-uk/.*?q%3D.*?relevance.*? a match isn't found as the = in the URL hasn't been encoded to %3D.
The fix we have is simple, it just encodes the "query" part in the same way as the "path" part. So instead we end up with URL:
/en-uk/new-wildlife-range/c/4018?q%3D%3Arelevance%3Atype%3AHedgehog%2BFood
the URL is matched with the regex string correctly, and crawler access is blocked.

Is this likely to cause any unforeseen issues?
Thanks

Use `pyre2` as optional dependency for RegExp speedup.

Just throwing up a far future idea.

I've seen that your lib is 40% slower compared to RobotFileParser from Python versions < 3.13 . I suspect this is because of re module compilation and matching.

pyre2 is a drop-in replacement for re which is faster for simple patterns which are exactly what robots.txt relies on. pyre2 falls back to re if it doesn't support some RegExp features (like lookarounds) but it won't be the case here.

My claims about potential speedup should be tested with your lib of course but nonetheless I think these are worth a consideration.

Disallowing / does not work when the target URL path is missing

>>> from protego import Protego
>>> robots_txt = "User-Agent: *\nDisallow: /\n"
>>> robots_txt_parser = Protego.parse(robots_txt)
>>> robots_txt_parser.can_fetch("http://example.com/", "mybot")
False
>>> robots_txt_parser.can_fetch("http://example.com", "mybot")
True
>>> 

Both calls should return False, since the / path is implicit if a URL has no path.

python-protego fails to build with Python 3.12: AttributeError: 'TestProtego' object has no attribute 'assertEquals'.

Bug 2175156 - python-protego fails to build with Python 3.12: AttributeError: 'TestProtego' object has no attribute 'assertEquals'.

python-protego fails to build with Python 3.12.0a5.

=================================== FAILURES ===================================
_____________________ TestProtego.test_sitemaps_come_first _____________________

self = <test_protego.TestProtego testMethod=test_sitemaps_come_first>

def test_sitemaps_come_first(self):
    """Some websites have sitemaps before any robots directives"""
    content = ("Sitemap: [https://www.foo.bar/sitmap.xml\n](https://www.foo.bar/sitmap.xml/n)"
               "User-Agent: FootBot\n"
               "Disallow: /something")
    rp = Protego.parse(content=content)
  self.assertEquals(list(rp.sitemaps), ["https://www.foo.bar/sitmap.xml"])

E AttributeError: 'TestProtego' object has no attribute 'assertEquals'. Did you mean: 'assertEqual'?

tests/test_protego.py:1055: AttributeError
=========================== short test summary info ============================
FAILED tests/test_protego.py::TestProtego::test_sitemaps_come_first - Attribu...
======================== 1 failed, 4337 passed in 5.62s ========================

Removed many old deprecated unittest features:

  • A number of TestCase method aliases:

    | Deprecated alias | Method Name | Deprecated in |
    +-----------------------|------------------------|---------------+
    | failUnless | assertTrue() | 3.1 |
    | failIf | assertFalse() | 3.1 |
    | failUnlessEqual | assertEqual() | 3.1 |
    | failIfEqual | assertNotEqual() | 3.1 |
    | failUnlessAlmostEqual | assertAlmostEqual() | 3.1 |
    | failIfAlmostEqual | assertNotAlmostEqual() | 3.1 |
    | failUnlessRaises | assertRaises() | 3.1 |
    | assert_ | assertTrue() | 3.2 |
    | assertEquals | assertEqual() | 3.2 |
    | assertNotEquals | assertNotEqual() | 3.2 |
    | assertAlmostEquals | assertAlmostEqual() | 3.2 |
    | assertNotAlmostEquals | assertNotAlmostEqual() | 3.2 |
    | assertRegexpMatches | assertRegex() | 3.2 |
    | assertRaisesRegexp | assertRaisesRegex() | 3.2 |
    | assertNotRegexpMatches| assertNotRegex() | 3.5 |
    +-----------------------|------------------------|---------------+

You can use https://github.com/isidentical/teyit to automatically modernise your unit tests.

  • Undocumented and broken TestCase method assertDictContainsSubset (deprecated in Python 3.2).
  • Undocumented TestLoader.loadTestsFromModule parameter use_load_tests (deprecated and ignored since Python 3.2).
  • An alias of the TextTestResult class: _TextTestResult (deprecated in Python
    3.2).

(Contributed by Serhiy Storchaka in bpo-45162.)
https://bugs.python.org/issue?@action=redirect&bpo=45162
https://docs.python.org/3.12/whatsnew/3.12.html

For the build logs, see:
https://copr-be.cloud.fedoraproject.org/results/@python/python3.12/fedora-rawhide-x86_64/05576684-python-protego/

For all our attempts to build python-protego with Python 3.12, see:
https://copr.fedorainfracloud.org/coprs/g/python/python3.12/package/python-protego/

Testing and mass rebuild of packages is happening in copr. You can follow these instructions to test locally in mock if your package builds with Python 3.12:
https://copr.fedorainfracloud.org/coprs/g/python/python3.12/

Let us know here if you have any questions.

Python 3.12 is planned to be included in Fedora 39. To make that update smoother, we're building Fedora packages with all pre-releases of Python 3.12.
A build failure prevents us from testing all dependent packages (transitive [Build]Requires), so if this package is required a lot, it's important for us to get it fixed soon.
We'd appreciate help from the people who know this package best, but if you don't want to work on this now, let us know so we can try to work around it on our side.

URL starting with double-slash are misinterpreted

When analyzing the following robots.txt, Protego parses the directive Disallow: //debug/* as if it was /*

User-agent: *
Disallow: //debug/*

This is due to the following line of code:

parts = urlparse(pattern)

The problem is that urlparse does not parse the URL as expected (i.e. as a path) and takes "debug" as the hostname:

from urllib.parse import urlparse
print(urlparse("//debug/*"))
### result: ParseResult(scheme='', netloc='debug', path='/*', params='', query='', fragment='')

According to Google's official documentation, the Allow and Disallow directives must be followed by relative paths starting with a / character.

Therefore, I see two possible solutions:

  1. avoid using urlparse on directives' patterns
  2. replace the starting double initial slashes with a single slash

Option 1
As is:

protego/src/protego.py

Lines 185 to 186 in 45e1948

parts = urlparse(pattern)
pattern = self._unquote(parts.path, ignore="/*$%")

To be:

pattern = self._unquote(pattern, ignore="/*$%")

Option 2
Add a re.sub at the beginning of the following method:

protego/src/protego.py

Lines 90 to 93 in 45e1948

def _prepare_pattern_for_regex(self, pattern):
"""Return equivalent regex pattern for the given URL pattern."""
pattern = re.sub(r"\*+", "*", pattern)
s = re.split(r"(\*|\$$)", pattern)

pattern = re.sub(r"^[/]{2,}", "*", pattern)

Accept robots.txt as bytes

>>> from protego import Protego
>>> robots_txt = b"User-Agent: *\nDisallow: /\n"
>>> robots_txt_parser = Protego.parse(robots_txt)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/adrian/temporal/venv/lib/python3.9/site-packages/protego.py", line 310, in parse
    o._parse_robotstxt(content)
  File "/home/adrian/temporal/venv/lib/python3.9/site-packages/protego.py", line 327, in _parse_robotstxt
    hash_pos = line.find('#')
TypeError: argument should be integer or bytes-like object, not 'str'
>>> robots_txt = "User-Agent: *\nDisallow: /\n"
>>> robots_txt_parser = Protego.parse(robots_txt)
>>>

six usage

As you seem to have dropped Python 2 support would you consider to drop six usage as well?

Protego differs from reppy in handling of wildcards for GET-params

I'm looking to replace Reppy with something that is easier to install and maintain. We have some unit tests for our usage of Reppy. Some of these test that wildcards are handled correctly (whatever 'correct' may mean here). One test that is failing, tests behavior of wildcards in GET-parameters. Reppy disallows that URL, while Protego allows it.

Could you shed some light on this? Is this something that should and can be fixed in Protego?

In [1]: from reppy.robots import Robots

In [2]: from protego import Protego

In [3]: robots_txt = """User-agent: *
   ...: Disallow: /*s=
   ...: """

In [4]: reppy = Robots.parse('', robots_txt)

In [5]: protego = Protego.parse(robots_txt)

In [6]: urls = ['https://mysite/', 'https://mysite/s/', 'https://mysite/?s=asd']

In [7]: [reppy.allowed(url, '*') for url in urls]
Out[7]: [True, True, False]

In [8]: [protego.can_fetch(url, '*') for url in urls]
Out[8]: [True, True, True]

Support for visit-time

I have a site using the visit-time attribute in their robots.txt. Any chance support for this could be implemented?
If not are there any suggestions on how to work around this in scrapy?

Colons in file names prevent installation on NTFS

Installing protego 0.1.16 via Conda produced the following error:

InvalidArchiveError("Error with archive /home/share/conda/miniconda3/pkgs/protego-0.1.16-py_0.tar.bz2.  
You probably need to delete and re-download or re-create this file.  Message from libarchive was:\n\nCan't create 'info/test/tests/test_data/www.weather.info:443'")

It turned out the file protego-0.1.16-py_0.tar.bz2 couldn't be extracted on NTFS due to the colons in the file names:

tar -xjf protego-0.1.16-py_0.tar.bz2 
tar: info/test/tests/test_data/www.weather.info\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.bmf.gv.at\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.nd.edu\:443: Cannot open: Invalid argument
...
tar: info/test/tests/test_data/www.airarabia.com\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.broadcom.com\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.pakwheels.com\:443: Cannot open: Invalid argument
tar: Exiting with failure status due to previous errors

No issues were encountered in a Conda environment on an ext4 filesystem, on the same machine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.