scrapy / protego Goto Github PK

A pure-Python robots.txt parser with support for modern conventions.

License: BSD 3-Clause "New" or "Revised" License

Python 0.64% DIGITAL Command Language 98.81% Io 0.01% C++ 0.01% Ruby 0.27% Shell 0.01% Prolog 0.01% Roff 0.01% LilyPond 0.02% Perl 0.03% xBase 0.03% mupad 0.06% JavaScript 0.01% NewLisp 0.03% Raku 0.01% Berry 0.10%

robots-txt robots-parser python hacktoberfest

protego's Introduction

Protego

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

	Protego	RobotFileParser	Reppy	Robotexclusionrulesparser
Implementation language	Python	Python	C++	Python
Reference specification	Google	Martijn Koster’s 1996 draft
Wildcard support	✓		✓	✓
Length-based precedence	✓		✓
Performance		+40%	+1300%	-25%

API Reference

Class protego.Protego:

Properties

sitemaps {list_iterator} A list of sitemaps specified in robots.txt.
preferred_host {string} Preferred host specified in robots.txt.

Methods

parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.
can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.
crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.
visit_time(user_agent) Return the visit time specified for the user agent as a named tuple VisitTime(start_time, end_time). If nothing is specified, return None.

protego's People

Contributors

Stargazers

Watchers

protego's Issues

Non-encoding of URL queries causing Regex robots.txt rule matches to be missed

Hi, we have recently switched from Reppy to Protego for our robots.txt parser. All seems fine, except we noticed a few differences between Reppy and Protego in the URLs we are crawling - essentially, Protego appeared to be allowing access to URLs that should be blocked. It seems that Protego follows the Google specification and Reppy does not, so differences should be expected. However, we noticed that the official Google Robots Tester blocked access to these URLs also - so there seems to be an error here.

The string in the robots.txt file that appeared to be being ignored is /en-uk/*q=*relevance* and an example of a URL that was not being filtered by this string is /en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Here is the output from Google Robots Tester showing that this URL should be blocked by the aforementioned string:

Having looked at the Protego code, we believe that we have found where this apparent error comes from. We also think we have a fix for it, and will happily submit the fix for your scrutiny as we'd like to know if there are unforeseen consequences from this change.

The problem involves the ASCII hex-encoding of the URL string. Protego splits the URL into parts, e.g.:

scheme="http://", netloc='www.website.com', path='/en-uk/new-wildlife-range/c/4018', params='', query='q=%3Arelevance%3Atype%3AHedgehog%2BFood', fragment=''

It then encodes symbols in the "path" part of this, removes the "scheme" and "netloc" parts, and reassembles the URL to compare with all the rules in robots.txt. The issue we're seeing is that it only encodes the symbols in the "path" part. The "query" part is left alone.
We end up with this as the URL to be checked:
/en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Which when a regex search is applied to it using this string: /en-uk/.*?q%3D.*?relevance.*? a match isn't found as the = in the URL hasn't been encoded to %3D.
The fix we have is simple, it just encodes the "query" part in the same way as the "path" part. So instead we end up with URL:
/en-uk/new-wildlife-range/c/4018?q%3D%3Arelevance%3Atype%3AHedgehog%2BFood
the URL is matched with the regex string correctly, and crawler access is blocked.

Is this likely to cause any unforeseen issues?
Thanks

Cannot fetch non-disallowed domain

https://www.walkerplus.com/robots.txt:

user-agent: *
disallow: http://ms-web00.walkerplus.com/
disallow: http://www-origin.walkerplus.com/
disallow: http://walkerplus.jp/
disallow: http://walkerplus.net/
disallow: https://ms.walkerplus.com/

user-agent: twitterbot
disallow:

Unexpectedly:

>>> rp.can_fetch("https://www.walkerplus.com/", "mybot")
False

Originally reported at scrapy/scrapy#4145

Use `pyre2` as optional dependency for RegExp speedup.

Just throwing up a far future idea.

I've seen that your lib is 40% slower compared to RobotFileParser from Python versions < 3.13 . I suspect this is because of re module compilation and matching.

pyre2 is a drop-in replacement for re which is faster for simple patterns which are exactly what robots.txt relies on. pyre2 falls back to re if it doesn't support some RegExp features (like lookarounds) but it won't be the case here.

My claims about potential speedup should be tested with your lib of course but nonetheless I think these are worth a consideration.

Disallowing / does not work when the target URL path is missing

>>> from protego import Protego
>>> robots_txt = "User-Agent: *\nDisallow: /\n"
>>> robots_txt_parser = Protego.parse(robots_txt)
>>> robots_txt_parser.can_fetch("http://example.com/", "mybot")
False
>>> robots_txt_parser.can_fetch("http://example.com", "mybot")
True
>>>

Both calls should return False, since the / path is implicit if a URL has no path.

Automatically deploy new Protego releases to conda-forge

Currently, we use Travis CI to automatically deploy new Protego releases to PyPI. We should have something similar in place for updating our conda-forge package.

python-protego fails to build with Python 3.12: AttributeError: 'TestProtego' object has no attribute 'assertEquals'.

Bug 2175156 - python-protego fails to build with Python 3.12: AttributeError: 'TestProtego' object has no attribute 'assertEquals'.

python-protego fails to build with Python 3.12.0a5.

=================================== FAILURES ===================================
_____________________ TestProtego.test_sitemaps_come_first _____________________

self = <test_protego.TestProtego testMethod=test_sitemaps_come_first>

def test_sitemaps_come_first(self):
    """Some websites have sitemaps before any robots directives"""
    content = ("Sitemap: [https://www.foo.bar/sitmap.xml\n](https://www.foo.bar/sitmap.xml/n)"
               "User-Agent: FootBot\n"
               "Disallow: /something")
    rp = Protego.parse(content=content)

  self.assertEquals(list(rp.sitemaps), ["https://www.foo.bar/sitmap.xml"])

E AttributeError: 'TestProtego' object has no attribute 'assertEquals'. Did you mean: 'assertEqual'?

tests/test_protego.py:1055: AttributeError
=========================== short test summary info ============================
FAILED tests/test_protego.py::TestProtego::test_sitemaps_come_first - Attribu...
======================== 1 failed, 4337 passed in 5.62s ========================

Removed many old deprecated unittest features:

A number of TestCase method aliases:

| Deprecated alias | Method Name | Deprecated in |
+-----------------------|------------------------|---------------+
| failUnless | assertTrue() | 3.1 |
| failIf | assertFalse() | 3.1 |
| failUnlessEqual | assertEqual() | 3.1 |
| failIfEqual | assertNotEqual() | 3.1 |
| failUnlessAlmostEqual | assertAlmostEqual() | 3.1 |
| failIfAlmostEqual | assertNotAlmostEqual() | 3.1 |
| failUnlessRaises | assertRaises() | 3.1 |
| assert_ | assertTrue() | 3.2 |
| assertEquals | assertEqual() | 3.2 |
| assertNotEquals | assertNotEqual() | 3.2 |
| assertAlmostEquals | assertAlmostEqual() | 3.2 |
| assertNotAlmostEquals | assertNotAlmostEqual() | 3.2 |
| assertRegexpMatches | assertRegex() | 3.2 |
| assertRaisesRegexp | assertRaisesRegex() | 3.2 |
| assertNotRegexpMatches| assertNotRegex() | 3.5 |
+-----------------------|------------------------|---------------+

You can use https://github.com/isidentical/teyit to automatically modernise your unit tests.

Undocumented and broken TestCase method assertDictContainsSubset (deprecated in Python 3.2).
Undocumented TestLoader.loadTestsFromModule parameter use_load_tests (deprecated and ignored since Python 3.2).
An alias of the TextTestResult class: _TextTestResult (deprecated in Python
3.2).

(Contributed by Serhiy Storchaka in bpo-45162.)
https://bugs.python.org/issue?@action=redirect&bpo=45162
https://docs.python.org/3.12/whatsnew/3.12.html

For the build logs, see:
https://copr-be.cloud.fedoraproject.org/results/@python/python3.12/fedora-rawhide-x86_64/05576684-python-protego/

For all our attempts to build python-protego with Python 3.12, see:
https://copr.fedorainfracloud.org/coprs/g/python/python3.12/package/python-protego/

Testing and mass rebuild of packages is happening in copr. You can follow these instructions to test locally in mock if your package builds with Python 3.12:
https://copr.fedorainfracloud.org/coprs/g/python/python3.12/

Let us know here if you have any questions.

Python 3.12 is planned to be included in Fedora 39. To make that update smoother, we're building Fedora packages with all pre-releases of Python 3.12.
A build failure prevents us from testing all dependent packages (transitive [Build]Requires), so if this package is required a lot, it's important for us to get it fixed soon.
We'd appreciate help from the people who know this package best, but if you don't want to work on this now, let us know so we can try to work around it on our side.

URL starting with double-slash are misinterpreted

When analyzing the following robots.txt, Protego parses the directive Disallow: //debug/* as if it was /*

User-agent: *
Disallow: //debug/*

This is due to the following line of code:

protego/src/protego.py

Line 185 in 45e1948

parts = urlparse(pattern)

The problem is that urlparse does not parse the URL as expected (i.e. as a path) and takes "debug" as the hostname:

from urllib.parse import urlparse
print(urlparse("//debug/*"))
### result: ParseResult(scheme='', netloc='debug', path='/*', params='', query='', fragment='')

According to Google's official documentation, the Allow and Disallow directives must be followed by relative paths starting with a / character.

Therefore, I see two possible solutions:

avoid using urlparse on directives' patterns
replace the starting double initial slashes with a single slash

Option 1
As is:

protego/src/protego.py

Lines 185 to 186 in 45e1948

    
           parts = urlparse(pattern) 
        
           pattern = self._unquote(parts.path, ignore="/*$%")

To be:

pattern = self._unquote(pattern, ignore="/*$%")

Option 2
Add a re.sub at the beginning of the following method:

protego/src/protego.py

Lines 90 to 93 in 45e1948

    
           def _prepare_pattern_for_regex(self, pattern): 
        
               """Return equivalent regex pattern for the given URL pattern.""" 
        
               pattern = re.sub(r"\*+", "*", pattern) 
        
               s = re.split(r"(\*|\$$)", pattern)

pattern = re.sub(r"^[/]{2,}", "*", pattern)

Accept robots.txt as bytes

>>> from protego import Protego
>>> robots_txt = b"User-Agent: *\nDisallow: /\n"
>>> robots_txt_parser = Protego.parse(robots_txt)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/adrian/temporal/venv/lib/python3.9/site-packages/protego.py", line 310, in parse
    o._parse_robotstxt(content)
  File "/home/adrian/temporal/venv/lib/python3.9/site-packages/protego.py", line 327, in _parse_robotstxt
    hash_pos = line.find('#')
TypeError: argument should be integer or bytes-like object, not 'str'
>>> robots_txt = "User-Agent: *\nDisallow: /\n"
>>> robots_txt_parser = Protego.parse(robots_txt)
>>>

Document Python version support in the comparison table of the README

Reppy does not support Python 3.9+.

Update for RFC 9309 Robots Exclusion Protocol

It seems Google got a standard published a few months ago. We should review it in detail, and make sure we follow it. We should also mention it in the README.

six usage

As you seem to have dropped Python 2 support would you consider to drop six usage as well?

Protego differs from reppy in handling of wildcards for GET-params

I'm looking to replace Reppy with something that is easier to install and maintain. We have some unit tests for our usage of Reppy. Some of these test that wildcards are handled correctly (whatever 'correct' may mean here). One test that is failing, tests behavior of wildcards in GET-parameters. Reppy disallows that URL, while Protego allows it.

Could you shed some light on this? Is this something that should and can be fixed in Protego?

In [1]: from reppy.robots import Robots

In [2]: from protego import Protego

In [3]: robots_txt = """User-agent: *
   ...: Disallow: /*s=
   ...: """

In [4]: reppy = Robots.parse('', robots_txt)

In [5]: protego = Protego.parse(robots_txt)

In [6]: urls = ['https://mysite/', 'https://mysite/s/', 'https://mysite/?s=asd']

In [7]: [reppy.allowed(url, '*') for url in urls]
Out[7]: [True, True, False]

In [8]: [protego.can_fetch(url, '*') for url in urls]
Out[8]: [True, True, True]

Support for visit-time

I have a site using the visit-time attribute in their robots.txt. Any chance support for this could be implemented?
If not are there any suggestions on how to work around this in scrapy?

Colons in file names prevent installation on NTFS

Installing protego 0.1.16 via Conda produced the following error:

InvalidArchiveError("Error with archive /home/share/conda/miniconda3/pkgs/protego-0.1.16-py_0.tar.bz2.  
You probably need to delete and re-download or re-create this file.  Message from libarchive was:\n\nCan't create 'info/test/tests/test_data/www.weather.info:443'")

It turned out the file protego-0.1.16-py_0.tar.bz2 couldn't be extracted on NTFS due to the colons in the file names:

tar -xjf protego-0.1.16-py_0.tar.bz2 
tar: info/test/tests/test_data/www.weather.info\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.bmf.gv.at\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.nd.edu\:443: Cannot open: Invalid argument
...
tar: info/test/tests/test_data/www.airarabia.com\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.broadcom.com\:443: Cannot open: Invalid argument
tar: info/test/tests/test_data/www.pakwheels.com\:443: Cannot open: Invalid argument
tar: Exiting with failure status due to previous errors

No issues were encountered in a Conda environment on an ext4 filesystem, on the same machine.

	parts = urlparse(pattern)
	pattern = self._unquote(parts.path, ignore="/*$%")

	def _prepare_pattern_for_regex(self, pattern):
	"""Return equivalent regex pattern for the given URL pattern."""
	pattern = re.sub(r"\+", "", pattern)
	s = re.split(r"(\*\|\$$)", pattern)

scrapy / protego Goto Github PK

protego's Introduction

Protego

Install

Usage

Comparison

API Reference

Properties

Methods

protego's People

Contributors

Stargazers

Watchers

Forkers

protego's Issues

Recommend Projects

Recommend Topics

Recommend Org