Protego is a pure-Python robots.txt parser with support for modern
conventions.
To install Protego, simply use pip:
pip install protego
>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'Using Protego with Requests:
>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']The following table compares Protego to the most popular robots.txt parsers
implemented in Python or featuring Python bindings:
| Protego | RobotFileParser | Reppy | Robotexclusionrulesparser | |
|---|---|---|---|---|
| Implementation language | Python | Python | C++ | Python |
| Reference specification | Martijn Koster’s 1996 draft | |||
| Wildcard support | ✓ | ✓ | ✓ | |
| Length-based precedence | ✓ | ✓ | ||
| Performance | +40% | +1300% | -25% | |
Class protego.Protego:
sitemaps{list_iterator} A list of sitemaps specified inrobots.txt.preferred_host{string} Preferred host specified inrobots.txt.
parse(robotstxt_body)Parserobots.txtand return a new instance ofprotego.Protego.can_fetch(url, user_agent)Return True if the user agent can fetch the URL, otherwise returnFalse.crawl_delay(user_agent)Return the crawl delay specified for the user agent as a float. If nothing is specified, returnNone.request_rate(user_agent)Return the request rate specified for the user agent as a named tupleRequestRate(requests, seconds, start_time, end_time). If nothing is specified, returnNone.visit_time(user_agent)Return the visit time specified for the user agent as a named tupleVisitTime(start_time, end_time). If nothing is specified, returnNone.