Protego

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

	Protego	RobotFileParser	Reppy	Robotexclusionrulesparser
Implementation language	Python	Python	C++	Python
Reference specification	Google	Martijn Koster’s 1996 draft
Wildcard support	✓		✓	✓
Length-based precedence	✓		✓
Performance		+40%	+1300%	-25%

API Reference

Class protego.Protego:

Properties

sitemaps {list_iterator} A list of sitemaps specified in robots.txt.
preferred_host {string} Preferred host specified in robots.txt.

Methods

parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.
can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.
crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.
visit_time(user_agent) Return the visit time specified for the user agent as a named tuple VisitTime(start_time, end_time). If nothing is specified, return None.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.github/workflows		.github/workflows
src/protego		src/protego
tests		tests
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.rst		CHANGELOG.rst
LICENSE		LICENSE
README.rst		README.rst
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Protego

Install

Usage

Comparison

API Reference

Properties

Methods

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors 18

Uh oh!

Languages

Uh oh!

License

Uh oh!

scrapy/protego

Folders and files

Latest commit

History

Repository files navigation

Protego

Install

Usage

Comparison

API Reference

Properties

Methods

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 18

Uh oh!

Languages

Packages