Collection of tools to bulk-download a large corpus of domains, fetch and clean up their robots.txt
, ai.txt
and llms.txt
, analyze overlap and conflicts, and generate summary reports, as well as an already gathered data set.
To use the data set, you will first need to combine the 3 file parts. For this run the following commands in the directory with the split files:
cat 2025-07-05.tar.gz* > 2025-07-05.tar.gz
Afterwards, you can unpack the archive using:
tar xf 2025-07-05.tar.gz
Now you can run your own analysis on the data.
-
Clone the repo
git clone https://github.com/TobiPeterG/robots-ai-permissions.git cd robots-ai-permissions
-
Create & activate a virtualenv
python3 -m venv .venv source .venv/bin/activate
-
Install Python dependencies
pip install -r requirements.txt
-
(For
fetch-all
) provide your ICANN CZDS credentials inconfig.json
:
Purpose: Download the Tranco Top 1 Million list, extract public-suffix domains (PLDs), sort + dedupe, and split into 10 000-line chunks.
Usage
./01-fetch-1M.py [--force]
-
Inputs: none
-
Flags:
--force
— ignore existing output and re-download/rebuild
-
Outputs: under
txt_downloads/YYYY-MM-DD/
:domains_sorted.txt
— all unique, sorted PLDssplits/
—split_00000.txt
,split_00001.txt
, … (10 000 domains each)
Purpose: Fetch zone files from ICANN CZDS, CommonCrawl, Tranco and CitizenLab lists; extract PLDs; merge-sort + split.
Usage
./01-fetch-all.py [--force]
-
Requires
config.json
for CZDS credentials -
Inputs: none
-
Outputs: under
txt_downloads/YYYY-MM-DD/
:zones/
– raw.zone
filesdomains_by_zone/
– one.txt
per zone filedomains_sorted.txt
– merged unique PLDssplits/
– 10 000-line chunks
Purpose:
For each split_XXXXX.txt
, create a folder of per-domain subfolders and download robots.txt
, ai.txt
, and llms.txt
.
Usage
./02-download_splits.py [--force]
-
Inputs:
txt_downloads/YYYY-MM-DD/splits/*.txt
-
Flags:
--force
— re-fetch even if already exists -
Outputs: under
txt_downloads/YYYY-MM-DD/files/
:split_00000/ example.com/robots.txt example.com/ai.txt example.com/llms.txt split_00001/ …
Purpose:
Validate and prune bad downloads: remove files masquerading as HTML, missing User-Agent
(for robots/ai), or non-Markdown llms.txt
.
Usage
./03-clean_downloads.py
- Inputs: downloaded files under
txt_downloads/YYYY-MM-DD/files/split_*/<domain>/{robots.txt,ai.txt,llms.txt}
- Outputs: deletes invalid files in place and prints a summary of removals
Purpose: Scan all domain subfolders and record which of the three files exist for each domain.
Usage
./04-analyze_downloads.py [--root txt_downloads] [--out analysis_output]
-
Inputs:
--root
: download root (defaulttxt_downloads/
)
-
Outputs: under
analysis_output/
:domain_files_map.csv
— mapping ofdomain,files
plds_with_robots.txt
,plds_with_ai_or_llms.txt
,plds_with_no_files.txt
Purpose:
Summarize counts from domain_files_map.csv
: how many domains have each combination of files.
Usage
./05-summarize_counts.py [--analysis-dir analysis_output]
- Inputs:
analysis_output/domain_files_map.csv
- Outputs: printed counts for each category
Purpose:
Parse every domain’s robots.txt
and ai.txt
via urllib.robotparser
, build a JSON map of per–UA “allow” and “disallow” lists.
Usage
./06-map-permissions.py [--root txt_downloads] [--csv analysis_output/domain_files_map.csv] [--out permissions_map.json]
- Inputs: download root & CSV of domains with both files
- Outputs:
permissions_map.json
(per-domain, per–UA permission maps)
Purpose:
Diff robots
vs ai
rules per UA, producing a JSON of shared vs unique rules.
Usage
./07-diff-permissions.py
- Inputs:
permissions_map.json
- Outputs:
permissions_diff.json
Purpose:
Report line-level conflicts for known AI crawler UAs (e.g., GPTBot, ClaudeBot) between robots.txt
and ai.txt
.
Usage
./08-find-ai-conflicts.py [--root txt_downloads] [--csv analysis_output/domain_files_map.csv] [--map permissions_map.json]
- Outputs: human-readable conflict report
Purpose:
Scan for experimental directives DisallowAITraining:
or Content-Usage:
in robots.txt
and ai.txt
.
Usage
./09-find-exp-directives.py [--root txt_downloads] [--csv analysis_output/domain_files_map.csv]
- Outputs: table of domain, file, directive, value, and line
Purpose:
Check every llms.txt
link target against robots.txt
and ai.txt
blocking rules.
Usage
./10-compare-llms.py [--root txt_downloads] [--csv analysis_output/domain_files_map.csv]
- Outputs: list of conflicts (domain, line, URL, blocking file)
Purpose: Detect typos in UA strings and suggest corrections based on known AI crawler names.
Usage
./11-typos.py [--csv analysis_output/domain_files_map.csv] [--map permissions_map.json]
- Outputs: table of domain, file, unknown UA, suggested correction
Purpose: Aggregate counts of explicit UA allow/disallow occurrences and cross-file conflicts.
Usage
./12-explicit-declarations.py [--csv analysis_output/domain_files_map.csv] [--map permissions_map.json]
- Outputs: summary table of UA counts and conflicts
Purpose: Enrich analysis of domains with AI-specific files by fetching geolocation (Whois, IP-API) and inferring industry from TLD.
Usage
python3 scripts/13-website-info.py [--csv analysis_output/domain_files_map.csv] [--workers N]
-
Inputs:
analysis_output/domain_files_map.csv
with domains and file presence- Optional
--workers
to set concurrent lookup threads (default 16)
-
Outputs:
- Printed table with columns: Files, Domain, Country (Whois), Country (ip-api), Industry (TLD)
- Country distribution summary by Whois
- Industry distribution summary by TLD
With this pipeline you can:
- Fetch large domain lists (Tranco or CZDS/CC/…)
- Download crawler directives and LLMS hints
- Clean invalid downloads
- Analyze presence/absence
- Map and diff permissions
- Spot experimental AI-specific directives
- Check if any
llms.txt
links into disallowed paths - Catch typos in UA blocks
- Summarize explicit allow/disallow counts and conflicts
- Enrich domain metadata with country and industry information
Happy auditing!