Skip to content

Commit f0966d0

Browse files
committed
Merge branch 'r/1.2.0'
1 parent 9ac5c53 commit f0966d0

14 files changed

+12601
-253
lines changed

README.md

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# archive wayback downloader
1+
# python wayback machine downloader
22

33
[![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
44
[![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
@@ -7,15 +7,17 @@
77

88
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
99

10-
Internet-archive is a nice source for several OSINT-information. This script is a work in progress to query and fetch archived web pages.
10+
Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.
11+
12+
This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
1113

1214
## Installation
1315

1416
### Pip
1517

1618
1. Install the package <br>
1719
```pip install pywaybackup```
18-
2. Run the script <br>
20+
2. Run the tool <br>
1921
```waybackup -h```
2022

2123
### Manual
@@ -26,30 +28,25 @@ Internet-archive is a nice source for several OSINT-information. This script is
2628
```pip install .```
2729
- in a virtual env or use `--break-system-package`
2830

29-
## Usage
30-
31-
This script allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
32-
33-
### Arguments
31+
## Arguments
3432

3533
- `-h`, `--help`: Show the help message and exit.
36-
- `-a`, `--about`: Show information about the script and exit.
34+
- `-a`, `--about`: Show information about the tool and exit.
3735

38-
#### Required Arguments
36+
### Required
3937

4038
- `-u`, `--url`: The URL of the web page to download. This argument is required.
4139

4240
#### Mode Selection (Choose One)
43-
4441
- `-c`, `--current`: Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
4542
- `-f`, `--full`: Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
4643
- `-s`, `--save`: Save a page to the Wayback Machine. (beta)
4744

48-
#### Optional Arguments
45+
### Optional query parameters
4946

5047
- `-l`, `--list`: Only print the snapshots available within the specified range. Does not download the snapshots.
5148
- `-e`, `--explicit`: Only download the explicit given url. No wildcard subdomains or paths. Use e.g. to get root-only snapshots.
52-
- `-o`, `--output`: The folder where downloaded files will be saved.
49+
- `-o`, `--output`: Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
5350

5451
- **Range Selection:**<br>
5552
Specify the range in years or a specific timestamp either start, end or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
@@ -58,13 +55,36 @@ Specify the range in years or a specific timestamp either start, end or both. If
5855
- `--start`: Timestamp to start searching.
5956
- `--end`: Timestamp to end searching.
6057

61-
#### Additional
62-
63-
- `--csv`: Save a csv file with the list of snapshots inside the output folder or a specified folder. If you set `--list` the csv will contain the cdx list of snapshots. If you set either `--current` or `--full` the csv will contain the downloaded files.
64-
- `--no-redirect`: Do not follow redirects of snapshots. Archive.org sometimes redirects to a different snapshot for several reasons. Downloading redirects may lead to timestamp-folders which contain some files with a different timestamp. This does not matter if you only want to download the latest version (`-c`).
65-
- `--verbosity`: Set the verbosity: json (print json response), progress (show progress bar).
66-
- `--retry`: Retry failed downloads. You can specify the number of retry attempts as an integer.
67-
- `--workers`: The number of workers to use for downloading (simultaneous downloads). Default is 1. A safe spot is about 10 workers. Beware: Using too many workers will lead into refused connections from the Wayback Machine. Duration about 1.5 minutes.
58+
### Additional behavior manipulation
59+
60+
- **`--csv`** `<path>`:<br>
61+
Path defaults to output-dir. Saves a CSV file with the json-response for successfull downloads. If `--list` is set, the CSV contains the CDX list of snapshots. If `--current` or `--full` is set, CSV contains downloaded files. Named as `waybackup_<sanitized_url>.csv`.
62+
63+
- **`--skip`** `<path>`:<br>
64+
Path defaults to output-dir. Checks for an existing `waybackup_<domain>.csv` for URLs to skip downloading. Useful for interrupted downloads. Files are checked by their root-domain, ensuring consistency across queries. This means that if you download `http://example.com/subdir1/` and later `http://example.com`, the second query will skip the first path.
65+
66+
- **`--no-redirect`**:<br>
67+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
68+
69+
- **`--verbosity`** `<level>`:<br>
70+
Sets verbosity level. Options are `json` (prints JSON response) or `progress` (shows progress bar).
71+
72+
- **`--retry`** `<attempts>`:<br>
73+
Specifies number of retry attempts for failed downloads.
74+
75+
- **`--workers`** `<count>`:<br>
76+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
77+
78+
**CDX Query Handling:**
79+
- **`--cdxbackup`** `<path>`:<br>
80+
Path defaults to output-dir. Saves the result of CDX query as a file. Useful for later downloading snapshots and overcoming refused connections by CDX server due to too many queries. Named as `waybackup_<sanitized_url>.cdx`.
81+
82+
- **`--cdxinject`** `<filepath>`:<br>
83+
Injects a CDX query file to download snapshots. Ensure the query matches the previous `--url` for correct folder structure.
84+
85+
### Debug
86+
87+
- `--debug`: If set, full traceback will be printed in case of an error. The full exception will be written into `waybackup_error.log`.
6888

6989
### Examples
7090

@@ -169,5 +189,5 @@ The csv contains the json response in a table format.
169189

170190
## Contributing
171191

172-
I'm always happy for some feature requests to improve the usability of this script.
173-
Feel free to give suggestions and report issues. Project is still far from being perfect.
192+
I'm always happy for some feature requests to improve the usability of this tool.
193+
Feel free to give suggestions and report issues. Project is still far from being perfect.

dev/pip_build.sh

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,8 @@
44
SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
55
TARGET_PATH="$SCRIPT_PATH/.."
66

7-
# check if venv is activated
8-
if [ -z "$VIRTUAL_ENV" ]; then
9-
echo "Please activate your virtual environment"
10-
exit 1
11-
fi
7+
# install dependencies
8+
pip install twine wheel setuptools
129

1310
# build
1411
python $TARGET_PATH/setup.py sdist bdist_wheel --verbose

pywaybackup/Exception.py

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
2+
import sys
3+
import os
4+
from datetime import datetime
5+
import linecache
6+
import traceback
7+
8+
class Exception:
9+
10+
new_debug = True
11+
debug = False
12+
output = None
13+
command = None
14+
15+
@classmethod
16+
def init(cls, debug=False, output=None, command=None):
17+
sys.excepthook = cls.exception_handler # set custom exception handler (uncaught exceptions)
18+
cls.output = output
19+
cls.command = command
20+
cls.debug = True if debug else False
21+
22+
@classmethod
23+
def exception(cls, message: str, e: Exception, tb=None):
24+
custom_tb = sys.exc_info()[-1] if tb is None else tb
25+
original_tb = "".join(traceback.format_exception(type(e), e, e.__traceback__))
26+
exception_message = (
27+
"-------------------------\n"
28+
f"!-- Exception: {message}\n"
29+
)
30+
if custom_tb is not None:
31+
while custom_tb.tb_next: # loop to last traceback frame
32+
custom_tb = custom_tb.tb_next
33+
tb_frame = custom_tb.tb_frame
34+
tb_line = custom_tb.tb_lineno
35+
func_name = tb_frame.f_code.co_name
36+
filename = tb_frame.f_code.co_filename
37+
codeline = linecache.getline(filename, tb_line).strip()
38+
exception_message += (
39+
f"!-- File: {filename}\n"
40+
f"!-- Function: {func_name}\n"
41+
f"!-- Line: {tb_line}\n"
42+
f"!-- Segment: {codeline}\n"
43+
)
44+
else:
45+
exception_message += "!-- Traceback is None\n"
46+
exception_message += (
47+
f"!-- Description: {e}\n"
48+
"-------------------------")
49+
print(exception_message)
50+
if cls.debug:
51+
debug_file = os.path.join(cls.output, "waybackup_error.log")
52+
print(f"Exception log: {debug_file}")
53+
print("-------------------------")
54+
print(f"Full traceback:\n{original_tb}")
55+
if cls.new_debug: # new run, overwrite file
56+
cls.new_debug = False
57+
f = open(debug_file, "w")
58+
f.write("-------------------------\n")
59+
f.write(f"Command: {cls.command}\n")
60+
f.write("-------------------------\n\n")
61+
else: # current run, append to file
62+
f = open(debug_file, "a")
63+
f.write(datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "\n")
64+
f.write(exception_message + "\n")
65+
f.write(original_tb + "\n")
66+
67+
@staticmethod
68+
def exception_handler(exception_type, exception, traceback):
69+
if issubclass(exception_type, KeyboardInterrupt):
70+
sys.__excepthook__(exception_type, exception, traceback)
71+
return
72+
Exception.exception("UNCAUGHT EXCEPTION", exception, traceback) # uncaught exceptions also with custom scheme
73+

pywaybackup/SnapshotCollection.py

Lines changed: 12 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
from urllib.parse import urlparse
1+
from pywaybackup.helper import url_split
22
import os
33

44
class SnapshotCollection:
55

66
SNAPSHOT_COLLECTION = []
7-
MODE_CURRENT = 0
7+
MODE_CURRENT = 0
88

99
@classmethod
1010
def create_list(cls, cdxResult, mode):
@@ -15,7 +15,7 @@ def create_list(cls, cdxResult, mode):
1515
- mode `current`: Only the latest snapshot of each file is included.
1616
"""
1717
# creates a list of dictionaries for each snapshot entry
18-
cls.SNAPSHOT_COLLECTION = sorted([{"timestamp": snapshot[0], "digest": snapshot[1], "mimetype": snapshot[2], "status": snapshot[3], "url": snapshot[4]} for snapshot in cdxResult.json()[1:]], key=lambda k: k['timestamp'], reverse=True)
18+
cls.SNAPSHOT_COLLECTION = sorted([{"timestamp": snapshot[0], "digest": snapshot[1], "mimetype": snapshot[2], "status": snapshot[3], "url": snapshot[4]} for snapshot in cdxResult[1:]], key=lambda k: k['timestamp'], reverse=True)
1919
if mode == "current":
2020
cls.MODE_CURRENT = 1
2121
cdxResult_list_filtered = []
@@ -29,21 +29,23 @@ def create_list(cls, cdxResult, mode):
2929
# writes the index for each snapshot entry
3030
cls.SNAPSHOT_COLLECTION = [{"id": idx, **entry} for idx, entry in enumerate(cls.SNAPSHOT_COLLECTION)]
3131

32+
3233
@classmethod
3334
def count_list(cls):
3435
return len(cls.SNAPSHOT_COLLECTION)
3536

37+
3638
@classmethod
3739
def create_collection(cls):
3840
new_collection = []
3941
for idx, cdx_entry in enumerate(cls.SNAPSHOT_COLLECTION):
40-
timestamp, url = cdx_entry["timestamp"], cdx_entry["url"]
41-
url_archive = f"http://web.archive.org/web/{timestamp}{cls._url_get_filetype(url)}/{url}"
42+
timestamp, url_origin = cdx_entry["timestamp"], cdx_entry["url"]
43+
url_archive = f"https://web.archive.org/web/{timestamp}id_/{url_origin}"
4244
collection_entry = {
4345
"id": idx,
4446
"timestamp": timestamp,
4547
"url_archive": url_archive,
46-
"url_origin": url,
48+
"url_origin": url_origin,
4749
"redirect_url": False,
4850
"redirect_timestamp": False,
4951
"response": False,
@@ -52,27 +54,18 @@ def create_collection(cls):
5254
new_collection.append(collection_entry)
5355
cls.SNAPSHOT_COLLECTION = new_collection
5456

55-
@classmethod
56-
def snapshot_entry_create_output(cls, collection_entry: dict, output: str) -> str:
57-
"""
58-
Create the output path for a snapshot entry of the collection according to the mode.
59-
60-
Input:
61-
- collection_entry: A single snapshot entry of the collection (dict).
62-
- output: The output directory (str).
6357

64-
Output:
65-
- download_file: The output path for the snapshot entry (str) with filename.
66-
"""
67-
timestamp, url = collection_entry["timestamp"], collection_entry["url_origin"]
68-
domain, subdir, filename = cls.url_split(url, index=True)
58+
@classmethod
59+
def create_output(cls, url: str, timestamp: str, output: str):
60+
domain, subdir, filename = url_split(url.split("id_/")[1], index=True)
6961
if cls.MODE_CURRENT:
7062
download_dir = os.path.join(output, domain, subdir)
7163
else:
7264
download_dir = os.path.join(output, domain, timestamp, subdir)
7365
download_file = os.path.abspath(os.path.join(download_dir, filename))
7466
return download_file
7567

68+
7669
@classmethod
7770
def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str):
7871
"""
@@ -82,47 +75,3 @@ def snapshot_entry_modify(cls, collection_entry: dict, key: str, value: str):
8275
- Modify an existing key-value pair if the key exists.
8376
"""
8477
collection_entry[key] = value
85-
86-
@classmethod
87-
def url_get_timestamp(cls, url):
88-
"""
89-
Extract the timestamp from a wayback machine URL.
90-
"""
91-
timestamp = url.split("web.archive.org/web/")[1].split("/")[0]
92-
timestamp = ''.join([char for char in timestamp if char.isdigit()])
93-
return timestamp
94-
95-
@classmethod
96-
def _url_get_filetype(cls, url):
97-
file_extension = os.path.splitext(url)[1][1:]
98-
urltype_mapping = {
99-
"jpg": "im_",
100-
"jpeg": "im_",
101-
"png": "im_",
102-
"gif": "im_",
103-
"svg": "im_",
104-
"ico": "im_",
105-
"css": "cs_"
106-
#"js": "js_"
107-
}
108-
urltype = urltype_mapping.get(file_extension, "id_")
109-
return urltype
110-
111-
@classmethod
112-
def url_split(cls, url, index=False):
113-
"""
114-
Split a URL into domain, subdir and filename.
115-
"""
116-
if not urlparse(url).scheme:
117-
url = "http://" + url
118-
parsed_url = urlparse(url)
119-
domain = parsed_url.netloc.split("@")[-1].split(":")[0] # split mailto: and port
120-
path_parts = parsed_url.path.split("/")
121-
if not url.endswith("/") or "." in path_parts[-1]:
122-
filename = path_parts[-1]
123-
subdir = "/".join(path_parts[:-1]).strip("/")
124-
else:
125-
filename = "index.html" if index else ""
126-
subdir = "/".join(path_parts).strip("/")
127-
filename = filename.replace("%20", " ") # replace url encoded spaces
128-
return domain, subdir, filename

pywaybackup/Verbosity.py

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,44 +2,46 @@
22
import json
33
from pywaybackup.SnapshotCollection import SnapshotCollection as sc
44

5+
56
class Verbosity:
67

78
mode = None
89
args = None
910
pbar = None
1011

12+
new_debug = True
13+
debug = False
14+
output = None
15+
command = None
16+
1117
@classmethod
12-
def open(cls, args: list):
13-
cls.args = args
18+
def init(cls, v_args: list, debug=False, output=None, command=None):
19+
cls.args = v_args
20+
cls.output = output
21+
cls.command = command
1422
if cls.args == "progress":
1523
cls.mode = "progress"
1624
elif cls.args == "json":
1725
cls.mode = "json"
1826
else:
1927
cls.mode = "standard"
28+
cls.debug = True if debug else False
2029

2130
@classmethod
22-
def close(cls):
31+
def fini(cls):
2332
if cls.mode == "progress":
2433
if cls.pbar is not None: cls.pbar.close()
25-
if cls.mode == "progress" or cls.mode == "standard":
26-
successed = len([snapshot for snapshot in sc.SNAPSHOT_COLLECTION if "file" in snapshot and snapshot["file"]])
27-
failed = len([snapshot for snapshot in sc.SNAPSHOT_COLLECTION if "file" in snapshot and not snapshot["file"]])
28-
print(f"\nFiles downloaded: {successed}")
29-
print(f"Files missing: {failed}")
30-
print("")
3134
if cls.mode == "json":
3235
print(json.dumps(sc.SNAPSHOT_COLLECTION, indent=4, sort_keys=True))
3336

3437
@classmethod
3538
def write(cls, message: str = None, progress: int = None):
3639
if cls.mode == "progress":
37-
if progress == 0:
38-
print("")
40+
if cls.pbar is None and progress == 0:
3941
maxval = sc.count_list()
4042
cls.pbar = tqdm.tqdm(total=maxval, desc="Downloading", unit=" snapshot", ascii="░▒█")
41-
elif cls.pbar is not None and progress == 1:
42-
cls.pbar.update(1)
43+
if cls.pbar is not None and progress is not None and progress > 0 :
44+
cls.pbar.update(progress)
4345
cls.pbar.refresh()
4446
elif cls.mode == "json":
4547
pass

pywaybackup/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "1.0.3"
1+
__version__ = "1.2.0"

0 commit comments

Comments
 (0)