Why v1.1?
The original "v1.0" waited until all HMDB IDs were extracted
before starting the network crawl. On files ≫1 GB this looked like it
"hung" and wasted memory. v1.1 switches to a streaming, producer‑
consumer design: IDs are extracted from XML and fed to worker threads
on the fly, so crawling begins immediately and RAM usage stays flat.
Key Updates
- Zero‑memory blow‑up: we never store more than the executor queue
size (≈workers*2
) IDs at once. - Visible progress from second 1: both XML parsing and crawl speeds
are shown via tqdm (falls back to textual counters if tqdm missing). - Auto‑resume: identical
--resume
semantics, but now we also
create a.partial
checkpoint every 5 s to guard against abrupt
power failures. - Py≥3.7 compatible (dropped the 3.8‑only
{*}tag
XML shortcut). - Graceful shutdown: Ctrl‑C or SIGTERM stops creating new tasks, but
in‑flight requests finish and the partial TSV is flushed.