feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. #17

Bill-hbrhbr · 2025-10-09T17:38:59Z

Description

Previously, each benchmark run tore down and rebuilt the Docker image for its associated engine, resulting in unnecessary rebuilds of identical images.

The new system leverages Task to store and compare checksums for each built image's JSON metadata. When an image rebuild is initiated via Task, it now checks whether the Dockerfile or the build scripts for that engine has changed. If neither has changed, the existing image is kept.

The presto_clp and presto_parquet images are temporarily disabled in this PR because their Dockerfiles are currently incorrect and result in failed builds. They will be re-enabled once their definitions are corrected in a future update.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

task docker-images:build builds the images and checksums them successfully.

Summary by CodeRabbit

New Features
- Added commands and tooling to build and export Docker images per engine, including parallel builds, checksum checks and optional image metadata export.
- Added small CLI utilities to locate tooling and compute image names.
Documentation
- Rewrote setup to emphasise dataset download, Docker-based workflows and engine-focused run commands; added instructions for single/all image builds and running benchmarks by engine.
Chores
- Removed obsolete container build contents and tightened static analysis config.

coderabbitai · 2025-10-09T17:39:36Z

Walkthrough

Replaces venv-focused setup with dataset download and Docker-based workflows, adds Docker image build scripts and utilities, introduces Taskfile tasks for parallel per-engine image builds with checksum handling, removes contents from two Dockerfiles, and adjusts mypy settings to include explicit package bases and src on the path.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Rewrote setup: dataset acquisition via `download_all.py`, Docker image build instructions (single and all), running benchmarks by engine (`{engine_name}`, `{dataset_name}`), Docker environment notes, and image-to-engine mapping.
Removed Dockerfile contents `assets/overhead_test/Dockerfile`, `assets/template/Dockerfile`	Deleted existing Dockerfile contents (base image and apt installs removed), leaving no build instructions.
Type-check config `pyproject.toml`	Added mypy options: `explicit_package_bases = true` and `mypy_path = ["src"]` (retains `strict = true`).
Docker image package `src/log_archival_bench/scripts/docker_images/__init__.py`, `src/log_archival_bench/scripts/docker_images/build.py`, `src/log_archival_bench/scripts/docker_images/utils.py`	New package and scripts: `utils.get_image_name(engine_name)` builds image tag using `$USER` fallback; `build.py` validates Dockerfile path, resolves `docker` via `which`, runs `docker build -t`, and optionally writes `docker inspect` JSON.
General utilities `src/log_archival_bench/utils/__init__.py`, `src/log_archival_bench/utils/path_utils.py`	New utilities exposing package root, build and config dirs, and `which(binary_name)` that locates an executable or raises `RuntimeError`.
Task runner integration `taskfile.yaml`	Includes `taskfiles/docker-images/main.yaml` and adds `G_PROJECT_SRC_DIR = "{{.ROOT_DIR}}/src/log_archival_bench"`.
Docker-images taskfile `taskfiles/docker-images/main.yaml`	New Taskfile defining per-engine `build-single-image` and `build-all-parallel` tasks, parallel build execution, per-engine output dirs, pre-validate and post-checksum steps, and invocation of the Python build script.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant Task as Taskfile: docker-images
  participant Py as build.py
  participant Utils as path_utils.which()
  participant Docker as docker

  Dev->>Task: task docker-images:build
  Task->>Task: init (ensure checksums dir)
  par per-engine build
    Task->>Task: build-single-image(ENGINE_NAME)
    Task->>Py: python build.py --engine-name ENGINE_NAME --dump-config-path /out/docker_inspect.json
    Py->>Utils: which("docker")
    Utils-->>Py: /usr/bin/docker
    Py->>Docker: docker build -t <tag> -f <path/to/Dockerfile> .
    Docker-->>Py: image built
    alt --dump-config-path provided
      Py->>Docker: docker inspect <tag>
      Docker-->>Py: JSON metadata
      Py-->>Task: write docker_inspect.json
    end
    Task->>Task: compute checksum and persist
  and
  end
  Task-->>Dev: per-engine outputs (inspect JSON, checksum)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately captures the primary change of separating the Docker image build process from benchmark runs and introducing checksum-based rebuild detection, which aligns directly with the pull request’s objectives and scope. It is concise, clear, and highlights the key feature added. A reviewer scanning the history would immediately understand the intent without needing additional context.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b986094 and e9faabe.

📒 Files selected for processing (11)

README.md (1 hunks)
assets/overhead_test/Dockerfile (0 hunks)
assets/template/Dockerfile (0 hunks)
pyproject.toml (1 hunks)
src/log_archival_bench/scripts/docker_images/__init__.py (1 hunks)
src/log_archival_bench/scripts/docker_images/build.py (1 hunks)
src/log_archival_bench/scripts/docker_images/utils.py (1 hunks)
src/log_archival_bench/utils/__init__.py (1 hunks)
src/log_archival_bench/utils/path_utils.py (1 hunks)
taskfile.yaml (1 hunks)
taskfiles/docker-images/main.yaml (1 hunks)

💤 Files with no reviewable changes (2)

assets/template/Dockerfile
assets/overhead_test/Dockerfile

🧰 Additional context used

🧬 Code graph analysis (1)

src/log_archival_bench/scripts/docker_images/build.py (2)

src/log_archival_bench/scripts/docker_images/utils.py (1)

get_image_name (6-12)

src/log_archival_bench/utils/path_utils.py (3)

get_config_dir (24-26)

get_package_root (14-16)

which (29-41)

🪛 Ruff (0.13.3)

src/log_archival_bench/scripts/docker_images/build.py

53-53: subprocess call: check for execution of untrusted input

(S603)

65-65: subprocess call: check for execution of untrusted input

(S603)

🔇 Additional comments (10)

src/log_archival_bench/scripts/docker_images/__init__.py (1)

1-1: LGTM!

Standard package initializer with an appropriate docstring.

pyproject.toml (1)

27-28: LGTM!

The mypy configuration correctly supports the new package structure under src/. The explicit_package_bases and mypy_path settings enable proper type checking for the newly introduced modules.

src/log_archival_bench/utils/__init__.py (1)

1-1: LGTM!

Standard package initializer with an appropriate docstring.

taskfile.yaml (2)

8-8: LGTM!

Appropriately includes the new docker-images taskfile to enable the image build workflow.

12-12: LGTM!

The G_PROJECT_SRC_DIR variable correctly points to the source directory for the Docker image build context.

README.md (1)

10-48: LGTM!

The documentation updates clearly explain the new Docker-based workflow, including dataset download, image building (both concurrent and single-engine), and execution. The shift from virtual environment setup to Docker environments aligns well with the PR objectives.

src/log_archival_bench/scripts/docker_images/utils.py (1)

6-12: LGTM!

The get_image_name function provides a clean utility for generating consistent Docker image tags. The fallback to "clp-user" appropriately handles environments where the USER variable might not be set.

src/log_archival_bench/scripts/docker_images/build.py (1)

17-67: LGTM!

The build logic is well-structured:

Properly validates Dockerfile existence before building

Uses the which() helper to safely locate the Docker executable

Handles optional metadata dumping with appropriate error checking

The subprocess calls are safe despite Ruff S603 warnings (the script is designed for developer/CI use with controlled inputs)

src/log_archival_bench/utils/path_utils.py (2)

8-26: LGTM!

The path derivation logic is correct:

_PACKAGE_ROOT correctly resolves to the package directory via __file__.parent

_BUILD_DIR and _CONFIG_DIR are appropriately derived relative to the package root

The getter functions provide clean public interfaces to these paths

29-41: LGTM!

The which() helper provides robust executable location with clear error handling. Raising RuntimeError with a descriptive message when the binary is not found is appropriate for this use case.

src/log_archival_bench/scripts/docker_images/build.py

src/log_archival_bench/utils/path_utils.py

taskfiles/docker-images/main.yaml

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9faabe and a28735d.

📒 Files selected for processing (3)

src/log_archival_bench/scripts/docker_images/build.py (1 hunks)
src/log_archival_bench/utils/path_utils.py (1 hunks)
taskfiles/docker-images/main.yaml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/log_archival_bench/scripts/docker_images/build.py (2)

src/log_archival_bench/scripts/docker_images/utils.py (1)

get_image_name (6-12)

src/log_archival_bench/utils/path_utils.py (3)

get_config_dir (24-26)

get_package_root (14-16)

which (29-41)

🪛 Ruff (0.13.3)

src/log_archival_bench/scripts/docker_images/build.py

53-53: subprocess call: check for execution of untrusted input

(S603)

65-65: subprocess call: check for execution of untrusted input

(S603)

🔇 Additional comments (4)

src/log_archival_bench/utils/path_utils.py (1)

1-41: LGTM! Clean implementation of path utilities.

The module provides well-structured path helpers with appropriate error handling. The use of Path objects and validation in which() ensures robust path resolution.

taskfiles/docker-images/main.yaml (1)

1-76: LGTM! Task configuration is well-structured.

The checksum validation has been correctly moved from sources to deps, resolving the previous parsing issue. The parallel build orchestration and incremental rebuild detection via checksums align well with the PR objectives.

src/log_archival_bench/scripts/docker_images/build.py (2)

43-53: Docker build implementation is correct.

The subprocess call is safe given that docker_bin is resolved via which() and other arguments are constructed from validated or trusted sources. The static analysis warning (S603) is a false positive in this context, assuming the engine_name validation suggested earlier is implemented.

Note: This approval assumes the engine_name input validation is added as suggested in the earlier comment.

55-65: LGTM! Config dump implementation follows best practices.

The optional JSON metadata dump correctly creates parent directories and uses a context manager for file operations. The subprocess call is safe given validated inputs.

coderabbitai · 2025-10-10T18:21:07Z

src/log_archival_bench/scripts/docker_images/build.py

+    args_parser = argparse.ArgumentParser()
+    args_parser.add_argument(
+        "--engine-name", required=True, help="The engine to be installed inside the Docker image."
+    )
+    args_parser.add_argument(
+        "--dump-config-path", help="Path to the file to dump the Docker image JSON metadata."
+    )
+
+    parsed_args = args_parser.parse_args(argv[1:])
+    engine_name = parsed_args.engine_name
+    dump_config_path = parsed_args.dump_config_path


⚠️ Potential issue | 🟠 Major

Validate engine_name to prevent path traversal.

The engine_name parameter is user-controlled and used to construct file paths (line 38) without validation. While the Dockerfile existence check (line 39) provides some protection, it doesn't prevent path traversal attempts that could access Dockerfiles outside the intended directory.

Add input validation before using engine_name:

parsed_args = args_parser.parse_args(argv[1:]) engine_name = parsed_args.engine_name + # Prevent path traversal + if "/" in engine_name or "\\" in engine_name or engine_name in (".", ".."): + err_msg = f"Invalid engine name: {engine_name}" + raise ValueError(err_msg) dump_config_path = parsed_args.dump_config_path

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

args_parser = argparse.ArgumentParser()

args_parser.add_argument(

"--engine-name", required=True, help="The engine to be installed inside the Docker image."

)

args_parser.add_argument(

"--dump-config-path", help="Path to the file to dump the Docker image JSON metadata."

)

parsed_args = args_parser.parse_args(argv[1:])

engine_name = parsed_args.engine_name

dump_config_path = parsed_args.dump_config_path

parsed_args = args_parser.parse_args(argv[1:])

engine_name = parsed_args.engine_name

# Prevent path traversal

if "/" in engine_name or "\\" in engine_name or engine_name in (".", ".."):

err_msg = f"Invalid engine name: {engine_name}"

raise ValueError(err_msg)

dump_config_path = parsed_args.dump_config_path

🤖 Prompt for AI Agents

In src/log_archival_bench/scripts/docker_images/build.py around lines 24 to 34, validate the user-controlled engine_name before using it to construct filesystem paths to prevent path traversal: reject names containing path separators or traversal components (e.g. ".."), enforce a tight whitelist or regex (alphanumeric, hyphen, underscore) or map names to known filenames, then construct the Dockerfile path with os.path.join against the intended directory and verify the resolved absolute path is inside that directory (compare realpath/abspath prefix) before proceeding; if validation fails, raise/exit with a clear error.

the engine_name is validated along with docker_file_path. Do I need to iterdirs the docker-images dir and add an accepted list of values?

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # View the complete implementation of the build.py file to understand the validation logic cat src/log_archival_bench/scripts/docker_images/build.py

Length of output: 2032

🏁 Script executed:

#!/bin/bash # Check the engines list defined in the Task configuration grep -A 20 "G_DOCKER_IMAGE_ENGINES:" taskfiles/docker-images/main.yaml

Length of output: 333

@Bill-hbrhbr, yes, I recommend creating an accepted list of engine names rather than relying solely on the Dockerfile existence check. Here's why:

Security concern: The current validation (docker_file_path.is_file()) checks if a Dockerfile exists at the constructed path, but doesn't prevent path traversal. If someone passes engine_name="../../../malicious" and a Dockerfile happens to exist there, the validation would pass.

Recommended approach: Use an explicit allowlist of valid engine names. This is more secure and provides better error messages. You already have a canonical list in taskfiles/docker-images/main.yaml:

def main(argv: list[str]) -> int: # ... existing argparse code ... parsed_args = args_parser.parse_args(argv[1:]) engine_name = parsed_args.engine_name # Validate against allowlist VALID_ENGINES = {"clickhouse", "clp", "elasticsearch", "sparksql", "zstandard"} if engine_name not in VALID_ENGINES: err_msg = f"Invalid engine name: {engine_name}. Valid engines: {', '.join(sorted(VALID_ENGINES))}" raise ValueError(err_msg) dump_config_path = parsed_args.dump_config_path # ... rest of code ...

Benefits:

Security: Prevents path traversal entirely, regardless of filesystem state

Better UX: Shows users the list of valid engines when they provide an invalid name

Explicit contract: Makes it clear what engines are supported

Fail-fast: Catches invalid input before any filesystem operations

You can keep the Dockerfile existence check as a secondary validation to catch configuration errors (e.g., if the Dockerfile was accidentally deleted).

Bill-hbrhbr added 5 commits October 9, 2025 04:10

relocate docker files

dda5e24

Task version of managing docker images

ac9e110

yaml lint fix

be23651

restructure

49f6e5b

lint fix and others

fc148bc

Bill-hbrhbr added 2 commits October 9, 2025 14:30

Variable refactoring

7fa26c0

Update README

e9faabe

Bill-hbrhbr changed the title ~~feat(docker-images): Task manage docker images~~ feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. Oct 9, 2025

Bill-hbrhbr marked this pull request as ready for review October 9, 2025 19:10

Bill-hbrhbr requested a review from a team as a code owner October 9, 2025 19:10

Bill-hbrhbr requested review from Eden-D-Zhang and hoophalab October 9, 2025 19:11

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

src/log_archival_bench/scripts/docker_images/build.py Outdated Show resolved Hide resolved

src/log_archival_bench/utils/path_utils.py Outdated Show resolved Hide resolved

taskfiles/docker-images/main.yaml Outdated Show resolved Hide resolved

address review comment

a28735d

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. #17

feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. #17

Uh oh!

Bill-hbrhbr commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 10, 2025 •

edited

Loading

Uh oh!

Bill-hbrhbr Oct 10, 2025

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. #17

Are you sure you want to change the base?

feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. #17

Uh oh!

Conversation

Bill-hbrhbr commented Oct 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bill-hbrhbr Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bill-hbrhbr commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

coderabbitai bot Oct 10, 2025 •

edited

Loading