Skip to content

Conversation

Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Oct 9, 2025

Description

Previously, each benchmark run tore down and rebuilt the Docker image for its associated engine, resulting in unnecessary rebuilds of identical images.

The new system leverages Task to store and compare checksums for each built image's JSON metadata. When an image rebuild is initiated via Task, it now checks whether the Dockerfile or the build scripts for that engine has changed. If neither has changed, the existing image is kept.

The presto_clp and presto_parquet images are temporarily disabled in this PR because their Dockerfiles are currently incorrect and result in failed builds. They will be re-enabled once their definitions are corrected in a future update.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • task docker-images:build builds the images and checksums them successfully.

Summary by CodeRabbit

  • New Features

    • Added commands and tooling to build and export Docker images per engine, including parallel builds, checksum checks and optional image metadata export.
    • Added small CLI utilities to locate tooling and compute image names.
  • Documentation

    • Rewrote setup to emphasise dataset download, Docker-based workflows and engine-focused run commands; added instructions for single/all image builds and running benchmarks by engine.
  • Chores

    • Removed obsolete container build contents and tightened static analysis config.

Copy link

coderabbitai bot commented Oct 9, 2025

Walkthrough

Replaces venv-focused setup with dataset download and Docker-based workflows, adds Docker image build scripts and utilities, introduces Taskfile tasks for parallel per-engine image builds with checksum handling, removes contents from two Dockerfiles, and adjusts mypy settings to include explicit package bases and src on the path.

Changes

Cohort / File(s) Summary
Documentation
README.md
Rewrote setup: dataset acquisition via download_all.py, Docker image build instructions (single and all), running benchmarks by engine ({engine_name}, {dataset_name}), Docker environment notes, and image-to-engine mapping.
Removed Dockerfile contents
assets/overhead_test/Dockerfile, assets/template/Dockerfile
Deleted existing Dockerfile contents (base image and apt installs removed), leaving no build instructions.
Type-check config
pyproject.toml
Added mypy options: explicit_package_bases = true and mypy_path = ["src"] (retains strict = true).
Docker image package
src/log_archival_bench/scripts/docker_images/__init__.py, src/log_archival_bench/scripts/docker_images/build.py, src/log_archival_bench/scripts/docker_images/utils.py
New package and scripts: utils.get_image_name(engine_name) builds image tag using $USER fallback; build.py validates Dockerfile path, resolves docker via which, runs docker build -t, and optionally writes docker inspect JSON.
General utilities
src/log_archival_bench/utils/__init__.py, src/log_archival_bench/utils/path_utils.py
New utilities exposing package root, build and config dirs, and which(binary_name) that locates an executable or raises RuntimeError.
Task runner integration
taskfile.yaml
Includes taskfiles/docker-images/main.yaml and adds G_PROJECT_SRC_DIR = "{{.ROOT_DIR}}/src/log_archival_bench".
Docker-images taskfile
taskfiles/docker-images/main.yaml
New Taskfile defining per-engine build-single-image and build-all-parallel tasks, parallel build execution, per-engine output dirs, pre-validate and post-checksum steps, and invocation of the Python build script.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant Task as Taskfile: docker-images
  participant Py as build.py
  participant Utils as path_utils.which()
  participant Docker as docker

  Dev->>Task: task docker-images:build
  Task->>Task: init (ensure checksums dir)
  par per-engine build
    Task->>Task: build-single-image(ENGINE_NAME)
    Task->>Py: python build.py --engine-name ENGINE_NAME --dump-config-path /out/docker_inspect.json
    Py->>Utils: which("docker")
    Utils-->>Py: /usr/bin/docker
    Py->>Docker: docker build -t <tag> -f <path/to/Dockerfile> .
    Docker-->>Py: image built
    alt --dump-config-path provided
      Py->>Docker: docker inspect <tag>
      Docker-->>Py: JSON metadata
      Py-->>Task: write docker_inspect.json
    end
    Task->>Task: compute checksum and persist
  and
  end
  Task-->>Dev: per-engine outputs (inspect JSON, checksum)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately captures the primary change of separating the Docker image build process from benchmark runs and introducing checksum-based rebuild detection, which aligns directly with the pull request’s objectives and scope. It is concise, clear, and highlights the key feature added. A reviewer scanning the history would immediately understand the intent without needing additional context.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Bill-hbrhbr Bill-hbrhbr changed the title feat(docker-images): Task manage docker images feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. Oct 9, 2025
@Bill-hbrhbr Bill-hbrhbr marked this pull request as ready for review October 9, 2025 19:10
@Bill-hbrhbr Bill-hbrhbr requested a review from a team as a code owner October 9, 2025 19:10
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b986094 and e9faabe.

📒 Files selected for processing (11)
  • README.md (1 hunks)
  • assets/overhead_test/Dockerfile (0 hunks)
  • assets/template/Dockerfile (0 hunks)
  • pyproject.toml (1 hunks)
  • src/log_archival_bench/scripts/docker_images/__init__.py (1 hunks)
  • src/log_archival_bench/scripts/docker_images/build.py (1 hunks)
  • src/log_archival_bench/scripts/docker_images/utils.py (1 hunks)
  • src/log_archival_bench/utils/__init__.py (1 hunks)
  • src/log_archival_bench/utils/path_utils.py (1 hunks)
  • taskfile.yaml (1 hunks)
  • taskfiles/docker-images/main.yaml (1 hunks)
💤 Files with no reviewable changes (2)
  • assets/template/Dockerfile
  • assets/overhead_test/Dockerfile
🧰 Additional context used
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
  • get_image_name (6-12)
src/log_archival_bench/utils/path_utils.py (3)
  • get_config_dir (24-26)
  • get_package_root (14-16)
  • which (29-41)
🪛 Ruff (0.13.3)
src/log_archival_bench/scripts/docker_images/build.py

53-53: subprocess call: check for execution of untrusted input

(S603)


65-65: subprocess call: check for execution of untrusted input

(S603)

🔇 Additional comments (10)
src/log_archival_bench/scripts/docker_images/__init__.py (1)

1-1: LGTM!

Standard package initializer with an appropriate docstring.

pyproject.toml (1)

27-28: LGTM!

The mypy configuration correctly supports the new package structure under src/. The explicit_package_bases and mypy_path settings enable proper type checking for the newly introduced modules.

src/log_archival_bench/utils/__init__.py (1)

1-1: LGTM!

Standard package initializer with an appropriate docstring.

taskfile.yaml (2)

8-8: LGTM!

Appropriately includes the new docker-images taskfile to enable the image build workflow.


12-12: LGTM!

The G_PROJECT_SRC_DIR variable correctly points to the source directory for the Docker image build context.

README.md (1)

10-48: LGTM!

The documentation updates clearly explain the new Docker-based workflow, including dataset download, image building (both concurrent and single-engine), and execution. The shift from virtual environment setup to Docker environments aligns well with the PR objectives.

src/log_archival_bench/scripts/docker_images/utils.py (1)

6-12: LGTM!

The get_image_name function provides a clean utility for generating consistent Docker image tags. The fallback to "clp-user" appropriately handles environments where the USER variable might not be set.

src/log_archival_bench/scripts/docker_images/build.py (1)

17-67: LGTM!

The build logic is well-structured:

  • Properly validates Dockerfile existence before building
  • Uses the which() helper to safely locate the Docker executable
  • Handles optional metadata dumping with appropriate error checking
  • The subprocess calls are safe despite Ruff S603 warnings (the script is designed for developer/CI use with controlled inputs)
src/log_archival_bench/utils/path_utils.py (2)

8-26: LGTM!

The path derivation logic is correct:

  • _PACKAGE_ROOT correctly resolves to the package directory via __file__.parent
  • _BUILD_DIR and _CONFIG_DIR are appropriately derived relative to the package root
  • The getter functions provide clean public interfaces to these paths

29-41: LGTM!

The which() helper provides robust executable location with clear error handling. Raising RuntimeError with a descriptive message when the binary is not found is appropriate for this use case.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9faabe and a28735d.

📒 Files selected for processing (3)
  • src/log_archival_bench/scripts/docker_images/build.py (1 hunks)
  • src/log_archival_bench/utils/path_utils.py (1 hunks)
  • taskfiles/docker-images/main.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
  • get_image_name (6-12)
src/log_archival_bench/utils/path_utils.py (3)
  • get_config_dir (24-26)
  • get_package_root (14-16)
  • which (29-41)
🪛 Ruff (0.13.3)
src/log_archival_bench/scripts/docker_images/build.py

53-53: subprocess call: check for execution of untrusted input

(S603)


65-65: subprocess call: check for execution of untrusted input

(S603)

🔇 Additional comments (4)
src/log_archival_bench/utils/path_utils.py (1)

1-41: LGTM! Clean implementation of path utilities.

The module provides well-structured path helpers with appropriate error handling. The use of Path objects and validation in which() ensures robust path resolution.

taskfiles/docker-images/main.yaml (1)

1-76: LGTM! Task configuration is well-structured.

The checksum validation has been correctly moved from sources to deps, resolving the previous parsing issue. The parallel build orchestration and incremental rebuild detection via checksums align well with the PR objectives.

src/log_archival_bench/scripts/docker_images/build.py (2)

43-53: Docker build implementation is correct.

The subprocess call is safe given that docker_bin is resolved via which() and other arguments are constructed from validated or trusted sources. The static analysis warning (S603) is a false positive in this context, assuming the engine_name validation suggested earlier is implemented.

Note: This approval assumes the engine_name input validation is added as suggested in the earlier comment.


55-65: LGTM! Config dump implementation follows best practices.

The optional JSON metadata dump correctly creates parent directories and uses a context manager for file operations. The subprocess call is safe given validated inputs.

Comment on lines +24 to +34
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"--engine-name", required=True, help="The engine to be installed inside the Docker image."
)
args_parser.add_argument(
"--dump-config-path", help="Path to the file to dump the Docker image JSON metadata."
)

parsed_args = args_parser.parse_args(argv[1:])
engine_name = parsed_args.engine_name
dump_config_path = parsed_args.dump_config_path
Copy link

@coderabbitai coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate engine_name to prevent path traversal.

The engine_name parameter is user-controlled and used to construct file paths (line 38) without validation. While the Dockerfile existence check (line 39) provides some protection, it doesn't prevent path traversal attempts that could access Dockerfiles outside the intended directory.

Add input validation before using engine_name:

     parsed_args = args_parser.parse_args(argv[1:])
     engine_name = parsed_args.engine_name
+    # Prevent path traversal
+    if "/" in engine_name or "\\" in engine_name or engine_name in (".", ".."):
+        err_msg = f"Invalid engine name: {engine_name}"
+        raise ValueError(err_msg)
     dump_config_path = parsed_args.dump_config_path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"--engine-name", required=True, help="The engine to be installed inside the Docker image."
)
args_parser.add_argument(
"--dump-config-path", help="Path to the file to dump the Docker image JSON metadata."
)
parsed_args = args_parser.parse_args(argv[1:])
engine_name = parsed_args.engine_name
dump_config_path = parsed_args.dump_config_path
parsed_args = args_parser.parse_args(argv[1:])
engine_name = parsed_args.engine_name
# Prevent path traversal
if "/" in engine_name or "\\" in engine_name or engine_name in (".", ".."):
err_msg = f"Invalid engine name: {engine_name}"
raise ValueError(err_msg)
dump_config_path = parsed_args.dump_config_path
🤖 Prompt for AI Agents
In src/log_archival_bench/scripts/docker_images/build.py around lines 24 to 34,
validate the user-controlled engine_name before using it to construct filesystem
paths to prevent path traversal: reject names containing path separators or
traversal components (e.g. ".."), enforce a tight whitelist or regex
(alphanumeric, hyphen, underscore) or map names to known filenames, then
construct the Dockerfile path with os.path.join against the intended directory
and verify the resolved absolute path is inside that directory (compare
realpath/abspath prefix) before proceeding; if validation fails, raise/exit with
a clear error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the engine_name is validated along with docker_file_path. Do I need to iterdirs the docker-images dir and add an accepted list of values?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# View the complete implementation of the build.py file to understand the validation logic
cat src/log_archival_bench/scripts/docker_images/build.py

Length of output: 2032


🏁 Script executed:

#!/bin/bash
# Check the engines list defined in the Task configuration
grep -A 20 "G_DOCKER_IMAGE_ENGINES:" taskfiles/docker-images/main.yaml

Length of output: 333


@Bill-hbrhbr, yes, I recommend creating an accepted list of engine names rather than relying solely on the Dockerfile existence check. Here's why:

Security concern: The current validation (docker_file_path.is_file()) checks if a Dockerfile exists at the constructed path, but doesn't prevent path traversal. If someone passes engine_name="../../../malicious" and a Dockerfile happens to exist there, the validation would pass.

Recommended approach: Use an explicit allowlist of valid engine names. This is more secure and provides better error messages. You already have a canonical list in taskfiles/docker-images/main.yaml:

def main(argv: list[str]) -> int:
    # ... existing argparse code ...
    
    parsed_args = args_parser.parse_args(argv[1:])
    engine_name = parsed_args.engine_name
    
    # Validate against allowlist
    VALID_ENGINES = {"clickhouse", "clp", "elasticsearch", "sparksql", "zstandard"}
    if engine_name not in VALID_ENGINES:
        err_msg = f"Invalid engine name: {engine_name}. Valid engines: {', '.join(sorted(VALID_ENGINES))}"
        raise ValueError(err_msg)
    
    dump_config_path = parsed_args.dump_config_path
    # ... rest of code ...

Benefits:

  1. Security: Prevents path traversal entirely, regardless of filesystem state
  2. Better UX: Shows users the list of valid engines when they provide an invalid name
  3. Explicit contract: Makes it clear what engines are supported
  4. Fail-fast: Catches invalid input before any filesystem operations

You can keep the Dockerfile existence check as a secondary validation to catch configuration errors (e.g., if the Dockerfile was accidentally deleted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant