-
Notifications
You must be signed in to change notification settings - Fork 5
feat(docker-images): Decouple image build flow from benchmark runs and add checksum-based rebuild detection. #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughReplaces venv-focused setup with dataset download and Docker-based workflows, adds Docker image build scripts and utilities, introduces Taskfile tasks for parallel per-engine image builds with checksum handling, removes contents from two Dockerfiles, and adjusts mypy settings to include explicit package bases and Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant Task as Taskfile: docker-images
participant Py as build.py
participant Utils as path_utils.which()
participant Docker as docker
Dev->>Task: task docker-images:build
Task->>Task: init (ensure checksums dir)
par per-engine build
Task->>Task: build-single-image(ENGINE_NAME)
Task->>Py: python build.py --engine-name ENGINE_NAME --dump-config-path /out/docker_inspect.json
Py->>Utils: which("docker")
Utils-->>Py: /usr/bin/docker
Py->>Docker: docker build -t <tag> -f <path/to/Dockerfile> .
Docker-->>Py: image built
alt --dump-config-path provided
Py->>Docker: docker inspect <tag>
Docker-->>Py: JSON metadata
Py-->>Task: write docker_inspect.json
end
Task->>Task: compute checksum and persist
and
end
Task-->>Dev: per-engine outputs (inspect JSON, checksum)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (11)
README.md
(1 hunks)assets/overhead_test/Dockerfile
(0 hunks)assets/template/Dockerfile
(0 hunks)pyproject.toml
(1 hunks)src/log_archival_bench/scripts/docker_images/__init__.py
(1 hunks)src/log_archival_bench/scripts/docker_images/build.py
(1 hunks)src/log_archival_bench/scripts/docker_images/utils.py
(1 hunks)src/log_archival_bench/utils/__init__.py
(1 hunks)src/log_archival_bench/utils/path_utils.py
(1 hunks)taskfile.yaml
(1 hunks)taskfiles/docker-images/main.yaml
(1 hunks)
💤 Files with no reviewable changes (2)
- assets/template/Dockerfile
- assets/overhead_test/Dockerfile
🧰 Additional context used
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name
(6-12)src/log_archival_bench/utils/path_utils.py (3)
get_config_dir
(24-26)get_package_root
(14-16)which
(29-41)
🪛 Ruff (0.13.3)
src/log_archival_bench/scripts/docker_images/build.py
53-53: subprocess
call: check for execution of untrusted input
(S603)
65-65: subprocess
call: check for execution of untrusted input
(S603)
🔇 Additional comments (10)
src/log_archival_bench/scripts/docker_images/__init__.py (1)
1-1
: LGTM!Standard package initializer with an appropriate docstring.
pyproject.toml (1)
27-28
: LGTM!The mypy configuration correctly supports the new package structure under
src/
. Theexplicit_package_bases
andmypy_path
settings enable proper type checking for the newly introduced modules.src/log_archival_bench/utils/__init__.py (1)
1-1
: LGTM!Standard package initializer with an appropriate docstring.
taskfile.yaml (2)
8-8
: LGTM!Appropriately includes the new docker-images taskfile to enable the image build workflow.
12-12
: LGTM!The
G_PROJECT_SRC_DIR
variable correctly points to the source directory for the Docker image build context.README.md (1)
10-48
: LGTM!The documentation updates clearly explain the new Docker-based workflow, including dataset download, image building (both concurrent and single-engine), and execution. The shift from virtual environment setup to Docker environments aligns well with the PR objectives.
src/log_archival_bench/scripts/docker_images/utils.py (1)
6-12
: LGTM!The
get_image_name
function provides a clean utility for generating consistent Docker image tags. The fallback to"clp-user"
appropriately handles environments where theUSER
variable might not be set.src/log_archival_bench/scripts/docker_images/build.py (1)
17-67
: LGTM!The build logic is well-structured:
- Properly validates Dockerfile existence before building
- Uses the
which()
helper to safely locate the Docker executable- Handles optional metadata dumping with appropriate error checking
- The subprocess calls are safe despite Ruff S603 warnings (the script is designed for developer/CI use with controlled inputs)
src/log_archival_bench/utils/path_utils.py (2)
8-26
: LGTM!The path derivation logic is correct:
_PACKAGE_ROOT
correctly resolves to the package directory via__file__.parent
_BUILD_DIR
and_CONFIG_DIR
are appropriately derived relative to the package root- The getter functions provide clean public interfaces to these paths
29-41
: LGTM!The
which()
helper provides robust executable location with clear error handling. RaisingRuntimeError
with a descriptive message when the binary is not found is appropriate for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
src/log_archival_bench/scripts/docker_images/build.py
(1 hunks)src/log_archival_bench/utils/path_utils.py
(1 hunks)taskfiles/docker-images/main.yaml
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/log_archival_bench/scripts/docker_images/build.py (2)
src/log_archival_bench/scripts/docker_images/utils.py (1)
get_image_name
(6-12)src/log_archival_bench/utils/path_utils.py (3)
get_config_dir
(24-26)get_package_root
(14-16)which
(29-41)
🪛 Ruff (0.13.3)
src/log_archival_bench/scripts/docker_images/build.py
53-53: subprocess
call: check for execution of untrusted input
(S603)
65-65: subprocess
call: check for execution of untrusted input
(S603)
🔇 Additional comments (4)
src/log_archival_bench/utils/path_utils.py (1)
1-41
: LGTM! Clean implementation of path utilities.The module provides well-structured path helpers with appropriate error handling. The use of
Path
objects and validation inwhich()
ensures robust path resolution.taskfiles/docker-images/main.yaml (1)
1-76
: LGTM! Task configuration is well-structured.The checksum validation has been correctly moved from
sources
todeps
, resolving the previous parsing issue. The parallel build orchestration and incremental rebuild detection via checksums align well with the PR objectives.src/log_archival_bench/scripts/docker_images/build.py (2)
43-53
: Docker build implementation is correct.The subprocess call is safe given that
docker_bin
is resolved viawhich()
and other arguments are constructed from validated or trusted sources. The static analysis warning (S603) is a false positive in this context, assuming theengine_name
validation suggested earlier is implemented.Note: This approval assumes the
engine_name
input validation is added as suggested in the earlier comment.
55-65
: LGTM! Config dump implementation follows best practices.The optional JSON metadata dump correctly creates parent directories and uses a context manager for file operations. The subprocess call is safe given validated inputs.
args_parser = argparse.ArgumentParser() | ||
args_parser.add_argument( | ||
"--engine-name", required=True, help="The engine to be installed inside the Docker image." | ||
) | ||
args_parser.add_argument( | ||
"--dump-config-path", help="Path to the file to dump the Docker image JSON metadata." | ||
) | ||
|
||
parsed_args = args_parser.parse_args(argv[1:]) | ||
engine_name = parsed_args.engine_name | ||
dump_config_path = parsed_args.dump_config_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate engine_name
to prevent path traversal.
The engine_name
parameter is user-controlled and used to construct file paths (line 38) without validation. While the Dockerfile existence check (line 39) provides some protection, it doesn't prevent path traversal attempts that could access Dockerfiles outside the intended directory.
Add input validation before using engine_name
:
parsed_args = args_parser.parse_args(argv[1:])
engine_name = parsed_args.engine_name
+ # Prevent path traversal
+ if "/" in engine_name or "\\" in engine_name or engine_name in (".", ".."):
+ err_msg = f"Invalid engine name: {engine_name}"
+ raise ValueError(err_msg)
dump_config_path = parsed_args.dump_config_path
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
args_parser = argparse.ArgumentParser() | |
args_parser.add_argument( | |
"--engine-name", required=True, help="The engine to be installed inside the Docker image." | |
) | |
args_parser.add_argument( | |
"--dump-config-path", help="Path to the file to dump the Docker image JSON metadata." | |
) | |
parsed_args = args_parser.parse_args(argv[1:]) | |
engine_name = parsed_args.engine_name | |
dump_config_path = parsed_args.dump_config_path | |
parsed_args = args_parser.parse_args(argv[1:]) | |
engine_name = parsed_args.engine_name | |
# Prevent path traversal | |
if "/" in engine_name or "\\" in engine_name or engine_name in (".", ".."): | |
err_msg = f"Invalid engine name: {engine_name}" | |
raise ValueError(err_msg) | |
dump_config_path = parsed_args.dump_config_path |
🤖 Prompt for AI Agents
In src/log_archival_bench/scripts/docker_images/build.py around lines 24 to 34,
validate the user-controlled engine_name before using it to construct filesystem
paths to prevent path traversal: reject names containing path separators or
traversal components (e.g. ".."), enforce a tight whitelist or regex
(alphanumeric, hyphen, underscore) or map names to known filenames, then
construct the Dockerfile path with os.path.join against the intended directory
and verify the resolved absolute path is inside that directory (compare
realpath/abspath prefix) before proceeding; if validation fails, raise/exit with
a clear error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the engine_name
is validated along with docker_file_path
. Do I need to iterdirs the docker-images dir and add an accepted list of values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# View the complete implementation of the build.py file to understand the validation logic
cat src/log_archival_bench/scripts/docker_images/build.py
Length of output: 2032
🏁 Script executed:
#!/bin/bash
# Check the engines list defined in the Task configuration
grep -A 20 "G_DOCKER_IMAGE_ENGINES:" taskfiles/docker-images/main.yaml
Length of output: 333
@Bill-hbrhbr, yes, I recommend creating an accepted list of engine names rather than relying solely on the Dockerfile existence check. Here's why:
Security concern: The current validation (docker_file_path.is_file()
) checks if a Dockerfile exists at the constructed path, but doesn't prevent path traversal. If someone passes engine_name="../../../malicious"
and a Dockerfile happens to exist there, the validation would pass.
Recommended approach: Use an explicit allowlist of valid engine names. This is more secure and provides better error messages. You already have a canonical list in taskfiles/docker-images/main.yaml
:
def main(argv: list[str]) -> int:
# ... existing argparse code ...
parsed_args = args_parser.parse_args(argv[1:])
engine_name = parsed_args.engine_name
# Validate against allowlist
VALID_ENGINES = {"clickhouse", "clp", "elasticsearch", "sparksql", "zstandard"}
if engine_name not in VALID_ENGINES:
err_msg = f"Invalid engine name: {engine_name}. Valid engines: {', '.join(sorted(VALID_ENGINES))}"
raise ValueError(err_msg)
dump_config_path = parsed_args.dump_config_path
# ... rest of code ...
Benefits:
- Security: Prevents path traversal entirely, regardless of filesystem state
- Better UX: Shows users the list of valid engines when they provide an invalid name
- Explicit contract: Makes it clear what engines are supported
- Fail-fast: Catches invalid input before any filesystem operations
You can keep the Dockerfile existence check as a secondary validation to catch configuration errors (e.g., if the Dockerfile was accidentally deleted).
Description
Previously, each benchmark run tore down and rebuilt the Docker image for its associated engine, resulting in unnecessary rebuilds of identical images.
The new system leverages Task to store and compare checksums for each built image's JSON metadata. When an image rebuild is initiated via Task, it now checks whether the Dockerfile or the build scripts for that engine has changed. If neither has changed, the existing image is kept.
The
presto_clp
andpresto_parquet
images are temporarily disabled in this PR because their Dockerfiles are currently incorrect and result in failed builds. They will be re-enabled once their definitions are corrected in a future update.Checklist
breaking change.
Validation performed
task docker-images:build
builds the images and checksums them successfully.Summary by CodeRabbit
New Features
Documentation
Chores