-
-
Notifications
You must be signed in to change notification settings - Fork 839
Two-Stage Token Optimization with Phase 1 UX Enhancements #283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
WKassebaum
wants to merge
18
commits into
BeehiveInnovations:main
from
WKassebaum:token-optimization-v2
Closed
Changes from 15 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
c0f3f94
docs: Initialize token optimization v2 rebase plan
WKassebaum 6a26dae
feat: Add two-stage token optimization architecture to main v8.0.0
WKassebaum 91679ff
docs: Add comprehensive token optimization documentation
WKassebaum ff94cc4
fix: Configure Docker for token optimization deployment
WKassebaum 2c1ed2a
fix: Remove conf volume mount to enable model registry loading
WKassebaum b2f6541
docs: Add PR description and token reduction metrics
WKassebaum 99a1653
feat: Add Phase 1 UX improvements to two-stage token optimization
WKassebaum 57bbc92
fix: Resolve critical schema validation mismatch bugs in two-stage to…
WKassebaum 2cb3724
docs: Update PR description with schema validation bug fixes
WKassebaum 14b0468
fix: Add enum constraints and codereview transformation discovered vi…
WKassebaum 6ec401a
fix: Fix smart stub mode forcing and add debug transformation
WKassebaum 33a4a79
fix: Add working_directory to ChatRequest in mode_executor
WKassebaum bf134c5
fix: Ensure refactor requests include non-empty relevant_files
WKassebaum 59bf078
feat: Add missing XAI Grok models and update PR with accurate model list
WKassebaum 0a47441
docs: Add detailed bug fix documentation and ignore test script
WKassebaum 67add5e
fix: Address code review feedback (Phase 1 fixes)
WKassebaum 3aceed6
docs: Add Phase 2 schema refactoring design document
WKassebaum 7661141
docs: Add Phase 2 resume guide for future implementation
WKassebaum File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| # Test directories - don't index test code | ||
| tests/ | ||
| simulator_tests/ | ||
| test_simulation_files/ | ||
| test-setup/ | ||
| test_output/ | ||
|
|
||
| # Test files | ||
| *.test.py | ||
| *_test.py | ||
| test_*.py | ||
|
|
||
| # Coverage and test artifacts | ||
| .coverage | ||
| htmlcov/ | ||
| coverage.xml | ||
| .pytest_cache/ | ||
| *.test.log | ||
|
|
||
| # Python cache | ||
| __pycache__/ | ||
| *.pyc | ||
| *.pyo | ||
| *.pyd | ||
|
|
||
| # Virtual environments | ||
| .venv/ | ||
| venv/ | ||
| env/ | ||
| .zen_venv/ | ||
|
|
||
| # Build artifacts | ||
| build/ | ||
| dist/ | ||
| *.egg-info/ | ||
|
|
||
| # Logs | ||
| logs/ | ||
| *.log | ||
|
|
||
| # Temporary files | ||
| tmp/ | ||
| /tmp/ | ||
| *.tmp | ||
| *.backup | ||
|
|
||
| # IDE | ||
| .idea/ | ||
| .vscode/ | ||
|
|
||
| # OS | ||
| .DS_Store | ||
| Thumbs.db | ||
|
|
||
| # Documentation build | ||
| docs/_build/ | ||
| site/ | ||
|
|
||
| # Environment files (may contain secrets) | ||
| .env | ||
| .env.* | ||
| *.key | ||
| *.pem |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -188,3 +188,4 @@ logs/ | |
| /worktrees/ | ||
| test_simulation_files/ | ||
| .mcp.json | ||
| test_new_grok_models.py | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,152 @@ | ||
| # Claude Code CLI Test Commands for A/B Testing | ||
|
|
||
| ## Important: How to Run Zen Tools in Claude Code CLI | ||
|
|
||
| In Claude Code CLI, Zen MCP tools must be invoked through the MCP protocol, not as bash commands. | ||
|
|
||
| **Correct format**: Use the `mcp__zen__` prefix and proper parameter structure | ||
| **Incorrect format**: `zen analyze --model gemini-2.5-flash` (this won't work) | ||
|
|
||
| ## Baseline Test Commands (9 tests) | ||
|
|
||
| ### Test 1: Architecture Analysis (gemini-2.5-flash) | ||
| ``` | ||
| Use mcp__zen__analyze with these parameters: | ||
| - step: "Analyze the token optimization architecture in this codebase. Focus on the two-stage approach, mode selection logic, and telemetry system." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting analysis of token optimization architecture" | ||
| - relevant_files: ["/app/server.py", "/app/tools/mode_selector.py", "/app/token_optimization_config.py"] | ||
| - model: "gemini-2.5-flash" | ||
| ``` | ||
|
|
||
| ### Test 2: Security Audit (grok-code-fast-1) | ||
| ``` | ||
| Use mcp__zen__secaudit with these parameters: | ||
| - step: "Perform comprehensive security audit of the MCP server focusing on: TCP transport security, Docker container isolation, API key handling, and input validation." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting security audit" | ||
| - relevant_files: ["/app/server.py", "/app/providers", "/app/docker-compose.yml"] | ||
| - model: "grok-code-fast-1" | ||
| ``` | ||
|
|
||
| ### Test 3: Performance Debug (o3-mini) | ||
| ``` | ||
| Use mcp__zen__debug with these parameters: | ||
| - step: "Investigate potential performance bottlenecks in the token optimization system. Analyze the two-stage execution flow, Redis conversation memory, and provider selection logic." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting performance investigation" | ||
| - confidence: "exploring" | ||
| - relevant_files: ["/app/token_optimization_config.py", "/app/tools/mode_selector.py", "/app/utils/conversation_memory.py"] | ||
| - model: "o3-mini" | ||
| ``` | ||
|
|
||
| ### Test 4: Code Review (gemini-2.5-flash) | ||
| ``` | ||
| Use mcp__zen__codereview with these parameters: | ||
| - step: "Review the token optimization implementation for code quality, maintainability, and best practices." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting code review" | ||
| - relevant_files: ["/app/server_token_optimized.py", "/app/tools/mode_executor.py"] | ||
| - model: "gemini-2.5-flash" | ||
| ``` | ||
|
|
||
| ### Test 5: Refactoring Analysis (grok-code-fast-1) | ||
| ``` | ||
| Use mcp__zen__refactor with these parameters: | ||
| - step: "Suggest refactoring opportunities for the MCP server architecture to improve modularity, reduce coupling, and enhance testability. Consider the provider system and tool registration." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting refactoring analysis" | ||
| - relevant_files: ["/app/server.py", "/app/providers/registry.py", "/app/tools/__init__.py"] | ||
| - model: "grok-code-fast-1" | ||
| ``` | ||
|
|
||
| ### Test 6: Test Generation (o3-mini) | ||
| ``` | ||
| Use mcp__zen__testgen with these parameters: | ||
| - step: "Generate comprehensive test strategy for token optimization feature including unit tests, integration tests, and A/B testing validation. Focus on edge cases and error scenarios." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting test generation" | ||
| - relevant_files: ["/app/token_optimization_config.py", "/app/tools/mode_selector.py"] | ||
| - model: "o3-mini" | ||
| ``` | ||
|
|
||
| ### Test 7: Debug Docker Issue (gemini-2.5-flash) | ||
| ``` | ||
| Use mcp__zen__debug with these parameters: | ||
| - step: "Debug why the Docker dual-transport mode occasionally restarts. Analyze server.py transport logic, Docker configuration, and error handling patterns." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting Docker transport investigation" | ||
| - confidence: "exploring" | ||
| - relevant_files: ["/app/server.py", "/app/docker-compose.yml"] | ||
| - model: "gemini-2.5-flash" | ||
| ``` | ||
|
|
||
| ### Test 8: Consensus on WebSocket (multiple models) | ||
| ``` | ||
| Use mcp__zen__consensus with these parameters: | ||
| - step: "Should we implement WebSocket transport in addition to TCP and stdio? Consider: performance implications, client complexity, Docker networking, and maintenance overhead." | ||
| - step_number: 1 | ||
| - total_steps: 3 | ||
| - next_step_required: true | ||
| - findings: "Starting consensus gathering" | ||
| - models: [{"model": "o3-mini"}, {"model": "gemini-2.5-flash"}, {"model": "grok-code-fast-1"}] | ||
| ``` | ||
|
|
||
| ### Test 9: Deep Investigation (grok-code-fast-1) | ||
| ``` | ||
| Use mcp__zen__thinkdeep with these parameters: | ||
| - step: "Investigate the optimal token budget allocation strategy for different model types. Consider context windows, pricing, response quality, and conversation threading requirements." | ||
| - step_number: 1 | ||
| - total_steps: 1 | ||
| - next_step_required: false | ||
| - findings: "Starting deep investigation" | ||
| - confidence: "high" | ||
| - relevant_files: ["/app/utils/token_utils.py", "/app/providers/base.py"] | ||
| - model: "grok-code-fast-1" | ||
| ``` | ||
|
|
||
| ## Test Protocol | ||
|
|
||
| ### Phase 1: Baseline Testing (current configuration) | ||
| 1. Verify `.env` has `ZEN_TOKEN_OPTIMIZATION=disabled` | ||
| 2. Container should already be running with baseline config | ||
| 3. Execute each test command above | ||
| 4. Monitor logs: `docker exec zen-mcp-server tail -f /app/logs/mcp_server.log` | ||
| 5. Check telemetry after each test | ||
|
|
||
| ### Phase 2: Optimized Testing | ||
| 1. Update `.env`: `ZEN_TOKEN_OPTIMIZATION=enabled` | ||
| 2. Restart container: `docker-compose restart zen-mcp` | ||
| 3. Restart Claude Code CLI connection | ||
| 4. Execute the same 9 tests | ||
| 5. Compare telemetry results | ||
|
|
||
| ## Monitoring Commands | ||
|
|
||
| Check execution logs: | ||
| ```bash | ||
| docker exec zen-mcp-server tail -50 /app/logs/mcp_activity.log | ||
| ``` | ||
|
|
||
| Check telemetry (when implemented): | ||
| ```bash | ||
| docker exec zen-mcp-server cat ~/.zen_mcp/token_telemetry.jsonl | tail -5 | ||
| ``` | ||
|
|
||
| ## Note on File Paths | ||
|
|
||
| All file paths must use Docker container paths (`/app/...`) not host paths (`/Users/wrk/...`) because the MCP server runs inside the Docker container. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a discrepancy in the claimed token reduction across the documentation. Here it says 95% (43k → 800), but the PR description and other documents like
TOKEN_REDUCTION_METRICS.mdstate it's an 82% reduction (43k → 7.8k) with compatibility stubs, and 96% for the core-only option.To avoid confusion, it would be best to standardize these metrics across all documentation. I'd recommend using the more detailed breakdown:
This provides a clearer picture of the trade-offs.