Skip to content

Conversation

@OtherVibes
Copy link
Owner

Description

Brief description of the changes in this PR.

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔧 Maintenance/refactoring
  • ⚡ Performance improvement
  • 🧪 Test improvement

Testing

  • Tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Related Issues

Closes #(issue number)

Screenshots (if applicable)

Additional Notes

Any additional information that reviewers should know.

Zvi Fried added 25 commits August 29, 2025 02:18
…roject structure

- Move elicit models (ObstacleResolutionDecision, RequirementsClarification) from server.py to models.py
- Remove duplicate model definitions to follow DRY principle
- Update imports in server.py to use centralized models
- Remove PROJECT_SUMMARY.md file for cleaner project structure
- Improve code organization and maintainability
…itHub Container Registry

- Update README installation instructions to prioritize PyPI package over git clone
- Change primary installation method to use 'uv add mcp-as-a-judge' and 'pip install mcp-as-a-judge'
- Update Docker Compose to use pre-built images from GitHub Container Registry for production
- Separate development and production Docker configurations with profiles
- Ensure all Docker instructions reference ghcr.io/hepivax/mcp-as-a-judge
- Keep git clone only for development and source builds
- Improve user experience by making package installation the default path
… for MCP stdio communication

- Remove PORT and TRANSPORT build args from Dockerfile (MCP uses stdio, not HTTP)
- Remove EXPOSE directive and port mappings from Docker configurations
- Update docker-compose.yml to remove port mappings and add stdin_open/tty for stdio
- Remove nginx service (not needed for MCP servers)
- Update Docker run commands in README to use -it instead of port mappings
- Fix health check to use process check instead of HTTP endpoint
- Add note in README explaining MCP uses stdio communication
- Simplify Docker configuration for proper MCP server deployment
…ient requirements

- Add explanation that concept derives from LLM-as-a-Judge paradigm
- Specify MCP client requirements with official documentation links:
  - Sampling capability required for AI-powered code evaluation
  - Elicitation capability required for user decision prompts
- Link to official MCP docs for sampling and elicitation concepts
- Enhance features section to reference specific MCP capabilities
- Improve clarity on technical requirements for proper functionality
…eveloper-AI interface

- Add prominent section explaining core mission to enhance developer-AI collaboration
- Emphasize preventing AI poor decisions and involving humans in critical choices
- Update main description to highlight transformation of developer-AI experience
- Add focus on intelligent AI-human collaboration with clear boundaries
- Make it clear this is about improving the interface between developers and AI assistants
- Position as solution for better AI-human workflow in software development
…nding

- Replace 🚨 with ⚖️ in main title for better thematic representation
- Add ⚖️ to Main Purpose section header
- Update Five Powerful Tools to Five Powerful Judge Tools with ⚖️ icon
- Add ⚖️ to Concept section for consistent judge theme
- Improve visual identity and reinforce the 'judge' concept throughout README
- Create cohesive branding with scales of justice emoji
…n with AI-powered evaluation

- Replace hardcoded research validation logic with intelligent AI evaluation
- Embed research, plan, design, and user requirements into validation prompt
- Use LLM sampling to assess research comprehensiveness and design alignment
- Evaluate if design is properly based on research findings
- Check for exploration of existing solutions, alternatives, and best practices
- Validate research quality and actionable insights
- Provide detailed feedback on research gaps and design-research alignment
- Maintain obstacle resolution pattern for user involvement in decisions
- Improve validation accuracy and reduce false positives from static checks
…use Pydantic JSON schema

- Fix judge_code_change trigger: must be called BEFORE making any file changes
- Replace hardcoded JSON format with actual Pydantic model schema
- Use JudgeResponse.model_json_schema() for consistent response format
- Ensure proper validation timing: code review before file modification
- Improve prompt accuracy by using actual model schema instead of manual format
- Maintain consistency between expected response format and actual model structure
…s to evaluation criteria

- Integrate key concepts from The Pragmatic Programmer book into judge prompts
- Add DRY Principle, Orthogonality, and Design by Contract evaluations
- Include Defensive Programming, Fail Fast, and Broken Windows Theory
- Add Tracer Bullets, Reversibility, and Good Enough Software principles
- Enhance with Test Early/Test Often and Premature Optimization awareness
- Include Easy to Change, Refactoring Strategy, and Plain Text Power concepts
- Add Rubber Duck Debugging and 'Use the Source, Luke' references
- Improve evaluation guidelines with pragmatic context-driven approach
- Balance perfectionism with practical software delivery principles
- Create more comprehensive and industry-standard evaluation criteria
…ring best practices to evaluation criteria

- Integrate DRY Principle, Orthogonality, and Design by Contract evaluations
- Add Defensive Programming, Fail Fast, and Broken Windows Theory concepts
- Include Tracer Bullets, Reversibility, and Good Enough Software principles
- Enhance with Test Early/Test Often and Premature Optimization awareness
- Add Easy to Change, Refactoring Strategy, and Plain Text Power concepts
- Include Rubber Duck Debugging and authoritative source validation
- Improve evaluation guidelines with context-driven approach
- Balance perfectionism with practical software delivery principles
- Create more comprehensive and industry-standard evaluation criteria
- Focus on maintainable, working software over perfect solutions
… new file creation

- Make it explicit that judge_code_change must be called BEFORE creating ANY new files
- Add comprehensive list of file operations that require code review
- Include new Python files, configuration files, scripts, and modules
- Update parameter descriptions to clarify new file content vs modifications
- Change prompt language from 'code changes' to 'code content' for clarity
- Ensure all file operations involving code are properly validated
- Prevent creation of unreviewed code files in any format
…on impossible to miss

- Add prominent 🚨🚨🚨 alerts and visual emphasis for mandatory requirement
- Specify exact triggers: save-file, str-replace-editor, and other code-writing tools
- Add explicit consequences of not calling: SWE compliance violations, security risks
- Include clear example workflow showing proper usage timing
- Change from 'BEFORE' to 'IMMEDIATELY AFTER' for clarity on timing
- Add specific tool names that trigger the requirement
- Make file_path parameter required instead of optional
- Emphasize this is mandatory compliance, not optional review
- Use multiple warning levels and visual cues to prevent oversight
… imports and enhance pre-commit

- Replace 'from .models' with 'from mcp_as_a_judge.models' in server.py
- Replace 'from .server' with 'from mcp_as_a_judge.server' in __init__.py
- Add gitleaks security scanning to pre-commit hooks (first priority)
- Add additional pre-commit hooks for better code quality
- Ensure all imports are absolute for better maintainability
- Improve import clarity and avoid relative import issues
- Note: ruff already provides black, isort, and flake8 functionality
- Fix trailing whitespace in multiple files
- Fix end-of-file issues in docker-compose.yml
- Apply isort import sorting to all Python files
- Apply black code formatting to 9 Python files
- Fix prettier formatting for markdown and YAML files
- All security checks passed (gitleaks found no secrets)
- Pre-commit hooks are now working correctly and enforcing quality standards
- Remove poetry-check from pre-commit (we use uv, not poetry)
- Fix all flake8 D202 errors (blank lines after docstrings)
- Fix flake8 D400 error (missing period in docstring)
- Fix boolean comparison issues (== True/False -> direct boolean checks)
- Add missing return type annotations to all test functions
- Add missing docstrings to __init__ methods in conftest.py
- Extract research validation logic to reduce complexity (C901)
- Create _validate_research_quality helper function
- Replace duplicated research validation code with helper function call
- Improve code maintainability and reduce cyclomatic complexity
…rors

- Extract _evaluate_coding_plan helper function to reduce complexity
- Reduce judge_coding_plan complexity from 15 to under 10 (C901 resolved)
- Remove duplicated prompt code and use helper functions
- Fix final D202 error (blank line after docstring)
- All flake8 errors now resolved
- Improve code maintainability with better separation of concerns
- Helper functions make code more testable and reusable
- Black automatically reformatted server.py for consistent style
- All flake8 errors resolved ✅
- Gitleaks security scan passing ✅
- Code formatting and style checks passing ✅
- Only mypy type checking issues remain (expected for MCP project)
…rate blocking behavior

- Add pytest hook to run tests before every commit
- Configure pytest with verbose output and short traceback
- Fix test assertions to match actual server name format
- Demonstrate pre-commit blocking with multiple hook failures
- All hooks now properly validate code quality before commits
…th Jinja2 templating

✨ MAJOR REFACTORING: Externalized Prompts for Better Maintainability

🎯 **What Changed:**
- **Extracted all hardcoded prompts** to separate Markdown files in src/prompts/
- **Added Jinja2 templating** for dynamic variable substitution
- **Created PromptLoader utility** for loading and rendering templates
- **Comprehensive test coverage** for prompt loading functionality

📁 **New Structure:**
- src/prompts/judge_coding_plan.md - Main evaluation prompt
- src/prompts/judge_code_change.md - Code review prompt
- src/prompts/research_validation.md - Research quality validation
- src/mcp_as_a_judge/prompt_loader.py - Template loading utility
- tests/test_prompt_loader.py - Full test coverage

🚀 **Benefits:**
- **Easy editing**: Prompts now in readable Markdown format
- **Version control**: Track prompt changes separately from code
- **Maintainability**: No more giant f-strings in Python code
- **Flexibility**: Jinja2 templating for dynamic content
- **Testability**: Isolated prompt testing and validation
- **Collaboration**: Non-developers can edit prompts easily

✅ **Quality Assurance:**
- All existing tests pass (28/28)
- New comprehensive prompt loader tests
- Backward compatibility maintained
- No functional changes to evaluation logic

This refactoring makes the codebase much more maintainable and allows for easier prompt iteration and improvement! 🎉
…eparation and fix all mypy issues

- Reorganized prompts into system/ and user/ directories for clear separation
- System prompts contain behavioral instructions (HOW to evaluate)
- User prompts contain simple requests (WHAT to evaluate)
- Fixed all mypy type checking issues with proper annotations
- Updated pre-commit configuration for proper mypy integration
- Removed unused files (docker-compose.yml, example files)
- All tests passing (29/29) with full type safety
- Perfect separation of concerns in prompt architecture
…xception swallowing

- Add ResearchValidationResponse Pydantic model for proper validation
- Create robust _extract_json_from_response() function to handle:
  * Markdown code blocks
  * Plain JSON objects
  * JSON embedded in explanatory text
  * Proper error handling for malformed responses
- Replace manual json.loads() + dict.get() with Pydantic model_validate_json()
- Remove exception swallowing that masked real parsing errors
- Remove inappropriate raise_obstacle suggestions from parsing errors
- Apply consistent parsing pattern to all LLM sampling functions:
  * _validate_research_quality
  * _evaluate_workflow_guidance
  * _evaluate_coding_plan
  * judge_code_change
- Add comprehensive test suite (tests/test_json_extraction.py) with 8 test cases
- Fix context injection issues by using proper Context type annotations
- All 37 tests passing, mypy clean

Resolves the Invalid JSON expected value at line 1 column 1 error
caused by LLMs returning JSON wrapped in markdown code blocks.
- Remove Technical Prerequisites section
- Update AI assistants section to show only supported ones in clean table format
- Change Critical Requirements to MCP Client Prerequisites with bold formatting
- Convert Five Powerful Judge Tools to List of Tools with tools emoji
- Reorganize tools section as a clean table with tool names and descriptions
- Streamline documentation for better readability and focus
…ge configuration

- Upgrade Python version from 3.12 to 3.13.5 across all configurations:
  * Update .python-version, pyproject.toml, and all GitHub workflows
  * Update Dockerfile to use python:3.13-slim base images
  * Update README badge and CONTRIBUTING.md requirements
  * Regenerate uv.lock with Python 3.13 dependencies
- Add Python 3.13+ to system prerequisites in README
- Improve coverage configuration in pyproject.toml:
  * Add comprehensive source and omit patterns
  * Configure exclude_lines for better coverage reporting
  * Set XML output configuration
- Update CI workflow for better Codecov integration:
  * Set fail_ci_if_error to false for more reliable CI
  * Add verbose output for better debugging
  * Ensure CODECOV_TOKEN environment variable is properly set
- All 37 tests passing on Python 3.13.5
- MyPy type checking clean with Python 3.13
@OtherVibes OtherVibes self-assigned this Aug 30, 2025
Zvi Fried added 2 commits August 30, 2025 07:41
…dio-only MCP configuration

- Remove HTTP/port-related configurations (PORT, TRANSPORT, EXPOSE)
- Keep Python 3.13-slim base images for latest Python version
- Maintain process-based health check using pgrep instead of HTTP curl
- Ensure MCP server remains stdio-only as intended for MCP protocol
- Resolve merge conflict with main branch while preserving Python 3.13 upgrade
…nd fix merge conflicts

- Replace hardcoded version '1.0.0' with dynamic VERSION build argument in Dockerfile
- Add VERSION build arg with 'latest' default for flexible versioning
- Update CI workflow to pass development version (dev-{commit-sha}) for test builds
- Update release workflow to pass actual tag version for production builds
- Remove HTTP/port configurations to keep MCP server stdio-only as intended
- Maintain Python 3.13-slim base images while resolving main branch conflicts
- Ensure proper version tracking across PyPI packages and Docker images
- Enable automatic versioning without manual Dockerfile updates
@OtherVibes OtherVibes merged commit a1718b2 into feat/initial-release-infrastructure Aug 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants