A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
This is the recommended method for users who want to use agentv as a command-line tool.
- Install via npm:
# Install globally
npm install -g agentv
# Or use npx to run without installing
npx agentv --help- Verify the installation:
agentv --helpFollow these steps if you want to contribute to the agentv project itself. This workflow uses pnpm workspaces and an editable install for immediate feedback.
- Clone the repository and navigate into it:
git clone https://github.com/EntityProcess/agentv.git
cd agentv- Install dependencies:
# Install pnpm if you don't have it
npm install -g pnpm
# Install all workspace dependencies
pnpm install- Build the project:
pnpm build- Run tests:
pnpm testYou are now ready to start development. The monorepo contains:
packages/core/- Core evaluation engineapps/cli/- Command-line interface
-
Configure environment variables:
- Copy .env.template to
.envin your project root - Fill in your API keys, endpoints, and other configuration values
- Copy .env.template to
-
Set up targets:
- Copy targets.yaml to
.agentv/targets.yaml - Update the environment variable names in targets.yaml to match those defined in your
.envfile
- Copy targets.yaml to
Validate your eval and targets files before running them:
# Lint a single file
agentv lint evals/my-test.yaml
# Lint multiple files
agentv lint evals/test1.yaml evals/test2.yaml
# Lint entire directory (recursively finds all YAML files)
agentv lint evals/
# Enable strict mode for additional checks
agentv lint --strict evals/
# Output results in JSON format
agentv lint --json evals/Linter features:
- Validates
$schemafield is present and correct - Checks required fields and structure for eval and targets files
- Validates file references exist and are accessible
- Provides clear error messages with file path and location context
- Exits with non-zero code on validation failures (CI-friendly)
- Supports strict mode for additional checks (e.g., non-empty file content)
File type detection:
All AgentV files must include a $schema field:
# Eval files
$schema: agentv-eval-v2
evalcases:
- id: test-1
# ...
# Targets files
$schema: agentv-targets-v2
targets:
- name: default
# ...Files without a $schema field will be rejected with a clear error message.
Run eval (target auto-selected from test file or CLI override):
# If your test.yaml contains "target: azure_base", it will be used automatically
agentv eval "path/to/test.yaml"
# Override the test file's target with CLI flag
agentv eval --target vscode_projectx "path/to/test.yaml"Run a specific test case with custom targets path:
agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --test-id "my-test-case" "path/to/test.yaml"test_file: Path to test YAML file (required, positional argument)--target TARGET: Execution target name from targets.yaml (overrides target specified in test file)--targets TARGETS: Path to targets.yaml file (default: ./.agentv/targets.yaml)--test-id TEST_ID: Run only the test case with this specific ID--out OUTPUT_FILE: Output file path (default: results/{testname}_{timestamp}.jsonl)--format FORMAT: Output format: 'jsonl' or 'yaml' (default: jsonl)--dry-run: Run with mock model for testing--agent-timeout SECONDS: Timeout in seconds for agent response polling (default: 120)--max-retries COUNT: Maximum number of retries for timeout cases (default: 2)--cache: Enable caching of LLM responses (default: disabled)--dump-prompts: Save all prompts to.agentv/prompts/directory--verbose: Verbose output
The CLI determines which execution target to use with the following precedence:
- CLI flag override:
--target my_target(when provided and not 'default') - Test file specification:
target: my_targetkey in the .test.yaml file - Default fallback: Uses the 'default' target (original behavior)
This allows test files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
Output goes to .agentv/results/{testname}_{timestamp}.jsonl (or .yaml) unless --out is provided.
Workspace Switching: The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
Recommended Models: Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
- Node.js 20.0.0 or higher
- Environment variables for your chosen providers (configured via targets.yaml)
Environment keys (configured via targets.yaml):
- Azure OpenAI: Set environment variables specified in your target's
settings.endpoint,settings.api_key, andsettings.model - Anthropic Claude: Set environment variables specified in your target's
settings.api_keyandsettings.model - Google Gemini: Set environment variables specified in your target's
settings.api_keyand optionalsettings.model - VS Code: Set environment variable specified in your target's
settings.workspace_env→.code-workspacepath
Execution targets in .agentv/targets.yaml decouple tests from providers/settings and provide flexible environment variable mapping.
Each target specifies:
name: Unique identifier for the targetprovider: The model provider (azure,anthropic,gemini,vscode,vscode-insiders, ormock)settings: Environment variable names to use for this target
Azure OpenAI targets:
- name: azure_base
provider: azure
settings:
endpoint: "AZURE_OPENAI_ENDPOINT"
api_key: "AZURE_OPENAI_API_KEY"
model: "AZURE_DEPLOYMENT_NAME"Anthropic targets:
- name: anthropic_base
provider: anthropic
settings:
api_key: "ANTHROPIC_API_KEY"
model: "ANTHROPIC_MODEL"Google Gemini targets:
- name: gemini_base
provider: gemini
settings:
api_key: "GOOGLE_API_KEY"
model: "GOOGLE_GEMINI_MODEL" # Optional, defaults to gemini-2.0-flash-expVS Code targets:
- name: vscode_projectx
provider: vscode
settings:
workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
- name: vscode_insiders_projectx
provider: vscode-insiders
settings:
workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
- Timeout detection: Automatically detects when agents timeout
- Automatic retries: When a timeout occurs, the same test case is retried up to
--max-retriestimes (default: 2) - Retry behavior: Only timeouts trigger retries; other errors proceed to the next test case
- Timeout configuration: Use
--agent-timeoutto adjust how long to wait for agent responses
Example with custom timeout settings:
agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3For each test case in a .yaml file:
- Parse YAML and collect user messages (inline text and referenced files)
- Extract code blocks from text for structured prompting
- Generate a candidate answer via the configured provider/model
- Score against the expected answer using AI-powered quality grading
- Output results in JSONL or YAML format with detailed metrics
- Opens your configured workspace and uses the
subagentlibrary to programmatically invoke VS Code Copilot - The prompt is built from the
.yamluser content (task, files, code blocks) - Copilot is instructed to complete the task within the workspace context
- Results are captured and scored automatically
Run with --verbose to print detailed information and stack traces on errors.
AgentV uses an AI-powered quality grader that:
- Extracts key aspects from the expected answer
- Compares model output against those aspects
- Provides detailed hit/miss analysis with reasoning
- Returns a normalized score (0.0 to 1.0)
JSONL format (default):
- One JSON object per line (newline-delimited)
- Fields:
test_id,score,hits,misses,model_answer,expected_aspect_count,target,timestamp,reasoning,raw_request,grader_raw_request
YAML format (with --format yaml):
- Human-readable YAML documents
- Same fields as JSONL, properly formatted for readability
- Multi-line strings use literal block style
After running all test cases, AgentV displays:
- Mean, median, min, max scores
- Standard deviation
- Distribution histogram
- Total test count and execution time
AgentV is built as a TypeScript monorepo using:
- pnpm workspaces: Efficient dependency management
- Turbo: Build system and task orchestration
- @ax-llm/ax: Unified LLM provider abstraction
- Vercel AI SDK: Streaming and tool use capabilities
- Zod: Runtime type validation
- Commander.js: CLI argument parsing
- Vitest: Testing framework
@agentv/core- Core evaluation engine, providers, grading logicagentv- Main package that bundles CLI functionality
Problem: Package installation fails or command not found.
Solution:
# Clear npm cache and reinstall
npm cache clean --force
npm uninstall -g agentv
npm install -g agentv
# Or use npx without installing
npx agentv@latest --helpProblem: VS Code workspace doesn't open or prompts aren't injected.
Solution:
- Ensure the
subagentpackage is installed (should be automatic) - Verify your workspace path in
.envis correct and points to a.code-workspacefile - Close any other VS Code instances before running evals
- Use
--verboseflag to see detailed workspace switching logs
Problem: API authentication errors or missing credentials.
Solution:
- Double-check environment variables in your
.envfile - Verify the variable names in
targets.yamlmatch your.envfile - Use
--dry-runfirst to test without making API calls - Check provider-specific documentation for required environment variables
MIT License - see LICENSE for details.