-
Notifications
You must be signed in to change notification settings - Fork 160
Description
We have three Skills (igniteui-angular-components, igniteui-angular-grids, igniteui-angular-theming) that teach coding agents how to correctly select, configure, and compose Ignite UI for Angular components. As these skills grow in complexity and more developers rely on them, silent regressions become a real risk rewording a step, reordering routing logic, or removing a "verify" clause can quietly degrade agent behavior with no signal until a user reports a wrong output.
This work item establishes a structured eval process for these skills, directly inspired by Minko Gechev's Skill Eval framework, topic and extended with patterns from Anthropic's agent eval research and the Skills Best Practices guide.
Goals
- Produce a measurable, repeatable quality score for each skill.
- Detect regressions automatically when a skill file is modified in a PR.
- Provide a feedback loop during skill authoring (edit → eval → score delta).
- Establish pass/fail thresholds that gate merges to
main.
Approach
Tooling: Adopt the skill-eval TypeScript framework as the eval runner. It supports Docker-isolated agent execution, deterministic shell graders, LLM rubric graders, multi-trial runs, and JSON result persistence — all the properties needed here.
Task Structure
Create an evals/ directory at the repo root. Each eval task is a self-contained directory:
Example:
evals/
├── tasks/
│ ├── grid-basic-setup/
│ │ ├── task.toml # timeouts, grader weights, trial count
│ │ ├── instruction.md # what the agent is asked to do
│ │ ├── environment/Dockerfile # clean Angular project baseline
│ │ ├── tests/test.sh # deterministic grader (file checks, compile, lint)
│ │ ├── prompts/quality.md # LLM rubric grader questions
│ │ ├── solution/solve.sh # reference solution for baseline
│ │ └── skills/ # symlinks or copies of the skills under test
│ │ └── igniteui-angular-grids/SKILL.md
│ ├── grid-sorting-remote-data/
│ ├── grid-hierarchical-setup/
│ ├── grid-pivot-config/
│ ├── component-combo-reactive-form/
│ ├── component-date-picker-validation/
│ ├── component-dialog-service/
│ ├── theming-palette-generation/
│ ├── theming-component-override/
│ └── skill-routing-intent-detection/ # tests the SKILL.md router logic itself
├── package.json
└── README.md
Tasks to Implement (per Skill)
igniteui-angular-grids skill (highest priority — most complex routing)
| Task ID | Instruction given to agent | Deterministic check | LLM rubric check |
|---|---|---|---|
grid-basic-setup |
"Add a data grid showing employee data with sorting and pagination" | Project compiles; <igx-grid> present in template; correct module imported |
Did agent choose IgxGrid (not Tree/Hierarchical) for flat data? Did it configure [data] binding correctly? |
grid-tree-vs-flat |
"Display department data with nested child rows" | <igx-tree-grid> present; childDataKey configured |
Did skill routing correctly select Tree Grid over flat Grid? |
grid-hierarchical-setup |
"Build a master-detail grid where clicking a row expands child orders" | <igx-hierarchical-grid> + <igx-row-island> present |
Did agent configure load-on-demand vs inline data correctly based on instructions? |
grid-remote-filtering |
"Add server-side filtering and sorting to the grid" | [filterMode]="'externalFilterMode'" set; remote service stub present |
Did agent wire onDataPreLoad/sortingExpressionsChange instead of local filtering? |
grid-pivot-config |
"Create a pivot grid with row/column/value dimensions" | <igx-pivot-grid> + IgxPivotConfiguration present |
Did agent define rows, columns, values correctly vs a flat grid with groupBy? |
grid-state-persistence |
"Persist grid sorting and filtering state to localStorage" | IgxGridStateDirective present; serialize/restore calls present |
Did agent use the state directive vs manually serializing expressions? |
igniteui-angular-components skill
| Task ID | Instruction | Deterministic check | LLM rubric check |
|---|---|---|---|
component-combo-reactive-form |
"Add a multi-select combo bound to a reactive form control" | <igx-combo> present; [formControlName] wired; module imported |
Did agent use IgxCombo (not IgxSelect or native <select>) for multi-select? |
component-date-picker-validation |
"Add a date picker with min/max date validation" | <igx-date-picker> present; minValue/maxValue inputs set |
Did agent avoid using native <input type=date>? Did it correctly set validators? |
component-dialog-service |
"Show a confirmation dialog when the user clicks Delete" | IgxDialogComponent or service open call present |
Did agent use the Dialog component/service vs a custom modal div? |
component-chart-selection |
"Display monthly sales as a bar chart" | <igx-category-chart> or <igx-bar-chart> present |
Did agent pick the correct chart type (Bar vs Column vs Line) per the skill's intent detection? |
igniteui-angular-theming skill
| Task ID | Instruction | Deterministic check | LLM rubric check |
|---|---|---|---|
theming-palette-generation |
"Create a custom blue/orange branded theme" | palette() call with $primary/$secondary; @include theme() present |
Did agent use palette() correctly vs hardcoding CSS variables? Did it call core() before theme()? |
theming-component-override |
"Change only the IgxButton background color without affecting the rest of the theme" | button-theme() mixin call present; scoped to component |
Did agent use a component-level theme override vs overriding the global palette? |
theming-mcp-tool-invocation |
"Use the MCP server to generate a palette and scaffold a grid theme" | MCP tool call in transcript | Did agent invoke the MCP tool rather than writing SCSS manually? |
Cross-skill / routing tasks
| Task ID | Instruction | What's tested |
|---|---|---|
skill-routing-intent-detection |
Various ambiguous prompts ("add a table", "style my app", "show nested data") | Tests whether the SKILL.md router in each skill fires the correct sub-skill path rather than hallucinating a generic Angular solution |
Grading Strategy
Deterministic grader (tests/test.sh) — runs after the agent finishes and checks:
- Project builds without errors (
ng build) - Correct Ignite UI selector is present in the generated template
- Required module or standalone import exists
- No use of forbidden alternatives (e.g., native
<table>or<select>when the skill mandates an Ignite UI component)
LLM rubric grader (prompts/quality.md) — evaluates the agent transcript for:
- Correct intent routing (did the skill's decision logic fire?)
- Idiomatic API usage (inputs, outputs, bindings as documented)
- Absence of hallucinated APIs (wrong input names, non-existent outputs)
- Following the skill's "prefer X over Y" guidance
Combined score: each task uses a weighted average, e.g. 60% deterministic + 40% rubric. Weights are configurable per task.toml.
Eval Execution & Pass/Fail Thresholds
Following Anthropic's recommendations on agent evals:
- Minimum 5 trials per task — agent behavior is non-deterministic; one run is meaningless.
pass@5 ≥ 80%is the gate for merging skill changes (can the agent solve it at least once in 5 tries?).pass^5 ≥ 60%is tracked but not blocking — used to flag flaky skills that need clarification.- A task scoring below
pass@5 = 60%on a PR that touches the relevant skill blocks merge.
CI Integration
Add a GitHub Actions workflow triggered on PRs that touch skills/**:
name: Skill Eval
on:
pull_request:
paths:
- 'skills/**'
- 'evals/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: cd evals && npm install
- run: npm run eval -- --trials=5 --provider=docker
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: skill-eval-results
path: evals/results/A result summary comment is posted on the PR showing per-task pass rates and any regressions relative to the main branch baseline.
Acceptance Criteria
-
evals/directory scaffolded with at least one task per skill (minimum 3 tasks total as a first pass). - Each task has both a deterministic grader and an LLM rubric grader.
- All tasks pass
pass@5 ≥ 80%onmainat the time of merging the initial suite. - GitHub Actions workflow runs on skill-touching PRs and posts a summary comment.
-
README.mdinevals/documents how to run evals locally and how to add a new task. - Baseline results JSON is committed to the repo for regression comparison.
Out of Scope (future work)
- Eval coverage for the
ng updatemigration schematic that installs skills into consumer projects. - Evals for the
igniteui-themingMCP server tools themselves (separate harness needed). - Multi-skill composition tasks (e.g., build a themed hierarchical grid with a custom palette) — tracked separately once per-skill coverage is stable.