Skip to content

Implement Automated Eval Test Suite for the Angular Skills #17001

@zdrawku

Description

@zdrawku

We have three Skills (igniteui-angular-components, igniteui-angular-grids, igniteui-angular-theming) that teach coding agents how to correctly select, configure, and compose Ignite UI for Angular components. As these skills grow in complexity and more developers rely on them, silent regressions become a real risk rewording a step, reordering routing logic, or removing a "verify" clause can quietly degrade agent behavior with no signal until a user reports a wrong output.

This work item establishes a structured eval process for these skills, directly inspired by Minko Gechev's Skill Eval framework, topic and extended with patterns from Anthropic's agent eval research and the Skills Best Practices guide.

Goals

  • Produce a measurable, repeatable quality score for each skill.
  • Detect regressions automatically when a skill file is modified in a PR.
  • Provide a feedback loop during skill authoring (edit → eval → score delta).
  • Establish pass/fail thresholds that gate merges to main.

Approach

Tooling: Adopt the skill-eval TypeScript framework as the eval runner. It supports Docker-isolated agent execution, deterministic shell graders, LLM rubric graders, multi-trial runs, and JSON result persistence — all the properties needed here.

Task Structure

Create an evals/ directory at the repo root. Each eval task is a self-contained directory:

Example:

evals/
├── tasks/
│   ├── grid-basic-setup/
│   │   ├── task.toml               # timeouts, grader weights, trial count
│   │   ├── instruction.md          # what the agent is asked to do
│   │   ├── environment/Dockerfile  # clean Angular project baseline
│   │   ├── tests/test.sh           # deterministic grader (file checks, compile, lint)
│   │   ├── prompts/quality.md      # LLM rubric grader questions
│   │   ├── solution/solve.sh       # reference solution for baseline
│   │   └── skills/                 # symlinks or copies of the skills under test
│   │       └── igniteui-angular-grids/SKILL.md
│   ├── grid-sorting-remote-data/
│   ├── grid-hierarchical-setup/
│   ├── grid-pivot-config/
│   ├── component-combo-reactive-form/
│   ├── component-date-picker-validation/
│   ├── component-dialog-service/
│   ├── theming-palette-generation/
│   ├── theming-component-override/
│   └── skill-routing-intent-detection/  # tests the SKILL.md router logic itself
├── package.json
└── README.md

Tasks to Implement (per Skill)

igniteui-angular-grids skill (highest priority — most complex routing)

Task ID Instruction given to agent Deterministic check LLM rubric check
grid-basic-setup "Add a data grid showing employee data with sorting and pagination" Project compiles; <igx-grid> present in template; correct module imported Did agent choose IgxGrid (not Tree/Hierarchical) for flat data? Did it configure [data] binding correctly?
grid-tree-vs-flat "Display department data with nested child rows" <igx-tree-grid> present; childDataKey configured Did skill routing correctly select Tree Grid over flat Grid?
grid-hierarchical-setup "Build a master-detail grid where clicking a row expands child orders" <igx-hierarchical-grid> + <igx-row-island> present Did agent configure load-on-demand vs inline data correctly based on instructions?
grid-remote-filtering "Add server-side filtering and sorting to the grid" [filterMode]="'externalFilterMode'" set; remote service stub present Did agent wire onDataPreLoad/sortingExpressionsChange instead of local filtering?
grid-pivot-config "Create a pivot grid with row/column/value dimensions" <igx-pivot-grid> + IgxPivotConfiguration present Did agent define rows, columns, values correctly vs a flat grid with groupBy?
grid-state-persistence "Persist grid sorting and filtering state to localStorage" IgxGridStateDirective present; serialize/restore calls present Did agent use the state directive vs manually serializing expressions?

igniteui-angular-components skill

Task ID Instruction Deterministic check LLM rubric check
component-combo-reactive-form "Add a multi-select combo bound to a reactive form control" <igx-combo> present; [formControlName] wired; module imported Did agent use IgxCombo (not IgxSelect or native <select>) for multi-select?
component-date-picker-validation "Add a date picker with min/max date validation" <igx-date-picker> present; minValue/maxValue inputs set Did agent avoid using native <input type=date>? Did it correctly set validators?
component-dialog-service "Show a confirmation dialog when the user clicks Delete" IgxDialogComponent or service open call present Did agent use the Dialog component/service vs a custom modal div?
component-chart-selection "Display monthly sales as a bar chart" <igx-category-chart> or <igx-bar-chart> present Did agent pick the correct chart type (Bar vs Column vs Line) per the skill's intent detection?

igniteui-angular-theming skill

Task ID Instruction Deterministic check LLM rubric check
theming-palette-generation "Create a custom blue/orange branded theme" palette() call with $primary/$secondary; @include theme() present Did agent use palette() correctly vs hardcoding CSS variables? Did it call core() before theme()?
theming-component-override "Change only the IgxButton background color without affecting the rest of the theme" button-theme() mixin call present; scoped to component Did agent use a component-level theme override vs overriding the global palette?
theming-mcp-tool-invocation "Use the MCP server to generate a palette and scaffold a grid theme" MCP tool call in transcript Did agent invoke the MCP tool rather than writing SCSS manually?

Cross-skill / routing tasks

Task ID Instruction What's tested
skill-routing-intent-detection Various ambiguous prompts ("add a table", "style my app", "show nested data") Tests whether the SKILL.md router in each skill fires the correct sub-skill path rather than hallucinating a generic Angular solution

Grading Strategy

Deterministic grader (tests/test.sh) — runs after the agent finishes and checks:

  • Project builds without errors (ng build)
  • Correct Ignite UI selector is present in the generated template
  • Required module or standalone import exists
  • No use of forbidden alternatives (e.g., native <table> or <select> when the skill mandates an Ignite UI component)

LLM rubric grader (prompts/quality.md) — evaluates the agent transcript for:

  • Correct intent routing (did the skill's decision logic fire?)
  • Idiomatic API usage (inputs, outputs, bindings as documented)
  • Absence of hallucinated APIs (wrong input names, non-existent outputs)
  • Following the skill's "prefer X over Y" guidance

Combined score: each task uses a weighted average, e.g. 60% deterministic + 40% rubric. Weights are configurable per task.toml.

Eval Execution & Pass/Fail Thresholds

Following Anthropic's recommendations on agent evals:

  • Minimum 5 trials per task — agent behavior is non-deterministic; one run is meaningless.
  • pass@5 ≥ 80% is the gate for merging skill changes (can the agent solve it at least once in 5 tries?).
  • pass^5 ≥ 60% is tracked but not blocking — used to flag flaky skills that need clarification.
  • A task scoring below pass@5 = 60% on a PR that touches the relevant skill blocks merge.

CI Integration

Add a GitHub Actions workflow triggered on PRs that touch skills/**:

name: Skill Eval
on:
  pull_request:
    paths:
      - 'skills/**'
      - 'evals/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: cd evals && npm install
      - run: npm run eval -- --trials=5 --provider=docker
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: skill-eval-results
          path: evals/results/

A result summary comment is posted on the PR showing per-task pass rates and any regressions relative to the main branch baseline.

Acceptance Criteria

  • evals/ directory scaffolded with at least one task per skill (minimum 3 tasks total as a first pass).
  • Each task has both a deterministic grader and an LLM rubric grader.
  • All tasks pass pass@5 ≥ 80% on main at the time of merging the initial suite.
  • GitHub Actions workflow runs on skill-touching PRs and posts a summary comment.
  • README.md in evals/ documents how to run evals locally and how to add a new task.
  • Baseline results JSON is committed to the repo for regression comparison.

Out of Scope (future work)

  • Eval coverage for the ng update migration schematic that installs skills into consumer projects.
  • Evals for the igniteui-theming MCP server tools themselves (separate harness needed).
  • Multi-skill composition tasks (e.g., build a themed hierarchical grid with a custom palette) — tracked separately once per-skill coverage is stable.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions