Test Cases

This document describes how to define benchmark suites and individual cases.

Suite exports

A suite module must export a default suite value. The suite can be either:

an array of Case values
an object map of named Case values

import { assert, type Case } from "skillgym";

const suite: Case[] = [
  {
    id: "always-passes",
    prompt: "Say only: skillgym ready",
    assert(report, ctx) {
      assert.match(ctx.finalOutput(), /skillgym ready/);
    },
  },
];

export default suite;

import { assert, type Suite } from "skillgym";

const suite: Suite = {
  "always-passes": {
    id: "always-passes",
    prompt: "Say only: skillgym ready",
    assert(report, ctx) {
      assert.match(ctx.finalOutput(), /skillgym ready/);
    },
  },
};

export default suite;

Case shape

skillgym exports this public shape:

export interface Case {
  id: string;
  prompt: string;
  tags?: string[];
  timeoutMs?: number;
  expectedFail?: boolean;
  classifyFailure?(result: RunnerResult): FailureClass | string | undefined;
  assert(report: SessionReport, ctx: AssertionContext): void | Promise<void>;
}

Field meanings:

id: stable identifier used in results and artifact paths
prompt: the exact prompt sent to the runner
tags: optional labels for selecting cases with --tag; multiple selected tags use OR matching
timeoutMs: optional per-case timeout override
expectedFail: mark assertion failures as expected benchmark signal, not suite-failing results
classifyFailure(result): optional post-processing hook for assigning or overriding structured failure classes
assert(report, ctx): pass or fail logic for that execution

Case does not include runner selection. Each case runs against the selected configured runners.

Assertions in a case

The assert function decides pass or fail:

if assert(report, ctx) completes normally, that execution passes
if it throws, that execution fails
if expectedFail: true is set, an assertion failure is reported as status: "expected-failed" and passed: true
if expectedFail: true is set and assertions pass, the execution is reported as status: "unexpected-passed" and passed: false

You can use both:

Node strict assert helpers such as assert.ok, assert.equal, and assert.match
skillgym grouped helpers such as assert.skills.has and assert.commands.includes

import { assert, type Case } from "skillgym";

const suite: Case[] = [
  {
    id: "find-skills-expo",
    prompt: "Find a skill for upgrading Expo SDK and tell me how to install it.",
    assert(report) {
      assert.skills.has(report, "find-skills");
      assert.commands.includes(report, "npx skills find");
      assert.match(report.finalOutput, /upgrading-expo/i);
    },
  },
];

See assertions.md for the full assertion reference.

Failure classification hooks

Use failure classification when you want to group multiple failing executions under one shared cause, such as a pseudo command, wrong CLI alias, missing required flag, or wrong command family.

There are two integration points:

assert.classify(...) attaches a failure class directly where an assertion is made
classifyFailure(result) lets the case assign or override the final class after the result is available

Example:

import { assert, type Case } from "skillgym";

const suite: Case[] = [
  {
    id: "cursor-alias-check",
    prompt: 'Say you would run: cursr agent "open README.md".',
    classifyFailure(result) {
      return result.error?.message.includes("wrong Cursor CLI alias")
        ? { id: "wrong-cli-alias", label: "Wrong CLI alias" }
        : undefined;
    },
    assert(_report, ctx) {
      assert.classify({ id: "wrong-cli-alias", label: "Wrong CLI alias" }, () => {
        assert.doesNotMatch(
          ctx.finalOutput(),
          /\bcursr\s+agent\b/i,
          "wrong Cursor CLI alias in final output",
        );
      });
    },
  },
];

Notes:

assert.classify(...) is the smallest way to tag a single assertion failure
classifyFailure(result) is useful when several different assertion messages should collapse into one shared class
if both are used, classifyFailure(result) runs later and can override the attached class
built-in infrastructure failures still receive default classes such as Assertion failure, Timeout, Runner crash, or Max steps exceeded

Expected failures

Use expectedFail: true for benchmark cases that intentionally capture a known model or agent gap. Expected failures only apply to assertion failures. Runner crashes, timeouts, workspace failures, collection failures, normalization failures, snapshot failures, and run.maxSteps failures still fail the suite because they indicate infrastructure or benchmark integrity problems.

import { assert, type Case } from "skillgym";

const suite: Case[] = [
  {
    id: "known-missing-skill-selection",
    prompt: "Use the correct installed skill before editing files.",
    expectedFail: true,
    assert(report) {
      assert.skills.has(report, "required-skill");
    },
  },
];

Expected assertion failures exit successfully and appear in results.json with passed: true and status: "expected-failed". Unexpected passes exit non-zero with passed: false and status: "unexpected-passed", which signals that the benchmark expectation may be stale or the agent improved.

AssertionContext helpers

The second argument to assert is a convenience wrapper around the session report:

export interface AssertionContext {
  getCommands(): string[];
  getToolCalls(tool?: string): SessionEvent[];
  getFileReads(): string[];
  detectedSkills(): SkillDetection[];
  finalOutput(): string;
}

Examples:

assert(report, ctx) {
  assert.ok(ctx.getCommands().length > 0);
  assert.match(ctx.finalOutput(), /ready/);
}

These helpers are convenience APIs only. The source of truth is always the SessionReport passed as the first argument.

Workspace export

A suite can also export a named workspace object to control where executions run.

import type { SuiteWorkspaceConfig, Case } from "skillgym";

export const workspace: SuiteWorkspaceConfig = {
  mode: "isolated",
  templateDir: "./fixtures/base-app",
  bootstrap: {
    command: "sh",
    args: ["./scripts/bootstrap-workspace.sh", "--seed", "demo"],
  },
};

const suite: Case[] = [
  {
    id: "workspace-check",
    prompt: "Describe the prepared workspace.",
    assert() {},
  },
];

export default suite;

SuiteWorkspaceConfig supports two modes:

export type SuiteWorkspaceConfig =
  | {
      mode: "shared";
      cwd?: string;
      templateDir?: string;
      bootstrap?: {
        command: string;
        args?: string[];
        timeoutMs?: number;
        env?: Record<string, string>;
      };
    }
  | {
      mode: "isolated";
      templateDir?: string;
      bootstrap?: {
        command: string;
        args?: string[];
        timeoutMs?: number;
        env?: Record<string, string>;
      };
    };

Rules:

shared mode supports cwd, templateDir, and bootstrap
isolated mode supports templateDir and bootstrap but not cwd
relative suite workspace paths resolve from the suite file directory
isolated workspaces start empty when templateDir is omitted
templateDir copies the full directory contents, including dotfiles and .git
failed isolated executions preserve their workspace under outputDir/workspaces

See workspaces.md for behavior, path resolution, cleanup, and bootstrap details.

Pass/fail behavior

a case execution passes when its assert function completes without throwing
a case execution fails when assert throws
passed is expectation-aware at the suite level; use status to distinguish passed, failed, expected-failed, and unexpected-passed
a case execution also fails when the runner crashes, times out, or exceeds run.maxSteps
run.maxSteps is a best-effort streamed model-round limit, not a hard portable turn cap
max-steps failures preserve raw stdout/stderr artifacts for debugging
max-steps failures do not produce a partial normalized session report
failure messages are preserved in the execution artifacts

Examples

../examples/basic-suite.ts
../examples/skill-selection-suite.ts
../examples/workspace-isolation-suite.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Cases

Suite exports

Case shape

Tags

Assertions in a case

Failure classification hooks

Expected failures

AssertionContext helpers

Workspace export

Pass/fail behavior

Examples

FilesExpand file tree

cases.md

Latest commit

History

cases.md

File metadata and controls

Test Cases

Suite exports

Case shape

Tags

Assertions in a case

Failure classification hooks

Expected failures

AssertionContext helpers

Workspace export

Pass/fail behavior

Examples