This document describes how to define benchmark suites and individual cases.
A suite module must export a default suite value. The suite can be either:
- an array of
Casevalues - an object map of named
Casevalues
import { assert, type Case } from "skillgym";
const suite: Case[] = [
{
id: "always-passes",
prompt: "Say only: skillgym ready",
assert(report, ctx) {
assert.match(ctx.finalOutput(), /skillgym ready/);
},
},
];
export default suite;import { assert, type Suite } from "skillgym";
const suite: Suite = {
"always-passes": {
id: "always-passes",
prompt: "Say only: skillgym ready",
assert(report, ctx) {
assert.match(ctx.finalOutput(), /skillgym ready/);
},
},
};
export default suite;skillgym exports this public shape:
export interface Case {
id: string;
prompt: string;
tags?: string[];
timeoutMs?: number;
expectedFail?: boolean;
classifyFailure?(result: RunnerResult): FailureClass | string | undefined;
assert(report: SessionReport, ctx: AssertionContext): void | Promise<void>;
}Field meanings:
id: stable identifier used in results and artifact pathsprompt: the exact prompt sent to the runnertags: optional labels for selecting cases with--tag; multiple selected tags use OR matchingtimeoutMs: optional per-case timeout overrideexpectedFail: mark assertion failures as expected benchmark signal, not suite-failing resultsclassifyFailure(result): optional post-processing hook for assigning or overriding structured failure classesassert(report, ctx): pass or fail logic for that execution
Case does not include runner selection. Each case runs against the selected configured runners.
Tags let you run subsets of a suite without changing case order:
const suite: Case[] = [
{
id: "login-smoke",
tags: ["smoke", "auth"],
prompt: "Verify the login screen behavior.",
assert() {},
},
];Run tagged cases with --tag. Repeated flags and comma-separated values are OR-matched, so a case runs when it has any selected tag:
skillgym run ./suite.ts --tag smoke
skillgym run ./suite.ts --tag smoke --tag auth
skillgym run ./suite.ts --tag smoke,authYou can also set defaults in config with run.tags: ["smoke"]. CLI --tag values override config tags.
The assert function decides pass or fail:
- if
assert(report, ctx)completes normally, that execution passes - if it throws, that execution fails
- if
expectedFail: trueis set, an assertion failure is reported asstatus: "expected-failed"andpassed: true - if
expectedFail: trueis set and assertions pass, the execution is reported asstatus: "unexpected-passed"andpassed: false
You can use both:
- Node strict assert helpers such as
assert.ok,assert.equal, andassert.match skillgymgrouped helpers such asassert.skills.hasandassert.commands.includes
import { assert, type Case } from "skillgym";
const suite: Case[] = [
{
id: "find-skills-expo",
prompt: "Find a skill for upgrading Expo SDK and tell me how to install it.",
assert(report) {
assert.skills.has(report, "find-skills");
assert.commands.includes(report, "npx skills find");
assert.match(report.finalOutput, /upgrading-expo/i);
},
},
];See assertions.md for the full assertion reference.
Use failure classification when you want to group multiple failing executions under one shared cause, such as a pseudo command, wrong CLI alias, missing required flag, or wrong command family.
There are two integration points:
assert.classify(...)attaches a failure class directly where an assertion is madeclassifyFailure(result)lets the case assign or override the final class after the result is available
Example:
import { assert, type Case } from "skillgym";
const suite: Case[] = [
{
id: "cursor-alias-check",
prompt: 'Say you would run: cursr agent "open README.md".',
classifyFailure(result) {
return result.error?.message.includes("wrong Cursor CLI alias")
? { id: "wrong-cli-alias", label: "Wrong CLI alias" }
: undefined;
},
assert(_report, ctx) {
assert.classify({ id: "wrong-cli-alias", label: "Wrong CLI alias" }, () => {
assert.doesNotMatch(
ctx.finalOutput(),
/\bcursr\s+agent\b/i,
"wrong Cursor CLI alias in final output",
);
});
},
},
];Notes:
assert.classify(...)is the smallest way to tag a single assertion failureclassifyFailure(result)is useful when several different assertion messages should collapse into one shared class- if both are used,
classifyFailure(result)runs later and can override the attached class - built-in infrastructure failures still receive default classes such as
Assertion failure,Timeout,Runner crash, orMax steps exceeded
Use expectedFail: true for benchmark cases that intentionally capture a known model or agent gap. Expected failures only apply to assertion failures. Runner crashes, timeouts, workspace failures, collection failures, normalization failures, snapshot failures, and run.maxSteps failures still fail the suite because they indicate infrastructure or benchmark integrity problems.
import { assert, type Case } from "skillgym";
const suite: Case[] = [
{
id: "known-missing-skill-selection",
prompt: "Use the correct installed skill before editing files.",
expectedFail: true,
assert(report) {
assert.skills.has(report, "required-skill");
},
},
];Expected assertion failures exit successfully and appear in results.json with passed: true and status: "expected-failed". Unexpected passes exit non-zero with passed: false and status: "unexpected-passed", which signals that the benchmark expectation may be stale or the agent improved.
The second argument to assert is a convenience wrapper around the session report:
export interface AssertionContext {
getCommands(): string[];
getToolCalls(tool?: string): SessionEvent[];
getFileReads(): string[];
detectedSkills(): SkillDetection[];
finalOutput(): string;
}Examples:
assert(report, ctx) {
assert.ok(ctx.getCommands().length > 0);
assert.match(ctx.finalOutput(), /ready/);
}These helpers are convenience APIs only. The source of truth is always the SessionReport passed as the first argument.
A suite can also export a named workspace object to control where executions run.
import type { SuiteWorkspaceConfig, Case } from "skillgym";
export const workspace: SuiteWorkspaceConfig = {
mode: "isolated",
templateDir: "./fixtures/base-app",
bootstrap: {
command: "sh",
args: ["./scripts/bootstrap-workspace.sh", "--seed", "demo"],
},
};
const suite: Case[] = [
{
id: "workspace-check",
prompt: "Describe the prepared workspace.",
assert() {},
},
];
export default suite;SuiteWorkspaceConfig supports two modes:
export type SuiteWorkspaceConfig =
| {
mode: "shared";
cwd?: string;
templateDir?: string;
bootstrap?: {
command: string;
args?: string[];
timeoutMs?: number;
env?: Record<string, string>;
};
}
| {
mode: "isolated";
templateDir?: string;
bootstrap?: {
command: string;
args?: string[];
timeoutMs?: number;
env?: Record<string, string>;
};
};Rules:
sharedmode supportscwd,templateDir, andbootstrapisolatedmode supportstemplateDirandbootstrapbut notcwd- relative suite workspace paths resolve from the suite file directory
- isolated workspaces start empty when
templateDiris omitted templateDircopies the full directory contents, including dotfiles and.git- failed isolated executions preserve their workspace under
outputDir/workspaces
See workspaces.md for behavior, path resolution, cleanup, and bootstrap details.
- a case execution passes when its
assertfunction completes without throwing - a case execution fails when
assertthrows passedis expectation-aware at the suite level; usestatusto distinguishpassed,failed,expected-failed, andunexpected-passed- a case execution also fails when the runner crashes, times out, or exceeds
run.maxSteps run.maxStepsis a best-effort streamed model-round limit, not a hard portable turn capmax-stepsfailures preserve raw stdout/stderr artifacts for debuggingmax-stepsfailures do not produce a partial normalized session report- failure messages are preserved in the execution artifacts
../examples/basic-suite.ts../examples/skill-selection-suite.ts../examples/workspace-isolation-suite.ts