Skip to content

vllm: arch-aware LayerAccessor + HF↔vLLM equivalence tests#47

Draft
Butanium wants to merge 1 commit into
devfrom
claude/vllm-arch-aware-layer-accessor
Draft

vllm: arch-aware LayerAccessor + HF↔vLLM equivalence tests#47
Butanium wants to merge 1 commit into
devfrom
claude/vllm-arch-aware-layer-accessor

Conversation

@Butanium
Copy link
Copy Markdown
Member

@Butanium Butanium commented May 6, 2026

Summary

  • Refactor LayerAccessor to be architecture-generic for vLLM (not just Llama). Replaces the misleading _is_vllm_layers_output predicate with explicit InputLayout / OutputLayout enums detected from forward signature + runtime structure + parent-source inspection.
  • Fix layers_input[i] and attentions_input[i] on vLLM Llama: were returning the int64 positions tensor (first positional arg of the decoder/attention forward) instead of hidden_states. Now correctly routed via module.inputs[0][1] (positional) or kwargs["hidden_states"] (vLLM Llama calls self.self_attn(positions=..., hidden_states=...) with kwargs).
  • Fix layer N>0 residual handling: _read_input returns hidden_states + residual to recover the combined stream. Numerical sanity: layers_output[i] == layers_input[i+1] exactly on vLLM Llama.
  • Move dual-stream shape-mismatch check into _infer_output_layout: raises RenamingError early instead of being dead code in check_io (the user pointed out that layers_output[i] already crashes on shape mismatch before the check_io block runs).
  • vLLM-only .clone() on single-stream reads in _read_input/_read_output and on token_embeddings. vLLM reuses inference-mode buffers across layers (layer N+1's fused RMSNorm mutates layer N's output buffer in-place); the saved reference surfaces the post-mutation value otherwise. nnsight's clone-on-save for inference-mode tensors doesn't reach every path here.

Test plan

  • tests/test_layout_detection.py — CPU-only unit tests for _infer_input_layout / _infer_output_layout / _parent_calls_with_kwargs (12 cases). All pass.
  • tests/test_vllm_hf_equivalence.py — loads SmolLM2-135M under HF and vLLM, verifies all LayerAccessor accessors match within bf16 tolerance. All pass.
  • scratch/llama_residual_check.py (manual) — verified layers_output[i] == layers_input[i+1] to 0.0 max diff for i ∈ {1, 2, 5, 15} on vLLM Llama.
  • Existing test_vllm.py — 9/10 still pass (the one failure, test_vllm_logits, is a pre-existing silent-trace-fail pattern unrelated to this refactor).

Notes

  • Bumps nnsight>=0.7 (with [tool.uv] exclude-newer-package to bypass the global age gate for nnsight).
  • ln_final.output is intentionally excluded from the equivalence test: vLLM Llama's fused RMSNorm returns (normalized, residual) while HF returns a single tensor — the standardization for ln_final is a separate piece of work.
  • Architecture-generic: signature/source inspection works on any nn.Module whose forward params are conventionally named (positions, hidden_states, residual). Verified on Llama (dual-stream) and GPT-2 (single-stream); should extend to Qwen2 etc. without code changes.

🤖 Generated with Claude Code

Replaces the misleading `_is_vllm_layers_output` predicate with explicit
`InputLayout` / `OutputLayout` enums detected per accessor. Input layout
comes from forward-signature inspection; output layout from a runtime probe
of layer 0; source inspection of the parent decoder layer disambiguates
positional vs kwargs sub-module calls (vLLM Llama calls `self.self_attn(positions=...,
hidden_states=...)` but `self.mlp(hidden_states)` positionally).

Fixes & changes:
- `layers_input[i]` for vLLM Llama now returns hidden_states (was returning
  the int64 positions tensor — first positional arg of the decoder forward).
- `attentions_input[i]` for vLLM Llama: same fix via kwargs path.
- Layer N>0 residual handling: `_read_input` returns `hidden_states + residual`
  to recover the combined stream; layer 0 returns hidden_states alone (residual
  is None there). Numerical sanity: layers_output[i] == layers_input[i+1] exactly.
- Dual-stream shape-mismatch check moved into `_infer_output_layout` so it
  raises `RenamingError` early with a clear message instead of silently
  broadcasting at the use site.
- Setter for vLLM dual-stream output uses in-place index assignment per
  nnsight VLLM_GUIDE (whole-tuple replacement crashes the engine).
- vLLM-only `.clone()` on single-stream reads in `_read_input`/`_read_output`
  and on `token_embeddings`: vLLM reuses inference-mode buffers across layers
  (layer N+1's fused RMSNorm mutates layer N's output buffer in-place); the
  saved reference would surface the post-mutation value. nnsight has
  clone-on-save for inference-mode tensors but it doesn't catch every path
  exercised here.

Tests:
- `tests/test_layout_detection.py` — CPU-only unit tests for the layout
  inference helpers (12 cases incl. shape-mismatch raise).
- `tests/test_vllm_hf_equivalence.py` — load SmolLM2 under both backends,
  assert each LayerAccessor returns the same hidden states (within bf16
  tolerance), plus an HF setter smoke test.

Bumps `nnsight` to >=0.7 (pinned via `[tool.uv] exclude-newer-package`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
return self.embed_tokens.output
"""Returns the token embeddings. Equivalent to self.embed_tokens.output.

Clones for vLLM: the embed_tokens output buffer is reused across
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JadenFiotto-Kaufman i'm pretty convinced this is real. Is that a bug to fix upstream on nnsight instead?

@Butanium
Copy link
Copy Markdown
Member Author

Butanium commented May 6, 2026

Repro using pure nnsight (no nnterp) — saves the same proxy twice, once as a bare reference and once with .clone() chained before .save(). After the trace exits, the bare-ref save shows the post-mutation buffer state; the cloned save shows the actual computed value.

# scratch/repro_clone_needed.py
import gc, torch as th
from nnsight.modeling.vllm import VLLM

MODEL  = "HuggingFaceTB/SmolLM2-135M-Instruct"  # Llama-arch
PROMPT = "Hello world"

def main():
    model = VLLM(MODEL, tensor_parallel_size=1, gpu_memory_utilization=0.3,
                 dispatch=True, dtype="bfloat16")
    try:
        with model.trace(PROMPT):
            embed_ref    = model.model.embed_tokens.output.save()
            embed_clone  = model.model.embed_tokens.output.clone().save()

            l0_args, _   = model.model.layers[0].inputs
            l0_hs_ref    = l0_args[1].save()
            l0_hs_clone  = l0_args[1].clone().save()

            attn_ref     = model.model.layers[0].self_attn.output.save()
            attn_clone   = model.model.layers[0].self_attn.output.clone().save()

            mlp_ref      = model.model.layers[0].mlp.output.save()
            mlp_clone    = model.model.layers[0].mlp.output.clone().save()

        def diff(name, a, b):
            d = (a.float() - b.float()).abs().max().item()
            tag = "OK" if d < 1e-3 else "MUTATED"
            print(f"  {name:30s} max|ref - clone| = {d:.4f}   [{tag}]")
            print(f"    ref   stats: min={a.min().item():.4f} max={a.max().item():.4f} std={a.std().item():.4f}")
            print(f"    clone stats: min={b.min().item():.4f} max={b.max().item():.4f} std={b.std().item():.4f}")

        diff("embed_tokens.output",        embed_ref,   embed_clone)
        diff("layers[0].inputs args[1]",   l0_hs_ref,   l0_hs_clone)
        diff("layers[0].self_attn.output", attn_ref,    attn_clone)
        diff("layers[0].mlp.output",       mlp_ref,     mlp_clone)
    finally:
        if getattr(model, "vllm_entrypoint", None) is not None:
            model.vllm_entrypoint.llm_engine.engine_core.shutdown()
        VLLM._cleanup_distributed()
        del model; gc.collect()

if __name__ == "__main__":
    main()

Output (single L40, torch 2.9.0, vllm 0.15.1, nnsight 0.7.0)

embed_tokens.output            max|ref - clone| = 4064.2344   [MUTATED]
  ref   stats: min=-800.0000  max=4064.0000  std=163.0000
  clone stats: min=-0.4785    max=0.4141     std=0.1099

layers[0].inputs args[1]       max|ref - clone| = 4064.2344   [MUTATED]
  ref   stats: min=-800.0000  max=4064.0000  std=163.0000
  clone stats: min=-0.4785    max=0.4141     std=0.1099

layers[0].self_attn.output     max|ref - clone| = 0.9688      [MUTATED]
  ref   stats: min=-1.0938    max=0.3164     std=0.0986
  clone stats: min=-2.0625    max=0.4785     std=0.1094

layers[0].mlp.output           max|ref - clone| = 23.6173     [MUTATED]
  ref   stats: min=-0.7031    max=4.3438     std=0.2158
  clone stats: min=-23.6250   max=16.5000    std=2.2969

The [clone stats] for embed_tokens.output match what HF returns for the same prompt (std ≈ 0.11). Without the clone, the saved tensor reflects whatever the buffer was overwritten with by downstream fused kernels (stds 163 / 0.10 / 0.22 — wildly different from the actual computed value).

For DUAL_STREAM outputs (layers_output[i] on Llama-arch) the standardized read returns target[0] + target[1], which is a fresh tensor, so no clone is needed there.

The VLLM_GUIDE notes that .save() auto-clones inference-mode tensors, but as this repro shows, that fix doesn't reach all the call paths exercised by the standardized accessors here — likely because the proxies surfaced by module.output / module.inputs[0][N] aren't tagged inference-mode at save time, or the clone happens at a layer of the proxy graph that's already been pre-resolved.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant