Skip to content

feat(dllm): DiffusionGemma backend via dllm.cpp with native gemma4 chat#10258

Open
localai-bot wants to merge 13 commits into
masterfrom
feat/dllm-backend
Open

feat(dllm): DiffusionGemma backend via dllm.cpp with native gemma4 chat#10258
localai-bot wants to merge 13 commits into
masterfrom
feat/dllm-backend

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Summary

  • Adds backend/go/dllm: a pure-Go (purego, no cgo) backend over dllm.cpp's 9-symbol C-ABI for DiffusionGemma block-diffusion models, implementing the rich gRPC interface (PredictRich/PredictStreamRich with ChatDelta streaming).
  • Chat is first-class without jinja: a hand-rolled gemma4 renderer (messages/tools/enable_thinking to prompt; fixtures verbatim from transformers' canonical decodes, validated byte-for-byte against the GGUF-embedded template with an independent jinja harness) and a fragment-safe streaming parser (thought channels to reasoning_content, <|tool_call> payloads decoded per vLLM's gemma4 grammar with ~25 of its test cases ported).
  • The engine's one-ctx-one-generate contract is structural: a per-model worker goroutine owns all C calls, cancel is the documented cross-thread exception, an RWMutex guards Free against in-flight requests, and UTF-8 holdback protects proto3 strings at diffusion-block boundaries (bug found by e2e, fixed with hold-back + sanitize).
  • pkg/grpc: request cancellation for Go backends. The llama.cpp C++ backend polls IsCancelled() in its result loops, but Go backends never observed context cancellation. New optional Cancellable capability + context.AfterFunc registration in Predict/PredictStream (after the locking block). dllm implements it: measured ~10ms cancel latency vs ~10s orphaned generation; follow-up requests no longer queue behind cancelled ones.
  • Fully registered: backend gallery (cpu amd64/arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class), CI matrix, bump-deps pin tracking, /backends/known (preference-only: GGUF autodetect would shadow llama-cpp), diffusiongemma-26b-a4b-it gallery model (sha256 verified against the HF LFS oid), user docs + .agents/dllm-backend.md.
  • Also: core/http test suite listen port is now configurable via LOCALAI_TEST_HTTP_PORT (~70 hardcoded 9090 call sites broke the pre-commit gate on machines where 9090 is taken).

Validation

  • Renderer/parser/wiring unit suites run ungated in CI; C-ABI smoke and gRPC e2e are env-gated on a tiny fixture model (green here incl. -race); a 2-split invariance property covers every fragment boundary.
  • Real-model validation on DGX Spark (GB10, CUDA 13): BF16 (50GB): load 32.7s, 5.6s/denoise-step, coherent long-form generation; Q4_K_M (17GB): quality holds (golden cosine 0.9862) but ~5x slower per step than BF16 on this hardware (K-quant MoE dequant vs native BF16 tensor cores), documented as guidance.

Caveats

  • github.com/mudler/dllm.cpp is private until publication (planned at merge time): backend image builds and the nightly bump leg stay red until then. Re-trigger CI after publication.
  • Request cancellation requires backends to opt into Cancellable; existing backends are unaffected.
  • Real-26B e2e specs are env-gated (CUDA-13-class hardware required); throughput (~0.15 tok/s BF16) is bound by per-step full recompute until dllm.cpp's prefix-KV cache (P3) lands.
  • History note: commit 294c04a carries both the gemma4 renderer and parser (a transient lint race bundled them); content is complete and reviewed.

Test Plan

  • CI green after mudler/dllm.cpp publication (backend images + bump leg)
  • go test ./backend/go/dllm/... ungated + gated with -race (tiny fixture)
  • tests/e2e-backends dllm specs incl. cancellation functional-red proof
  • pkg/grpc cancellation specs; core/http/endpoints/localai known-backends spec
  • Real-model BF16 + Q4_K_M validation on GB10 (results in docs)

🤖 Generated with Claude Code

mudler added 10 commits June 11, 2026 14:28
The core/http specs hardcoded 127.0.0.1:9090 in ~70 call sites, so the
pre-commit coverage gate fails on any machine where an unrelated service
holds 9090. Centralize the address in the suite file behind
LOCALAI_TEST_HTTP_PORT (default unchanged: 9090).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Binds the 9-symbol flat C-ABI of dllm.cpp (DiffusionGemma engine) via
purego: typed wrappers with correct string ownership (malloc'd returns
freed via dllm_capi_free_string, borrowed last_error never freed),
once-allocated stream-callback trampolines, and a gated Ginkgo binding
smoke against the tiny fixture model.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fragment-safe state machine (content / channel header / thought /
tool-call / done) classifying model output into content,
reasoning_content and tool_calls deltas. Tool-call payload decoder is a
non-partial port of vLLM's gemma4 parser grammar; ~25 of its test cases
are ported with citations, plus a 2-split invariance property over
every byte position. Recursion depth-capped against model-generated
deep nesting; marker constants shared with the renderer.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Implements PredictRich/PredictStreamRich (legacy methods delegate),
TokenizeString, and Load over the purego binding. A single worker
goroutine serializes all C calls per the dllm.cpp one-generate-per-ctx
contract (cancel is the documented exception); an RWMutex guards Free
against in-flight requests. Under use_tokenizer_template the gemma4
renderer and streaming parser own templating and ChatDelta extraction;
raw-prompt mode passes through verbatim. enable_thinking is opt-in via
request metadata (the gemma4 template treats thinking as opt-in).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Registers the dllm backend across every surface: backend gallery index
(cpu amd64+arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class
hardware; no darwin per engine scope), top-level Makefile targets,
bump_deps pin tracking for DLLM_VERSION, and the curated known-backends
list for /backends/known (pref-only: auto-detecting on .gguf would
shadow llama-cpp). Note: image builds and the nightly bump leg stay red
until github.com/mudler/dllm.cpp is published (planned at merge time).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Gallery model diffusiongemma-26b-a4b-it (unsloth BF16 GGUF, sha256
verified against the HF LFS oid) with use_tokenizer_template and an
honest experimental/throughput description. e2e: BACKEND_BINARY-mode
specs boot the real gRPC backend with the tiny fixture model (templated
chat + streaming); real-26B specs are separately env-gated. Adds an
opt-in BACKEND_TEST_SEED knob so random-weight fixture models run the
generic specs deterministically.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
User docs: dllm section in text-generation (setup, eb_* options table,
n_predict canvas rounding, enable_thinking metadata, honest GB10
throughput numbers). Agents guide: .agents/dllm-backend.md covering the
purego C-ABI contract, serialization rules, template provenance, test
layers, and known limitations.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Drop the stray executable bit from the Go sources and Makefile (the
sibling Go backends commit them 644; only run.sh/package.sh are
executable), and correct two documentation claims found in the final
branch review: cuda13-dllm is built for amd64 only (arm64 CUDA ships as
the l4t flavor), and package.sh is the parakeet-cpp-style stub layout
with no ldd walk.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…capability

The llama.cpp C++ backend aborts generation when its gRPC context is
cancelled (grpc-server.cpp polls context->IsCancelled() in the result
loops), but Go backends served by pkg/grpc never observed context
cancellation: a disconnected client left the generation running to
completion. Add an optional Cancellable capability; the server registers
context.AfterFunc on the request/stream context (after the Locking block
so queued requests cannot abort the current owner) covering both rich
and legacy paths. dllm implements it: measured cancel latency ~10ms vs
~10s of orphaned generation, and follow-up requests no longer queue
behind cancelled ones (~220ms vs ~9s in the e2e proof).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Q4_K_M validated on GB10: quality holds (cosine 0.9862, coherent
generation, 19/48 stopper exit) but a forward step is ~5x slower than
BF16 (27.5s vs 5.6s: native BF16 tensor cores vs K-quant MoE dequant).
Guidance: prefer BF16 when it fits; Q4_K_M is the memory-bound option.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Comment thread backend/go/dllm/capi.go
if cptr == 0 {
return ""
}
p := *(*unsafe.Pointer)(unsafe.Pointer(&cptr)) // C-owned buffer, not Go-GC memory (see doc above)
Comment thread backend/go/dllm/capi.go
for *(*byte)(unsafe.Add(p, n)) != 0 {
n++
}
return string(unsafe.Slice((*byte)(p), n))
mudler added 3 commits June 11, 2026 20:24
Q4_K_M (~17 GB, GB10-validated: cosine 0.9862, coherent generation) is
the friendlier default download than the 50 GB BF16; Q8_0 (~27 GB) is
the higher-fidelity middle ground. Both descriptions carry the measured
caveat that BF16 is ~5x faster per denoise step on BF16-native hardware,
with a pointer to fetch it manually when it fits. sha256 values are the
HF LFS oids.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Routes PredictOptions.Images (raw base64, the core convention) through
dllm.cpp's probed multimodal entry points as data: URIs; the gemma4
renderer appends one engine-side <image> marker per image after the
last user message (llama.cpp attachment convention; the template's
content-parts branch is unreachable through the flattened pb shape).
The engine expands markers to boi + soft*n + eoi and splices the
vision-tower embeddings. Older libdllm.so without the mm symbols fails
with an actionable error (Dlsym probe). DLLM_VERSION pin bumped to the
engine's vision-capable commit.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants