feat(dllm): DiffusionGemma backend via dllm.cpp with native gemma4 chat by localai-bot · Pull Request #10258 · mudler/LocalAI

localai-bot · 2026-06-11T20:11:44Z

Summary

Adds backend/go/dllm: a pure-Go (purego, no cgo) backend over dllm.cpp's 9-symbol C-ABI for DiffusionGemma block-diffusion models, implementing the rich gRPC interface (PredictRich/PredictStreamRich with ChatDelta streaming).
Chat is first-class without jinja: a hand-rolled gemma4 renderer (messages/tools/enable_thinking to prompt; fixtures verbatim from transformers' canonical decodes, validated byte-for-byte against the GGUF-embedded template with an independent jinja harness) and a fragment-safe streaming parser (thought channels to reasoning_content, <|tool_call> payloads decoded per vLLM's gemma4 grammar with ~25 of its test cases ported).
The engine's one-ctx-one-generate contract is structural: a per-model worker goroutine owns all C calls, cancel is the documented cross-thread exception, an RWMutex guards Free against in-flight requests, and UTF-8 holdback protects proto3 strings at diffusion-block boundaries (bug found by e2e, fixed with hold-back + sanitize).
pkg/grpc: request cancellation for Go backends. The llama.cpp C++ backend polls IsCancelled() in its result loops, but Go backends never observed context cancellation. New optional Cancellable capability + context.AfterFunc registration in Predict/PredictStream (after the locking block). dllm implements it: measured ~10ms cancel latency vs ~10s orphaned generation; follow-up requests no longer queue behind cancelled ones.
Fully registered: backend gallery (cpu amd64/arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class), CI matrix, bump-deps pin tracking, /backends/known (preference-only: GGUF autodetect would shadow llama-cpp), diffusiongemma-26b-a4b-it gallery model (sha256 verified against the HF LFS oid), user docs + .agents/dllm-backend.md.
Also: core/http test suite listen port is now configurable via LOCALAI_TEST_HTTP_PORT (~70 hardcoded 9090 call sites broke the pre-commit gate on machines where 9090 is taken).

Validation

Renderer/parser/wiring unit suites run ungated in CI; C-ABI smoke and gRPC e2e are env-gated on a tiny fixture model (green here incl. -race); a 2-split invariance property covers every fragment boundary.
Real-model validation on DGX Spark (GB10, CUDA 13): BF16 (50GB): load 32.7s, 5.6s/denoise-step, coherent long-form generation; Q4_K_M (17GB): quality holds (golden cosine 0.9862) but ~5x slower per step than BF16 on this hardware (K-quant MoE dequant vs native BF16 tensor cores), documented as guidance.

Caveats

github.com/mudler/dllm.cpp is private until publication (planned at merge time): backend image builds and the nightly bump leg stay red until then. Re-trigger CI after publication.
Request cancellation requires backends to opt into Cancellable; existing backends are unaffected.
Real-26B e2e specs are env-gated (CUDA-13-class hardware required); throughput (~0.15 tok/s BF16) is bound by per-step full recompute until dllm.cpp's prefix-KV cache (P3) lands.
History note: commit 294c04a carries both the gemma4 renderer and parser (a transient lint race bundled them); content is complete and reviewed.

Test Plan

CI green after mudler/dllm.cpp publication (backend images + bump leg)
go test ./backend/go/dllm/... ungated + gated with -race (tiny fixture)
tests/e2e-backends dllm specs incl. cancellation functional-red proof
pkg/grpc cancellation specs; core/http/endpoints/localai known-backends spec
Real-model BF16 + Q4_K_M validation on GB10 (results in docs)

🤖 Generated with Claude Code

The core/http specs hardcoded 127.0.0.1:9090 in ~70 call sites, so the pre-commit coverage gate fails on any machine where an unrelated service holds 9090. Centralize the address in the suite file behind LOCALAI_TEST_HTTP_PORT (default unchanged: 9090). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Binds the 9-symbol flat C-ABI of dllm.cpp (DiffusionGemma engine) via purego: typed wrappers with correct string ownership (malloc'd returns freed via dllm_capi_free_string, borrowed last_error never freed), once-allocated stream-callback trampolines, and a gated Ginkgo binding smoke against the tiny fixture model. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Fragment-safe state machine (content / channel header / thought / tool-call / done) classifying model output into content, reasoning_content and tool_calls deltas. Tool-call payload decoder is a non-partial port of vLLM's gemma4 parser grammar; ~25 of its test cases are ported with citations, plus a 2-split invariance property over every byte position. Recursion depth-capped against model-generated deep nesting; marker constants shared with the renderer. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Implements PredictRich/PredictStreamRich (legacy methods delegate), TokenizeString, and Load over the purego binding. A single worker goroutine serializes all C calls per the dllm.cpp one-generate-per-ctx contract (cancel is the documented exception); an RWMutex guards Free against in-flight requests. Under use_tokenizer_template the gemma4 renderer and streaming parser own templating and ChatDelta extraction; raw-prompt mode passes through verbatim. enable_thinking is opt-in via request metadata (the gemma4 template treats thinking as opt-in). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Registers the dllm backend across every surface: backend gallery index (cpu amd64+arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class hardware; no darwin per engine scope), top-level Makefile targets, bump_deps pin tracking for DLLM_VERSION, and the curated known-backends list for /backends/known (pref-only: auto-detecting on .gguf would shadow llama-cpp). Note: image builds and the nightly bump leg stay red until github.com/mudler/dllm.cpp is published (planned at merge time). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Gallery model diffusiongemma-26b-a4b-it (unsloth BF16 GGUF, sha256 verified against the HF LFS oid) with use_tokenizer_template and an honest experimental/throughput description. e2e: BACKEND_BINARY-mode specs boot the real gRPC backend with the tiny fixture model (templated chat + streaming); real-26B specs are separately env-gated. Adds an opt-in BACKEND_TEST_SEED knob so random-weight fixture models run the generic specs deterministically. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

User docs: dllm section in text-generation (setup, eb_* options table, n_predict canvas rounding, enable_thinking metadata, honest GB10 throughput numbers). Agents guide: .agents/dllm-backend.md covering the purego C-ABI contract, serialization rules, template provenance, test layers, and known limitations. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Drop the stray executable bit from the Go sources and Makefile (the sibling Go backends commit them 644; only run.sh/package.sh are executable), and correct two documentation claims found in the final branch review: cuda13-dllm is built for amd64 only (arm64 CUDA ships as the l4t flavor), and package.sh is the parakeet-cpp-style stub layout with no ldd walk. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…capability The llama.cpp C++ backend aborts generation when its gRPC context is cancelled (grpc-server.cpp polls context->IsCancelled() in the result loops), but Go backends served by pkg/grpc never observed context cancellation: a disconnected client left the generation running to completion. Add an optional Cancellable capability; the server registers context.AfterFunc on the request/stream context (after the Locking block so queued requests cannot abort the current owner) covering both rich and legacy paths. dllm implements it: measured cancel latency ~10ms vs ~10s of orphaned generation, and follow-up requests no longer queue behind cancelled ones (~220ms vs ~9s in the e2e proof). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Q4_K_M validated on GB10: quality holds (cosine 0.9862, coherent generation, 19/48 stopper exit) but a forward step is ~5x slower than BF16 (27.5s vs 5.6s: native BF16 tensor cores vs K-quant MoE dequant). Guidance: prefer BF16 when it fits; Q4_K_M is the memory-bound option. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

+	if cptr == 0 {
+		return ""
+	}
+	p := *(*unsafe.Pointer)(unsafe.Pointer(&cptr)) // C-owned buffer, not Go-GC memory (see doc above)


+	for *(*byte)(unsafe.Add(p, n)) != 0 {
+		n++
+	}
+	return string(unsafe.Slice((*byte)(p), n))


Q4_K_M (~17 GB, GB10-validated: cosine 0.9862, coherent generation) is the friendlier default download than the 50 GB BF16; Q8_0 (~27 GB) is the higher-fidelity middle ground. Both descriptions carry the measured caveat that BF16 is ~5x faster per denoise step on BF16-native hardware, with a pointer to fetch it manually when it fits. sha256 values are the HF LFS oids. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Routes PredictOptions.Images (raw base64, the core convention) through dllm.cpp's probed multimodal entry points as data: URIs; the gemma4 renderer appends one engine-side <image> marker per image after the last user message (llama.cpp attachment convention; the template's content-parts branch is unreachable through the flattened pb shape). The engine expands markers to boi + soft*n + eoi and splices the vision-tower embeddings. Older libdllm.so without the mm symbols fails with an actionable error (Dlsym probe). DLLM_VERSION pin bumped to the engine's vision-capable commit. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 10 commits June 11, 2026 14:28

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread backend/go/dllm/capi.go

if cptr == 0 {

return ""

}

p := *(*unsafe.Pointer)(unsafe.Pointer(&cptr)) // C-owned buffer, not Go-GC memory (see doc above)

Comment thread backend/go/dllm/capi.go

for *(*byte)(unsafe.Add(p, n)) != 0 {

n++

}

return string(unsafe.Slice((*byte)(p), n))

mudler added 3 commits June 11, 2026 20:24

chore(dllm): bump dllm.cpp pin to P5 head

b75ab7c

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dllm): DiffusionGemma backend via dllm.cpp with native gemma4 chat#10258

feat(dllm): DiffusionGemma backend via dllm.cpp with native gemma4 chat#10258
localai-bot wants to merge 13 commits into
masterfrom
feat/dllm-backend

localai-bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

localai-bot commented Jun 11, 2026

Summary

Validation

Caveats

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants