feat(dllm): DiffusionGemma backend via dllm.cpp with native gemma4 chat#10258
Open
localai-bot wants to merge 13 commits into
Open
feat(dllm): DiffusionGemma backend via dllm.cpp with native gemma4 chat#10258localai-bot wants to merge 13 commits into
localai-bot wants to merge 13 commits into
Conversation
The core/http specs hardcoded 127.0.0.1:9090 in ~70 call sites, so the pre-commit coverage gate fails on any machine where an unrelated service holds 9090. Centralize the address in the suite file behind LOCALAI_TEST_HTTP_PORT (default unchanged: 9090). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Binds the 9-symbol flat C-ABI of dllm.cpp (DiffusionGemma engine) via purego: typed wrappers with correct string ownership (malloc'd returns freed via dllm_capi_free_string, borrowed last_error never freed), once-allocated stream-callback trampolines, and a gated Ginkgo binding smoke against the tiny fixture model. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fragment-safe state machine (content / channel header / thought / tool-call / done) classifying model output into content, reasoning_content and tool_calls deltas. Tool-call payload decoder is a non-partial port of vLLM's gemma4 parser grammar; ~25 of its test cases are ported with citations, plus a 2-split invariance property over every byte position. Recursion depth-capped against model-generated deep nesting; marker constants shared with the renderer. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Implements PredictRich/PredictStreamRich (legacy methods delegate), TokenizeString, and Load over the purego binding. A single worker goroutine serializes all C calls per the dllm.cpp one-generate-per-ctx contract (cancel is the documented exception); an RWMutex guards Free against in-flight requests. Under use_tokenizer_template the gemma4 renderer and streaming parser own templating and ChatDelta extraction; raw-prompt mode passes through verbatim. enable_thinking is opt-in via request metadata (the gemma4 template treats thinking as opt-in). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Registers the dllm backend across every surface: backend gallery index (cpu amd64+arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class hardware; no darwin per engine scope), top-level Makefile targets, bump_deps pin tracking for DLLM_VERSION, and the curated known-backends list for /backends/known (pref-only: auto-detecting on .gguf would shadow llama-cpp). Note: image builds and the nightly bump leg stay red until github.com/mudler/dllm.cpp is published (planned at merge time). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Gallery model diffusiongemma-26b-a4b-it (unsloth BF16 GGUF, sha256 verified against the HF LFS oid) with use_tokenizer_template and an honest experimental/throughput description. e2e: BACKEND_BINARY-mode specs boot the real gRPC backend with the tiny fixture model (templated chat + streaming); real-26B specs are separately env-gated. Adds an opt-in BACKEND_TEST_SEED knob so random-weight fixture models run the generic specs deterministically. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
User docs: dllm section in text-generation (setup, eb_* options table, n_predict canvas rounding, enable_thinking metadata, honest GB10 throughput numbers). Agents guide: .agents/dllm-backend.md covering the purego C-ABI contract, serialization rules, template provenance, test layers, and known limitations. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Drop the stray executable bit from the Go sources and Makefile (the sibling Go backends commit them 644; only run.sh/package.sh are executable), and correct two documentation claims found in the final branch review: cuda13-dllm is built for amd64 only (arm64 CUDA ships as the l4t flavor), and package.sh is the parakeet-cpp-style stub layout with no ldd walk. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…capability The llama.cpp C++ backend aborts generation when its gRPC context is cancelled (grpc-server.cpp polls context->IsCancelled() in the result loops), but Go backends served by pkg/grpc never observed context cancellation: a disconnected client left the generation running to completion. Add an optional Cancellable capability; the server registers context.AfterFunc on the request/stream context (after the Locking block so queued requests cannot abort the current owner) covering both rich and legacy paths. dllm implements it: measured cancel latency ~10ms vs ~10s of orphaned generation, and follow-up requests no longer queue behind cancelled ones (~220ms vs ~9s in the e2e proof). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Q4_K_M validated on GB10: quality holds (cosine 0.9862, coherent generation, 19/48 stopper exit) but a forward step is ~5x slower than BF16 (27.5s vs 5.6s: native BF16 tensor cores vs K-quant MoE dequant). Guidance: prefer BF16 when it fits; Q4_K_M is the memory-bound option. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
| if cptr == 0 { | ||
| return "" | ||
| } | ||
| p := *(*unsafe.Pointer)(unsafe.Pointer(&cptr)) // C-owned buffer, not Go-GC memory (see doc above) |
| for *(*byte)(unsafe.Add(p, n)) != 0 { | ||
| n++ | ||
| } | ||
| return string(unsafe.Slice((*byte)(p), n)) |
Q4_K_M (~17 GB, GB10-validated: cosine 0.9862, coherent generation) is the friendlier default download than the 50 GB BF16; Q8_0 (~27 GB) is the higher-fidelity middle ground. Both descriptions carry the measured caveat that BF16 is ~5x faster per denoise step on BF16-native hardware, with a pointer to fetch it manually when it fits. sha256 values are the HF LFS oids. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Routes PredictOptions.Images (raw base64, the core convention) through dllm.cpp's probed multimodal entry points as data: URIs; the gemma4 renderer appends one engine-side <image> marker per image after the last user message (llama.cpp attachment convention; the template's content-parts branch is unreachable through the flattened pb shape). The engine expands markers to boi + soft*n + eoi and splices the vision-tower embeddings. Older libdllm.so without the mm symbols fails with an actionable error (Dlsym probe). DLLM_VERSION pin bumped to the engine's vision-capable commit. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
backend/go/dllm: a pure-Go (purego, no cgo) backend over dllm.cpp's 9-symbol C-ABI for DiffusionGemma block-diffusion models, implementing the rich gRPC interface (PredictRich/PredictStreamRichwith ChatDelta streaming).reasoning_content,<|tool_call>payloads decoded per vLLM's gemma4 grammar with ~25 of its test cases ported).cancelis the documented cross-thread exception, an RWMutex guardsFreeagainst in-flight requests, and UTF-8 holdback protects proto3 strings at diffusion-block boundaries (bug found by e2e, fixed with hold-back + sanitize).pkg/grpc: request cancellation for Go backends. The llama.cpp C++ backend pollsIsCancelled()in its result loops, but Go backends never observed context cancellation. New optionalCancellablecapability +context.AfterFuncregistration inPredict/PredictStream(after the locking block). dllm implements it: measured ~10ms cancel latency vs ~10s orphaned generation; follow-up requests no longer queue behind cancelled ones./backends/known(preference-only: GGUF autodetect would shadow llama-cpp),diffusiongemma-26b-a4b-itgallery model (sha256 verified against the HF LFS oid), user docs +.agents/dllm-backend.md.core/httptest suite listen port is now configurable viaLOCALAI_TEST_HTTP_PORT(~70 hardcoded 9090 call sites broke the pre-commit gate on machines where 9090 is taken).Validation
-race); a 2-split invariance property covers every fragment boundary.Caveats
Cancellable; existing backends are unaffected.Test Plan
go test ./backend/go/dllm/...ungated + gated with-race(tiny fixture)tests/e2e-backendsdllm specs incl. cancellation functional-red proofpkg/grpccancellation specs;core/http/endpoints/localaiknown-backends spec🤖 Generated with Claude Code