Skip to content

Mistral 4 Small - an issue with MoE CPU layers #2045

@aoleg

Description

@aoleg

Testing https://huggingface.co/noctrex/Mistral-Small-4-119B-2603-MXFP4_MOE-GGUF

This build: https://github.com/LostRuins/koboldcpp/actions/runs/23207558572

This is a MoE model, yet "MoE CPU Layers" setting in koboldcpp does not seem to be effective. When using llama.cpp, the correct value is "--n-cpu-moe 22" (tested, working with latest llama.cpp build), but regardless of the "MoE CPU Layers" setting in koboldcpp it fails to offload any MoE layers. Same result with 0, 22 or 99. This, for example, is what I get with "MoE CPU Layers" set to 999:

done_getting_tensors: tensor 'blk.0.ffn_down_exps.weight' (mxfp4) (and 35 others) cannot be used with preferred buffer type CUDA0, using CUDA_Host instead
ggml_cuda_host_malloc: failed to allocate 20608.00 MiB of pinned memory: out of memory
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        CUDA0 model buffer size = 43919.32 MiB
load_tensors:          CPU model buffer size = 20608.00 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0

And this is llama.cpp (--n-cpu-moe 22):

load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        CUDA0 model buffer size = 27599.32 MiB
load_tensors:          CPU model buffer size = 36928.00 MiB

koboldcpp reports the correct architecture: "The reported GGUF Arch is: mistral4
Arch Category: 0"

Enabling "auto fit" produces the following command; then the model loads correctly.

Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
Autofit Success: 1, Autofit Result: -c 32896 -ngl 22 -ot blk\.16\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.17\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.18\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.19\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.20\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.21\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.22\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.23\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.24\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.25\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.26\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.27\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.28\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.29\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.30\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.31\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.32\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.33\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.34\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.35\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.36\.ffn_(up|down|gate)_(ch|)exps=CPU
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 30991 MiB free

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions