Mistral 4 Small - an issue with MoE CPU layers

Testing https://huggingface.co/noctrex/Mistral-Small-4-119B-2603-MXFP4_MOE-GGUF

This build: https://github.com/LostRuins/koboldcpp/actions/runs/23207558572

This is a MoE model, yet "MoE CPU Layers" setting in koboldcpp does not seem to be effective. When using llama.cpp, the correct value is "--n-cpu-moe 22" (tested, working with latest llama.cpp build), but regardless of the "MoE CPU Layers" setting in koboldcpp it fails to offload any MoE layers. Same result with 0, 22 or 99. This, for example, is what I get with "MoE CPU Layers" set to 999:

```
done_getting_tensors: tensor 'blk.0.ffn_down_exps.weight' (mxfp4) (and 35 others) cannot be used with preferred buffer type CUDA0, using CUDA_Host instead
ggml_cuda_host_malloc: failed to allocate 20608.00 MiB of pinned memory: out of memory
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        CUDA0 model buffer size = 43919.32 MiB
load_tensors:          CPU model buffer size = 20608.00 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
```

And this is llama.cpp (--n-cpu-moe 22):

```
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        CUDA0 model buffer size = 27599.32 MiB
load_tensors:          CPU model buffer size = 36928.00 MiB
```

koboldcpp reports the correct architecture: "The reported GGUF Arch is: mistral4
Arch Category: 0"

Enabling "auto fit" produces the following command; then the model loads correctly.

```
Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
Autofit Success: 1, Autofit Result: -c 32896 -ngl 22 -ot blk\.16\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.17\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.18\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.19\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.20\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.21\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.22\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.23\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.24\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.25\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.26\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.27\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.28\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.29\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.30\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.31\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.32\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.33\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.34\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.35\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.36\.ffn_(up|down|gate)_(ch|)exps=CPU
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 30991 MiB free
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral 4 Small - an issue with MoE CPU layers #2045

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Mistral 4 Small - an issue with MoE CPU layers #2045

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions