-
Notifications
You must be signed in to change notification settings - Fork 648
Description
Testing https://huggingface.co/noctrex/Mistral-Small-4-119B-2603-MXFP4_MOE-GGUF
This build: https://github.com/LostRuins/koboldcpp/actions/runs/23207558572
This is a MoE model, yet "MoE CPU Layers" setting in koboldcpp does not seem to be effective. When using llama.cpp, the correct value is "--n-cpu-moe 22" (tested, working with latest llama.cpp build), but regardless of the "MoE CPU Layers" setting in koboldcpp it fails to offload any MoE layers. Same result with 0, 22 or 99. This, for example, is what I get with "MoE CPU Layers" set to 999:
done_getting_tensors: tensor 'blk.0.ffn_down_exps.weight' (mxfp4) (and 35 others) cannot be used with preferred buffer type CUDA0, using CUDA_Host instead
ggml_cuda_host_malloc: failed to allocate 20608.00 MiB of pinned memory: out of memory
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CUDA0 model buffer size = 43919.32 MiB
load_tensors: CPU model buffer size = 20608.00 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
And this is llama.cpp (--n-cpu-moe 22):
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CUDA0 model buffer size = 27599.32 MiB
load_tensors: CPU model buffer size = 36928.00 MiB
koboldcpp reports the correct architecture: "The reported GGUF Arch is: mistral4
Arch Category: 0"
Enabling "auto fit" produces the following command; then the model loads correctly.
Attempting to use llama.cpp's automating fitting code. This will override all your layer configs, may or may not work!
Autofit Reserve Space: 1024 MB
Autofit Success: 1, Autofit Result: -c 32896 -ngl 22 -ot blk\.16\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.17\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.18\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.19\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.20\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.21\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.22\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.23\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.24\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.25\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.26\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.27\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.28\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.29\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.30\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.31\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.32\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.33\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.34\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.35\.ffn_(up|down|gate)_(ch|)exps=CPU,blk\.36\.ffn_(up|down|gate)_(ch|)exps=CPU
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 30991 MiB free