Open
Description
🐞 Describe the Bug
Converting a distributed checkpoint (from training runs with tensor_parallel > 1
) to HF throws an error with missing parameters. This only happens when calling fast-llm convert
explicitly on the checkpoint, whereas there's no error while exporting checkpoints during training.
🔄 Steps to Reproduce
Steps to reproduce the behavior:
- Train a model with
tensor_parallel>1
. Example model config:
model:
base_model:
transformer:
normalization:
type: rms_norm
epsilon: 1.0e-05
zero_centered: false
rotary:
type: default
theta: 1000000.0
triton: true
scale_factor: 8.0
low_frequency_factor: 1.0
high_frequency_factor: 4.0
original_context_length: 8192
attention_factor: null
beta_fast: 32.0
beta_slow: 1.0
peft:
type: none
num_layers: 30
hidden_size: 576
num_attention_heads: 16
head_groups: 4
add_linear_biases: false
ffn_hidden_size: 1536
kv_channels: 128
gated: true
num_experts: 1
num_shared_experts: 0
num_experts_per_token: 1
expert_routing_type: aux_loss
activation_type: silu
use_flash_attention: true
max_position_embeddings: 2048
vocab_size: 131072
use_position_embeddings: false
tie_word_embeddings: false
prediction_heads: 1
cross_entropy_impl: fused
multi_stage:
zero_stage: 3
distributed:
tensor_parallel: 2
sequence_tensor_parallel: true
world_size: 2
rank: 0
local_world_size: 2
timeout: 3600.0
training_dtype: bfloat16
- Convert an intermediate checkpoint to HF using
fast-llm convert gpt
fast-llm convert gpt input.format=distributed output.format=mistral input.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/ output.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/config.py:215: UserWarning: The default behaviour for model configuration loading has changed (May 2025).All model parameters are now loaded, not just the architecture parameters.Please make sure this doesn't lead to unexpected breaking changes.Suppress this warning by setting `load_config = model` explicitly.
warnings.warn(
2025-06-10 17:57:25,891 Command run:
/mnt/core_llm/soham/.envs/fast-llm/bin/fast-llm convert gpt input.format=distributed output.format=mistral input.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/ output.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
2025-06-10 17:57:25,892
----------- fast_llm.tools.convert.ConvertConfig -----------
input:
format: distributed
path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40
load_config: model
output:
format: mistral
path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
model: gpt
--------------------------- end ----------------------------
2025-06-10 17:57:25,893
--------------------- fast_llm.tools.convert.ConvertConfig ---------------------
input:
format: distributed
path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40
load_config: model
output:
format: mistral
path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
model: gpt
------------------------------------- end --------------------------------------
2025-06-10 17:57:31,965 Loading <class 'fast_llm.engine.checkpoint.config.DistributedCheckpointFormat'> checkpoint from /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40...
2025-06-10 17:57:32,193 Splitting the model into 32 stages...
2025-06-10 17:57:32,200 Total parameters: 319,129,920
2025-06-10 17:57:32,201 Weight buffer placement:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31}
2025-06-10 17:57:32,201 Grad buffer placement:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31}
2025-06-10 17:57:32,253 Setting random seeds...
2025-06-10 17:57:32,253 >>> Allocating 1 shards (1,217.38 MiB)
2025-06-10 17:57:32,508 Total allocated: 1,217.38 MiB
2025-06-10 17:57:32,518 Checkpoint format doesn't match, using safe load
2025-06-10 17:57:33,116 Loading from /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/rank_0.safetensors
2025-06-10 17:57:33,768 Loading from /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/rank_1.safetensors
2025-06-10 17:57:34,286 Loaded a total of 212,996,736, state entries, expected 319,129,920
2025-06-10 17:57:34,286 106,168,320 state entries failed to load or corrupted (local=106,168,320).
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.1.self_attn.key_value.weight in stage 1, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.1.self_attn.dense.weight in stage 1, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.1.mlp.layer_1.weight in stage 1, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 1, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.2.self_attn.key_value.weight in stage 2, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.2.self_attn.dense.weight in stage 2, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.2.mlp.layer_1.weight in stage 2, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 2, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.3.self_attn.key_value.weight in stage 3, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.3.self_attn.dense.weight in stage 3, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.3.mlp.layer_1.weight in stage 3, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 3, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.4.self_attn.key_value.weight in stage 4, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.4.self_attn.dense.weight in stage 4, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.4.mlp.layer_1.weight in stage 4, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 4, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.5.self_attn.key_value.weight in stage 5, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.5.self_attn.dense.weight in stage 5, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.5.mlp.layer_1.weight in stage 5, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 5, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.6.self_attn.key_value.weight in stage 6, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.6.self_attn.dense.weight in stage 6, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.6.mlp.layer_1.weight in stage 6, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 6, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.7.self_attn.key_value.weight in stage 7, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.7.self_attn.dense.weight in stage 7, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.7.mlp.layer_1.weight in stage 7, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 7, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.8.self_attn.key_value.weight in stage 8, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.8.self_attn.dense.weight in stage 8, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.8.mlp.layer_1.weight in stage 8, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 8, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.9.self_attn.key_value.weight in stage 9, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.9.self_attn.dense.weight in stage 9, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.9.mlp.layer_1.weight in stage 9, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 9, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.10.self_attn.key_value.weight in stage 10, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.10.self_attn.dense.weight in stage 10, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.10.mlp.layer_1.weight in stage 10, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 10, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.11.self_attn.key_value.weight in stage 11, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.11.self_attn.dense.weight in stage 11, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.11.mlp.layer_1.weight in stage 11, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 11, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.12.self_attn.key_value.weight in stage 12, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.12.self_attn.dense.weight in stage 12, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.12.mlp.layer_1.weight in stage 12, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 12, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.13.self_attn.key_value.weight in stage 13, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.13.self_attn.dense.weight in stage 13, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.13.mlp.layer_1.weight in stage 13, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 13, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.14.self_attn.key_value.weight in stage 14, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.14.self_attn.dense.weight in stage 14, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.14.mlp.layer_1.weight in stage 14, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 14, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.15.self_attn.key_value.weight in stage 15, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.15.self_attn.dense.weight in stage 15, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.15.mlp.layer_1.weight in stage 15, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 15, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.16.self_attn.key_value.weight in stage 16, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.16.self_attn.dense.weight in stage 16, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.16.mlp.layer_1.weight in stage 16, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 16, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.17.self_attn.key_value.weight in stage 17, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.17.self_attn.dense.weight in stage 17, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.17.mlp.layer_1.weight in stage 17, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 17, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.18.self_attn.key_value.weight in stage 18, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.18.self_attn.dense.weight in stage 18, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.18.mlp.layer_1.weight in stage 18, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 18, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.19.self_attn.key_value.weight in stage 19, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.19.self_attn.dense.weight in stage 19, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.19.mlp.layer_1.weight in stage 19, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 19, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.20.self_attn.key_value.weight in stage 20, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.20.self_attn.dense.weight in stage 20, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.20.mlp.layer_1.weight in stage 20, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 20, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.21.self_attn.key_value.weight in stage 21, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.21.self_attn.dense.weight in stage 21, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.21.mlp.layer_1.weight in stage 21, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 21, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.22.self_attn.key_value.weight in stage 22, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.22.self_attn.dense.weight in stage 22, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.22.mlp.layer_1.weight in stage 22, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 22, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.23.self_attn.key_value.weight in stage 23, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.23.self_attn.dense.weight in stage 23, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.23.mlp.layer_1.weight in stage 23, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 23, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.24.self_attn.key_value.weight in stage 24, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.24.self_attn.dense.weight in stage 24, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.24.mlp.layer_1.weight in stage 24, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 24, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.25.self_attn.key_value.weight in stage 25, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.25.self_attn.dense.weight in stage 25, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.25.mlp.layer_1.weight in stage 25, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 25, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.26.self_attn.key_value.weight in stage 26, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.26.self_attn.dense.weight in stage 26, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.26.mlp.layer_1.weight in stage 26, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 26, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.27.self_attn.key_value.weight in stage 27, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.27.self_attn.dense.weight in stage 27, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.27.mlp.layer_1.weight in stage 27, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 27, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.28.self_attn.key_value.weight in stage 28, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.28.self_attn.dense.weight in stage 28, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.28.mlp.layer_1.weight in stage 28, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 28, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.29.self_attn.key_value.weight in stage 29, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.29.self_attn.dense.weight in stage 29, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.29.mlp.layer_1.weight in stage 29, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 29, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.30.self_attn.key_value.weight in stage 30, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.30.self_attn.dense.weight in stage 30, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.30.mlp.layer_1.weight in stage 30, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 30, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 Incorrect global breakdown of missing state entries (expected 106,168,320, got 212,336,640)
2025-06-10 17:57:34,292 Traceback (most recent call last):
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/tools/cli.py", line 29, in fast_llm
Runnable.parse_and_run(unparsed)
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
runnable()
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/tools/convert.py", line 86, in run
self._convert_model_partial(model_class, self.output)
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/tools/convert.py", line 59, in _convert_model_partial
model = model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/multi_stage/fast_llm_model.py", line 81, in from_pretrained
model.load_checkpoint(pretrained_config)
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/multi_stage/fast_llm_model.py", line 40, in load_checkpoint
metadata = converter.load(config)
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/distributed.py", line 89, in load
with SafeLoad(self._model, shard_names=shard_names, timeout=config.timeout) as context:
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/safe_load.py", line 49, in __exit__
self._validate()
File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/safe_load.py", line 69, in _validate
raise RuntimeError("Model loading validation failed. See logs for details.")
RuntimeError: Model loading validation failed. See logs for details.
🎯 Expected Behavior
Conversion happens without errors