Skip to content

[bug] Conversion of distributed checkpoints to huggingface #293

Open
@sohamparikh

Description

@sohamparikh

🐞 Describe the Bug

Converting a distributed checkpoint (from training runs with tensor_parallel > 1) to HF throws an error with missing parameters. This only happens when calling fast-llm convert explicitly on the checkpoint, whereas there's no error while exporting checkpoints during training.

🔄 Steps to Reproduce

Steps to reproduce the behavior:

  1. Train a model with tensor_parallel>1. Example model config:
model:
  base_model:
    transformer:
      normalization:
        type: rms_norm
        epsilon: 1.0e-05
        zero_centered: false
      rotary:
        type: default
        theta: 1000000.0
        triton: true
        scale_factor: 8.0
        low_frequency_factor: 1.0
        high_frequency_factor: 4.0
        original_context_length: 8192
        attention_factor: null
        beta_fast: 32.0
        beta_slow: 1.0
      peft:
        type: none
      num_layers: 30
      hidden_size: 576
      num_attention_heads: 16
      head_groups: 4
      add_linear_biases: false
      ffn_hidden_size: 1536
      kv_channels: 128
      gated: true
      num_experts: 1
      num_shared_experts: 0
      num_experts_per_token: 1
      expert_routing_type: aux_loss
      activation_type: silu
      use_flash_attention: true
    max_position_embeddings: 2048
    vocab_size: 131072
    use_position_embeddings: false
    tie_word_embeddings: false
    prediction_heads: 1
    cross_entropy_impl: fused
  multi_stage:
    zero_stage: 3
  distributed:
    tensor_parallel: 2
    sequence_tensor_parallel: true
    world_size: 2
    rank: 0
    local_world_size: 2
    timeout: 3600.0
    training_dtype: bfloat16
  1. Convert an intermediate checkpoint to HF using fast-llm convert gpt
fast-llm convert gpt input.format=distributed output.format=mistral input.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/ output.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/config.py:215: UserWarning: The default behaviour for model configuration loading has changed (May 2025).All model parameters are now loaded, not just the architecture parameters.Please make sure this doesn't lead to unexpected breaking changes.Suppress this warning by setting `load_config = model` explicitly.
  warnings.warn(
2025-06-10 17:57:25,891 Command run:
/mnt/core_llm/soham/.envs/fast-llm/bin/fast-llm convert gpt input.format=distributed output.format=mistral input.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/ output.path=/mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
2025-06-10 17:57:25,892 
----------- fast_llm.tools.convert.ConvertConfig -----------
input:
  format: distributed
  path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40
  load_config: model
output:
  format: mistral
  path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
model: gpt
--------------------------- end ----------------------------
2025-06-10 17:57:25,893 
--------------------- fast_llm.tools.convert.ConvertConfig ---------------------
input:
  format: distributed
  path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40
  load_config: model
output:
  format: mistral
  path: /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/mistral/40
model: gpt
------------------------------------- end --------------------------------------
2025-06-10 17:57:31,965 Loading <class 'fast_llm.engine.checkpoint.config.DistributedCheckpointFormat'> checkpoint from /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40...
2025-06-10 17:57:32,193   Splitting the model into 32 stages...
2025-06-10 17:57:32,200   Total parameters: 319,129,920 
2025-06-10 17:57:32,201 Weight buffer placement:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31}
2025-06-10 17:57:32,201 Grad buffer placement:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31}
2025-06-10 17:57:32,253 Setting random seeds...
2025-06-10 17:57:32,253 >>> Allocating 1 shards (1,217.38 MiB)
2025-06-10 17:57:32,508 Total allocated: 1,217.38 MiB
2025-06-10 17:57:32,518 Checkpoint format doesn't match, using safe load
2025-06-10 17:57:33,116 Loading from /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/rank_0.safetensors
2025-06-10 17:57:33,768 Loading from /mnt/core_llm_large/slam/experiments/upcycle/smol_mistral_debug/checkpoint/40/rank_1.safetensors
2025-06-10 17:57:34,286 Loaded a total of 212,996,736, state entries, expected 319,129,920
2025-06-10 17:57:34,286 106,168,320 state entries failed to load or corrupted (local=106,168,320).
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.1.self_attn.key_value.weight in stage 1, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.1.self_attn.dense.weight in stage 1, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.1.mlp.layer_1.weight in stage 1, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 1, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.2.self_attn.key_value.weight in stage 2, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.2.self_attn.dense.weight in stage 2, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.2.mlp.layer_1.weight in stage 2, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 2, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.3.self_attn.key_value.weight in stage 3, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.3.self_attn.dense.weight in stage 3, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.3.mlp.layer_1.weight in stage 3, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 3, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.4.self_attn.key_value.weight in stage 4, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.4.self_attn.dense.weight in stage 4, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.4.mlp.layer_1.weight in stage 4, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 4, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.5.self_attn.key_value.weight in stage 5, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.5.self_attn.dense.weight in stage 5, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.5.mlp.layer_1.weight in stage 5, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 5, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.6.self_attn.key_value.weight in stage 6, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.6.self_attn.dense.weight in stage 6, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.6.mlp.layer_1.weight in stage 6, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 6, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.7.self_attn.key_value.weight in stage 7, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.7.self_attn.dense.weight in stage 7, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.7.mlp.layer_1.weight in stage 7, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 7, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.8.self_attn.key_value.weight in stage 8, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.8.self_attn.dense.weight in stage 8, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.8.mlp.layer_1.weight in stage 8, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 8, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.9.self_attn.key_value.weight in stage 9, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.9.self_attn.dense.weight in stage 9, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.9.mlp.layer_1.weight in stage 9, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,286 3,538,944 values missing out of 0 for padding in stage 9, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,286 589,824 values missing out of 589,824 for parameter layers.10.self_attn.key_value.weight in stage 10, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,286 1,179,648 values missing out of 1,179,648 for parameter layers.10.self_attn.dense.weight in stage 10, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,286 1,769,472 values missing out of 1,769,472 for parameter layers.10.mlp.layer_1.weight in stage 10, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 10, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.11.self_attn.key_value.weight in stage 11, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.11.self_attn.dense.weight in stage 11, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.11.mlp.layer_1.weight in stage 11, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 11, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.12.self_attn.key_value.weight in stage 12, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.12.self_attn.dense.weight in stage 12, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.12.mlp.layer_1.weight in stage 12, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 12, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.13.self_attn.key_value.weight in stage 13, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.13.self_attn.dense.weight in stage 13, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.13.mlp.layer_1.weight in stage 13, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 13, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.14.self_attn.key_value.weight in stage 14, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.14.self_attn.dense.weight in stage 14, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.14.mlp.layer_1.weight in stage 14, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 14, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.15.self_attn.key_value.weight in stage 15, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.15.self_attn.dense.weight in stage 15, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.15.mlp.layer_1.weight in stage 15, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 15, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.16.self_attn.key_value.weight in stage 16, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.16.self_attn.dense.weight in stage 16, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.16.mlp.layer_1.weight in stage 16, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 16, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.17.self_attn.key_value.weight in stage 17, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.17.self_attn.dense.weight in stage 17, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.17.mlp.layer_1.weight in stage 17, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 17, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.18.self_attn.key_value.weight in stage 18, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.18.self_attn.dense.weight in stage 18, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.18.mlp.layer_1.weight in stage 18, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 18, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.19.self_attn.key_value.weight in stage 19, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.19.self_attn.dense.weight in stage 19, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.19.mlp.layer_1.weight in stage 19, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 19, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.20.self_attn.key_value.weight in stage 20, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.20.self_attn.dense.weight in stage 20, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.20.mlp.layer_1.weight in stage 20, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 20, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.21.self_attn.key_value.weight in stage 21, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.21.self_attn.dense.weight in stage 21, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.21.mlp.layer_1.weight in stage 21, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 21, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.22.self_attn.key_value.weight in stage 22, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.22.self_attn.dense.weight in stage 22, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.22.mlp.layer_1.weight in stage 22, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 22, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.23.self_attn.key_value.weight in stage 23, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.23.self_attn.dense.weight in stage 23, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.23.mlp.layer_1.weight in stage 23, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 23, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.24.self_attn.key_value.weight in stage 24, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,287 1,179,648 values missing out of 1,179,648 for parameter layers.24.self_attn.dense.weight in stage 24, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,287 1,769,472 values missing out of 1,769,472 for parameter layers.24.mlp.layer_1.weight in stage 24, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,287 3,538,944 values missing out of 0 for padding in stage 24, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,287 589,824 values missing out of 589,824 for parameter layers.25.self_attn.key_value.weight in stage 25, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.25.self_attn.dense.weight in stage 25, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.25.mlp.layer_1.weight in stage 25, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 25, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.26.self_attn.key_value.weight in stage 26, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.26.self_attn.dense.weight in stage 26, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.26.mlp.layer_1.weight in stage 26, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 26, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.27.self_attn.key_value.weight in stage 27, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.27.self_attn.dense.weight in stage 27, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.27.mlp.layer_1.weight in stage 27, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 27, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.28.self_attn.key_value.weight in stage 28, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.28.self_attn.dense.weight in stage 28, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.28.mlp.layer_1.weight in stage 28, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 28, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.29.self_attn.key_value.weight in stage 29, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.29.self_attn.dense.weight in stage 29, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.29.mlp.layer_1.weight in stage 29, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 29, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 589,824 values missing out of 589,824 for parameter layers.30.self_attn.key_value.weight in stage 30, shard weights (locally 589,824 out of 589,824)
2025-06-10 17:57:34,288 1,179,648 values missing out of 1,179,648 for parameter layers.30.self_attn.dense.weight in stage 30, shard weights (locally 1,179,648 out of 1,179,648)
2025-06-10 17:57:34,288 1,769,472 values missing out of 1,769,472 for parameter layers.30.mlp.layer_1.weight in stage 30, shard weights (locally 1,769,472 out of 1,769,472)
2025-06-10 17:57:34,288 3,538,944 values missing out of 0 for padding in stage 30, shard weights (locally 0 out of 0)
2025-06-10 17:57:34,288 Incorrect global breakdown of missing state entries (expected 106,168,320, got 212,336,640)
2025-06-10 17:57:34,292 Traceback (most recent call last):
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/tools/cli.py", line 29, in fast_llm
    Runnable.parse_and_run(unparsed)
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
    runnable()
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/tools/convert.py", line 86, in run
    self._convert_model_partial(model_class, self.output)
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/tools/convert.py", line 59, in _convert_model_partial
    model = model_class.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/multi_stage/fast_llm_model.py", line 81, in from_pretrained
    model.load_checkpoint(pretrained_config)
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/multi_stage/fast_llm_model.py", line 40, in load_checkpoint
    metadata = converter.load(config)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/distributed.py", line 89, in load
    with SafeLoad(self._model, shard_names=shard_names, timeout=config.timeout) as context:
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/safe_load.py", line 49, in __exit__
    self._validate()
  File "/mnt/core_llm/soham/Fast-LLM/fast_llm/engine/checkpoint/safe_load.py", line 69, in _validate
    raise RuntimeError("Model loading validation failed. See logs for details.")
RuntimeError: Model loading validation failed. See logs for details.

🎯 Expected Behavior

Conversion happens without errors

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions