Cannot load Qwen-2.5-14B-Instruct on more than 2 cores

### System Info

```shell
Platform:

- Platform: Linux-6.8.0-1031-aws-x86_64-with-glibc2.35
- Python version: 3.11.13


Python packages:

- `optimum-neuron` version: 0.3.0
- `neuron-sdk` version: 2.24.0
- `optimum` version: 1.24.0
- `transformers` version: 4.51.3
- `huggingface_hub` version: 0.34.4
- `torch` version: 2.7.0+cu126
- `aws-neuronx-runtime-discovery` version: NA
- `libneuronxla` version: 2.2.4410.0+835a67fb
- `neuronx-cc` version: 2.19.8089.0+8ab9f450
- `neuronx-distributed` version: 0.13.14393+b8569585
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.7.0.2.8.6734+ac864f72
- `torch-xla` version: 2.7.0
- `transformers-neuronx` version: NA


Neuron Driver:


WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.27.34.0-ec8cd5e8b amd64 [installed]
aws-neuronx-dkms/unknown,now 2.23.9.0 all [installed]
aws-neuronx-oci-hook/unknown,now 2.11.42.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.27.23.0-8deec4dbf amd64 [installed]
aws-neuronx-tools/unknown,now 2.25.145.0 amd64 [installed]
```

### Who can help?

@JingyaHuang @dacorvo 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

I cannot load Qwen-2.5-14B for long context on more than 2 cores.

Main Issue : I compiled a [Qwen-2.4-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) model with the following command : 
`optimum-cli export neuron --model Qwen/Qwen2.5-14B-Instruct --sequence_length 16384 --batch_size 1 --num_cores 6 qwen-compiled-16k`

- The compilation happened successfully.
- But when I am loading the model with `model = NeuronModelForCausalLM.from_pretrained('qwen-compiled-16k')`
- I am getting the error `RuntimeError: expected shape torch.Size([896, 5120]) for layers.0.self_attn.qkv_proj.k_proj.weight but found torch.Size([4480, 5120])`

But the same model runs file when I restrict the sequence_length to 4096 and the num_cores to 2

### Expected behavior

Expected Behavior : `RuntimeError: expected shape torch.Size([896, 5120]) for layers.0.self_attn.qkv_proj.k_proj.weight but found torch.Size([4480, 5120])`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot load Qwen-2.5-14B-Instruct on more than 2 cores #965

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot load Qwen-2.5-14B-Instruct on more than 2 cores #965

Description

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions