Unexpected Performance: GPU slower than CPU for Qwen 8B inference

Issue Description
I'm getting strange performance results where GPU is slower than CPU when running Qwen 8B model with distributed llama.

Test Environment
- Model: Qwen 8B (Q40 quantization)
- Nodes: Jetson orin nano x 8
- Framework: Distributed Llama

CPU Performance:

Tokens/s: 7.94
ms/token: 125.90
Pred: 115 ms, Sync: 83 ms
Network: Sent 2254 kB, Recv 2661 kB

GPU Performance:

Tokens/s: 5.12
ms/token: 195.14
Pred: 45 ms, Sync: 75 ms
Network: Sent 2254 kB, Recv 2661 kB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected Performance: GPU slower than CPU for Qwen 8B inference #265

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unexpected Performance: GPU slower than CPU for Qwen 8B inference #265

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions