Issue Description
I'm getting strange performance results where GPU is slower than CPU when running Qwen 8B model with distributed llama.
Test Environment
- Model: Qwen 8B (Q40 quantization)
- Nodes: Jetson orin nano x 8
- Framework: Distributed Llama
CPU Performance:
Tokens/s: 7.94
ms/token: 125.90
Pred: 115 ms, Sync: 83 ms
Network: Sent 2254 kB, Recv 2661 kB
GPU Performance:
Tokens/s: 5.12
ms/token: 195.14
Pred: 45 ms, Sync: 75 ms
Network: Sent 2254 kB, Recv 2661 kB