Skip to content

Unexpected Performance: GPU slower than CPU for Qwen 8B inference #265

@jason11489

Description

@jason11489

Issue Description
I'm getting strange performance results where GPU is slower than CPU when running Qwen 8B model with distributed llama.

Test Environment

  • Model: Qwen 8B (Q40 quantization)
  • Nodes: Jetson orin nano x 8
  • Framework: Distributed Llama

CPU Performance:

Tokens/s: 7.94
ms/token: 125.90
Pred: 115 ms, Sync: 83 ms
Network: Sent 2254 kB, Recv 2661 kB

GPU Performance:

Tokens/s: 5.12
ms/token: 195.14
Pred: 45 ms, Sync: 75 ms
Network: Sent 2254 kB, Recv 2661 kB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions