Skip to content

Slow Benchmark Result from Trained Model Using SparseML. #2361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WayneSkywalker opened this issue Mar 3, 2025 · 1 comment
Open

Slow Benchmark Result from Trained Model Using SparseML. #2361

WayneSkywalker opened this issue Mar 3, 2025 · 1 comment

Comments

@WayneSkywalker
Copy link

Hi, I have an issue about Transter Learning using SparseML by following instructions in https://github.com/neuralmagic/sparseml/blob/main/integrations/ultralytics-yolov8/tutorials/sparse-transfer-learning.md.

More specific, I trained:

sparseml.ultralytics.train \
  --model "zoo:cv/detection/yolov8-m/pytorch/ultralytics/coco/pruned80-none" \
  --recipe "zoo:cv/detection/yolov8-m/pytorch/ultralytics/voc/pruned80_quant-none" \
  --data "coco128.yaml" \
  --batch 2

and then export the trained model:

sparseml.ultralytics.export_onnx \
  --model ./runs/detect/train/weights/last.pt \
  --save_dir yolov8-m

And then run benchmark using Deepsparse:

>> deepsparse.benchmark /home/ubuntu/code/models/trained_model.onnx
2025-03-03 03:23:56 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
        onnx_file_path: /home/ubuntu/code/models/trained_model.onnx
        batch_size: 1
        num_cores: 4
        num_streams: 1
        scheduler: Scheduler.default
        fraction_of_supported_ops: 0.0
        cpu_avx_type: avx512
        cpu_vnni: True
2025-03-03 03:23:56 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/trained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 4.1084
Latency Mean (ms/batch): 243.3896
Latency Median (ms/batch): 240.5514
Latency Std (ms/batch): 10.9256
Iterations: 42

And here are related dependencies and training environment
Libraries:

  • torch==2.5.1
  • sparseml==1.8.0
  • deepsparse==1.8.0
  • ultralytics==8.0.124
  • onnx==1.14.1
  • onnxruntime==1.17.0

Training Environment:

  • NVIDIA GeForece RTX 4070 Ti (12 GB RAM)
  • Ubuntu 22.04

It is quite slow. I suspect that it is about fraction_of_supported_ops: 0.0 related to the benchmark result, because I run benchmark on the pretrained weight used to train in the training command mentioned (get from https://sparsezoo.neuralmagic.com/models/yolov8-m-coco-pruned80_quantized?hardware=deepsparse-c6i.12xlarge&comparison=yolov8-m-coco-base).

>> deepsparse.benchmark /home/ubuntu/code/models/pretrained_model.onnx
2025-03-03 03:52:06 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:52:07 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
        onnx_file_path: /home/ubuntu/code/models/pretrained_model.onnx
        batch_size: 1
        num_cores: 4
        num_streams: 1
        scheduler: Scheduler.default
        fraction_of_supported_ops: 1.0
        cpu_avx_type: avx512
        cpu_vnni: True
2025-03-03 03:52:08 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:52:08 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/pretrained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 25.9231
Latency Mean (ms/batch): 38.5548
Latency Median (ms/batch): 38.2803
Latency Std (ms/batch): 1.4339
Iterations: 260

I found out that fraction_of_supported_ops is 1.0.

Then I searched about this, I found that is about optimized runtime as described in https://github.com/neuralmagic/deepsparse/blob/36b92eeb730a74a787cea467c9132eaa1b78167f/src/deepsparse/engine.py#L417, and that's it.

I have some questions:

  1. What exactly is fraction_of_supported_ops?
  2. What can I do about fraction_of_supported_ops?
  3. And how fraction_of_supported_ops affect to the benchmark result?
@sriram-dsl
Copy link

sriram-dsl commented Apr 8, 2025

even iam facing same issue
Inconsistent Speed and High CPU Usage with DeepSparse/YOLOv5*

Problem:

  1. Speed Changes Randomly:

    • Same model sometimes gives high speed (~240 FPS), sometimes low (~45 FPS) on the same machine.
    • Docs promise 240+ FPS, but I only get 45 FPS most of the time.
  2. Uses Too Many CPU Cores:

    • DeepSparse always uses 8 CPU cores, even when testing just one model.
    • A normal (non-sparse) model gives the same speed (~45 FPS) with just 1 core.

What I Tried:

  1. Two Training Methods:

    • Trained YOLOv5 normally → exported with sparseml.
    • Trained YOLOv5 with SparseML (pruned/quantized) → exported with sparseml.
    !sparseml.yolov5.train \
     --weights zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned75_quant-none?recipe_type=transfer_learn \
     --recipe zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned75_quant-none?recipe_type=transfer_learn \
     --recipe-args '{"num_epochs":10}' \
     --data "/kaggle/input/asdfghjklpouyy/coco.yaml" \
     --patience 0 \
     --cfg yolov5s.yaml \
     --hyp hyps/hyp.finetune.yaml \
     --imgsz 320 \
     --batch-size 8 \
     --device 0
    !sparseml.yolov5.export_onnx \
     --weights /kaggle/working/yolov5/yolov5_runs/train/exp/weights/best_pruned.pt \
     --batch-size 1 \
     --imgsz 320 320 \
     --int8 \
     --dynamic  
    • Both give the same speed (~45 FPS).
  2. Testing Commands:

    deepsparse.benchmark model.onnx --batch_size 8 --num_cores 8
    • Still slow, and forcing 1 core (--num_cores 1) makes it even slower.

Expected vs. Reality:

What Expected (Docs) What Happens
Speed (FPS) 240+ 45-80
CPU Cores 1-2 Always 8
Delay (ms) <10 ~20

What’s Wrong?

  • Maybe DeepSparse isn’t using the CPU properly (VNNI/AVX-512).
  • Maybe the model isn’t really using sparsity/quantization.
  • Cloud CPU might be slowing down during testing.

Questions:

  1. How do I make DeepSparse use fewer cores without losing speed?
  2. What’s the best sparsity setting for YOLOv5 to hit 200+ FPS?
  3. How can I check if quantization is really working?
  4. please provide the exact pipeline to follow to get the high throughput

Next Steps:

  • I’ll share my model and test logs if needed.
  • Let me know how to fix this!

This keeps it simple and clear while covering all key issues. Let me know if you want any changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants