Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Slow Benchmark Result from Trained Model Using SparseML. #2361

Closed
@WayneSkywalker

Description

@WayneSkywalker

Hi, I have an issue about Transter Learning using SparseML by following instructions in https://github.com/neuralmagic/sparseml/blob/main/integrations/ultralytics-yolov8/tutorials/sparse-transfer-learning.md.

More specific, I trained:

sparseml.ultralytics.train \
  --model "zoo:cv/detection/yolov8-m/pytorch/ultralytics/coco/pruned80-none" \
  --recipe "zoo:cv/detection/yolov8-m/pytorch/ultralytics/voc/pruned80_quant-none" \
  --data "coco128.yaml" \
  --batch 2

and then export the trained model:

sparseml.ultralytics.export_onnx \
  --model ./runs/detect/train/weights/last.pt \
  --save_dir yolov8-m

And then run benchmark using Deepsparse:

>> deepsparse.benchmark /home/ubuntu/code/models/trained_model.onnx
2025-03-03 03:23:56 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
        onnx_file_path: /home/ubuntu/code/models/trained_model.onnx
        batch_size: 1
        num_cores: 4
        num_streams: 1
        scheduler: Scheduler.default
        fraction_of_supported_ops: 0.0
        cpu_avx_type: avx512
        cpu_vnni: True
2025-03-03 03:23:56 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/trained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 4.1084
Latency Mean (ms/batch): 243.3896
Latency Median (ms/batch): 240.5514
Latency Std (ms/batch): 10.9256
Iterations: 42

And here are related dependencies and training environment
Libraries:

  • torch==2.5.1
  • sparseml==1.8.0
  • deepsparse==1.8.0
  • ultralytics==8.0.124
  • onnx==1.14.1
  • onnxruntime==1.17.0

Training Environment:

  • NVIDIA GeForece RTX 4070 Ti (12 GB RAM)
  • Ubuntu 22.04

It is quite slow. I suspect that it is about fraction_of_supported_ops: 0.0 related to the benchmark result, because I run benchmark on the pretrained weight used to train in the training command mentioned (get from https://sparsezoo.neuralmagic.com/models/yolov8-m-coco-pruned80_quantized?hardware=deepsparse-c6i.12xlarge&comparison=yolov8-m-coco-base).

>> deepsparse.benchmark /home/ubuntu/code/models/pretrained_model.onnx
2025-03-03 03:52:06 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:52:07 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
        onnx_file_path: /home/ubuntu/code/models/pretrained_model.onnx
        batch_size: 1
        num_cores: 4
        num_streams: 1
        scheduler: Scheduler.default
        fraction_of_supported_ops: 1.0
        cpu_avx_type: avx512
        cpu_vnni: True
2025-03-03 03:52:08 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:52:08 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/pretrained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 25.9231
Latency Mean (ms/batch): 38.5548
Latency Median (ms/batch): 38.2803
Latency Std (ms/batch): 1.4339
Iterations: 260

I found out that fraction_of_supported_ops is 1.0.

Then I searched about this, I found that is about optimized runtime as described in https://github.com/neuralmagic/deepsparse/blob/36b92eeb730a74a787cea467c9132eaa1b78167f/src/deepsparse/engine.py#L417, and that's it.

I have some questions:

  1. What exactly is fraction_of_supported_ops?
  2. What can I do about fraction_of_supported_ops?
  3. And how fraction_of_supported_ops affect to the benchmark result?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions