Slow Benchmark Result from Trained Model Using SparseML. #2361
Description
Hi, I have an issue about Transter Learning using SparseML by following instructions in https://github.com/neuralmagic/sparseml/blob/main/integrations/ultralytics-yolov8/tutorials/sparse-transfer-learning.md.
More specific, I trained:
sparseml.ultralytics.train \
--model "zoo:cv/detection/yolov8-m/pytorch/ultralytics/coco/pruned80-none" \
--recipe "zoo:cv/detection/yolov8-m/pytorch/ultralytics/voc/pruned80_quant-none" \
--data "coco128.yaml" \
--batch 2
and then export the trained model:
sparseml.ultralytics.export_onnx \
--model ./runs/detect/train/weights/last.pt \
--save_dir yolov8-m
And then run benchmark using Deepsparse:
>> deepsparse.benchmark /home/ubuntu/code/models/trained_model.onnx
2025-03-03 03:23:56 deepsparse.benchmark.helpers INFO Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO deepsparse.engine.Engine:
onnx_file_path: /home/ubuntu/code/models/trained_model.onnx
batch_size: 1
num_cores: 4
num_streams: 1
scheduler: Scheduler.default
fraction_of_supported_ops: 0.0
cpu_avx_type: avx512
cpu_vnni: True
2025-03-03 03:23:56 deepsparse.utils.onnx INFO Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:23:56 deepsparse.benchmark.benchmark_model INFO Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/trained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 4.1084
Latency Mean (ms/batch): 243.3896
Latency Median (ms/batch): 240.5514
Latency Std (ms/batch): 10.9256
Iterations: 42
And here are related dependencies and training environment
Libraries:
- torch==2.5.1
- sparseml==1.8.0
- deepsparse==1.8.0
- ultralytics==8.0.124
- onnx==1.14.1
- onnxruntime==1.17.0
Training Environment:
- NVIDIA GeForece RTX 4070 Ti (12 GB RAM)
- Ubuntu 22.04
It is quite slow. I suspect that it is about fraction_of_supported_ops: 0.0
related to the benchmark result, because I run benchmark on the pretrained weight used to train in the training command mentioned (get from https://sparsezoo.neuralmagic.com/models/yolov8-m-coco-pruned80_quantized?hardware=deepsparse-c6i.12xlarge&comparison=yolov8-m-coco-base).
>> deepsparse.benchmark /home/ubuntu/code/models/pretrained_model.onnx
2025-03-03 03:52:06 deepsparse.benchmark.helpers INFO Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.8.0 COMMUNITY | (e3778e93) (release) (optimized) (system=avx512_vnni, binary=avx512)
2025-03-03 03:52:07 deepsparse.benchmark.benchmark_model INFO deepsparse.engine.Engine:
onnx_file_path: /home/ubuntu/code/models/pretrained_model.onnx
batch_size: 1
num_cores: 4
num_streams: 1
scheduler: Scheduler.default
fraction_of_supported_ops: 1.0
cpu_avx_type: avx512
cpu_vnni: True
2025-03-03 03:52:08 deepsparse.utils.onnx INFO Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2025-03-03 03:52:08 deepsparse.benchmark.benchmark_model INFO Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: /home/ubuntu/code/models/pretrained_model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 25.9231
Latency Mean (ms/batch): 38.5548
Latency Median (ms/batch): 38.2803
Latency Std (ms/batch): 1.4339
Iterations: 260
I found out that fraction_of_supported_ops
is 1.0
.
Then I searched about this, I found that is about optimized runtime as described in https://github.com/neuralmagic/deepsparse/blob/36b92eeb730a74a787cea467c9132eaa1b78167f/src/deepsparse/engine.py#L417, and that's it.
I have some questions:
- What exactly is
fraction_of_supported_ops
? - What can I do about
fraction_of_supported_ops
? - And how
fraction_of_supported_ops
affect to the benchmark result?