Skip to content

Fix overflow in CopyCpuTensor for sub-byte types#28171

Merged
tianleiwu merged 4 commits into
mainfrom
adrastogi/cpu-fix
Apr 23, 2026
Merged

Fix overflow in CopyCpuTensor for sub-byte types#28171
tianleiwu merged 4 commits into
mainfrom
adrastogi/cpu-fix

Conversation

@adrastogi
Copy link
Copy Markdown
Contributor

Description

When copying tensors of sub-byte types (e.g., int4, uint4), two code paths in the CPU EP's CopyCpuTensor function use the element count as the byte count during memcpy, but those types are only 4 bits (0.5 bytes). This causes ORT to copy the incorrect number of bytes, which can cause overflow / corruption.

The fix is to replace the math with Tensor::SizeInBytes(), which already correctly handles sub-byte packing. There is a second site in IdentityOp::Compute which has been refactored to call CopyCpuTensor directly, eliminating the duplicated logic.

The change also adds regression tests for different variations of this scenario (I was able to use them to reproduce the original issue without the proposed updates).

Motivation and Context

This fixes a problem in the CPU EP in the scenario described above.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a heap overflow/corruption risk in the CPU EP tensor copy path for sub-byte packed element types (e.g., int4/uint4, int2/uint2) by ensuring copy byte counts match the packed storage size.

Changes:

  • Update CopyCpuTensor to copy using Tensor::SizeInBytes() instead of Shape().Size() * DataType()->Size().
  • Refactor IdentityOp::Compute to reuse CopyCpuTensor, removing duplicated (and previously unsafe) memcpy logic.
  • Add CPU regression tests covering same-type casts and direct CopyCpuTensor copies for packed sub-byte types (odd/even sizes and larger shapes).
Show a summary per file
File Description
onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Adds regression tests for packed sub-byte same-type cast and direct CopyCpuTensor memcpy behavior.
onnxruntime/core/providers/cpu/tensor/utils.h Fixes memcpy byte-count calculation by using src->SizeInBytes().
onnxruntime/core/providers/cpu/tensor/identity_op.h Replaces custom copy logic with CopyCpuTensor to avoid duplicated unsafe sizing.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 1

Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Outdated
Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Outdated
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good overall: moving the non-string copy path to Tensor::SizeInBytes() fixes the packed sub-byte overflow in the shared helper, and reusing CopyCpuTensor from IdentityOp closes the duplicate site there too.

I left one suggestion on the regression coverage: the new direct-copy test checks the copied payload bytes, but it does not deterministically prove that the buggy implementation stayed in bounds on non-ASan runs.

Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@adrastogi adrastogi requested a review from tianleiwu April 23, 2026 02:57
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current head looks good. Switching the shared CPU copy path to Tensor::SizeInBytes() fixes the packed sub-byte overflow at the right abstraction boundary, and routing IdentityOp through CopyCpuTensor removes the duplicated buggy byte-count logic.

The updated regression coverage also addresses the earlier test gap: the direct-copy test now uses distinct guarded buffers, so the old overflow is caught deterministically without relying on sanitizer behavior. I did not find any remaining correctness, performance, or maintainability issue in the changed paths on this head.

@adrastogi
Copy link
Copy Markdown
Contributor Author

adrastogi commented Apr 23, 2026

@tianleiwu thanks for the review! The Linux TensorRT CI seems to be consistently failing. I see errors like this:

2026-04-23T03:10:34.9340117Z 2: [----------] 153 tests from CastOpTest
2026-04-23T03:10:34.9340224Z 2: [ RUN      ] CastOpTest.NonStringTypes
2026-04-23T03:10:34.9340289Z 2: unknown file: Failure
2026-04-23T03:10:34.9342955Z 2: C++ exception with description "/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, SUCCTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; SUCCTYPE = cudaError; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, SUCCTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; SUCCTYPE = cudaError; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=7f6d4e25f76b ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=336 ; expr=cudaSetDevice(info_.device_id); 
2026-04-23T03:10:34.9343019Z 2: 
2026-04-23T03:10:34.9343088Z 2: " thrown in the test body.
2026-04-23T03:10:34.9343140Z 2: 
2026-04-23T03:10:34.9343238Z 2: [  FAILED  ] CastOpTest.NonStringTypes (0 ms)
2026-04-23T03:10:34.9343313Z 2: [ RUN      ] CastOpTest.FromString
2026-04-23T03:10:34.9343378Z 2: unknown file: Failure
2026-04-23T03:10:34.9345860Z 2: C++ exception with description "/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, SUCCTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; SUCCTYPE = cudaError; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, SUCCTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; SUCCTYPE = cudaError; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 100: no CUDA-capable device is detected ; GPU=-1 ; hostname=7f6d4e25f76b ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=336 ; expr=cudaSetDevice(info_.device_id); 
2026-04-23T03:10:34.9346159Z 2: 
2026-04-23T03:10:34.9346228Z 2: " thrown in the test body.
2026-04-23T03:10:34.9346276Z 2: 
2026-04-23T03:10:34.9346359Z 2: [  FAILED  ] CastOpTest.FromString (0 ms)

The same error impacts the tests I've added as well. Is the CastOpTest suite not supposed to run on CUDA EP? (Seems like this would have been a pre-existing issue in that case, but I didn't see any issues logged.)

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu thanks for the review! The Linux TensorRT CI seems to be consistently failing. I see errors like this:

Please merge latest main branch. There is a fix updated cuda driver for that.

@tianleiwu tianleiwu merged commit 4265122 into main Apr 23, 2026
88 of 89 checks passed
@tianleiwu tianleiwu deleted the adrastogi/cpu-fix branch April 23, 2026 15:55
Copy link
Copy Markdown
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

adrastogi added a commit that referenced this pull request Apr 29, 2026
This cherry-picks the following commits for the release:

#28171 Fix overflow in CopyCpuTensor for sub-byte types
@anskumar01
Copy link
Copy Markdown

Hi @adrastogi , will this PR be cherry picked to ORT1.25?

@adrastogi
Copy link
Copy Markdown
Contributor Author

@anskumar01 thanks for flagging. @devang-ml is there runway for cherry-picking this into 1.25.1? I am not sure of the schedule there.

adrastogi added a commit that referenced this pull request May 6, 2026
This cherry-picks the following commits for the release:

#28171 Fix overflow in CopyCpuTensor for sub-byte types
#27808 [VitisAI] pass base timestamp for vitisai profiling

---------

Signed-off-by: Andrew Luo <junpengl@amd.com>
Co-authored-by: andrew0917 <68576025+andrew0917@users.noreply.github.com>
Co-authored-by: Andrew Luo <junpengl@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants