Skip to content

More data types supported in cv::cuda::transpose() #3371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: 4.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 124 additions & 11 deletions modules/cudaarithm/src/cuda/transpose.cu
Original file line number Diff line number Diff line change
Expand Up @@ -60,34 +60,147 @@ void cv::cuda::transpose(InputArray _src, OutputArray _dst, Stream& stream)
{
GpuMat src = getInputMat(_src, stream);

const int srcType = src.type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant?

const int srcDepth = src.depth();
const int srcCn = src.channels();
const size_t elemSize = src.elemSize();
const size_t elemSize1 = src.elemSize1();

CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 );
//CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if elemSize == 5?

What kind of error message we would show to the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is handled by

...
else
      CV_Error(Error::StsUnsupportedFormat, "");


GpuMat dst = getOutputMat(_dst, src.cols, src.rows, src.type(), stream);

if (elemSize == 1)
const bool isNppiNativelySupported =
(srcType == CV_8UC1) || (srcType == CV_8UC3) || (srcType == CV_8UC4) ||
(srcType == CV_16UC1) || (srcType == CV_16UC3) || (srcType == CV_16UC4) ||
(srcType == CV_16SC1) || (srcType == CV_16SC3) || (srcType == CV_16SC4) ||
(srcType == CV_32SC1) || (srcType == CV_32SC3) || (srcType == CV_32SC4) ||
(srcType == CV_32FC1) || (srcType == CV_32FC3) || (srcType == CV_32FC4);
const bool isElemSizeSupportedByNppi =
((elemSize != 0) && !(elemSize%1) && ((elemSize/1)<=4)) ||
Copy link
Contributor

@cudawarped cudawarped Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if I have completely misunderstood the logic but
Isn't elemSize%1 == 0 always?
When can elemSize == 0?
Isn't this already taken care of by
(srcType == CV_8UC1) || (srcType == CV_8UC3) || (srcType == CV_8UC4) || (srcType == CV_16UC1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but it was on purpose for code clarity/readibility
elemSize%1 and elemSize/1 will be optimized out by the compiler, this is just to balance with the 2, 4, 8 cases

Ok to get rid of the (elemSize != 0) check

Copy link
Contributor

@cudawarped cudawarped Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity I would use elemSize1 and channels?
For what CV data type will isNppiNativelySupported == false and isElemSizeSupportedByNppi == true or isElemSizeSupportedByGridTranspose == true?

Copy link
Contributor Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is overlap between isNppiNativelySupported, isElemSizeSupportedByNppi and isElemSizeSupportedByGridTranspose
The idea is that if anything goes wrong, they can be disabled independently.
isNppiNativelySupported means that the data type is directly mapped to an nppiTranspose call
isElemSizeSupportedByNppi means that we can cheat on the real data type to use another data type of the same size (assuming that NPP is only using memory copies, otherwise it would not work)
isElemSizeSupportedByGridTranspose is the fallback if we don't rely on the NPP implementation

For instance, 16UC2 is not supported by NPP but can be mapped to 32SC1 (isNppiNativelySupported == false and isElemSizeSupportedByNppi == true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a complete overlap, i.e. is isElemSizeSupportedByNppi and isElemSizeSupportedByGridTranspose redundant unless
isNppiNativelySupported is manually set to false?

Copy link
Contributor Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a complete overlap, i.e. is isElemSizeSupportedByNppi and isElemSizeSupportedByGridTranspose redundant unless
isNppiNativelySupported is manually set to false?

There are cases where (isNppiNativelySupported == false) and (isElemSizeSupportedByNppi== true) (e.g. 16UC2)

But there are currently no cases where (isElemSizeSupportedByNppi== false) and (isElemSizeSupportedByGridTranspose == true). This is a fallback implementation if :

  • a (future) opencv global option disables NPP
  • performance tests reveals that some NPP calls are slower than gridTranspose(). Specific cases could be excluded from isElemSizeSupportedByNppi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 16UC2 is handled by the below block?

else if (!(elemSize%2) && ((elemSize/2)==2))
  nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
    dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );

If so wouldn't it be better to have this under

else if (srcType == CV_16UC2)

Copy link
Contributor Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is not a native nppi Call (isNppiNativelySupported == false) but it can be handled by cheating on the data type (isElemSizeSupportedByNppi == true)
That's why the tests inside if (isElemSizeSupportedByNppi) is explicitely focusing on elemSize rather than srcType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this makes it easier or harder for the next person who looks at it to see whats going on, what do you think @alalek ?

Either way I think having overlap between both makes it difficult to understand, maybe if the "cheat" only included the cases where it was applicable it would be more obvious?

((elemSize != 0) && !(elemSize%2) && ((elemSize/2)<=4)) ||
((elemSize != 0) && !(elemSize%4) && ((elemSize/4)<=4)) ||
((elemSize != 0) && !(elemSize%8) && ((elemSize/8)<=2));
if (src.empty())
dst.release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty input should be a error as nobody want to process "nothing" in real use cases: opencv/opencv#8300

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought it was better since Allow empty matrices in most functions (personnaly I prefer a no-op rather than an exception)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add a fastpath for size == (1, 1), where transpose is just a copy

else if (isNppiNativelySupported)
{
NppStreamHandler h(StreamAccessor::getStream(stream));

NppiSize sz;
sz.width = src.cols;
sz.height = src.rows;

nppSafeCall( nppiTranspose_8u_C1R(src.ptr<Npp8u>(), static_cast<int>(src.step),
if (srcType == CV_8UC1)
nppSafeCall( nppiTranspose_8u_C1R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_8UC3)
nppSafeCall( nppiTranspose_8u_C3R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_8UC4)
nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_16UC1)
nppSafeCall( nppiTranspose_16u_C1R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_16UC3)
nppSafeCall( nppiTranspose_16u_C3R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_16UC4)
nppSafeCall( nppiTranspose_16u_C4R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_16SC1)
nppSafeCall( nppiTranspose_16s_C1R(src.ptr<Npp16s>(), static_cast<int>(src.step),
dst.ptr<Npp16s>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_16SC3)
nppSafeCall( nppiTranspose_16s_C3R(src.ptr<Npp16s>(), static_cast<int>(src.step),
dst.ptr<Npp16s>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_16SC4)
nppSafeCall( nppiTranspose_16s_C4R(src.ptr<Npp16s>(), static_cast<int>(src.step),
dst.ptr<Npp16s>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_32SC1)
nppSafeCall( nppiTranspose_32s_C1R(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_32SC3)
nppSafeCall( nppiTranspose_32s_C3R(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_32SC4)
nppSafeCall( nppiTranspose_32s_C4R(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_32FC1)
nppSafeCall( nppiTranspose_32f_C1R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_32FC3)
nppSafeCall( nppiTranspose_32f_C3R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );
else if (srcType == CV_32FC4)
nppSafeCall( nppiTranspose_32f_C4R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );

if (!stream)
CV_CUDEV_SAFE_CALL( cudaDeviceSynchronize() );
}
else if (elemSize == 4)
{
gridTranspose(globPtr<int>(src), globPtr<int>(dst), stream);
}
else // if (elemSize == 8)
}//end if (isNppiNativelySupported)
else if (isElemSizeSupportedByNppi)
{
gridTranspose(globPtr<double>(src), globPtr<double>(dst), stream);
}
NppStreamHandler h(StreamAccessor::getStream(stream));

NppiSize sz;
sz.width = src.cols;
sz.height = src.rows;

if (!(elemSize%1) && ((elemSize/1)==1))
nppSafeCall( nppiTranspose_8u_C1R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%1) && ((elemSize/1)==2))
nppSafeCall( nppiTranspose_16u_C1R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%1) && ((elemSize/1)==3))
nppSafeCall( nppiTranspose_8u_C3R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%1) && ((elemSize/1)==4))
nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%2) && ((elemSize/2)==1))
nppSafeCall( nppiTranspose_16u_C1R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%2) && ((elemSize/2)==2))
nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%2) && ((elemSize/2)==3))
nppSafeCall( nppiTranspose_16u_C3R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%2) && ((elemSize/2)==4))
nppSafeCall( nppiTranspose_16u_C4R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%4) && ((elemSize/4)==1))
nppSafeCall( nppiTranspose_32f_C1R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%4) && ((elemSize/4)==2))
nppSafeCall( nppiTranspose_16u_C4R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%4) && ((elemSize/4)==3))
nppSafeCall( nppiTranspose_32f_C3R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%4) && ((elemSize/4)==4))
nppSafeCall( nppiTranspose_32f_C4R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%8) && ((elemSize/8)==1))
nppSafeCall( nppiTranspose_16u_C4R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (!(elemSize%8) && ((elemSize/8)==2))
nppSafeCall( nppiTranspose_32f_C4R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );
}//end if (isElemSizeSupportedByNppi)
else if (elemSize == 1)
gridTranspose(globPtr<unsigned char>(src), globPtr<unsigned char>(dst), stream);
else if (elemSize == 2)
gridTranspose(globPtr<unsigned short>(src), globPtr<unsigned short>(dst), stream);
else if (elemSize == 4)
gridTranspose(globPtr<signed int>(src), globPtr<signed int>(dst), stream);
else if (elemSize == 8)
gridTranspose(globPtr<double>(src), globPtr<float>(dst), stream);
else
CV_Error(Error::StsUnsupportedFormat, "");

syncOutput(dst, _dst, stream);
}
Expand Down
21 changes: 20 additions & 1 deletion modules/cudaarithm/test/test_core.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -231,12 +231,31 @@ INSTANTIATE_TEST_CASE_P(CUDA_Arithm, Transpose, testing::Combine(
ALL_DEVICES,
DIFFERENT_SIZES,
testing::Values(MatType(CV_8UC1),
MatType(CV_8UC2),
MatType(CV_8UC3),
MatType(CV_8UC4),
MatType(CV_8SC1),
MatType(CV_8SC2),
MatType(CV_8SC3),
MatType(CV_8SC4),
MatType(CV_16UC1),
MatType(CV_16UC2),
MatType(CV_16UC3),
MatType(CV_16UC4),
MatType(CV_16SC1),
MatType(CV_16SC2),
MatType(CV_16SC3),
MatType(CV_16SC4),
MatType(CV_32SC1),
MatType(CV_32SC2),
MatType(CV_64FC1)),
MatType(CV_32SC3),
MatType(CV_32SC4),
MatType(CV_32FC1),
MatType(CV_32FC2),
MatType(CV_32FC3),
MatType(CV_32FC4),
MatType(CV_64FC1),
MatType(CV_64FC2)),
WHOLE_SUBMAT));

////////////////////////////////////////////////////////////////////////////////
Expand Down