-
Notifications
You must be signed in to change notification settings - Fork 265
Improving swizzles #1086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For the 'selecting the right kernel when performance allows it', does the approach used in https://godbolt.org/z/vz6GGz8nx help? |
@serge-sans-paille well I'm not entirely sure? I don't imagine your goal was to show there'd be a runtime exception when the performance would be poor? If I disable the categorize_simd(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&):
push r14
push rbx
push rax
mov edi, 16
call __cxa_allocate_exception@PLT
mov rbx, rax
lea rsi, [rip + .L.str.1]
mov rdi, rax
call std::runtime_error::runtime_error(char const*)@PLT
mov rsi, qword ptr [rip + typeinfo for std::runtime_error@GOTPCREL]
mov rdx, qword ptr [rip + std::runtime_error::~runtime_error()@GOTPCREL]
mov rdi, rbx
call __cxa_throw@PLT
mov r14, rax
mov rdi, rbx
call __cxa_free_exception@PLT
mov rdi, r14
call _Unwind_Resume@PLT I think what you're trying to show me is I should be able to specialize what happens when |
Well I appear to have managed to construct a similar ouput in assembly doing the following: https://godbolt.org/z/9ceEWfvc6 constexpr size_t mask_for_size(size_t sz) {
size_t m = ~0ull;
for (;m > sz;) {
m >>= 1;
}
return m;
}
template<typename T, typename Arch>
xsimd::batch<T, Arch> lookup(const xsimd::batch<T, Arch>& table, const xsimd::batch<T, Arch>& idxs, xsimd::kernel::requires_arch<xsimd::generic>)
{
constexpr size_t sz = xsimd::batch<T, Arch>::size;
constexpr size_t mask = mask_for_size(sz*sizeof(T));
std::array<T, sz> tbl = {};
std::array<T, sz> ind = {};
table.store_unaligned(tbl.data());
idxs.store_unaligned(ind.data());
for (size_t i = 0; i < sz; i++) {
ind[i] = tbl[ind[i]&mask];
}
return xsimd::batch<T, Arch>::load_unaligned(ind.data());
} |
Found a variation that improves on the code generated for template<typename T, typename Arch>
xsimd::batch<T, Arch> lookup(const xsimd::batch<T, Arch>& table, const xsimd::batch<T, Arch>& idxs, xsimd::kernel::requires_arch<xsimd::sse2>)
{
constexpr size_t sz = xsimd::batch<T, Arch>::size;
constexpr size_t mask = mask_for_size(sz*sizeof(T));
std::array<T, sz> tbl;
std::array<T, sz> out;
table.store_unaligned(tbl.data());
xsimd::batch<T, Arch> idxs_masked = idxs & mask;
for (size_t i = 0; i < sz; i++) {
out[i] = tbl[idxs_masked.data[i]]; //unclear why accessing the bytes from the simd register vs the out array generates better assembly...but it works
}
return xsimd::batch<T, Arch>::load_unaligned(out.data());
} May be worth using as a uin8_t swizzle for |
Here's an example table lookup problem program with assembly from godbolt:
https://godbolt.org/z/9WP19sfq8
Compare the assembly between the "simdless" versions and the result from the template:
The "simdless" version not only produces less instructions to process the same amount of data, it really is just faster! Compiling with MSVC makes the difference worse.
However modeling the roughly equivalent simd-swizzle and using it as a callback:
produces a similar effect.
There's likely an optimization in not using simd except where instructions like
pshufb
are used.For context here's some benchmarks using MSVC:
NOTE: The results I'm seeing from godbolt appear to indicate that this is the case when using
sse2
.call xsimd::batch<unsigned char, xsimd::sse2> categorize<xsimd::batch<unsigned char, xsimd::sse2>>(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&)
When
sse4.1
is enabled the performance is better. (I could not find a way to enable ssse3 with MSVC).| 1,242.30 | 804,958.97 | 0.6% | 0.01 | `simd lookup (categorize)`
I also wanted to benchmark against an emulated or generic batch for comparison but I'm not sure how to express that type.
Edit: using clang-cl I managed to enable what I think is ssse3, for reference clang's performance is
Take note that there's a massive performance difference between
sse2
andssse3
when using clang, such that it actually starts out performing the byte lookup! (What we'd hope).The text was updated successfully, but these errors were encountered: