Improving swizzles #1086

Andersama · 2025-02-11T07:58:00Z

Here's an example table lookup problem program with assembly from godbolt:

Compare the assembly between the "simdless" versions and the result from the template:

template<typename T>
T categorize(T& out, T& lo, T& hi, const T& in) {
    T l = in & 0xf;
    //l = xsimd::swizzle(lo,l);
    T h = in >> 4;
    //h = xsimd::swizzle(hi,h);
    return out = lookup(lo,l) & lookup(hi,h);
}

The "simdless" version not only produces less instructions to process the same amount of data, it really is just faster! Compiling with MSVC makes the difference worse.

However modeling the roughly equivalent simd-swizzle and using it as a callback:

std::array<uint8_t,16> swizzle_like(const std::array<uint8_t,16> &table, const std::array<uint8_t,16>& idxs) {
    std::array<uint8_t,16> out;
    for (size_t i = 0; i < idxs.size(); i++) {
        out[i] = table[idxs[i]&0xf];
    }
    return out;
}

produces a similar effect.

There's likely an optimization in not using simd except where instructions like pshufb are used.

For context here's some benchmarks using MSVC:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              325.10 |        3,075,962.91 |    3.0% |      0.01 | `byte lookup (sizzzle_like_but_256_wide)`
|            2,456.13 |          407,144.73 |   13.7% |      0.01 | :wavy_dash: `simd lookup (categorize)` (Unstable with ~288.1 iters. Increase `minEpochIterations` to e.g. b41)

NOTE: The results I'm seeing from godbolt appear to indicate that this is the case when using sse2.
call xsimd::batch<unsigned char, xsimd::sse2> categorize<xsimd::batch<unsigned char, xsimd::sse2>>(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&)
When sse4.1 is enabled the performance is better. (I could not find a way to enable ssse3 with MSVC).

|            1,242.30 |          804,958.97 |    0.6% |      0.01 | `simd lookup (categorize)`

I also wanted to benchmark against an emulated or generic batch for comparison but I'm not sure how to express that type.

Edit: using clang-cl I managed to enable what I think is ssse3, for reference clang's performance is

|               38.09 |       26,255,319.15 |    0.3% |      0.01 | `byte lookup (ssse3)`
|               40.52 |       24,681,447.96 |    1.5% |      0.01 | `byte lookup (sse2)`

|               13.57 |       73,690,033.38 |    0.0% |      0.01 | `simd lookup (ssse3)`
|              204.76 |        4,883,829.00 |    7.0% |      0.01 | :wavy_dash: `simd lookup (sse2)` (Unstable with ~5,035.6 iters. Increase `minEpochIterations` to e.g. c4b4)

Take note that there's a massive performance difference between sse2 and ssse3 when using clang, such that it actually starts out performing the byte lookup! (What we'd hope).

The text was updated successfully, but these errors were encountered:

serge-sans-paille · 2025-02-11T10:20:53Z

For the 'selecting the right kernel when performance allows it', does the approach used in https://godbolt.org/z/vz6GGz8nx help?

Andersama · 2025-02-11T19:54:34Z

@serge-sans-paille well I'm not entirely sure? I don't imagine your goal was to show there'd be a runtime exception when the performance would be poor? If I disable the ssse3 flag compiler option it appears that what we would end up calling into a function that just throws.

categorize_simd(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&):
        push    r14
        push    rbx
        push    rax
        mov     edi, 16
        call    __cxa_allocate_exception@PLT
        mov     rbx, rax
        lea     rsi, [rip + .L.str.1]
        mov     rdi, rax
        call    std::runtime_error::runtime_error(char const*)@PLT
        mov     rsi, qword ptr [rip + typeinfo for std::runtime_error@GOTPCREL]
        mov     rdx, qword ptr [rip + std::runtime_error::~runtime_error()@GOTPCREL]
        mov     rdi, rbx
        call    __cxa_throw@PLT
        mov     r14, rax
        mov     rdi, rbx
        call    __cxa_free_exception@PLT
        mov     rdi, r14
        call    _Unwind_Resume@PLT

I think what you're trying to show me is I should be able to specialize what happens when sse2 or generic is used. Let me see what I can do from here.

Andersama · 2025-02-11T20:17:35Z

Well I appear to have managed to construct a similar ouput in assembly doing the following:

https://godbolt.org/z/9ceEWfvc6

constexpr size_t mask_for_size(size_t sz) {
    size_t m = ~0ull;
    for (;m > sz;) {
        m >>= 1;
    }
    return m;
}

template<typename T, typename Arch>
xsimd::batch<T, Arch>  lookup(const xsimd::batch<T, Arch>& table, const xsimd::batch<T, Arch>& idxs, xsimd::kernel::requires_arch<xsimd::generic>)
{

    constexpr size_t sz = xsimd::batch<T, Arch>::size;
    constexpr size_t mask = mask_for_size(sz*sizeof(T));
    std::array<T, sz> tbl = {};
    std::array<T, sz> ind = {};
    table.store_unaligned(tbl.data());
    idxs.store_unaligned(ind.data());

    for (size_t i = 0; i < sz; i++) {
        ind[i] = tbl[ind[i]&mask];
    }

    return xsimd::batch<T, Arch>::load_unaligned(ind.data());
}

Andersama · 2025-02-11T23:19:19Z

Found a variation that improves on the code generated for sse2, the odd part is that when the indexs are stored into the array like the above the assembly is identical to the pessimized generic case (at least for clang).

template<typename T, typename Arch>
xsimd::batch<T, Arch>  lookup(const xsimd::batch<T, Arch>& table, const xsimd::batch<T, Arch>& idxs, xsimd::kernel::requires_arch<xsimd::sse2>)
{
    constexpr size_t sz = xsimd::batch<T, Arch>::size;
    constexpr size_t mask = mask_for_size(sz*sizeof(T));
    std::array<T, sz> tbl;
    std::array<T, sz> out;
    table.store_unaligned(tbl.data());
    xsimd::batch<T, Arch> idxs_masked = idxs & mask;

    for (size_t i = 0; i < sz; i++) {
        out[i] = tbl[idxs_masked.data[i]]; //unclear why accessing the bytes from the simd register vs the out array generates better assembly...but it works
    }

    return xsimd::batch<T, Arch>::load_unaligned(out.data());
}

May be worth using as a uin8_t swizzle for sse2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving swizzles #1086

Improving swizzles #1086

Andersama commented Feb 11, 2025 •

edited

Loading

serge-sans-paille commented Feb 11, 2025

Andersama commented Feb 11, 2025 •

edited

Loading

Andersama commented Feb 11, 2025

Andersama commented Feb 11, 2025 •

edited

Loading

Improving swizzles #1086

Improving swizzles #1086

Comments

Andersama commented Feb 11, 2025 • edited Loading

serge-sans-paille commented Feb 11, 2025

Andersama commented Feb 11, 2025 • edited Loading

Andersama commented Feb 11, 2025

Andersama commented Feb 11, 2025 • edited Loading

Andersama commented Feb 11, 2025 •

edited

Loading

Andersama commented Feb 11, 2025 •

edited

Loading

Andersama commented Feb 11, 2025 •

edited

Loading