Skip to content

Improving swizzles #1086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Andersama opened this issue Feb 11, 2025 · 4 comments
Open

Improving swizzles #1086

Andersama opened this issue Feb 11, 2025 · 4 comments

Comments

@Andersama
Copy link

Andersama commented Feb 11, 2025

Here's an example table lookup problem program with assembly from godbolt:

https://godbolt.org/z/9WP19sfq8

Compare the assembly between the "simdless" versions and the result from the template:

template<typename T>
T categorize(T& out, T& lo, T& hi, const T& in) {
    T l = in & 0xf;
    //l = xsimd::swizzle(lo,l);
    T h = in >> 4;
    //h = xsimd::swizzle(hi,h);
    return out = lookup(lo,l) & lookup(hi,h);
}

The "simdless" version not only produces less instructions to process the same amount of data, it really is just faster! Compiling with MSVC makes the difference worse.

However modeling the roughly equivalent simd-swizzle and using it as a callback:

std::array<uint8_t,16> swizzle_like(const std::array<uint8_t,16> &table, const std::array<uint8_t,16>& idxs) {
    std::array<uint8_t,16> out;
    for (size_t i = 0; i < idxs.size(); i++) {
        out[i] = table[idxs[i]&0xf];
    }
    return out;
}

produces a similar effect.

There's likely an optimization in not using simd except where instructions like pshufb are used.

For context here's some benchmarks using MSVC:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              325.10 |        3,075,962.91 |    3.0% |      0.01 | `byte lookup (sizzzle_like_but_256_wide)`
|            2,456.13 |          407,144.73 |   13.7% |      0.01 | :wavy_dash: `simd lookup (categorize)` (Unstable with ~288.1 iters. Increase `minEpochIterations` to e.g. b41)

NOTE: The results I'm seeing from godbolt appear to indicate that this is the case when using sse2.
call xsimd::batch<unsigned char, xsimd::sse2> categorize<xsimd::batch<unsigned char, xsimd::sse2>>(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&)
When sse4.1 is enabled the performance is better. (I could not find a way to enable ssse3 with MSVC).

|            1,242.30 |          804,958.97 |    0.6% |      0.01 | `simd lookup (categorize)`

I also wanted to benchmark against an emulated or generic batch for comparison but I'm not sure how to express that type.

Edit: using clang-cl I managed to enable what I think is ssse3, for reference clang's performance is

|               38.09 |       26,255,319.15 |    0.3% |      0.01 | `byte lookup (ssse3)`
|               40.52 |       24,681,447.96 |    1.5% |      0.01 | `byte lookup (sse2)`

|               13.57 |       73,690,033.38 |    0.0% |      0.01 | `simd lookup (ssse3)`
|              204.76 |        4,883,829.00 |    7.0% |      0.01 | :wavy_dash: `simd lookup (sse2)` (Unstable with ~5,035.6 iters. Increase `minEpochIterations` to e.g. c4b4)

Take note that there's a massive performance difference between sse2 and ssse3 when using clang, such that it actually starts out performing the byte lookup! (What we'd hope).

@serge-sans-paille
Copy link
Contributor

For the 'selecting the right kernel when performance allows it', does the approach used in https://godbolt.org/z/vz6GGz8nx help?

@Andersama
Copy link
Author

Andersama commented Feb 11, 2025

@serge-sans-paille well I'm not entirely sure? I don't imagine your goal was to show there'd be a runtime exception when the performance would be poor? If I disable the ssse3 flag compiler option it appears that what we would end up calling into a function that just throws.

categorize_simd(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&):
        push    r14
        push    rbx
        push    rax
        mov     edi, 16
        call    __cxa_allocate_exception@PLT
        mov     rbx, rax
        lea     rsi, [rip + .L.str.1]
        mov     rdi, rax
        call    std::runtime_error::runtime_error(char const*)@PLT
        mov     rsi, qword ptr [rip + typeinfo for std::runtime_error@GOTPCREL]
        mov     rdx, qword ptr [rip + std::runtime_error::~runtime_error()@GOTPCREL]
        mov     rdi, rbx
        call    __cxa_throw@PLT
        mov     r14, rax
        mov     rdi, rbx
        call    __cxa_free_exception@PLT
        mov     rdi, r14
        call    _Unwind_Resume@PLT

I think what you're trying to show me is I should be able to specialize what happens when sse2 or generic is used. Let me see what I can do from here.

@Andersama
Copy link
Author

Well I appear to have managed to construct a similar ouput in assembly doing the following:

https://godbolt.org/z/9ceEWfvc6

constexpr size_t mask_for_size(size_t sz) {
    size_t m = ~0ull;
    for (;m > sz;) {
        m >>= 1;
    }
    return m;
}

template<typename T, typename Arch>
xsimd::batch<T, Arch>  lookup(const xsimd::batch<T, Arch>& table, const xsimd::batch<T, Arch>& idxs, xsimd::kernel::requires_arch<xsimd::generic>)
{

    constexpr size_t sz = xsimd::batch<T, Arch>::size;
    constexpr size_t mask = mask_for_size(sz*sizeof(T));
    std::array<T, sz> tbl = {};
    std::array<T, sz> ind = {};
    table.store_unaligned(tbl.data());
    idxs.store_unaligned(ind.data());

    for (size_t i = 0; i < sz; i++) {
        ind[i] = tbl[ind[i]&mask];
    }

    return xsimd::batch<T, Arch>::load_unaligned(ind.data());
}

@Andersama
Copy link
Author

Andersama commented Feb 11, 2025

Found a variation that improves on the code generated for sse2, the odd part is that when the indexs are stored into the array like the above the assembly is identical to the pessimized generic case (at least for clang).

template<typename T, typename Arch>
xsimd::batch<T, Arch>  lookup(const xsimd::batch<T, Arch>& table, const xsimd::batch<T, Arch>& idxs, xsimd::kernel::requires_arch<xsimd::sse2>)
{
    constexpr size_t sz = xsimd::batch<T, Arch>::size;
    constexpr size_t mask = mask_for_size(sz*sizeof(T));
    std::array<T, sz> tbl;
    std::array<T, sz> out;
    table.store_unaligned(tbl.data());
    xsimd::batch<T, Arch> idxs_masked = idxs & mask;

    for (size_t i = 0; i < sz; i++) {
        out[i] = tbl[idxs_masked.data[i]]; //unclear why accessing the bytes from the simd register vs the out array generates better assembly...but it works
    }

    return xsimd::batch<T, Arch>::load_unaligned(out.data());
}

May be worth using as a uin8_t swizzle for sse2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants