You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chore: enable simd optimizations for aarch64 (#5150)
chore: enable SIMD optimizations for aarch64
for ascii pack and unpack,
Also optimize scalar unpack for both x86 and aarch64.
We had to fix the bug in #5140 and now we load chunks of 7 bytes
during unpacking. This greatly degraded the performance of
scalar unpack, so we now use the "naive" byte by byte implementation
which actually faster then using 7-byte loads on both x86 and aarch64.
On c4a (aarch64):
Benchmarks before:
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_PackNaive 222 ns 222 ns 18936335
BM_Pack 222 ns 222 ns 18956309
BM_Pack2 222 ns 222 ns 18951694
BM_PackSimd 220 ns 220 ns 19103906
BM_PackSimd2 223 ns 223 ns 18861252
BM_UnpackNaive 229 ns 229 ns 18228081
BM_Unpack 743 ns 743 ns 5643824
BM_UnpackSimd 744 ns 744 ns 5648469
Benchmarks after:
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_PackNaive 221 ns 221 ns 18971332
BM_Pack 222 ns 221 ns 18963948
BM_PackSimd 97.2 ns 97.2 ns 43226095
BM_PackSimd2 96.6 ns 96.6 ns 43491371
BM_Unpack 228 ns 228 ns 18397585
BM_UnpackSimd 101 ns 101 ns 41733901
We improved scalar unpack by x3 from 743ns to 228ns, and improved vectorized unpack by x7.
We improved vectorized pack by x2.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
0 commit comments