-
Notifications
You must be signed in to change notification settings - Fork 25.3k
libvec: unroll pragma and push stride down #107460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline discussion just need some handling for gcc, otherwise LGTM.
@@ -35,6 +27,7 @@ EXPORT int32_t dot8s(int8_t* a, int8_t* b, size_t dims) { | |||
int32x4_t acc4 = vdupq_n_s32(0); | |||
|
|||
// Some unrolling gives around 50% performance improvement. | |||
#pragma clang loop unroll_count(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it might be worth tweaking the comment a bit, i.e. accumulating into multiple registers gives around 50%, and unroll directive gives around 5%. I think otherwise it is ambiguous.
int32_t res = 0; | ||
int i = 0; | ||
if (dims > DOT8_STRIDE_BYTES_LEN) { | ||
i += dims & ~(DOT8_STRIDE_BYTES_LEN - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works if DOT8_STRIDE_BYTES_LEN is a power of 2. Of course it will be. Perhaps it is worth enforcing this with a static_assert somewhere so people don't accidentally break it though, i.e. static_assert((1031 & ~(DOT8_STRIDE_BYTES_LEN - 1)) == (1031 - 1031 % DOT8_STRIDE_BYTES_LEN), "Invalid DOT8_STRIDE_BYTES_LEN must be a power of 2");
. Note this can be anywhere in the source file, so you don't need to put in any actual function definition (although it shouldn't actually generate any code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
I think you can even make this a compile time error (not sure if you need to switch to c++ for that though)
Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As you know, I like this approach of having the native code "self-contained", with the tail computed in the native part (and the stride as an internal implementation detail).
int32_t res = 0; | ||
int i = 0; | ||
if (dims > DOT8_STRIDE_BYTES_LEN) { | ||
i += dims & ~(DOT8_STRIDE_BYTES_LEN - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
I think you can even make this a compile time error (not sure if you need to switch to c++ for that though)
i += dims & ~(SQR8S_STRIDE_BYTES_LEN - 1); | ||
res = sqr8s_inner(a, b, i); | ||
} | ||
for (; i < dims; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can try and unroll this loop too?
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
This commit adds an unroll pragma and pushes stride down. Overall we squeeze about another 4-5% out of these native implementations. Pushing the stride down allows future implementations, namely AVX2 and AVX 512, to choose their own stride.