libvec: unroll pragma and push stride down #107460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ChrisHegarty wants to merge 4 commits into elastic:main from ChrisHegarty:native_refactor

Contributor

ChrisHegarty commented Apr 15, 2024

This commit adds an unroll pragma and pushes stride down. Overall we squeeze about another 4-5% out of these native implementations. Pushing the stride down allows future implementations, namely AVX2 and AVX 512, to choose their own stride.


          libvec: unroll pragma and push stride down

b9b7825

ChrisHegarty added >refactoring Team:Search :Search Relevance/Vectors labels

ChrisHegarty requested a review from tveasey

April 15, 2024 10:34

ChrisHegarty requested review from a team as code owners

April 15, 2024 10:34

elasticsearchmachine added the v8.14.0 label

Collaborator

elasticsearchmachine commented Apr 15, 2024

Pinging @elastic/es-search (Team:Search)

ChrisHegarty added 2 commits

April 15, 2024 11:37


          bump native lib version

dd3b9b5


          bump version

b86d1f3

ChrisHegarty added test-windows test-arm labels

tveasey approved these changes

View reviewed changes

Contributor

tveasey left a comment

As per offline discussion just need some handling for gcc, otherwise LGTM.

libs/vec/native/src/vec/c/vec.c

@@ @@ -35,6 +27,7 @@ EXPORT int32_t dot8s(int8_t* a, int8_t* b, size_t dims) { @@
                   int32x4_t acc4 = vdupq_n_s32(0);
                   // Some unrolling gives around 50% performance improvement.
+                  #pragma clang loop unroll_count(2)

Contributor

tveasey Apr 15, 2024

Perhaps it might be worth tweaking the comment a bit, i.e. accumulating into multiple registers gives around 50%, and unroll directive gives around 5%. I think otherwise it is ambiguous.

libs/vec/native/src/vec/c/vec.c

+                  int32_t res = 0;
+                  int i = 0;
+                  if (dims > DOT8_STRIDE_BYTES_LEN) {
+                      i += dims & ~(DOT8_STRIDE_BYTES_LEN - 1);

Contributor

tveasey Apr 15, 2024

This only works if DOT8_STRIDE_BYTES_LEN is a power of 2. Of course it will be. Perhaps it is worth enforcing this with a static_assert somewhere so people don't accidentally break it though, i.e. static_assert((1031 & ~(DOT8_STRIDE_BYTES_LEN - 1)) == (1031 - 1031 % DOT8_STRIDE_BYTES_LEN), "Invalid DOT8_STRIDE_BYTES_LEN must be a power of 2");. Note this can be anywhere in the source file, so you don't need to put in any actual function definition (although it shouldn't actually generate any code).

Contributor

ldematte Apr 22, 2024

++
I think you can even make this a compile time error (not sure if you need to switch to c++ for that though)

libs/vec/native/src/vec/c/vec.c Outdated Show resolved Hide resolved


          Update libs/vec/native/src/vec/c/vec.c

ce7c2f5

Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com>

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels

ldematte approved these changes

View reviewed changes

Contributor

ldematte left a comment

LGTM
As you know, I like this approach of having the native code "self-contained", with the tail computed in the native part (and the stride as an internal implementation detail).

libs/vec/native/src/vec/c/vec.c

+                  int32_t res = 0;
+                  int i = 0;
+                  if (dims > DOT8_STRIDE_BYTES_LEN) {
+                      i += dims & ~(DOT8_STRIDE_BYTES_LEN - 1);

Contributor

ldematte Apr 22, 2024

++
I think you can even make this a compile time error (not sure if you need to switch to c++ for that though)

libs/vec/native/src/vec/c/vec.c

+                      i += dims & ~(SQR8S_STRIDE_BYTES_LEN - 1);
+                      res = sqr8s_inner(a, b, i);
+                  }
+                  for (; i < dims; i++) {

Contributor

ldematte Apr 22, 2024

Maybe you can try and unroll this loop too?

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels

javanna added Team:Search Relevance and removed Team:Search labels

Collaborator

elasticsearchmachine commented Jul 16, 2024

Pinging @elastic/es-search-relevance (Team:Search Relevance)

mark-vieira added v9.0.0 and removed v8.16.0 labels

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>refactoring :Search Relevance/Vectors Team:Search Relevance test-arm test-windows v9.2.0