Skip to content

libvec: unroll pragma and push stride down #107460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ChrisHegarty
Copy link
Contributor

This commit adds an unroll pragma and pushes stride down. Overall we squeeze about another 4-5% out of these native implementations. Pushing the stride down allows future implementations, namely AVX2 and AVX 512, to choose their own stride.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@ChrisHegarty ChrisHegarty added test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents labels Apr 15, 2024
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussion just need some handling for gcc, otherwise LGTM.

@@ -35,6 +27,7 @@ EXPORT int32_t dot8s(int8_t* a, int8_t* b, size_t dims) {
int32x4_t acc4 = vdupq_n_s32(0);

// Some unrolling gives around 50% performance improvement.
#pragma clang loop unroll_count(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it might be worth tweaking the comment a bit, i.e. accumulating into multiple registers gives around 50%, and unroll directive gives around 5%. I think otherwise it is ambiguous.

int32_t res = 0;
int i = 0;
if (dims > DOT8_STRIDE_BYTES_LEN) {
i += dims & ~(DOT8_STRIDE_BYTES_LEN - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if DOT8_STRIDE_BYTES_LEN is a power of 2. Of course it will be. Perhaps it is worth enforcing this with a static_assert somewhere so people don't accidentally break it though, i.e. static_assert((1031 & ~(DOT8_STRIDE_BYTES_LEN - 1)) == (1031 - 1031 % DOT8_STRIDE_BYTES_LEN), "Invalid DOT8_STRIDE_BYTES_LEN must be a power of 2");. Note this can be anywhere in the source file, so you don't need to put in any actual function definition (although it shouldn't actually generate any code).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++
I think you can even make this a compile time error (not sure if you need to switch to c++ for that though)

Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com>
Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
As you know, I like this approach of having the native code "self-contained", with the tail computed in the native part (and the stride as an internal implementation detail).

int32_t res = 0;
int i = 0;
if (dims > DOT8_STRIDE_BYTES_LEN) {
i += dims & ~(DOT8_STRIDE_BYTES_LEN - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++
I think you can even make this a compile time error (not sure if you need to switch to c++ for that though)

i += dims & ~(SQR8S_STRIDE_BYTES_LEN - 1);
res = sqr8s_inner(a, b, i);
}
for (; i < dims; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can try and unroll this loop too?

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>refactoring :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents test-windows Trigger CI checks on Windows v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants