Term vector API on stateless search nodes #129902

kingherc · 2025-06-24T09:03:59Z

Up to now, the (m)term vector API real-time requests were being executed on the indexing nodes of serverless (see #94257). However, we would like to execute them on the search nodes, similar to real-time (m)GETs. This PR does that, by introducing an intermediate action for search nodes to become up-to-date with an indexing node in respect to the term vector API request, before executing it locally on the search node.

The new intermediate action searches for any of the requested document IDs in the shard's LiveVersionMap and if it finds any of them there, it means the search nodes need to be refreshed in order to capture the new document IDs before searching for them.

Relates ES-12112

Up to now, the (m)term vector API real-time requests were being executed on the indexing nodes of serverless. However, we would like to execute them on the search nodes, similar to real-time (m)GETs. This PR does that, by introducing an intermediate action for search nodes to become up-to-date with an indexing node in respect to the term vector API request, before executing it locally on the search node. The new intermediate action searches for any of the requested document IDs in the shard's LiveVersionMap and if it finds any of them there, it means the search nodes need to be refreshed in order to capture the new document IDs before searching for them. Relates ES-12112

elasticsearchmachine · 2025-07-01T17:06:22Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2025-07-01T17:06:23Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

.../src/main/java/org/elasticsearch/action/termvectors/TransportEnsureDocsSearchableAction.java

server/src/main/java/org/elasticsearch/index/engine/Engine.java

tlrx · 2025-07-02T13:25:08Z

server/src/main/java/org/elasticsearch/action/termvectors/TransportTermVectorsAction.java

-        );
+            );
+        if (iterator == null) {
+            return null;


So it will execute on the indexing shard in that case?

I think that it'll execute the request in the receiving node. I think that we should return an empty iterator instead.

I copied this from TransportGetAction#shards(), which does the same to send the requests to the searchable shards. And there's this old discussion which seems to concluded to keep it.

However, I do see that in case it's null, it will be executed locally which may be a node that does not contain the shard (and thus might cause NPE / shard not found) or worst-case be executed on an indexing node (e.g., if all search nodes are down and the proxy sends it to an indexing node) which may try to execute it locally, meaning it may to search later on a possibly hollow indexing shard and get a weird exception (customer may see something like "cannot search a hollow shard").

So all-in-all, I agree we should better return an empty iterator, meaning it will finally give a shard not available exception (which seems better and more usual for end users).

I changed it, but question remains for you whether the same should be done for real-time gets?

tlrx · 2025-07-02T13:29:35Z

...src/main/java/org/elasticsearch/action/termvectors/TransportShardMultiTermsVectorAction.java

+        ShardId shardId,
+        ActionListener<MultiTermVectorsShardResponse> listener
+    ) throws IOException {
+        if (DiscoveryNode.isStateless(clusterService.getSettings())) {


I suppose we can capture this only once instead of reevaluating for every request?

I put it on a private variable evaluated on the constructor. Hopefully that's what you meant.

fcofdez

Looks good! I left a few comments

fcofdez · 2025-07-02T19:48:48Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mtermvectors/30_routing.yml

@@ -7,11 +7,6 @@ routing:
          settings:
           index:
              number_of_shards: 5
-              number_of_replicas: 0
-
-  - do:


is this removed intentionally?

Yes because it won't work in Serverless, as there'll be no search shard to execute the new term API. In serverless we force to have a search node and we'd like a search shard.

So this way it works in both stateful and stateless.

I wonder if the test would become flaky now that we don't wait for the cluster to go green, that's why I was asking.

To wait for the search shard is definitely up and running, I incorporated the following piece of code

- do: cluster.health: wait_for_no_initializing_shards: true

which is copied from #114641 which solved a similar issue to make the tests work in both ES and serverless.

I run it also 10 locally, both core ES and severless, and it succeeds. Feel free to tell me if you have more feedback.

.../src/main/java/org/elasticsearch/action/termvectors/TransportEnsureDocsSearchableAction.java

fcofdez · 2025-07-02T19:57:43Z

.../src/main/java/org/elasticsearch/action/termvectors/TransportEnsureDocsSearchableAction.java

+        assert DiscoveryNode.hasRole(clusterService.getSettings(), DiscoveryNodeRole.INDEX_ROLE)
+            : ACTION_NAME + " should only be executed on a stateless indexing node";
+        logger.debug("received request with {} docs", request.docIds.length);
+        getExecutor(shardId).execute(() -> ActionListener.run(listener, l -> {


are we ok failing the request if the primary moved in the meantime?

I think so because the previous state of the code also allows for the request to fail in case shard move. Specifically:

In stateless, the action only goes to the primary shard. If we see for example TransportTermVectorsAction#asyncShardOperation(), it accesses indicesService.indexServiceSafe() which will throw if the primary has moved. And TransportSingleShardAction will then fail the request with NoShardAvailableActionException.

In stateful, the action iterates over all searchable shards, so including the primary and replicas. And each time a failure is met (e.g., if the shard moved), it will try the next shard. But it's possible with 1 primary and 1 replica that maybe both move around the same time, and the TransportSingleShardAction will then fail the request with NoShardAvailableActionException.

are we ok failing the request if the primary moved in the meantime?

So yes requests already could fail. But this PR makes the stateless behavior a bit worse though, because we're doubling the possibilities that a shard is not found (first the search shard may be moved, and then the primary shard may be moved). It does not "break" the premise that the request may fail though. Do you think we should do something more, like a reroute phase, and it should it be only in serverless only or also in stateful (if all shards move)? Maybe it can be an amendment ticket for the future.

.../src/main/java/org/elasticsearch/action/termvectors/TransportEnsureDocsSearchableAction.java

fcofdez · 2025-07-02T20:00:55Z

server/src/main/java/org/elasticsearch/action/termvectors/TransportTermVectorsAction.java

-        );
+            );
+        if (iterator == null) {
+            return null;


I think that it'll execute the request in the receiving node. I think that we should return an empty iterator instead.

server/src/main/java/org/elasticsearch/index/engine/Engine.java

…termvectors

kingherc

Thanks @tlrx and @fcofdez for the comments! Finally incorporated them, so feel free to review again.

kingherc · 2025-07-17T10:59:18Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mtermvectors/30_routing.yml

@@ -7,11 +7,6 @@ routing:
          settings:
           index:
              number_of_shards: 5
-              number_of_replicas: 0
-
-  - do:


Yes because it won't work in Serverless, as there'll be no search shard to execute the new term API. In serverless we force to have a search node and we'd like a search shard.

So this way it works in both stateful and stateless.

.../src/main/java/org/elasticsearch/action/termvectors/TransportEnsureDocsSearchableAction.java

kingherc · 2025-07-17T13:01:40Z

.../src/main/java/org/elasticsearch/action/termvectors/TransportEnsureDocsSearchableAction.java

+        assert DiscoveryNode.hasRole(clusterService.getSettings(), DiscoveryNodeRole.INDEX_ROLE)
+            : ACTION_NAME + " should only be executed on a stateless indexing node";
+        logger.debug("received request with {} docs", request.docIds.length);
+        getExecutor(shardId).execute(() -> ActionListener.run(listener, l -> {


I think so because the previous state of the code also allows for the request to fail in case shard move. Specifically:

In stateless, the action only goes to the primary shard. If we see for example TransportTermVectorsAction#asyncShardOperation(), it accesses indicesService.indexServiceSafe() which will throw if the primary has moved. And TransportSingleShardAction will then fail the request with NoShardAvailableActionException.

In stateful, the action iterates over all searchable shards, so including the primary and replicas. And each time a failure is met (e.g., if the shard moved), it will try the next shard. But it's possible with 1 primary and 1 replica that maybe both move around the same time, and the TransportSingleShardAction will then fail the request with NoShardAvailableActionException.

are we ok failing the request if the primary moved in the meantime?

So yes requests already could fail. But this PR makes the stateless behavior a bit worse though, because we're doubling the possibilities that a shard is not found (first the search shard may be moved, and then the primary shard may be moved). It does not "break" the premise that the request may fail though. Do you think we should do something more, like a reroute phase, and it should it be only in serverless only or also in stateful (if all shards move)? Maybe it can be an amendment ticket for the future.

kingherc · 2025-07-17T15:38:02Z

...src/main/java/org/elasticsearch/action/termvectors/TransportShardMultiTermsVectorAction.java

+        ShardId shardId,
+        ActionListener<MultiTermVectorsShardResponse> listener
+    ) throws IOException {
+        if (DiscoveryNode.isStateless(clusterService.getSettings())) {


I put it on a private variable evaluated on the constructor. Hopefully that's what you meant.

kingherc · 2025-07-17T15:53:10Z

server/src/main/java/org/elasticsearch/action/termvectors/TransportTermVectorsAction.java

-        );
+            );
+        if (iterator == null) {
+            return null;


I copied this from TransportGetAction#shards(), which does the same to send the requests to the searchable shards. And there's this old discussion which seems to concluded to keep it.

However, I do see that in case it's null, it will be executed locally which may be a node that does not contain the shard (and thus might cause NPE / shard not found) or worst-case be executed on an indexing node (e.g., if all search nodes are down and the proxy sends it to an indexing node) which may try to execute it locally, meaning it may to search later on a possibly hollow indexing shard and get a weird exception (customer may see something like "cannot search a hollow shard").

So all-in-all, I agree we should better return an empty iterator, meaning it will finally give a shard not available exception (which seems better and more usual for end users).

I changed it, but question remains for you whether the same should be done for real-time gets?

server/src/main/java/org/elasticsearch/index/engine/Engine.java

…termvectors

fcofdez

LGTM. My only outstanding comment is around the yaml tests.

fcofdez · 2025-07-21T09:31:09Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mtermvectors/30_routing.yml

@@ -7,11 +7,6 @@ routing:
          settings:
           index:
              number_of_shards: 5
-              number_of_replicas: 0
-
-  - do:


I wonder if the test would become flaky now that we don't wait for the cluster to go green, that's why I was asking.

server/src/main/java/org/elasticsearch/action/termvectors/TransportTermVectorsAction.java

server/src/main/java/org/elasticsearch/index/engine/Engine.java

Similar to elastic#114641

…termvectors

kingherc

Thanks @fcofdez ! Please take another look, and see the new origin string.

kingherc · 2025-07-21T14:58:12Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mtermvectors/30_routing.yml

@@ -7,11 +7,6 @@ routing:
          settings:
           index:
              number_of_shards: 5
-              number_of_replicas: 0
-
-  - do:


To wait for the search shard is definitely up and running, I incorporated the following piece of code

- do: cluster.health: wait_for_no_initializing_shards: true

which is copied from #114641 which solved a similar issue to make the tests work in both ES and serverless.

I run it also 10 locally, both core ES and severless, and it succeeds. Feel free to tell me if you have more feedback.

kingherc · 2025-07-21T15:42:04Z

...plugin/security/src/main/java/org/elasticsearch/xpack/security/authz/AuthorizationUtils.java

@@ -132,6 +133,7 @@ public static void switchUserBasedOnActionOriginAndExecute(
            case SECURITY_PROFILE_ORIGIN:
                securityContext.executeAsInternalUser(InternalUsers.SECURITY_PROFILE_USER, version, consumer);
                break;
+            case ENSURE_DOCS_SEARCHABLE_ORIGIN:


Hi @fcofdez , please review this new addition to the PR. It follows a similar approach as is used in PostWriteRefresh.java to send unpromotable refreshes.

I'm not an expert on the auth/security, but this looks reasonable to me.

…king * upstream/main: (100 commits) Term vector API on stateless search nodes (elastic#129902) TEST Fix ThreadPoolMergeSchedulerStressTestIT testMergingFallsBehindAndThenCatchesUp (elastic#131636) Add inference.put_custom rest-api-spec (elastic#131660) ESQL: Fewer serverless docs in tests (elastic#131651) Skip search on indices with INDEX_REFRESH_BLOCK (elastic#129132) Mute org.elasticsearch.indices.cluster.RemoteSearchForceConnectTimeoutIT testTimeoutSetting elastic#131656 [jdk] Resolve EA OpenJDK builds to our JDK archive (elastic#131237) Add optimized path for intermediate values aggregator (elastic#131390) Correctly handling download_database_on_pipeline_creation within a pipeline processor within a default or final pipeline (elastic#131236) Refresh potential lost connections at query start for `_search` (elastic#130463) Add template_id to patterned-text type (elastic#131401) Integrate LIKE/RLIKE LIST with ReplaceStringCasingWithInsensitiveRegexMatch rule (elastic#131531) [ES|QL] Add doc for the COMPLETION command (elastic#131010) ESQL: Add times to topn status (elastic#131555) ESQL: Add asynchronous pre-optimization step for logical plan (elastic#131440) ES|QL: Improve generative tests for FORK [130015] (elastic#131206) Update index mapping update privileges (elastic#130894) ESQL: Added Sample operator NamedWritable to plugin (elastic#131541) update `kibana_system` to grant it access to `.chat-*` system index (elastic#131419) Clarify heap size configuration (elastic#131607) ...

…-tracking * upstream/main: (44 commits) Term vector API on stateless search nodes (elastic#129902) TEST Fix ThreadPoolMergeSchedulerStressTestIT testMergingFallsBehindAndThenCatchesUp (elastic#131636) Add inference.put_custom rest-api-spec (elastic#131660) ESQL: Fewer serverless docs in tests (elastic#131651) Skip search on indices with INDEX_REFRESH_BLOCK (elastic#129132) Mute org.elasticsearch.indices.cluster.RemoteSearchForceConnectTimeoutIT testTimeoutSetting elastic#131656 [jdk] Resolve EA OpenJDK builds to our JDK archive (elastic#131237) Add optimized path for intermediate values aggregator (elastic#131390) Correctly handling download_database_on_pipeline_creation within a pipeline processor within a default or final pipeline (elastic#131236) Refresh potential lost connections at query start for `_search` (elastic#130463) Add template_id to patterned-text type (elastic#131401) Integrate LIKE/RLIKE LIST with ReplaceStringCasingWithInsensitiveRegexMatch rule (elastic#131531) [ES|QL] Add doc for the COMPLETION command (elastic#131010) ESQL: Add times to topn status (elastic#131555) ESQL: Add asynchronous pre-optimization step for logical plan (elastic#131440) ES|QL: Improve generative tests for FORK [130015] (elastic#131206) Update index mapping update privileges (elastic#130894) ESQL: Added Sample operator NamedWritable to plugin (elastic#131541) update `kibana_system` to grant it access to `.chat-*` system index (elastic#131419) Clarify heap size configuration (elastic#131607) ...

kingherc self-assigned this Jun 24, 2025

elasticsearchmachine added v9.1.0 serverless-linked Added by automation, don't add manually labels Jun 24, 2025

kingherc force-pushed the non-issue/ES-12112-termvectors branch 3 times, most recently from 59876ce to dd19b4c Compare June 24, 2025 11:49

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels Jun 26, 2025

kingherc force-pushed the non-issue/ES-12112-termvectors branch from dd19b4c to ab4e1ed Compare June 30, 2025 11:01

kingherc force-pushed the non-issue/ES-12112-termvectors branch from 82434e3 to e3b6a4b Compare July 1, 2025 15:23

kingherc marked this pull request as ready for review July 1, 2025 17:05

kingherc requested review from fcofdez and tlrx July 1, 2025 17:06

tlrx reviewed Jul 2, 2025

View reviewed changes

fcofdez reviewed Jul 2, 2025

View reviewed changes

kingherc added 7 commits July 3, 2025 11:52

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

551564a

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

f6a2b60

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

ac6501e

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

768061c

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

4700319

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

7b78d40

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

10a0516

…termvectors

kingherc added 4 commits July 17, 2025 14:41

Move transport action to serverless

a1de023

PR comments

0d35fe4

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

e56b4b2

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

cc02815

…termvectors

kingherc commented Jul 18, 2025

View reviewed changes

kingherc requested review from tlrx and fcofdez July 18, 2025 10:17

kingherc added 6 commits July 18, 2025 13:20

Remove dead code

e022eb1

Checkstyle

ecd0132

Make action internal

f3b5777

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

9bf99d9

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

d4e7c8b

…termvectors

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

2533240

…termvectors

fcofdez approved these changes Jul 21, 2025

View reviewed changes

kingherc added 5 commits July 21, 2025 17:21

Allow refreshing a shard by system

032ea2e

Add wait_for_no_initializing_shards in yaml rest

849e31c

Similar to elastic#114641

Add special origin to call refresh shard

fde5fe0

Merge remote-tracking branch 'kingherc/main' into non-issue/ES-12112-…

12b0d36

…termvectors

Comments

167fa81

kingherc commented Jul 21, 2025

View reviewed changes

kingherc merged commit 238b9e1 into elastic:main Jul 22, 2025
33 checks passed

Term vector API on stateless search nodes #129902

Term vector API on stateless search nodes #129902

Uh oh!

Conversation

kingherc commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 1, 2025

Uh oh!

elasticsearchmachine commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fcofdez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kingherc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fcofdez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kingherc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kingherc commented Jun 24, 2025 •

edited

Loading