Bug fix: Fixing partial cache update post snapshot restore #5478

nagarajg17 · 2025-07-15T12:36:50Z

Description

[Describe what this change achieves]

Category: Bug Fix
Why these changes are required?
In #5307, a functionality was introduced to update the cache from the security index after the index was restored from a snapshot. It was done on nodes having only primary shard. This PR fixes to reload cache on all nodex
What is the old behavior before changes and new behavior after changes?
Cache update was only done for nodes having primary shard. Now it will be done for all nodes

Issues Resolved

Is this a backport? If so, please add backport PR # and/or commits #, and remove backport-failed label from the original PR.

Do these changes introduce new permission(s) to be displayed in the static dropdown on the front-end? If so, please open a draft PR in the security dashboards plugin and link the draft PR here

Testing

Tested with 2 node setup on single machine.

Added myuser1 with permissions to read index my-index-000001
Took the snapshot of security index
Added myuser2 with permissions to read index my-index-000001
Verified myuser2 was able to read index from both nodes
Deleted security index
Restored from snapshot
Verified cache getting updating on both nodes
Verified myuser2 was not able to read index from both nodes

# Node 1
cluster.name: test-cluster
node.name: node-1
node.roles: [ master, data ]
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
discovery.seed_hosts: ["localhost:9300", "localhost:9301"]
cluster.initial_master_nodes: ["node-1"]

# Node 2
cluster.name: test-cluster
node.name: node-3
node.roles: [ data ]
network.host: 0.0.0.0
http.port: 9201
transport.port: 9301
discovery.seed_hosts: ["localhost:9300", "localhost:9301"]
cluster.initial_master_nodes: ["node-1"]

Check List

New functionality includes testing
New functionality has been documented
New Roles/Permissions have a corresponding security dashboards plugin PR
API changes companion pull request created
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2025-07-15T13:34:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.99%. Comparing base (16993b4) to head (9784c01).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5478      +/-   ##
==========================================
- Coverage   73.02%   72.99%   -0.03%     
==========================================
  Files         408      408              
  Lines       25293    25314      +21     
  Branches     3854     3854              
==========================================
+ Hits        18469    18478       +9     
- Misses       4952     4963      +11     
- Partials     1872     1873       +1

Files with missing lines	Coverage Δ
...ecurity/configuration/ConfigurationRepository.java	`83.82% <100.00%> (+0.67%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cwperks · 2025-07-15T14:51:25Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

@@ -681,11 +681,10 @@ public void afterIndexShardStarted(IndexShard indexShard) {

        // Check if this is a security index shard
        if (securityIndex.equals(index.getName())) {
-            // Only trigger on primary shard to avoid multiple reloads
-            if (indexShard.routingEntry() != null && indexShard.routingEntry().primary()) {
+            if (indexShard.routingEntry() != null) {
                threadPool.generic().execute(() -> {


Let's make sure this get's updated when we have a dedicated threadpool for security updates

nibix

I have doubts that this will fix the issue.

Of course, I might be wrong here. However, in any case, a feature that relies on such specific cluster behavior should always come with an integration test which tests it on a real cluster with nodes assuming different roles.

Just having a unit test which mocks dependencies won't be sufficient, as these mocks encode just the assumptions of the developer - which might be just wrong.

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

willyborankin · 2025-07-15T17:29:32Z

@nagarajg17, @cwperks, and @nibix — there's a SnapshotRestoreHelper class that manages and provides information about the restoration process. I think it makes sense to use this class and reload the configuration once the restoration is complete.

cwperks · 2025-07-15T17:39:18Z

@willyborankin The SnapshotRestoreHelper is being using here to determine if the security index is being restored from a snapshot. See https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/support/SnapshotRestoreHelper.java#L96-L112

willyborankin · 2025-07-15T18:03:58Z

@willyborankin The SnapshotRestoreHelper is being using here to determine if the security index is being restored from a snapshot. See https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/support/SnapshotRestoreHelper.java#L96-L112

Ahhh did not notice static import :-) .

Signed-off-by: Nagaraj G <narajg@amazon.com>

cwperks · 2025-09-05T14:38:27Z

CHANGELOG.md

+* Use isClusterPerm instead of requestedResolved.isLocalAll() to determine if action is a cluster action ([#5445](https://github.com/opensearch-project/security/pull/5445))
+* Fix config update with deprecated config types failing in mixed clusters ([#5456](https://github.com/opensearch-project/security/pull/5456))
+* Fix usage of jwt_clock_skew_tolerance_seconds in HTTPJwtAuthenticator ([#5506](https://github.com/opensearch-project/security/pull/5506))
+* Fix partial cache update post snapshot restore[#5478](https://github.com/opensearch-project/security/pull/5478)


@nagarajg17 can you fix the CHANGELOG here? Otherwise this looks good to me.

nibix · 2025-09-05T16:57:32Z

Thank you for the integration test!

However, apologies once again; and I am not really feeling good with always being the one who sees issues here :-(

I am still not convinced that this works the way it is. I just tried the integration test. Additionally, I added some more logs to afterIndexShardStarted() in ConfigurationRepository:

    @Override
    public void afterIndexShardStarted(IndexShard indexShard) {
        final ShardId shardId = indexShard.shardId();
        final Index index = shardId.getIndex();

        // Check if this is a security index shard
        if (securityIndex.equals(index.getName())) {
            threadPool.generic().execute(() -> {
                LOGGER.info("Shard started on node " + clusterService.localNode().getName());
                if (isSecurityIndexRestoredFromSnapshot(clusterService, index, securityIndex)) {
                    LOGGER.info("Security index shard {} started - config reloading for snapshot restore; node: {}", shardId, clusterService.localNode().getName());
                    reloadConfiguration(CType.values());
                }
            });
        }
    }

The output of the test is as follows:

2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  RepositoriesService:252 - put repository [test-snapshot-repository]
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@28ebbc7e enabled for index my_index_001
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataCreateIndexService:563 - [my_index_001] creating index, cause [api], templates [], shards [1]/[1]
2025-09-05 18:47:30 opensearch[data_0][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[data_0][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@3bff4c3c enabled for index my_index_001
2025-09-05 18:47:30 opensearch[data_1][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[data_1][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@3e9d578f enabled for index my_index_001
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataMappingService:335 - [my_index_001/vnGUzqpjR6yMbTm5l1WWBQ] create_mapping
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:31 opensearch[data_0][generic][T#5] INFO  RecoverySourceHandler:916 - finalizing recovery took [62.3ms]
2025-09-05 18:47:31 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  AllocationService:577 - Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[my_index_001][0]]]).
2025-09-05 18:47:31 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  SnapshotsService:443 - snapshot [test-snapshot-repository:test-snap/JhJTMIK6T1CXyEndM73Zjg] started
2025-09-05 18:47:31 opensearch[cluster_manager_1][snapshot][T#3] INFO  SnapshotsService:2175 - snapshot [test-snapshot-repository:test-snap/JhJTMIK6T1CXyEndM73Zjg] completed with state [SUCCESS]
2025-09-05 18:47:34 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataDeleteIndexService:166 - [.opendistro_security/ZDkcCrfES56szgkkYqcs6Q] deleting index
2025-09-05 18:47:35 opensearch[data_0][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[.opendistro_security/K0nnLcTYRrqlA4kl12KhKg]
2025-09-05 18:47:35 opensearch[data_0][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@2125eceb enabled for index .opendistro_security
2025-09-05 18:47:35 opensearch[data_0][generic][T#2] INFO  ConfigurationRepository:685 - Shard started on node data_0
2025-09-05 18:47:35 opensearch[data_0][generic][T#2] INFO  ConfigurationRepository:687 - Security index shard [.opendistro_security][0] started - config reloading for snapshot restore; node: data_0
2025-09-05 18:47:35 opensearch[data_1][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[.opendistro_security/K0nnLcTYRrqlA4kl12KhKg]
2025-09-05 18:47:35 opensearch[data_1][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@2755ec7f enabled for index .opendistro_security
2025-09-05 18:47:35 opensearch[data_0][generic][T#3] INFO  RecoverySourceHandler:916 - finalizing recovery took [9.3ms]
2025-09-05 18:47:35 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  AllocationService:577 - Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.opendistro_security][0]]]).
2025-09-05 18:47:35 opensearch[data_1][generic][T#5] INFO  ConfigurationRepository:685 - Shard started on node data_1
2025-09-05 18:47:35 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataDeleteIndexService:166 - [my_index_001/vnGUzqpjR6yMbTm5l1WWBQ] deleting index
2025-09-05 18:47:35 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  RepositoriesService:364 - delete repository [test-snapshot-repository]
2025-09-05 18:47:36 ForkJoinPool.commonPool-worker-6 INFO  LocalOpenSearchCluster:505 - Stopping cluster_manager_0 RUNNING [47320, 47220]

You can see that the log message for afterIndexShardStarted() is only emitted for the nodes data_0 and data_1. The cluster is however configured as ClusterManager.THREE_CLUSTER_MANAGERS:

    THREE_CLUSTER_MANAGERS(
        new NodeSettings(NodeRole.CLUSTER_MANAGER),
        new NodeSettings(NodeRole.CLUSTER_MANAGER),
        new NodeSettings(NodeRole.CLUSTER_MANAGER),
        new NodeSettings(NodeRole.DATA),
        new NodeSettings(NodeRole.DATA)
    ),

So, we have two data nodes and three further cluster manager nodes. The cluster manager nodes do not get any calls to afterIndexShardStarted().

Generally, the int test should verify that each node has loaded the valid configuration. With the current test, you only verify that a single - undetermined - node has reloaded the configuration.

cwperks · 2025-09-05T17:02:54Z

@nibix Would it make sense to get rid of TransportConfigUpdateAction altogether and change ConfigurationRepository to an IndexOperationListener that triggers reloadConfiguration on the postIndex event?

Idk if that would have the same problem of only triggering on nodes with shards.

nibix · 2025-09-05T17:04:35Z

Would it make sense to get rid of TransportConfigUpdateAction altogether and change ConfigurationRepository to an IndexOperationListener that triggers reloadConfiguration on the postIndex event?

Not sure to be honest. The broadcast update action has the advantage that it is straight forward, its semantics are crystal clear. For other methods, I'd first have to research what the conditions would be when they are called.

cwperks · 2025-09-05T17:09:23Z

Got it, I suppose we can yank the code from AbstractApiAction to trigger a ConfigUpdateAction from the node with a primary shard: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/dlic/rest/api/AbstractApiAction.java#L502-L503

nibix · 2025-09-05T17:12:56Z

yip, I think that's the way to go

cwperks · 2025-09-05T17:20:07Z

Actually the important section is the ConfigUpdatingActionListener here: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/dlic/rest/api/AbstractApiAction.java#L551-L567

The index has already been restored from a snapshot, so we need to trigger ConfigUpdateAction using the node client for all ctypes.

nibix · 2025-09-05T17:25:43Z

Yes. Still, I think, we also have to keep timing in mind: When the primary shard has been restored, it's possible that not all shards are up to date. If we are having bad luck, the update request might be faster than the restoring of the shards and trigger the reloading too early.

I am not sure how to take care of this: Possibly a fixed delay? That would be a bit hacky and fragile as well.

nagarajg17 requested review from cwperks, DarshitChanpura, derek-ho, nibix, peternied, RyanL1997, reta, shikharj05 and willyborankin as code owners July 15, 2025 12:36

nagarajg17 marked this pull request as draft July 15, 2025 13:00

nagarajg17 force-pushed the main branch 2 times, most recently from 9597028 to f09c188 Compare July 15, 2025 13:10

nagarajg17 marked this pull request as ready for review July 15, 2025 14:48

cwperks reviewed Jul 15, 2025

View reviewed changes

cwperks previously approved these changes Jul 15, 2025

View reviewed changes

nibix requested changes Jul 15, 2025

View reviewed changes

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java Outdated Show resolved Hide resolved

nibix mentioned this pull request Jul 22, 2025

Moved configuration reloading to a single dedicated thread #5479

Open

3 tasks

nagarajg17 dismissed cwperks’s stale review via 81282f4 September 5, 2025 06:02

nagarajg17 force-pushed the main branch 2 times, most recently from 81282f4 to b76f389 Compare September 5, 2025 06:11

Fix partial cache update post snapshot restore

9784c01

Signed-off-by: Nagaraj G <narajg@amazon.com>

nagarajg17 force-pushed the main branch from b76f389 to 9784c01 Compare September 5, 2025 06:27

cwperks reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug fix: Fixing partial cache update post snapshot restore #5478

Bug fix: Fixing partial cache update post snapshot restore #5478

nagarajg17 commented Jul 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 15, 2025 •

edited

Loading

Uh oh!

cwperks Jul 15, 2025

Uh oh!

nagarajg17 Jul 15, 2025

Uh oh!

nibix left a comment

Uh oh!

Uh oh!

willyborankin commented Jul 15, 2025

Uh oh!

cwperks commented Jul 15, 2025

Uh oh!

willyborankin commented Jul 15, 2025

Uh oh!

cwperks Sep 5, 2025

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

cwperks commented Sep 5, 2025 •

edited

Loading

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

cwperks commented Sep 5, 2025

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

cwperks commented Sep 5, 2025

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

Uh oh!

Bug fix: Fixing partial cache update post snapshot restore #5478

Are you sure you want to change the base?

Bug fix: Fixing partial cache update post snapshot restore #5478

Conversation

nagarajg17 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Testing

Check List

Uh oh!

codecov bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cwperks Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

nagarajg17 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

nibix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

willyborankin commented Jul 15, 2025

Uh oh!

cwperks commented Jul 15, 2025

Uh oh!

willyborankin commented Jul 15, 2025

Uh oh!

cwperks Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

cwperks commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

cwperks commented Sep 5, 2025

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

cwperks commented Sep 5, 2025

Uh oh!

nibix commented Sep 5, 2025

Uh oh!

Uh oh!

nagarajg17 commented Jul 15, 2025 •

edited

Loading

codecov bot commented Jul 15, 2025 •

edited

Loading

cwperks commented Sep 5, 2025 •

edited

Loading