Skip to content

Conversation

nagarajg17
Copy link
Contributor

@nagarajg17 nagarajg17 commented Jul 15, 2025

Description

[Describe what this change achieves]

  • Category: Bug Fix

  • Why these changes are required?
    In #5307, a functionality was introduced to update the cache from the security index after the index was restored from a snapshot. It was done on nodes having only primary shard. This PR fixes to reload cache on all nodex

  • What is the old behavior before changes and new behavior after changes?
    Cache update was only done for nodes having primary shard. Now it will be done for all nodes

Issues Resolved

5472

Is this a backport? If so, please add backport PR # and/or commits #, and remove backport-failed label from the original PR.

Do these changes introduce new permission(s) to be displayed in the static dropdown on the front-end? If so, please open a draft PR in the security dashboards plugin and link the draft PR here

Testing

Tested with 2 node setup on single machine.

  1. Added myuser1 with permissions to read index my-index-000001
  2. Took the snapshot of security index
  3. Added myuser2 with permissions to read index my-index-000001
  4. Verified myuser2 was able to read index from both nodes
  5. Deleted security index
  6. Restored from snapshot
  7. Verified cache getting updating on both nodes
  8. Verified myuser2 was not able to read index from both nodes
# Node 1
cluster.name: test-cluster
node.name: node-1
node.roles: [ master, data ]
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
discovery.seed_hosts: ["localhost:9300", "localhost:9301"]
cluster.initial_master_nodes: ["node-1"]
# Node 2
cluster.name: test-cluster
node.name: node-3
node.roles: [ data ]
network.host: 0.0.0.0
http.port: 9201
transport.port: 9301
discovery.seed_hosts: ["localhost:9300", "localhost:9301"]
cluster.initial_master_nodes: ["node-1"]

Check List

  • New functionality includes testing
  • New functionality has been documented
  • New Roles/Permissions have a corresponding security dashboards plugin PR
  • API changes companion pull request created
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Jul 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.99%. Comparing base (16993b4) to head (9784c01).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5478      +/-   ##
==========================================
- Coverage   73.02%   72.99%   -0.03%     
==========================================
  Files         408      408              
  Lines       25293    25314      +21     
  Branches     3854     3854              
==========================================
+ Hits        18469    18478       +9     
- Misses       4952     4963      +11     
- Partials     1872     1873       +1     
Files with missing lines Coverage Δ
...ecurity/configuration/ConfigurationRepository.java 83.82% <100.00%> (+0.67%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nagarajg17 nagarajg17 marked this pull request as ready for review July 15, 2025 14:48
@@ -681,11 +681,10 @@ public void afterIndexShardStarted(IndexShard indexShard) {

// Check if this is a security index shard
if (securityIndex.equals(index.getName())) {
// Only trigger on primary shard to avoid multiple reloads
if (indexShard.routingEntry() != null && indexShard.routingEntry().primary()) {
if (indexShard.routingEntry() != null) {
threadPool.generic().execute(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this get's updated when we have a dedicated threadpool for security updates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

cwperks
cwperks previously approved these changes Jul 15, 2025
Copy link
Collaborator

@nibix nibix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts that this will fix the issue.

Of course, I might be wrong here. However, in any case, a feature that relies on such specific cluster behavior should always come with an integration test which tests it on a real cluster with nodes assuming different roles.

Just having a unit test which mocks dependencies won't be sufficient, as these mocks encode just the assumptions of the developer - which might be just wrong.

@willyborankin
Copy link
Collaborator

@nagarajg17, @cwperks, and @nibix — there's a SnapshotRestoreHelper class that manages and provides information about the restoration process. I think it makes sense to use this class and reload the configuration once the restoration is complete.

@cwperks
Copy link
Member

cwperks commented Jul 15, 2025

@willyborankin The SnapshotRestoreHelper is being using here to determine if the security index is being restored from a snapshot. See https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/support/SnapshotRestoreHelper.java#L96-L112

@willyborankin
Copy link
Collaborator

@willyborankin The SnapshotRestoreHelper is being using here to determine if the security index is being restored from a snapshot. See https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/support/SnapshotRestoreHelper.java#L96-L112

Ahhh did not notice static import :-) .

Signed-off-by: Nagaraj G <narajg@amazon.com>
* Use isClusterPerm instead of requestedResolved.isLocalAll() to determine if action is a cluster action ([#5445](https://github.com/opensearch-project/security/pull/5445))
* Fix config update with deprecated config types failing in mixed clusters ([#5456](https://github.com/opensearch-project/security/pull/5456))
* Fix usage of jwt_clock_skew_tolerance_seconds in HTTPJwtAuthenticator ([#5506](https://github.com/opensearch-project/security/pull/5506))
* Fix partial cache update post snapshot restore[#5478](https://github.com/opensearch-project/security/pull/5478)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nagarajg17 can you fix the CHANGELOG here? Otherwise this looks good to me.

@nibix
Copy link
Collaborator

nibix commented Sep 5, 2025

Thank you for the integration test!

However, apologies once again; and I am not really feeling good with always being the one who sees issues here :-(

I am still not convinced that this works the way it is. I just tried the integration test. Additionally, I added some more logs to afterIndexShardStarted() in ConfigurationRepository:

    @Override
    public void afterIndexShardStarted(IndexShard indexShard) {
        final ShardId shardId = indexShard.shardId();
        final Index index = shardId.getIndex();

        // Check if this is a security index shard
        if (securityIndex.equals(index.getName())) {
            threadPool.generic().execute(() -> {
                LOGGER.info("Shard started on node " + clusterService.localNode().getName());
                if (isSecurityIndexRestoredFromSnapshot(clusterService, index, securityIndex)) {
                    LOGGER.info("Security index shard {} started - config reloading for snapshot restore; node: {}", shardId, clusterService.localNode().getName());
                    reloadConfiguration(CType.values());
                }
            });
        }
    }

The output of the test is as follows:

2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  RepositoriesService:252 - put repository [test-snapshot-repository]
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@28ebbc7e enabled for index my_index_001
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataCreateIndexService:563 - [my_index_001] creating index, cause [api], templates [], shards [1]/[1]
2025-09-05 18:47:30 opensearch[data_0][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[data_0][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@3bff4c3c enabled for index my_index_001
2025-09-05 18:47:30 opensearch[data_1][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[data_1][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@3e9d578f enabled for index my_index_001
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataMappingService:335 - [my_index_001/vnGUzqpjR6yMbTm5l1WWBQ] create_mapping
2025-09-05 18:47:30 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[my_index_001/vnGUzqpjR6yMbTm5l1WWBQ]
2025-09-05 18:47:31 opensearch[data_0][generic][T#5] INFO  RecoverySourceHandler:916 - finalizing recovery took [62.3ms]
2025-09-05 18:47:31 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  AllocationService:577 - Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[my_index_001][0]]]).
2025-09-05 18:47:31 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  SnapshotsService:443 - snapshot [test-snapshot-repository:test-snap/JhJTMIK6T1CXyEndM73Zjg] started
2025-09-05 18:47:31 opensearch[cluster_manager_1][snapshot][T#3] INFO  SnapshotsService:2175 - snapshot [test-snapshot-repository:test-snap/JhJTMIK6T1CXyEndM73Zjg] completed with state [SUCCESS]
2025-09-05 18:47:34 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataDeleteIndexService:166 - [.opendistro_security/ZDkcCrfES56szgkkYqcs6Q] deleting index
2025-09-05 18:47:35 opensearch[data_0][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[.opendistro_security/K0nnLcTYRrqlA4kl12KhKg]
2025-09-05 18:47:35 opensearch[data_0][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@2125eceb enabled for index .opendistro_security
2025-09-05 18:47:35 opensearch[data_0][generic][T#2] INFO  ConfigurationRepository:685 - Shard started on node data_0
2025-09-05 18:47:35 opensearch[data_0][generic][T#2] INFO  ConfigurationRepository:687 - Security index shard [.opendistro_security][0] started - config reloading for snapshot restore; node: data_0
2025-09-05 18:47:35 opensearch[data_1][clusterApplierService#updateTask][T#1] INFO  PluginsService:343 - PluginService:onIndexModule index:[.opendistro_security/K0nnLcTYRrqlA4kl12KhKg]
2025-09-05 18:47:35 opensearch[data_1][clusterApplierService#updateTask][T#1] DEBUG SecurityFlsDlsIndexSearcherWrapper:103 - FLS/DLS org.opensearch.security.configuration.SecurityFlsDlsIndexSearcherWrapper@2755ec7f enabled for index .opendistro_security
2025-09-05 18:47:35 opensearch[data_0][generic][T#3] INFO  RecoverySourceHandler:916 - finalizing recovery took [9.3ms]
2025-09-05 18:47:35 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  AllocationService:577 - Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.opendistro_security][0]]]).
2025-09-05 18:47:35 opensearch[data_1][generic][T#5] INFO  ConfigurationRepository:685 - Shard started on node data_1
2025-09-05 18:47:35 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  MetadataDeleteIndexService:166 - [my_index_001/vnGUzqpjR6yMbTm5l1WWBQ] deleting index
2025-09-05 18:47:35 opensearch[cluster_manager_1][clusterManagerService#updateTask][T#1] INFO  RepositoriesService:364 - delete repository [test-snapshot-repository]
2025-09-05 18:47:36 ForkJoinPool.commonPool-worker-6 INFO  LocalOpenSearchCluster:505 - Stopping cluster_manager_0 RUNNING [47320, 47220]

You can see that the log message for afterIndexShardStarted() is only emitted for the nodes data_0 and data_1. The cluster is however configured as ClusterManager.THREE_CLUSTER_MANAGERS:

    THREE_CLUSTER_MANAGERS(
        new NodeSettings(NodeRole.CLUSTER_MANAGER),
        new NodeSettings(NodeRole.CLUSTER_MANAGER),
        new NodeSettings(NodeRole.CLUSTER_MANAGER),
        new NodeSettings(NodeRole.DATA),
        new NodeSettings(NodeRole.DATA)
    ),

So, we have two data nodes and three further cluster manager nodes. The cluster manager nodes do not get any calls to afterIndexShardStarted().

Generally, the int test should verify that each node has loaded the valid configuration. With the current test, you only verify that a single - undetermined - node has reloaded the configuration.

@cwperks
Copy link
Member

cwperks commented Sep 5, 2025

@nibix Would it make sense to get rid of TransportConfigUpdateAction altogether and change ConfigurationRepository to an IndexOperationListener that triggers reloadConfiguration on the postIndex event?

Idk if that would have the same problem of only triggering on nodes with shards.

@nibix
Copy link
Collaborator

nibix commented Sep 5, 2025

Would it make sense to get rid of TransportConfigUpdateAction altogether and change ConfigurationRepository to an IndexOperationListener that triggers reloadConfiguration on the postIndex event?

Not sure to be honest. The broadcast update action has the advantage that it is straight forward, its semantics are crystal clear. For other methods, I'd first have to research what the conditions would be when they are called.

@cwperks
Copy link
Member

cwperks commented Sep 5, 2025

Got it, I suppose we can yank the code from AbstractApiAction to trigger a ConfigUpdateAction from the node with a primary shard: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/dlic/rest/api/AbstractApiAction.java#L502-L503

@nibix
Copy link
Collaborator

nibix commented Sep 5, 2025

yip, I think that's the way to go

@cwperks
Copy link
Member

cwperks commented Sep 5, 2025

Actually the important section is the ConfigUpdatingActionListener here: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/dlic/rest/api/AbstractApiAction.java#L551-L567

The index has already been restored from a snapshot, so we need to trigger ConfigUpdateAction using the node client for all ctypes.

@nibix
Copy link
Collaborator

nibix commented Sep 5, 2025

Yes. Still, I think, we also have to keep timing in mind: When the primary shard has been restored, it's possible that not all shards are up to date. If we are having bad luck, the update request might be faster than the restoring of the shards and trigger the reloading too early.

I am not sure how to take care of this: Possibly a fixed delay? That would be a bit hacky and fragile as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants