-
Notifications
You must be signed in to change notification settings - Fork 331
Bug fix: Fixing partial cache update post snapshot restore #5478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9597028
to
f09c188
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5478 +/- ##
==========================================
- Coverage 73.02% 72.99% -0.03%
==========================================
Files 408 408
Lines 25293 25314 +21
Branches 3854 3854
==========================================
+ Hits 18469 18478 +9
- Misses 4952 4963 +11
- Partials 1872 1873 +1
🚀 New features to boost your workflow:
|
@@ -681,11 +681,10 @@ public void afterIndexShardStarted(IndexShard indexShard) { | |||
|
|||
// Check if this is a security index shard | |||
if (securityIndex.equals(index.getName())) { | |||
// Only trigger on primary shard to avoid multiple reloads | |||
if (indexShard.routingEntry() != null && indexShard.routingEntry().primary()) { | |||
if (indexShard.routingEntry() != null) { | |||
threadPool.generic().execute(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure this get's updated when we have a dedicated threadpool for security updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts that this will fix the issue.
Of course, I might be wrong here. However, in any case, a feature that relies on such specific cluster behavior should always come with an integration test which tests it on a real cluster with nodes assuming different roles.
Just having a unit test which mocks dependencies won't be sufficient, as these mocks encode just the assumptions of the developer - which might be just wrong.
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Outdated
Show resolved
Hide resolved
@nagarajg17, @cwperks, and @nibix — there's a |
@willyborankin The SnapshotRestoreHelper is being using here to determine if the security index is being restored from a snapshot. See https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/support/SnapshotRestoreHelper.java#L96-L112 |
Ahhh did not notice static import :-) . |
81282f4
to
b76f389
Compare
Signed-off-by: Nagaraj G <narajg@amazon.com>
* Use isClusterPerm instead of requestedResolved.isLocalAll() to determine if action is a cluster action ([#5445](https://github.com/opensearch-project/security/pull/5445)) | ||
* Fix config update with deprecated config types failing in mixed clusters ([#5456](https://github.com/opensearch-project/security/pull/5456)) | ||
* Fix usage of jwt_clock_skew_tolerance_seconds in HTTPJwtAuthenticator ([#5506](https://github.com/opensearch-project/security/pull/5506)) | ||
* Fix partial cache update post snapshot restore[#5478](https://github.com/opensearch-project/security/pull/5478) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nagarajg17 can you fix the CHANGELOG here? Otherwise this looks good to me.
Thank you for the integration test! However, apologies once again; and I am not really feeling good with always being the one who sees issues here :-( I am still not convinced that this works the way it is. I just tried the integration test. Additionally, I added some more logs to
The output of the test is as follows:
You can see that the log message for
So, we have two data nodes and three further cluster manager nodes. The cluster manager nodes do not get any calls to Generally, the int test should verify that each node has loaded the valid configuration. With the current test, you only verify that a single - undetermined - node has reloaded the configuration. |
@nibix Would it make sense to get rid of Idk if that would have the same problem of only triggering on nodes with shards. |
Not sure to be honest. The broadcast update action has the advantage that it is straight forward, its semantics are crystal clear. For other methods, I'd first have to research what the conditions would be when they are called. |
Got it, I suppose we can yank the code from AbstractApiAction to trigger a ConfigUpdateAction from the node with a primary shard: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/dlic/rest/api/AbstractApiAction.java#L502-L503 |
yip, I think that's the way to go |
Actually the important section is the ConfigUpdatingActionListener here: https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/dlic/rest/api/AbstractApiAction.java#L551-L567 The index has already been restored from a snapshot, so we need to trigger |
Yes. Still, I think, we also have to keep timing in mind: When the primary shard has been restored, it's possible that not all shards are up to date. If we are having bad luck, the update request might be faster than the restoring of the shards and trigger the reloading too early. I am not sure how to take care of this: Possibly a fixed delay? That would be a bit hacky and fragile as well. |
Description
[Describe what this change achieves]
Category: Bug Fix
Why these changes are required?
In #5307, a functionality was introduced to update the cache from the security index after the index was restored from a snapshot. It was done on nodes having only primary shard. This PR fixes to reload cache on all nodex
What is the old behavior before changes and new behavior after changes?
Cache update was only done for nodes having primary shard. Now it will be done for all nodes
Issues Resolved
5472
Is this a backport? If so, please add backport PR # and/or commits #, and remove
backport-failed
label from the original PR.Do these changes introduce new permission(s) to be displayed in the static dropdown on the front-end? If so, please open a draft PR in the security dashboards plugin and link the draft PR here
Testing
Tested with 2 node setup on single machine.
my-index-000001
my-index-000001
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.