You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug?
When a manual snapshot policy runs, it creates and deletes snapshots based on configured cron jobs. These actions update the state in a system index (.ism-config index). However, due to a race condition, this state update can fail. This occurs when a snapshot deletion is in progress and another snapshot creation starts while holding a lock on the system index. When the snapshot deletion completes, it fails to update the metadata in the system index.
log.error("Could not release lock [${lock.lockId}] for ${job.id}.")
}
Currently, we send a notification to users on metadata update failures. This is a false alarm, as it's an internal error rather than a user-facing issue that requires action.
On metadata update failures we are sending a notification to users. This is a false alarm as this is an internal error instead of user facing issue that user can act upon and fix.
How can one reproduce the bug?
Set up a manual snapshot policy with both creation and deletion operations.
Configure a notification channel. Run the policy and observe the notifications.
What is the expected behavior?
The system should not send false positive notifications to users for internal metadata update failures.
Do you have any screenshots?
[2024-12-19T02:49:12,259][ERROR][c.o.i.s.e.SMStateMachine [xxxxxx]] [c15aefb119d1092fc32d73e9e5ef8c22] Failed to update metadata.
[.ism-config/QHnWuqpwS46e7r0qCLwuNQ][[.ism-config][4]] VersionConflictEngineException[[xxxxxx-sm-metadata]: version conflict, required seqNo [754565], primary term [1]. current document has seqNo [754720] and primary term [1]]
[2024-12-19T02:49:12,259][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:getSingleConfig-get snapshot-error-notification
[2024-12-19T02:49:12,939][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:sendMessage:statusCode=200, statusText=Success, message id: a3db63d7-295e-5608-b188-3c0aa2b6a1c2
[2024-12-19T02:49:12,941][WARN ][o.o.i.u.JobSchedulerUtils] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock for job xxxxxx-sm-policy
[2024-12-19T02:49:12,941][ERROR][o.o.i.s.SMRunner ] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock [.ism-config-xxxxxx-sm-policy] for xxxxxx-sm-policy.
Do you have any additional context?
Add any other context about the problem.
The text was updated successfully, but these errors were encountered:
@bowenlan-amzn Could you please review this and share your thoughts? I'm particularly interested in your perspective on the proposed changes to SMStateMachine.kt, as you were the original author of this file.
Specific areas where your feedback would be appreciated:
The accuracy of the bug description and its root cause
The potential impact of removing these notifications
seqNo count the indexing operations (index, update, delete) for a shard.
The conflict can happen if there are indexing operations between 2 metadata updates in one snapshot management run.
I feel the easy fix is handle the conflict exception gracefully, read the current seqNo from the exception and retry the update again with it.
For the metadata document we are updating, if we are using multi-thread to update it, that may have out of order update problem, but I think we are not using multi-thread.
Uh oh!
There was an error while loading. Please reload this page.
What is the bug?
When a manual snapshot policy runs, it creates and deletes snapshots based on configured cron jobs. These actions update the state in a system index (.ism-config index). However, due to a race condition, this state update can fail. This occurs when a snapshot deletion is in progress and another snapshot creation starts while holding a lock on the system index. When the snapshot deletion completes, it fails to update the metadata in the system index.
index-management/src/main/kotlin/org/opensearch/indexmanagement/snapshotmanagement/SMRunner.kt
Lines 104 to 120 in eb6afa8
Currently, we send a notification to users on metadata update failures. This is a false alarm, as it's an internal error rather than a user-facing issue that requires action.
On metadata update failures we are sending a notification to users. This is a false alarm as this is an internal error instead of user facing issue that user can act upon and fix.
index-management/src/main/kotlin/org/opensearch/indexmanagement/snapshotmanagement/engine/SMStateMachine.kt
Lines 124 to 127 in eb6afa8
How can one reproduce the bug?
Set up a manual snapshot policy with both creation and deletion operations.
Configure a notification channel. Run the policy and observe the notifications.
What is the expected behavior?
The system should not send false positive notifications to users for internal metadata update failures.
Do you have any screenshots?
Do you have any additional context?
Add any other context about the problem.
The text was updated successfully, but these errors were encountered: