Skip to content

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
skumawat2025 opened this issue Feb 12, 2025 · 3 comments
Open
Labels
bug Something isn't working

Comments

@skumawat2025
Copy link

skumawat2025 commented Feb 12, 2025

What is the bug?
When a manual snapshot policy runs, it creates and deletes snapshots based on configured cron jobs. These actions update the state in a system index (.ism-config index). However, due to a race condition, this state update can fail. This occurs when a snapshot deletion is in progress and another snapshot creation starts while holding a lock on the system index. When the snapshot deletion completes, it fails to update the metadata in the system index.

// creation, deletion workflow have to be executed sequentially,
// because they are sharing the same metadata document.
SMStateMachine(client, job, metadata, settings, threadPool, indicesManager)
.handlePolicyChange()
.currentState(metadata.creation.currentState)
.next(creationTransitions)
.apply {
val deleteMetadata = metadata.deletion
if (deleteMetadata != null) {
this.currentState(deleteMetadata.currentState)
.next(deletionTransitions)
}
}
} finally {
if (!releaseLockForScheduledJob(context, lock)) {
log.error("Could not release lock [${lock.lockId}] for ${job.id}.")
}

Currently, we send a notification to users on metadata update failures. This is a false alarm, as it's an internal error rather than a user-facing issue that requires action.

On metadata update failures we are sending a notification to users. This is a false alarm as this is an internal error instead of user facing issue that user can act upon and fix.

} catch (ex: Exception) {
val message = "There was an exception at ${now()} while executing Snapshot Management policy ${job.policyName}, please check logs."
job.notificationConfig?.sendFailureNotification(client, job.policyName, message, job.user, log)
@Suppress("InstanceOfCheckForException")

How can one reproduce the bug?
Set up a manual snapshot policy with both creation and deletion operations.
Configure a notification channel. Run the policy and observe the notifications.

What is the expected behavior?
The system should not send false positive notifications to users for internal metadata update failures.

Do you have any screenshots?

[2024-12-19T02:49:12,259][ERROR][c.o.i.s.e.SMStateMachine [xxxxxx]] [c15aefb119d1092fc32d73e9e5ef8c22] Failed to update metadata.
[.ism-config/QHnWuqpwS46e7r0qCLwuNQ][[.ism-config][4]] VersionConflictEngineException[[xxxxxx-sm-metadata]: version conflict, required seqNo [754565], primary term [1]. current document has seqNo [754720] and primary term [1]]
[2024-12-19T02:49:12,259][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:getSingleConfig-get snapshot-error-notification
[2024-12-19T02:49:12,939][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:sendMessage:statusCode=200, statusText=Success, message id: a3db63d7-295e-5608-b188-3c0aa2b6a1c2
[2024-12-19T02:49:12,941][WARN ][o.o.i.u.JobSchedulerUtils] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock for job xxxxxx-sm-policy
[2024-12-19T02:49:12,941][ERROR][o.o.i.s.SMRunner         ] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock [.ism-config-xxxxxx-sm-policy] for xxxxxx-sm-policy.

Do you have any additional context?
Add any other context about the problem.

@skumawat2025 skumawat2025 added bug Something isn't working untriaged labels Feb 12, 2025
@skumawat2025
Copy link
Author

@bowenlan-amzn Could you please review this and share your thoughts? I'm particularly interested in your perspective on the proposed changes to SMStateMachine.kt, as you were the original author of this file.
Specific areas where your feedback would be appreciated:

  1. The accuracy of the bug description and its root cause
  2. The potential impact of removing these notifications
  3. Any alternative solutions you might suggest
  4. Any unintended consequences we should consider

@bowenlan-amzn
Copy link
Member

bowenlan-amzn commented Feb 13, 2025

seqNo count the indexing operations (index, update, delete) for a shard.
The conflict can happen if there are indexing operations between 2 metadata updates in one snapshot management run.
I feel the easy fix is handle the conflict exception gracefully, read the current seqNo from the exception and retry the update again with it.

For the metadata document we are updating, if we are using multi-thread to update it, that may have out of order update problem, but I think we are not using multi-thread.

@andrross
Copy link
Member

andrross commented Mar 3, 2025

Catch All Triage - 1 2 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants