[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

skumawat2025 · 2025-02-12T09:36:24Z

What is the bug?
When a manual snapshot policy runs, it creates and deletes snapshots based on configured cron jobs. These actions update the state in a system index (.ism-config index). However, due to a race condition, this state update can fail. This occurs when a snapshot deletion is in progress and another snapshot creation starts while holding a lock on the system index. When the snapshot deletion completes, it fails to update the metadata in the system index.

index-management/src/main/kotlin/org/opensearch/indexmanagement/snapshotmanagement/SMRunner.kt

Lines 104 to 120 in eb6afa8

    
               // creation, deletion workflow have to be executed sequentially, 
        
               // because they are sharing the same metadata document. 
        
               SMStateMachine(client, job, metadata, settings, threadPool, indicesManager) 
        
                   .handlePolicyChange() 
        
                   .currentState(metadata.creation.currentState) 
        
                   .next(creationTransitions) 
        
                   .apply { 
        
                       val deleteMetadata = metadata.deletion 
        
                       if (deleteMetadata != null) { 
        
                           this.currentState(deleteMetadata.currentState) 
        
                               .next(deletionTransitions) 
        
                       } 
        
                   } 
        
           } finally { 
        
               if (!releaseLockForScheduledJob(context, lock)) { 
        
                   log.error("Could not release lock [${lock.lockId}] for ${job.id}.") 
        
               }

Currently, we send a notification to users on metadata update failures. This is a false alarm, as it's an internal error rather than a user-facing issue that requires action.

On metadata update failures we are sending a notification to users. This is a false alarm as this is an internal error instead of user facing issue that user can act upon and fix.

index-management/src/main/kotlin/org/opensearch/indexmanagement/snapshotmanagement/engine/SMStateMachine.kt

Lines 124 to 127 in eb6afa8

    
           } catch (ex: Exception) { 
        
               val message = "There was an exception at ${now()} while executing Snapshot Management policy ${job.policyName}, please check logs." 
        
               job.notificationConfig?.sendFailureNotification(client, job.policyName, message, job.user, log) 
        
               @Suppress("InstanceOfCheckForException")

How can one reproduce the bug?
Set up a manual snapshot policy with both creation and deletion operations.
Configure a notification channel. Run the policy and observe the notifications.

What is the expected behavior?
The system should not send false positive notifications to users for internal metadata update failures.

Do you have any screenshots?

[2024-12-19T02:49:12,259][ERROR][c.o.i.s.e.SMStateMachine [xxxxxx]] [c15aefb119d1092fc32d73e9e5ef8c22] Failed to update metadata.
[.ism-config/QHnWuqpwS46e7r0qCLwuNQ][[.ism-config][4]] VersionConflictEngineException[[xxxxxx-sm-metadata]: version conflict, required seqNo [754565], primary term [1]. current document has seqNo [754720] and primary term [1]]
[2024-12-19T02:49:12,259][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:getSingleConfig-get snapshot-error-notification
[2024-12-19T02:49:12,939][INFO ][o.o.n.s.SendMessageActionHelper] [c15aefb119d1092fc32d73e9e5ef8c22] notifications:sendMessage:statusCode=200, statusText=Success, message id: a3db63d7-295e-5608-b188-3c0aa2b6a1c2
[2024-12-19T02:49:12,941][WARN ][o.o.i.u.JobSchedulerUtils] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock for job xxxxxx-sm-policy
[2024-12-19T02:49:12,941][ERROR][o.o.i.s.SMRunner         ] [c15aefb119d1092fc32d73e9e5ef8c22] Could not release lock [.ism-config-xxxxxx-sm-policy] for xxxxxx-sm-policy.

Do you have any additional context?
Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

skumawat2025 · 2025-02-12T09:38:59Z

@bowenlan-amzn Could you please review this and share your thoughts? I'm particularly interested in your perspective on the proposed changes to SMStateMachine.kt, as you were the original author of this file.
Specific areas where your feedback would be appreciated:

The accuracy of the bug description and its root cause
The potential impact of removing these notifications
Any alternative solutions you might suggest
Any unintended consequences we should consider

bowenlan-amzn · 2025-02-13T23:35:13Z

seqNo count the indexing operations (index, update, delete) for a shard.
The conflict can happen if there are indexing operations between 2 metadata updates in one snapshot management run.
I feel the easy fix is handle the conflict exception gracefully, read the current seqNo from the exception and retry the update again with it.

For the metadata document we are updating, if we are using multi-thread to update it, that may have out of order update problem, but I think we are not using multi-thread.

andrross · 2025-03-03T17:14:15Z

Catch All Triage - 1 2 3

skumawat2025 · 2025-06-12T00:37:54Z

Closing as this PR is merged: #1413

skumawat2025 added bug Something isn't working untriaged labels Feb 12, 2025

andrross removed the untriaged label Mar 3, 2025

This was referenced Apr 25, 2025

Removed unnecessary user notifications for version conflict exceptions in Snapshot Management #1411

Closed

Removed unnecessary user notifications for version conflict exception #1413

Merged

skumawat2025 closed this as completed Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

skumawat2025 commented Feb 12, 2025 •

edited

Loading

skumawat2025 commented Feb 12, 2025

Uh oh!

bowenlan-amzn commented Feb 13, 2025 •

edited

Loading

Uh oh!

andrross commented Mar 3, 2025

Uh oh!

skumawat2025 commented Jun 12, 2025

Uh oh!

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

[BUG] Eliminate False Positive Notifications in Manual Snapshot Policy #1371

Comments

skumawat2025 commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

skumawat2025 commented Feb 12, 2025

Uh oh!

bowenlan-amzn commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrross commented Mar 3, 2025

Uh oh!

skumawat2025 commented Jun 12, 2025

Uh oh!

skumawat2025 commented Feb 12, 2025 •

edited

Loading

bowenlan-amzn commented Feb 13, 2025 •

edited

Loading