Skip to content

[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflushed data on crash #6882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025

Conversation

souravagrawal
Copy link
Contributor

I encountered an issue where a Workqueue stream’s first and last sequence numbers were unexpectedly reset to 0 following an abrupt termination of the NATS server. Interestingly, the consumer remained fully caught up with messages and retained its expected state even after the crash, but the stream itself appeared to have been reset.

I was able to retrieve a backup of the data after the crash and debug it locally. During analysis, I found that new msgs had not been flushed to disk, resulting in a zero-sized blk file I believe. As a result, during recovery, the stream state remained at zero and the index.db could not be used to reconstruct the state

Resolves : #6881

Signed-off-by: souravagrawal souravagrawal1111@gmail.com

@souravagrawal souravagrawal requested a review from a team as a code owner May 8, 2025 11:19
@@ -489,6 +489,11 @@ func newFileStoreWithCreated(fcfg FileStoreConfig, cfg StreamConfig, created tim
return nil, err
}
}
//Use prior state when our stream state could not be recovered from blk files.
if fs.state.FirstSeq|fs.state.LastSeq == 0 && prior.FirstSeq > 0 && prior.LastSeq > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this effectively just covered by taking the if fs.ld != nil off the above existing condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can remoe the fs.ld != nil, but I believe based on fs.ld != nil a new msgblock has to be created to write tombstones, we can move the fs.ld != nil check before creating msgblock

if fs.ld != nil {
   if _, err := fs.newMsgBlockForWrite(); err == nil {
	if err = fs.writeTombstone(prior.LastSeq, prior.LastTime.UnixNano()); err != nil {
		return nil, err
	}
	} else {
	        return nil, err
	}
   }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @neilalexander I have made the changes as you suggested, but I have moved the fs.ld != nil before creating a new msgblock for tombstones, without it the existing msg block with zero sized will just stay there forever and does not get removed.

…shed data on crash

Signed-off-by: souravagrawal <souravagrawal1111@gmail.com>
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but would like @derekcollison to take a look too.

@derekcollison
Copy link
Member

derekcollison commented May 9, 2025

@souravagrawal can you confirm that the machine or VM or container crashed, not just the nats server here? We always flush through to the kernel once a message has been processed and before any publish ack. So the OS would have had to crash in our opinion from your description above.

@souravagrawal
Copy link
Contributor Author

@souravagrawal can you confirm that the machine or VM or container crashed, not just the nats server here? We always flush through to the kernel once a message has been processed and before any publish ack. So the OS would have had to crash in our opinion from your description above.

Hello @derekcollison, yes we have observed this issue with VM crash and not only NATS.

@derekcollison
Copy link
Member

@souravagrawal I am saying that what you describe above can not happen without a kernel crash. For clarity, are you claiming you have seen this with just abnormal termination of the server and the kernel / OS remained running?

@souravagrawal
Copy link
Contributor Author

@souravagrawal I am saying that what you describe above can not happen without a kernel crash. For clarity, are you claiming you have seen this with just abnormal termination of the server and the kernel / OS remained running?

We've observed this issue multiple times in the event of a VM crash though it's unclear whether the kernel was also affected in those instances, we couldn't initially determine the root cause or reproduce the problem consistently.

we simulated a kernel panic by issuing the following command in one of the env:
echo c > /proc/sysrq-trigger
This reliably reproduced the issue in almost every run.
To troubleshoot this issue, we examined the NATS data directory before rebooting NATS and found that only a single .blk file was present with a file size of 0 bytes. We restored this data locally to troubleshoot and found that the associated stream state is getting reset to 0.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit 7c06a4f into nats-io:main May 12, 2025
65 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WQ Stream Sequence Reset to 0 on abrupt termination
3 participants