[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflushed data on crash #6882

souravagrawal · 2025-05-08T11:19:44Z

I encountered an issue where a Workqueue stream’s first and last sequence numbers were unexpectedly reset to 0 following an abrupt termination of the NATS server. Interestingly, the consumer remained fully caught up with messages and retained its expected state even after the crash, but the stream itself appeared to have been reset.

I was able to retrieve a backup of the data after the crash and debug it locally. During analysis, I found that new msgs had not been flushed to disk, resulting in a zero-sized blk file I believe. As a result, during recovery, the stream state remained at zero and the index.db could not be used to reconstruct the state

Resolves : #6881

Signed-off-by: souravagrawal souravagrawal1111@gmail.com

neilalexander · 2025-05-08T16:50:51Z

server/filestore.go

@@ -489,6 +489,11 @@ func newFileStoreWithCreated(fcfg FileStoreConfig, cfg StreamConfig, created tim
 				return nil, err
 			}
 		}
+		//Use prior state when our stream state could not be recovered from blk files.
+		if fs.state.FirstSeq|fs.state.LastSeq == 0 && prior.FirstSeq > 0 && prior.LastSeq > 0 {


Is this effectively just covered by taking the if fs.ld != nil off the above existing condition?

yes we can remoe the fs.ld != nil, but I believe based on fs.ld != nil a new msgblock has to be created to write tombstones, we can move the fs.ld != nil check before creating msgblock

if fs.ld != nil { if _, err := fs.newMsgBlockForWrite(); err == nil { if err = fs.writeTombstone(prior.LastSeq, prior.LastTime.UnixNano()); err != nil { return nil, err } } else { return nil, err } }

Hello @neilalexander I have made the changes as you suggested, but I have moved the fs.ld != nil before creating a new msgblock for tombstones, without it the existing msg block with zero sized will just stay there forever and does not get removed.

…shed data on crash Signed-off-by: souravagrawal <souravagrawal1111@gmail.com>

neilalexander

LGTM but would like @derekcollison to take a look too.

derekcollison · 2025-05-09T15:18:30Z

@souravagrawal can you confirm that the machine or VM or container crashed, not just the nats server here? We always flush through to the kernel once a message has been processed and before any publish ack. So the OS would have had to crash in our opinion from your description above.

souravagrawal · 2025-05-09T15:29:16Z

@souravagrawal can you confirm that the machine or VM or container crashed, not just the nats server here? We always flush through to the kernel once a message has been processed and before any publish ack. So the OS would have had to crash in our opinion from your description above.

Hello @derekcollison, yes we have observed this issue with VM crash and not only NATS.

derekcollison · 2025-05-09T15:43:19Z

@souravagrawal I am saying that what you describe above can not happen without a kernel crash. For clarity, are you claiming you have seen this with just abnormal termination of the server and the kernel / OS remained running?

souravagrawal · 2025-05-09T16:11:36Z

@souravagrawal I am saying that what you describe above can not happen without a kernel crash. For clarity, are you claiming you have seen this with just abnormal termination of the server and the kernel / OS remained running?

We've observed this issue multiple times in the event of a VM crash though it's unclear whether the kernel was also affected in those instances, we couldn't initially determine the root cause or reproduce the problem consistently.

we simulated a kernel panic by issuing the following command in one of the env:
echo c > /proc/sysrq-trigger
This reliably reproduced the issue in almost every run.
To troubleshoot this issue, we examined the NATS data directory before rebooting NATS and found that only a single .blk file was present with a file size of 0 bytes. We restored this data locally to troubleshoot and found that the associated stream state is getting reset to 0.

derekcollison

LGTM

souravagrawal requested a review from a team as a code owner May 8, 2025 11:19

neilalexander reviewed May 8, 2025

View reviewed changes

[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflu…

485fb9c

…shed data on crash Signed-off-by: souravagrawal <souravagrawal1111@gmail.com>

souravagrawal force-pushed the main branch from 9f6fcc5 to 485fb9c Compare May 8, 2025 18:05

souravagrawal requested a review from neilalexander May 9, 2025 05:31

neilalexander approved these changes May 9, 2025

View reviewed changes

neilalexander requested a review from derekcollison May 9, 2025 11:16

derekcollison approved these changes May 9, 2025

View reviewed changes

neilalexander merged commit 7c06a4f into nats-io:main May 12, 2025
65 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflushed data on crash #6882

[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflushed data on crash #6882

Uh oh!

souravagrawal commented May 8, 2025

Uh oh!

neilalexander May 8, 2025

Uh oh!

souravagrawal May 8, 2025

Uh oh!

souravagrawal May 8, 2025

Uh oh!

neilalexander left a comment

Uh oh!

derekcollison commented May 9, 2025 •

edited

Loading

Uh oh!

souravagrawal commented May 9, 2025

Uh oh!

derekcollison commented May 9, 2025

Uh oh!

souravagrawal commented May 9, 2025

Uh oh!

derekcollison left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflushed data on crash #6882

[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflushed data on crash #6882

Uh oh!

Conversation

souravagrawal commented May 8, 2025

Uh oh!

neilalexander May 8, 2025

Choose a reason for hiding this comment

Uh oh!

souravagrawal May 8, 2025

Choose a reason for hiding this comment

Uh oh!

souravagrawal May 8, 2025

Choose a reason for hiding this comment

Uh oh!

neilalexander left a comment

Choose a reason for hiding this comment

Uh oh!

derekcollison commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

souravagrawal commented May 9, 2025

Uh oh!

derekcollison commented May 9, 2025

Uh oh!

souravagrawal commented May 9, 2025

Uh oh!

derekcollison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

derekcollison commented May 9, 2025 •

edited

Loading