-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[FIXED] workqueue reset to 0 when blk file is zero-sized due to unflushed data on crash #6882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
server/filestore.go
Outdated
@@ -489,6 +489,11 @@ func newFileStoreWithCreated(fcfg FileStoreConfig, cfg StreamConfig, created tim | |||
return nil, err | |||
} | |||
} | |||
//Use prior state when our stream state could not be recovered from blk files. | |||
if fs.state.FirstSeq|fs.state.LastSeq == 0 && prior.FirstSeq > 0 && prior.LastSeq > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this effectively just covered by taking the if fs.ld != nil
off the above existing condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can remoe the fs.ld != nil, but I believe based on fs.ld != nil a new msgblock has to be created to write tombstones, we can move the fs.ld != nil check before creating msgblock
if fs.ld != nil {
if _, err := fs.newMsgBlockForWrite(); err == nil {
if err = fs.writeTombstone(prior.LastSeq, prior.LastTime.UnixNano()); err != nil {
return nil, err
}
} else {
return nil, err
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @neilalexander I have made the changes as you suggested, but I have moved the fs.ld != nil before creating a new msgblock for tombstones, without it the existing msg block with zero sized will just stay there forever and does not get removed.
…shed data on crash Signed-off-by: souravagrawal <souravagrawal1111@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but would like @derekcollison to take a look too.
@souravagrawal can you confirm that the machine or VM or container crashed, not just the nats server here? We always flush through to the kernel once a message has been processed and before any publish ack. So the OS would have had to crash in our opinion from your description above. |
Hello @derekcollison, yes we have observed this issue with VM crash and not only NATS. |
@souravagrawal I am saying that what you describe above can not happen without a kernel crash. For clarity, are you claiming you have seen this with just abnormal termination of the server and the kernel / OS remained running? |
We've observed this issue multiple times in the event of a VM crash though it's unclear whether the kernel was also affected in those instances, we couldn't initially determine the root cause or reproduce the problem consistently. we simulated a kernel panic by issuing the following command in one of the env: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I encountered an issue where a Workqueue stream’s first and last sequence numbers were unexpectedly reset to 0 following an abrupt termination of the NATS server. Interestingly, the consumer remained fully caught up with messages and retained its expected state even after the crash, but the stream itself appeared to have been reset.
I was able to retrieve a backup of the data after the crash and debug it locally. During analysis, I found that new msgs had not been flushed to disk, resulting in a zero-sized blk file I believe. As a result, during recovery, the stream state remained at zero and the index.db could not be used to reconstruct the state
Resolves : #6881
Signed-off-by: souravagrawal souravagrawal1111@gmail.com