-
Notifications
You must be signed in to change notification settings - Fork 32
fix error if 2 events occur simultaneously: decision progression and view stopping #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you provide a full stack trace of the process? Just to make sure the test aligns with what you experience in the test benches. |
I don't see a stack trace in the file |
the stack trace, which is written when the program crashes? |
I just don't get it, what do you need? but there is no stacktrace and there never has been. There were only logs that showed that the view executed the solution, but it was not delivered to the fabric, and at the same time the controller tries to stop the view and fails to do so |
A stack trace which shows the stack of each goroutine. Can you please tell me in which line in the file it resides in? |
I'm afraid he doesn't exist. |
You can make the orderer output a stack trace into the logs. |
The situation rarely happens. I can't repeat it on the bench right now. But I can repeat the same situation on the SmartBFT test and output the stacktrace |
But how do you know that the test reproduces the problem manifests in the benchmark environment without a stack trace dump? |
just like I restored it. and then repeated it in the test |
|
a011d74
to
52ac023
Compare
steps to reproduce the error:
|
So the reason I asked to see the stack trace from the benchmark test is that I find it odd that we have a view change due to a timeout (which requires the node to be disconnected for a period of time):
But the node has just received the missing commit messages which allow it to commit the block. Setting that aside, would it not make sense to let the view finish committing the decision before tearing down the view? |
I'll describe what happened on the bench.
In the test the first thing I did was to repeat this behavior through heartbeat timeout.
In the test I needed to make these two events happen almost simultaneously. That's what I did. The order is not important. |
let's go through the log file: each view solution And only in the last case (line 1947) there is no such thing. Only one Conclusion one: somewhere on the way from view through controller to fabric the solution was lost. It has not reached its destination. A few lines above (line 1501) we see that the controller has received a Conclusion two: the view has not stopped Analyzing the code yields the only result: view and controller are trying to call each other at the same time. |
52ac023
to
0a68060
Compare
@yacovm I got it. It's fabric v3.0.0.
|
can you attach the entire stack trace of the node? |
Added |
…view stopping Signed-off-by: Fedor Partanskiy <fredprtnsk@gmail.com>
I don't get it. Why don't you want this channel? |
It seems to me it is not used anywhere |
Or maybe you mean something else. But I think it's being used. |
What I mean is it seems to me the channel is not used because you never close it |
Of course it's closing. You can remove it and try to run the last added test TestDecideAndAbortViewAtSameTime. Without closing this channel it will terminate with an error. |
I am sorry...my mistake...I didn't notice you are returning the already existing abort chan in the view... |
I purposely put it outside to bypass the deadlock between view and controller |
On test benches, deadlock of a node is detected due to simultaneous 2 events: decide and abort view.
Added a test that shows this in the current version and a solution to the problem.