Skip to content

Conversation

pfi79
Copy link
Contributor

@pfi79 pfi79 commented Apr 27, 2025

On test benches, deadlock of a node is detected due to simultaneous 2 events: decide and abort view.
Added a test that shows this in the current version and a solution to the problem.

@pfi79 pfi79 requested review from C0rWin and yacovm April 27, 2025 21:42
@yacovm
Copy link
Contributor

yacovm commented Apr 27, 2025

On test benches, deadlock of a node is detected due to simultaneous 2 events: decide and abort view.

Can you provide a full stack trace of the process? Just to make sure the test aligns with what you experience in the test benches.

@yacovm
Copy link
Contributor

yacovm commented Apr 27, 2025

I don't see a stack trace in the file

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 27, 2025

I don't see a stack trace in the file

the stack trace, which is written when the program crashes?

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 27, 2025

I just don't get it, what do you need?

but there is no stacktrace and there never has been.

There were only logs that showed that the view executed the solution, but it was not delivered to the fabric, and at the same time the controller tries to stop the view and fails to do so

@yacovm
Copy link
Contributor

yacovm commented Apr 27, 2025

I don't see a stack trace in the file

the stack trace, which is written when the program crashes?

A stack trace which shows the stack of each goroutine. Can you please tell me in which line in the file it resides in?

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 27, 2025

A stack trace which shows the stack of each goroutine. Can you please tell me in which line in the file it resides in?

I'm afraid he doesn't exist.

@yacovm
Copy link
Contributor

yacovm commented Apr 27, 2025

I just don't get it, what do you need?

but there is no stacktrace and there never has been.

There were only logs that showed that the view executed the solution, but it was not delivered to the fabric, and at the same time the controller tries to stop the view and fails to do so

You can make the orderer output a stack trace into the logs.
Send a SIGUSR1 to the orderer process and it will dump the stack trace of all goroutines.

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 27, 2025

You can make the orderer output a stack trace into the logs.
Send a SIGUSR1 to the orderer process and it will dump the stack trace of all goroutines.

The situation rarely happens. I can't repeat it on the bench right now.

But I can repeat the same situation on the SmartBFT test and output the stacktrace

@yacovm
Copy link
Contributor

yacovm commented Apr 27, 2025

You can make the orderer output a stack trace into the logs.
Send a SIGUSR1 to the orderer process and it will dump the stack trace of all goroutines.

The situation rarely happens. I can't repeat it on the bench right now.

But I can repeat the same situation on the SmartBFT test and output the stacktrace

But how do you know that the test reproduces the problem manifests in the benchmark environment without a stack trace dump?

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 27, 2025

But how do you know that the test reproduces the problem manifests in the benchmark environment without a stack trace dump?

just like I restored it.
on the logs.

and then repeated it in the test

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 27, 2025

goroutine 45 [select]:
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).Decide(0x140002b6248, {{0x1400002c240, 0x57, 0x57}, {0x0, 0x0, 0x0}, {0x140000b6100, 0x4, 0x4}, ...}, ...)
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:883 +0x128
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).decide(0x140002a0400, 0x140000a80f0, {0x14000718540, 0x3, 0x4}, {0x140001ec060, 0x1, 0x1})
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:857 +0x2c4
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).prepared(0x140002a0400)
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:347 +0x3c4
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).doPhase(0x140002a0400)
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:289 +0x10c
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).run(0x140002a0400)
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:277 +0x1b4
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).Start.func1()
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:140 +0x24
created by github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).Start in goroutine 5
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:139 +0x20c

goroutine 47 [sync.WaitGroup.Wait]:
sync.runtime_SemacquireWaitGroup(0x140002a05d8?)
	/usr/local/go/src/runtime/sema.go:110 +0x2c
sync.(*WaitGroup).Wait(0x140002a05d8)
	/usr/local/go/src/sync/waitgroup.go:118 +0x98
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).Abort(0x140002a0400)
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:1006 +0x34
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).abortView(0x140002b6248, 0x0)
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:452 +0x310
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).run(0x140002b6248)
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:514 +0x3e0
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).Start.func1()
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:819 +0x9c
created by github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).Start in goroutine 5
	/Users/f/go/src/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:817 +0x760

@pfi79 pfi79 force-pushed the decide-abort-view branch from a011d74 to 52ac023 Compare April 28, 2025 00:01
@pfi79
Copy link
Contributor Author

pfi79 commented Apr 28, 2025

steps to reproduce the error:

  1. take the test I wrote here

  2. run it on the main branch

@yacovm
Copy link
Contributor

yacovm commented Apr 28, 2025

So the reason I asked to see the stack trace from the benchmark test is that I find it odd that we have a view change due to a timeout (which requires the node to be disconnected for a period of time):

	// Emulation of node removal from the consensus in the absence of heartbeat
	nodes[3].Disconnect()
	<-leaderComplaneCh
	nodes[3].Connect()

But the node has just received the missing commit messages which allow it to commit the block.
Sure, it can be that the node is incredibly unlucky, but I find it very hard to believe, because why would the node receive the commit messages so late? Therefore I think seeing the stack trace can be beneficial to ensure nothing here is overlooked.

Setting that aside, would it not make sense to let the view finish committing the decision before tearing down the view?

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 28, 2025

So the reason I asked to see the stack trace from the benchmark test is that I find it odd that we have a view change due to a timeout (which requires the node to be disconnected for a period of time):

	// Emulation of node removal from the consensus in the absence of heartbeat
	nodes[3].Disconnect()
	<-leaderComplaneCh
	nodes[3].Connect()

But the node has just received the missing commit messages which allow it to commit the block. Sure, it can be that the node is incredibly unlucky, but I find it very hard to believe, because why would the node receive the commit messages so late? Therefore I think seeing the stack trace can be beneficial to ensure nothing here is overlooked.

Setting that aside, would it not make sense to let the view finish committing the decision before tearing down the view?

I'll describe what happened on the bench.

  1. In an hour of collapse, this node considered the leader compromised. And no one supported him.
    It starts sending out signals about the leader change with a certain frequency and restarting the view every now and then.

In the test the first thing I did was to repeat this behavior through heartbeat timeout.

  1. Then the view continues to participate in transaction counting.
    And at some point, these two events coincided:
  • view sends a decision to the controller
  • controller tries to abort the view
    and they block each other. Further work of the node is stopped until it is rebooted.

In the test I needed to make these two events happen almost simultaneously. That's what I did. The order is not important.

@pfi79
Copy link
Contributor Author

pfi79 commented Apr 28, 2025

orderer.txt

let's go through the log file:

each view solution
Sequence: 4870-->4871
is accompanied by receiving a block in the fabric and writing it down
Delivering proposal, writing block 4870 with 1 transactions and metadata
[channel: acl] Wrote block [4870]

And only in the last case (line 1947) there is no such thing. Only one Sequence record: 4872-->4873.

Conclusion one: somewhere on the way from view through controller to fabric the solution was lost. It has not reached its destination.

A few lines above (line 1501) we see that the controller has received a get msg from syncChan message and is busy processing it. This processing ends with an attempt to stop the view (line 2009) Aborting current view with number 42. After that, the new view does not start until the end of the file.

Conclusion two: the view has not stopped

Analyzing the code yields the only result: view and controller are trying to call each other at the same time.

@pfi79 pfi79 force-pushed the decide-abort-view branch from 52ac023 to 0a68060 Compare April 28, 2025 06:46
@pfi79
Copy link
Contributor Author

pfi79 commented May 1, 2025

@yacovm I got it. It's fabric v3.0.0.

goroutine 250 gp=0xc001749180 m=nil [semacquire, 69 minutes]:
runtime.gopark(0x54a4bd?, 0xc0060d3ca8?, 0x0?, 0x8c?, 0xdf75c5?)
	/usr/local/go/src/runtime/proc.go:424 +0xce fp=0xc007411c00 sp=0xc007411be0 pc=0x47616e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:430
runtime.semacquire1(0xc002f2e1e0, 0x0, 0x1, 0x0, 0x12)
	/usr/local/go/src/runtime/sema.go:178 +0x225 fp=0xc007411c68 sp=0xc007411c00 pc=0x452e85
sync.runtime_Semacquire(0xc004bc53ff?)
	/usr/local/go/src/runtime/sema.go:71 +0x25 fp=0xc007411ca0 sp=0xc007411c68 pc=0x477525
sync.(*WaitGroup).Wait(0x1?)
	/usr/local/go/src/sync/waitgroup.go:118 +0x48 fp=0xc007411cc8 sp=0xc007411ca0 pc=0x48c9a8
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).Abort(0xc002f2e000)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:1003 +0x2a fp=0xc007411ce0 sp=0xc007411cc8 pc=0xdf760a
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).abortView(0xc0016d2908, 0x3c)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:443 +0x251 fp=0xc007411d38 sp=0xc007411ce0 pc=0xddf831
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).changeView(0xc0016d2908, 0x3c, 0x18ca, 0x0)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:414 +0x1fd fp=0xc007411d98 sp=0xc007411d38 pc=0xddf4bd
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).run(0xc0016d2908)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:515 +0x312 fp=0xc007411fa8 sp=0xc007411d98 pc=0xddffb2
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).Start.func1()
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:810 +0x4c fp=0xc007411fe0 sp=0xc007411fa8 pc=0xde282c
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc007411fe8 sp=0xc007411fe0 pc=0x47e321
created by github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).Start in goroutine 1
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:808 +0x689

goroutine 1338755 gp=0xc00739d880 m=nil [select, 69 minutes]:
runtime.gopark(0xc002bedc70?, 0x2?, 0x10?, 0x0?, 0xc002bedc64?)
	/usr/local/go/src/runtime/proc.go:424 +0xce fp=0xc0051d3b08 sp=0xc0051d3ae8 pc=0x47616e
runtime.selectgo(0xc0051d3c70, 0xc002bedc60, 0x0?, 0x1, 0xc000071508?, 0x1)
	/usr/local/go/src/runtime/select.go:335 +0x7a5 fp=0xc0051d3c30 sp=0xc0051d3b08 pc=0x451de5
github.com/hyperledger-labs/SmartBFT/internal/bft.(*Controller).Decide(0xc0016d2908, {{0xc000f76600, 0x12fd, 0x1300}, {0xc006ab3590, 0x4a, 0x50}, {0xc007397de8, 0x8, 0x8}, ...}, ...)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/controller.go:874 +0xf0 fp=0xc0051d3d40 sp=0xc0051d3c30 pc=0xde2bb0
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).decide(0xc002f2e000, 0xc00689a2d0, {0xc005bf1dc0, 0x2, 0x2}, {0xc0065d74e0, 0x1, 0x1})
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:857 +0x324 fp=0xc0051d3e60 sp=0xc0051d3d40 pc=0xdf64a4
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).prepared(0xc002f2e000)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:347 +0x29f fp=0xc0051d3ee0 sp=0xc0051d3e60 pc=0xdf193f
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).doPhase(0xc002f2e000)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:289 +0x125 fp=0xc0051d3f20 sp=0xc0051d3ee0 pc=0xdf1325
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).run(0xc002f2e000)
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:277 +0x117 fp=0xc0051d3fc8 sp=0xc0051d3f20 pc=0xdf10b7
github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).Start.func1()
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:140 +0x17 fp=0xc0051d3fe0 sp=0xc0051d3fc8 pc=0xdf03d7
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0051d3fe8 sp=0xc0051d3fe0 pc=0x47e321
created by github.com/hyperledger-labs/SmartBFT/internal/bft.(*View).Start in goroutine 250
	/vendor/github.com/hyperledger-labs/SmartBFT/internal/bft/view.go:139 +0x1b8

@yacovm
Copy link
Contributor

yacovm commented May 1, 2025

can you attach the entire stack trace of the node?

@pfi79
Copy link
Contributor Author

pfi79 commented May 1, 2025

@pfi79 pfi79 force-pushed the decide-abort-view branch from 0a68060 to 9a693d4 Compare May 1, 2025 10:58
@pfi79
Copy link
Contributor Author

pfi79 commented May 1, 2025

can you attach the entire stack trace of the node?

Added

…view stopping

Signed-off-by: Fedor Partanskiy <fredprtnsk@gmail.com>
@pfi79 pfi79 force-pushed the decide-abort-view branch from 9a693d4 to 728da84 Compare May 1, 2025 21:24
@yacovm yacovm merged commit eb005ee into main May 3, 2025
10 checks passed
@pfi79 pfi79 deleted the decide-abort-view branch May 4, 2025 05:34
@HagarMeir
Copy link
Contributor

@yacovm @pfi79
I took a look at the code here and I don't understand the purpose of this abort channel
It seems like it is never used (or closed) so probably not needed
Maybe adding a buffer to the decision channel is enough?
If you agree I can make a fix

@pfi79
Copy link
Contributor Author

pfi79 commented May 6, 2025

I took a look at the code here and I don't understand the purpose of this abort channel
It seems like it is never used (or closed) so probably not needed
Maybe adding a buffer to the decision channel is enough?
If you agree I can make a fix

I don't get it. Why don't you want this channel?

@HagarMeir
Copy link
Contributor

I took a look at the code here and I don't understand the purpose of this abort channel
It seems like it is never used (or closed) so probably not needed
Maybe adding a buffer to the decision channel is enough?
If you agree I can make a fix

I don't get it. Why don't you want this channel?

It seems to me it is not used anywhere

@pfi79
Copy link
Contributor Author

pfi79 commented May 6, 2025

It seems to me it is not used anywhere

Or maybe you mean something else. But I think it's being used.

@HagarMeir
Copy link
Contributor

It seems to me it is not used anywhere

Or maybe you mean something else. But I think it's being used.

What I mean is it seems to me the channel is not used because you never close it
So adding a case to wait on this channel does not contribute anything

@pfi79
Copy link
Contributor Author

pfi79 commented May 6, 2025

It seems to me it is not used anywhere

Or maybe you mean something else. But I think it's being used.

What I mean is it seems to me the channel is not used because you never close it So adding a case to wait on this channel does not contribute anything

Of course it's closing. You can remove it and try to run the last added test TestDecideAndAbortViewAtSameTime. Without closing this channel it will terminate with an error.

@HagarMeir
Copy link
Contributor

I am sorry...my mistake...I didn't notice you are returning the already existing abort chan in the view...
Never mind

@pfi79
Copy link
Contributor Author

pfi79 commented May 6, 2025

I purposely put it outside to bypass the deadlock between view and controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants