-
Notifications
You must be signed in to change notification settings - Fork 204
Description
This issue is spun off from some findings we've seen with neutrino over time, either when we have a lot of active peers, or some of the peers are connected over tor. This'll serve as a mastering tracking issue for a number of improvements we can make to this area of the code base. I also consider much of this to be preparatory work before we begin parallel block download in btcd in earnest. This was spun off discussions in lightningnetwork/lnd#8250.
Sampling Based Peer Ranking
Today the default peer tanking algorithm will just re-order the peers:
Lines 20 to 26 in 42a196f
// peerRanking is a struct that keeps history of peer's previous query success | |
// rate, and uses that to prioritise which peers to give the next queries to. | |
type peerRanking struct { | |
// rank keeps track of the current set of peers and their score. A | |
// lower score is better. | |
rank map[string]uint64 | |
} |
As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.
To get around this, we should look at using a sampling based method. So we'll map the rankings into a probability that the peer will respond properly instead.
Dynamic Ping Tuned Timeouts
Today we have a hard coded 2 second timeout for all requests. This works in settings of very good network connections, but for tor nodes, this might be too slow.
Instead, we should use the ping latency to tune the timeout for a given peer, based on the extrapolated RTT. This can also be fed into the peer ranking itself. We should also refresh this metric in the background (as network jitter can change perceived latency, especially on tor).
Work Stealing Workers
We should add work stealing to ensure that fast peers are able to dynamically increase the load to increase overall throughput. With work stealing peers would check at the top of their loop if another peer has work that has yet to be claimed, then adding that to its own queue. The Go scheduler also uses work stealing itself to ensure all processors are utilized: https://rakyll.org/scheduler/.
Along the way, we should modify the work queue to add better type safety via type params.