Skip to content

query: implement likelihood sampling based peer scheduling and add work stealing across peers #292

@Roasbeef

Description

@Roasbeef

This issue is spun off from some findings we've seen with neutrino over time, either when we have a lot of active peers, or some of the peers are connected over tor. This'll serve as a mastering tracking issue for a number of improvements we can make to this area of the code base. I also consider much of this to be preparatory work before we begin parallel block download in btcd in earnest. This was spun off discussions in lightningnetwork/lnd#8250.

Sampling Based Peer Ranking

Today the default peer tanking algorithm will just re-order the peers:

// peerRanking is a struct that keeps history of peer's previous query success
// rate, and uses that to prioritise which peers to give the next queries to.
type peerRanking struct {
// rank keeps track of the current set of peers and their score. A
// lower score is better.
rank map[string]uint64
}
) in a map. Each time we go to schedule something new, we use the ranking to sort a slice of the peer, then go attempt to give the work to the next free peer: https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L226-L256.

As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.

To get around this, we should look at using a sampling based method. So we'll map the rankings into a probability that the peer will respond properly instead.

Dynamic Ping Tuned Timeouts

Today we have a hard coded 2 second timeout for all requests. This works in settings of very good network connections, but for tor nodes, this might be too slow.

Instead, we should use the ping latency to tune the timeout for a given peer, based on the extrapolated RTT. This can also be fed into the peer ranking itself. We should also refresh this metric in the background (as network jitter can change perceived latency, especially on tor).

Work Stealing Workers

We should add work stealing to ensure that fast peers are able to dynamically increase the load to increase overall throughput. With work stealing peers would check at the top of their loop if another peer has work that has yet to be claimed, then adding that to its own queue. The Go scheduler also uses work stealing itself to ensure all processors are utilized: https://rakyll.org/scheduler/.

Along the way, we should modify the work queue to add better type safety via type params.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions