-
Notifications
You must be signed in to change notification settings - Fork 926
Description
Description
We suspect that a race condition exists in the reprocessing queue. Consider the following events:
- An attestation A arrives for a block that is being processed (block B). The attestation is processed with an
UnknownHeadBlock
error and will be added to the reprocess queue. - Block B finishes being imported on another thread, and the
BlockImported
event is sent to the reprocess queue, which dequeues all existing attestations for B. Our attestation from A is not dequeued because it hasn't reached the reprocess queue yet. - Attestation A is added to the reprocess queue.
- Attestation A times out after 12s and is retried. It succeeds upon retry, because B was already processed.
This is suboptimal, as A is processed ~12s later than it could have been. This is also a counterexample to the simplification proposed here:
Version
Lighthouse v8.0.0-rc.0 or thereabouts
Steps to resolve
One way to fix this might be to make the reprocess queue aware of recently processed blocks (yuck?). It could immediately reprocess any messages it receives for recently processed blocks in order to guard against this race.
Another option is to just do nothing. We could try to observe how often this race happens in the wild before deciding whether an invasive code fix is worth it. Or we could wait until we are improving/rewriting the beacon processor.
Perhaps another option would be to use the new block status tracker to link the reprocessing of the attestation to the completion of the block import. We could attach a receiver for the status channel to the attestation when it initially fails processing, and use this to prompt reprocessing. However I think a similar race could still exist even if this case, if the block is completely unknown when attestation processing begins, but it starts and finishes importing between when the attestation is found to be invalid and when it is added to the reprocess queue (this is quite unlikely because this window of time is small, but not impossible from a concurrency correctness PoV).