Description
Our current vsock dialer implementation does exponential backoff from 100ms to 1.6s before giving up.
I encountered a situation in the real world in which this timeout was too short and resulted in ctr run
to fail unnecessarily. The particular situation was when I attached strace to firecracker as it started in (in order to debug a separate issue) which understandably significantly slowed down the VM startup time. I could see in the strace output that firecracker was still just in the midst of copying the VM's rootfs when firecracker-containerd gave up dialing to the VSock. When I just increased the (currently hardcoded) timeout to try one more time (6 retries instead of 5) ctr run
completed successfully.
While I encountered this when attaching strace, its seems plausible the timeout could be hit in other real-world situations, such as slower hardware than an i3.metal and/or large VM rootfs images.
There's a few fixes possible here (not mutually exclusive):
- Just increase the timeout
- Make the timeout configurable
- See if checking the status of the VM (enum here) can help improve the logic here; i.e. should we first have a waiting period for the status to change
Running
and then have a separate waiting period for trying to connect to the agent?