Improve VSock dialing timeout

Our current vsock dialer implementation does exponential backoff from 100ms to 1.6s before giving up.

I encountered a situation in the real world in which this timeout was too short and resulted in `ctr run` to fail unnecessarily. The particular situation was when I attached strace to firecracker as it started in (in order to debug a separate issue) which understandably significantly slowed down the VM startup time. I could see in the strace output that firecracker was still just in the midst of copying the VM's rootfs when firecracker-containerd gave up dialing to the VSock. When I just increased the (currently hardcoded) timeout to try one more time (6 retries instead of 5) `ctr run` completed successfully.

While I encountered this when attaching strace, its seems plausible the timeout could be hit in other real-world situations, such as slower hardware than an i3.metal and/or large VM rootfs images.

There's a few fixes possible here (not mutually exclusive):
1. Just increase the timeout
2. Make the timeout configurable
3. See if checking the status of the VM ([enum here](https://github.com/firecracker-microvm/firecracker/blob/3157cc5d94f8f827fcb4338e926fc9f0138e019d/api_server/swagger/firecracker.yaml#L424)) can help improve the logic here; i.e. should we first have a waiting period for the status to change `Running` and then have a separate waiting period for trying to connect to the agent? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve VSock dialing timeout #133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve VSock dialing timeout #133

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions