Skip to content

Improve VSock dialing timeout #133

Open
@sipsma

Description

@sipsma

Our current vsock dialer implementation does exponential backoff from 100ms to 1.6s before giving up.

I encountered a situation in the real world in which this timeout was too short and resulted in ctr run to fail unnecessarily. The particular situation was when I attached strace to firecracker as it started in (in order to debug a separate issue) which understandably significantly slowed down the VM startup time. I could see in the strace output that firecracker was still just in the midst of copying the VM's rootfs when firecracker-containerd gave up dialing to the VSock. When I just increased the (currently hardcoded) timeout to try one more time (6 retries instead of 5) ctr run completed successfully.

While I encountered this when attaching strace, its seems plausible the timeout could be hit in other real-world situations, such as slower hardware than an i3.metal and/or large VM rootfs images.

There's a few fixes possible here (not mutually exclusive):

  1. Just increase the timeout
  2. Make the timeout configurable
  3. See if checking the status of the VM (enum here) can help improve the logic here; i.e. should we first have a waiting period for the status to change Running and then have a separate waiting period for trying to connect to the agent?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions