Skip to content

apt(-get) operations that require network: do not suppress stdout/err, timeout-control, retry #347

@jgehrcke

Description

@jgehrcke

I think that we were bitten by a lack of a meaningful TCP connect() timeout in apt-get ... when we recently saw the driver container hang in places like

Installing Linux kernel headers...
Installing Linux kernel module files...

The pod would hang there for a long time and eventually restart (after about ~20 minutes).

It took quite a bit of back and forth before we understood that this is probably networking-related and not filesystem-related. We would have understood that faster if we would have seen stdout of apt-get: stdout would have revealed that we're running apt-get in the first place (the log msgs shown above don't show that), and that apt is trying to fetch data from remote infra.

Looking at code, in many places we invoke apt-get with this pattern:

apt-get -qq install ... > /dev/null

Changes that we should make:

  • never suppress stdout
  • add an overall meaningful timeout around the command, and retry it (a good timeout constant could be: maximum expected time of execution on a slow system times ~5)

A bit of background / big picture about this type of problem: I deeply care about debuggabililty. That often means that we want to support humans in their debugging efforts, so that they don't spend unnecessary time.

One of the not-so-much fun unnecessary time sinks during debugging is when there's no log output and something just hangs in an operation that

  1. could have been logged (so that one knows where something hangs)
  2. could have been timeout-controlled and retried (often leading to success, instead of hanging forever)

Rather often that relates to deliberately setting a TCP connect() timeout. For the record, I have frequently talked about and written about the immense importance and power of combining i) timeout control, ii) retries, and iii) logging elsewhere, such as in

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions