apt(-get) operations that require network: do not suppress stdout/err, timeout-control, retry

I think that we were bitten by a lack of a meaningful TCP connect() timeout in `apt-get ...` when we recently saw the driver container hang in places like

```
Installing Linux kernel headers...
```
```
Installing Linux kernel module files...
```
The pod would hang there for a long time and eventually restart (after about ~20 minutes).

It took quite a bit of back and forth before we understood that this is probably networking-related and not filesystem-related. We would have understood that faster if we would have seen stdout of `apt-get`: stdout would have revealed that we're running apt-get in the first place (the log msgs shown above don't show that), and that apt is _trying to_ fetch data from remote infra.

Looking at code, in many places we invoke `apt-get` with this pattern: 
```
apt-get -qq install ... > /dev/null
```

Changes that we should make:
- never suppress stdout
- add an overall meaningful timeout around the command, and retry it (a good timeout constant could be: maximum expected time of execution on a slow system times ~5)

A bit of background / big picture about this type of problem: I deeply care about debuggabililty. That often means that we want to support humans in their debugging efforts, so that they don't spend unnecessary time.

One of the not-so-much fun unnecessary time sinks during debugging is when there's no log output and something just hangs in an operation that
1) could have been logged (so that one knows where something hangs)
2) could have been timeout-controlled and retried (often leading to success, instead of hanging forever)

Rather often that relates to deliberately setting a TCP connect() timeout. For the record, I have frequently talked about and written about  the immense importance and power of combining i) timeout control, ii) retries, and iii) logging elsewhere, such as in
- https://github.com/conbench/conbench/issues/801
- https://github.com/python/cpython/issues/89953#issuecomment-1458261299
- https://github.com/astral-sh/uv/issues/8144



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

apt(-get) operations that require network: do not suppress stdout/err, timeout-control, retry #347

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

apt(-get) operations that require network: do not suppress stdout/err, timeout-control, retry #347

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions