Remove a noisy log line #1125

djeebus · 2025-09-04T20:37:57Z

This line is 30% of the last 60 minutes of logs in production

packages/orchestrator/internal/sandbox/uffd/serve.go

jakubno

Not sure about removing the log before we solve this, we identified it because of these logs, without them we have 0 observability

djeebus · 2025-09-15T17:23:14Z

Not sure about removing the log before we solve this, we identified it because of these logs, without them we have 0 observability

I don't think there's an actual bug here, it's working as designed, we're just spamming a lot of logs in a happy path.

jakubno · 2025-09-15T19:07:57Z

I though you reported an issue where the start was slowed down and it correlated to seeing a lot of these logs

djeebus · 2025-09-15T22:21:07Z

I though you reported an issue where the start was slowed down and it correlated to seeing a lot of these logs

Pretty sure the slowdown ended up being the clock sync issue fixed here. Regardless, this was something I found while I was looking for errors, but it was not the error itself.

ValentaTomas · 2025-10-08T18:29:01Z

We can try the epoll to reduce this, but I think the reason for the EAGAIN is because of the race between the uffd request and page served in a goroutine.

@jakubno The correlation is not between the error itself, but probably between the length of how long it took to serve the page:

A different postcopy thread in the destination node listens with poll() to the userfaultfd in parallel. When a POLLIN event is generated after a userfault triggers, the postcopy thread read() from the userfaultfd and receives the fault address (or -EAGAIN in case the userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run by the parallel QEMU migration thread).

(https://docs.kernel.org/admin-guide/mm/userfaultfd.html)

ValentaTomas · 2025-10-08T19:36:46Z

We could potentially get rid of the EAGAIN by using a blocking uffd fd, but I'm not sure of the other implications there.

https://man7.org/linux/man-pages/man2/userfaultfd.2.html
https://man7.org/linux/man-pages/man2/open.2.html
https://github.com/firecracker-microvm/firecracker/blob/f335a0adf46f0680a141eb1e76fe31ac258918c5/src/vmm/src/persist.rs#L479
https://github.com/torvalds/linux/blob/master/fs/userfaultfd.c#L1072-L1075

(cc @tomassrnka)

ValentaTomas · 2025-10-08T20:36:05Z

The log/error can be triggered if you put a small sleep before serving the page here

infra/packages/orchestrator/internal/sandbox/uffd/serve.go

Line 151 in 31b90a4

You also need to enable the logs as zap is not logged during tests to userfaultfd_test.go:

// start of TestUffdMissing
	config := zap.NewDevelopmentConfig()
	logger, err := config.Build()
	require.NoError(t, err)
	zap.ReplaceGlobals(logger)
...
err := uffd.Serve(t.Context(), m, data, fdExit, logger)

ValentaTomas · 2025-10-08T21:18:55Z

Using epoll does not help—getting:

uffd: eagain error, going back to polling       {"error": "resource temporarily unavailable", "read_bytes": 1}

ValentaTomas · 2025-10-08T21:25:17Z

With the poll the error is slightly different (returning "read_bytes" -1 instead of 1):

uffd: eagain error, going back to polling       {"error": "resource temporarily unavailable", "read_bytes": -1}

ValentaTomas · 2025-10-08T22:11:10Z

Removing syscall.O_NONBLOCK blocks the whole serving (it is in a go routine and uses the same fd).
Seems to be stuck on uffd: no data in fd, going back to polling

tomassrnka · 2025-10-11T11:03:48Z

I’d recommend switching to count-based logging instead of logging every single occurrence. Aggregating the EAGAIN events lets us keep visibility into how often the queue is empty, without flooding the logs. Rather than spamming a line for each empty poll (which can be tens of thousands per minute), we emit a single summary every few seconds. That way, you still see the behavior but don’t drown out real errors or blow the log budget.

Sth like (not meant as production patch, just an idea):

diff --git a/packages/orchestrator/internal/sandbox/uffd/serve.go b/packages/orchestrator/internal/sandbox/uffd/serve.go
index c5866b52d..b10f0a42c 100644
--- a/packages/orchestrator/internal/sandbox/uffd/serve.go
+++ b/packages/orchestrator/internal/sandbox/uffd/serve.go
@@ -5,6 +5,7 @@ import (
        "errors"
        "fmt"
        "syscall"
+       "time"
        "unsafe"

        "go.uber.org/zap"
@@ -42,6 +43,12 @@ func Serve(
        var eg errgroup.Group

        missingPagesBeingHandled := map[int64]struct{}{}
+       // Track how frequently we hit non-blocking reads with no data so we can log them in batches.
+       var (
+               eagainCount     int
+               lastEagainLog   = time.Now()
+               eagainLogWindow = 5 * time.Second
+       )

 outerLoop:
        for {
@@ -104,18 +111,25 @@ outerLoop:
                                continue
                        }

-                       if err == nil {
-                               // There is no error so we can proceed.
-                               break
-                       }
-
                        if err == syscall.EAGAIN {
-                               logger.Debug("uffd: eagain error, going back to polling", zap.Error(err), zap.Int("read_bytes", n))
+                               elapsed := time.Since(lastEagainLog)
+                               eagainCount++
+                               if elapsed >= eagainLogWindow {
+                                       logger.Info("uffd: read returned no data yet", zap.Int("occurrences", eagainCount), zap.Duration("window", elapsed))
+                                       eagainCount = 0
+                                       lastEagainLog = time.Now()
+                               }

                                // Continue polling the fd.
                                continue outerLoop
                        }

+                       if err == nil {
+                               _ = n
+                               // There is no error so we can proceed.
+                               break
+                       }
+
                        logger.Error("uffd: read error", zap.Error(err))

                        return fmt.Errorf("failed to read: %w", err)

ValentaTomas · 2025-10-12T19:12:30Z

Just to be explicit for context—I don't think we can get rid of the EAGAIN.

ValentaTomas · 2025-10-12T19:25:53Z

Also two ideas here:

Use logging debounce
OR log the EAGAINs occurence at the end of the goroutine that serves memory as it seems the eagains starts being there when there is a delay in serving (happens even for a single fault) [added possible implementation]

this line is 30% of the last 60 minutes of logs

f5e7ca8

djeebus marked this pull request as ready for review September 4, 2025 20:38

djeebus requested review from ValentaTomas, dobrac and jakubno as code owners September 4, 2025 20:38

djeebus added 2 commits September 4, 2025 16:42

EAGAIN implies that we should wait a little

10ac311

remove unused variable

49d9204

jakubno requested changes Sep 5, 2025

View reviewed changes

packages/orchestrator/internal/sandbox/uffd/serve.go Outdated Show resolved Hide resolved

remove the sleeps, one thing at a time!

31b90a4

jakubno reviewed Sep 9, 2025

View reviewed changes

djeebus self-assigned this Sep 10, 2025

djeebus marked this pull request as draft September 10, 2025 00:42

djeebus added the improvement Improvement for current functionality label Sep 10, 2025

ValentaTomas unassigned djeebus Sep 23, 2025

ValentaTomas self-assigned this Oct 8, 2025

Distinguish the outer loop eagain error message

e58784e

[WIP] Possible copy goroutine eagain log

81777ea

Remove a noisy log line #1125

Are you sure you want to change the base?

Remove a noisy log line #1125

Uh oh!

Conversation

djeebus commented Sep 4, 2025

Uh oh!

Uh oh!

jakubno left a comment

Choose a reason for hiding this comment

Uh oh!

djeebus commented Sep 15, 2025

Uh oh!

jakubno commented Sep 15, 2025

Uh oh!

djeebus commented Sep 15, 2025

Uh oh!

ValentaTomas commented Oct 8, 2025

Uh oh!

ValentaTomas commented Oct 8, 2025

Uh oh!

ValentaTomas commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ValentaTomas commented Oct 8, 2025

Uh oh!

ValentaTomas commented Oct 8, 2025

Uh oh!

ValentaTomas commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomassrnka commented Oct 11, 2025

Uh oh!

ValentaTomas commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ValentaTomas commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ValentaTomas commented Oct 8, 2025 •

edited

Loading

ValentaTomas commented Oct 8, 2025 •

edited

Loading

ValentaTomas commented Oct 12, 2025 •

edited

Loading

ValentaTomas commented Oct 12, 2025 •

edited

Loading