Skip to content

Fix block and ref rootfs#572

Merged
urunc-bot[bot] merged 9 commits intomainfrom
fix_block_and_ref_rootfs
May 4, 2026
Merged

Fix block and ref rootfs#572
urunc-bot[bot] merged 9 commits intomainfrom
fix_block_and_ref_rootfs

Conversation

@cmainas
Copy link
Copy Markdown
Contributor

@cmainas cmainas commented Apr 23, 2026

Description

This PR takes care of various bugs we had with the handling of the rootfs, but all related to the related issue below. The main problem was that when the reexec process was unmounting the block-based snapshot which was mounted as the container;s rootfs in the host, the unmounting was not propagating to the other mount namespaces (especially the initial one where the mounting of the snapshot was taking place).

The thing is that usually the container;s rootfs in the host is mounted with MS_SHARED propagation flag. Therefore all mount / unmount events taking place in subsequent mount namespaces (or peer mount groups) should also propagate. In simple words the unmount in reexec should propagate, except if something was cutting of the propagation.

Well, indeed something was cutting off the propagation (See https://github.com/urunc-dev/urunc/blob/main/pkg/unikontainers/unikontainers.go#L341). As part of preparing the rootfs for the execution environment of the sandbox monitor, we had to set the propagation flags according to the spec we received for the new mount namespace and make sure that the parent of the monitor's rootfs in the host will be MS_PRIVATE. This is necessary for pivot to work.

All the above are necessary steps, but they cut off the mount propagation of mounts and when we perform the unmount later in https://github.com/urunc-dev/urunc/blob/main/pkg/unikontainers/unikontainers.go#L362 the unmount will never propagate. Subsequently the solution is simple: make sure the rootfs of the monitor process in the host does have MS_SHARED as propagation flag and unmount before cutting off the mount propagation.

Furthermore, to avoid any further surprises with block based mounts and rootfs, a simple check is added in getMountInfo to always exclude mount sources which appear more than once in /proc/self/mountinfo.

Due to the high cognitive complexity of Exec a quick fix would add extra complexity as shown in the 3rd commit. Since we are already planning to perform a major refactor for Exec and unikontainers, I brought parts of this refactor in this PR, in order to help with the further refactoring later.

At last, there was a bug in checking for a specific namespace since findNS was also checking for the path of the namespace in /proc and hence failing for new namespaces that the runtime needs to create. As a result, when we had to create a mount namespace we were not pivoting but chroot.

Related issues

How was this tested?

With the end-to-end tests and to make sure the unmounting propagates, after running a urunc container which can accept a block-based device as the sandbox's rootfs the mount command should not show this block image mounted.

LLM usage

N/A

Checklist

  • I have read the contribution guide.
  • The linter passes locally (make lint).
  • The e2e tests of at least one tool pass locally (make test_ctr, make test_nerdctl, make test_docker, make test_crictl).
  • If LLMs were used: I have read the llm policy.

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 23, 2026

Deploy Preview for urunc canceled.

Name Link
🔨 Latest commit f71e10a
🔍 Latest deploy log https://app.netlify.com/projects/urunc/deploys/69f879b3aa0f900008ed58ab

@cmainas cmainas force-pushed the fix_block_and_ref_rootfs branch 8 times, most recently from 394b124 to da92017 Compare April 24, 2026 09:58
@cmainas cmainas marked this pull request as ready for review April 24, 2026 11:34
@cmainas cmainas requested a review from ananos April 24, 2026 11:34
@ananos
Copy link
Copy Markdown
Contributor

ananos commented May 1, 2026

thanks @cmainas.

The MS_SHARED fix in InitialSetup is correct — verified that the block rootfs unmount now propagates and #562 is resolved.

However, the findNS change is doing the opposite of what the description says. Quick recap of the description:

"when we had to create a mount namespace we were not pivoting but chroot".

I built both branches (main & PR) with a probe at the call site and ran rumprun-hvt via ctr run:

  main:    findNS(mount)=("", err="the namespace does not exist")             
           withPivot=true  -> PIVOT_ROOT                            
  this PR: findNS(mount)=("", err=<nil>)                                      
           withPivot=false -> CHROOT

Same OCI spec in both runs:

namespaces=[{Type:pid} {Type:ipc} {Type:uts} {Type:mount} {Type:network}]

The standard containerd shape with empty Path on the mount ns.

main was already pivoting in that case; this PR changes it to chroot.

Cause: findNS no longer errors on empty Path, but the call site in Exec is unmodified:

    _, err = findNS(u.Spec.Linux.Namespaces, specs.MountNamespace)  
    withPivot := err != nil   // was a side-effect of the old findNS contract

The comment above this block was already inverted from the code on main, so
the original intent isn't fully clear. I think you want:

    nsPath, err := findNS(u.Spec.Linux.Namespaces, specs.MountNamespace)
    // own the ns (in spec, no path) -> pivot; joining an existing ns -> chroot                                                                               
    withPivot := err == nil && nsPath == ""                                   

Would it make sense to split this PR? I'd merge the MS_SHARED + getMountInfo + refactor
parts as-is, and address the findNS / pivot-vs-chroot question in a separate
PR with a test, since it's an isolation-level change rather than a bug fix. Maybe I'm missing something in the process, so feel free to ignore the findNS analysis. Regardless, let's try to expedite the merge!

@cmainas
Copy link
Copy Markdown
Contributor Author

cmainas commented May 4, 2026

Hello @ananos ,

thank you for the review. That was a good catch.

However, the findNS change is doing the opposite of what the description says. Quick recap of the description:

"when we had to create a mount namespace we were not pivoting but chroot".

I built both branches (main & PR) with a probe at the call site and ran rumprun-hvt via ctr run:

  main:    findNS(mount)=("", err="the namespace does not exist")             
           withPivot=true  -> PIVOT_ROOT                            
  this PR: findNS(mount)=("", err=<nil>)                                      
           withPivot=false -> CHROOT

Same OCI spec in both runs:

namespaces=[{Type:pid} {Type:ipc} {Type:uts} {Type:mount} {Type:network}]

The standard containerd shape with empty Path on the mount ns.

main was already pivoting in that case; this PR changes it to chroot.

Cause: findNS no longer errors on empty Path, but the call site in Exec is unmodified:

You are correct. My description was wrong. The problem is not when we create a new mount namespace, but when we join a new mount namespace. In that case, findNS returns a nil error and due tto ``withPivot := err != nil , withPivot` becomes false and hence we chroot, instead of pivot

I think you want:

    nsPath, err := findNS(u.Spec.Linux.Namespaces, specs.MountNamespace)
    // own the ns (in spec, no path) -> pivot; joining an existing ns -> chroot                                                                               
    withPivot := err == nil && nsPath == ""                                   

Yes, that is correct, I did not rebase correctly and over time I lost this change.

Would it make sense to split this PR? I'd merge the MS_SHARED + getMountInfo + refactor parts as-is, and address the findNS / pivot-vs-chroot question in a separate PR with a test, since it's an isolation-level change rather than a bug fix. Maybe I'm missing something in the process, so feel free to ignore the findNS analysis. Regardless, let's try to expedite the merge!

Sure, I can drop the findNS change and make a new PR for it with an issue.

@cmainas cmainas force-pushed the fix_block_and_ref_rootfs branch from da92017 to 8ecc61b Compare May 4, 2026 07:56
Copy link
Copy Markdown
Contributor

@ananos ananos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @cmainas!

cmainas added 9 commits May 4, 2026 10:49
To determine whether the container rootfs or a container mount is
block-based, urunc retrieves mount information from
/proc/self/mountinfo. The corresponding function (`getMountInfo`)
currently returns only the mount source, mount point, and filesystem
type. Based on this information, urunc checks whether the guest
supports the filesystem type, and if so, attaches the mount source
as a block device to the sandbox.

However, there are cases where multiple mount entries appear to share
the same source. An exmaple is bind mounts where the mount source of a
bind mount appears to be the underlying device of the source path.

To handle this, this commit extends `getMountInfo` with an additional
check. It collects the mount sources of all non-special filesystems into
a map. The source of the mount corresponding to the target path is
intentionally excluded from this map. If at the end of the searching the
source of the target mount does not exist in the map, it is considered
safe to attach as a block device. If it does exist, it is not safe to
use, since the same source is mounted elsewhere or it is not a
block-based mount.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
As a first step for another refactor of Exec move the chooseRootfs
function to be a mthod of unikontainers. This allows us to call
chooseRootfs from other parts of the code so we can retrieve the rootfs
that will be used. Moreover, it will help us with the refactoring of the
Exec monolith and separate the various parts of the Exec.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
Some actions in the preparation of the block based rootfs for the
sandbox must take place before the preparation of the rootfs for the
execution environment of the monitor. In particular, the unmount of the
block-based container rootfs in the host must take place before the
prepareRoot and prepareMonRootfs functions. This is the reason of the
bug we had where the unmount of the block-based contianer's rootfs in
the host was not propagating to other mount namespaces.

Given the opportunity, define a set of functions related to the handling
of the sandbox's rootfs preparation. This commit only focuses on
block-based ones.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
Following the definition of specific functions for the prearation of the
block-based rootfs for the sandbox, implement the same functions for
initrd based rootfs.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
Following the definition of specific functions for the prearation of the
block-based rootfs for the sandbox, implement the same functions for
shared-fs based rootfs.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
Following the definition of specific functions for the preparation of the
block-based rootfs for the sandbox, implement the same functions for
the cases where the sandbox does not have a rootfs.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
…otfs

Define a generic interface for the preparation of the sandbox's rootfs
and make all rootfs types to follow this interface.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
Move the tmpfs creation in the rootfs for the execution environment of
the monitor process to the postSetup step of the rootfs handling
for all rootfs types.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
Ensure that the container's rootfs has the MS_SHARED propagation flag.
THis is necessary in order to propagate any unmounts that might happen
later (e.g. in the case of block-based snapshots which are attached to
the sandbox) in the mount peer groups (i.e. in the initial
mount namespace).

THere is no problem to do that for every rootfs, because reexec will
later cut off the propagation during the preparation of the monitor;s
rootfs.

In the future though, we need to move this up in the shim.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
@github-actions github-actions Bot force-pushed the fix_block_and_ref_rootfs branch from 8ecc61b to f71e10a Compare May 4, 2026 10:49
@urunc-bot urunc-bot Bot merged commit 8a917b1 into main May 4, 2026
4 of 6 checks passed
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
To determine whether the container rootfs or a container mount is
block-based, urunc retrieves mount information from
/proc/self/mountinfo. The corresponding function (`getMountInfo`)
currently returns only the mount source, mount point, and filesystem
type. Based on this information, urunc checks whether the guest
supports the filesystem type, and if so, attaches the mount source
as a block device to the sandbox.

However, there are cases where multiple mount entries appear to share
the same source. An exmaple is bind mounts where the mount source of a
bind mount appears to be the underlying device of the source path.

To handle this, this commit extends `getMountInfo` with an additional
check. It collects the mount sources of all non-special filesystems into
a map. The source of the mount corresponding to the target path is
intentionally excluded from this map. If at the end of the searching the
source of the target mount does not exist in the map, it is considered
safe to attach as a block device. If it does exist, it is not safe to
use, since the same source is mounted elsewhere or it is not a
block-based mount.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
As a first step for another refactor of Exec move the chooseRootfs
function to be a mthod of unikontainers. This allows us to call
chooseRootfs from other parts of the code so we can retrieve the rootfs
that will be used. Moreover, it will help us with the refactoring of the
Exec monolith and separate the various parts of the Exec.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
Some actions in the preparation of the block based rootfs for the
sandbox must take place before the preparation of the rootfs for the
execution environment of the monitor. In particular, the unmount of the
block-based container rootfs in the host must take place before the
prepareRoot and prepareMonRootfs functions. This is the reason of the
bug we had where the unmount of the block-based contianer's rootfs in
the host was not propagating to other mount namespaces.

Given the opportunity, define a set of functions related to the handling
of the sandbox's rootfs preparation. This commit only focuses on
block-based ones.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
Following the definition of specific functions for the prearation of the
block-based rootfs for the sandbox, implement the same functions for
initrd based rootfs.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
Following the definition of specific functions for the prearation of the
block-based rootfs for the sandbox, implement the same functions for
shared-fs based rootfs.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
Following the definition of specific functions for the preparation of the
block-based rootfs for the sandbox, implement the same functions for
the cases where the sandbox does not have a rootfs.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
…otfs

Define a generic interface for the preparation of the sandbox's rootfs
and make all rootfs types to follow this interface.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
urunc-bot Bot pushed a commit that referenced this pull request May 4, 2026
Move the tmpfs creation in the rootfs for the execution environment of
the monitor process to the postSetup step of the rootfs handling
for all rootfs types.

PR: #572
Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Reviewed-by: Anastassios Nanos <ananos@nubificus.co.uk>
Approved-by: Anastassios Nanos <ananos@nubificus.co.uk>
@ananos ananos deleted the fix_block_and_ref_rootfs branch May 4, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Block based rootfs is not unmounted from host before attaching to sandbox

2 participants