Skip to content

Improve core::intrinsics::black_box output. #99899

Open
@thomcc

Description

@thomcc

On discord, the user kangalioo (unsure of github name) shared a custom version of the black_box (#64102) function they're using to improve the asm output of black_box, and reduce the overhead of its use. It does this by passing small things in registers instead of by pointers.

// Warning, not sound. Do not use.
pub fn black_box<T>(x: T) -> T {
    use std::mem::{transmute_copy as t, forget as f};
    use std::arch::asm;
    unsafe { match std::mem::size_of::<T>() {
        1 => { let mut y: u8 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg_byte) y, options(nostack)); t(&y) }
        2 => { let mut y: u16 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg) y, options(nostack)); t(&y) }
        4 => { let mut y: u32 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg) y, options(nostack)); t(&y) }
        8 => { let mut y: u64 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg) y, options(nostack)); t(&y) }
        16 => { let [mut y, mut z]: [u64; 2] = t(&x); f(x); asm!("/*{y}{z}*/", y = inout(reg) y, z = inout(reg) z, options(nostack)); t(&[y, z]) }
        _ => { x },
    } }
}
pub fn example() {
    black_box(black_box(2) + black_box(3));
    extern "C" { fn print(_: &str); }
    unsafe { print(black_box("hello world :)")); }
}

Which produces the following output:

example::example:
    mov     eax, 2
    mov     ecx, 3
    add     ecx, eax
    lea     rdi, [rip + .L__unnamed_1]
    mov     esi, 14
    jmp     qword ptr [rip + print@GOTPCREL]
.L__unnamed_1:
    .ascii  "hello world :)"

In comparison, the current black box black_box spills the output in basically all cases. The equivalent output with the current black_box is as follows (Godbolt for all this is available here https://godbolt.org/z/a7evcEP6x):

example::example:
    sub     rsp, 24
    mov     dword ptr [rsp + 8], 2
    lea     rax, [rsp + 8]
    mov     ecx, dword ptr [rsp + 8]
    mov     dword ptr [rsp + 8], 3
    add     ecx, dword ptr [rsp + 8]
    mov     dword ptr [rsp + 8], ecx
    lea     rcx, [rip + .L__unnamed_1]
    mov     qword ptr [rsp + 8], rcx
    mov     qword ptr [rsp + 16], 14
    mov     rdi, qword ptr [rsp + 8]
    mov     rsi, qword ptr [rsp + 16]
    call    qword ptr [rip + print@GOTPCREL]
    add     rsp, 24
    ret

.L__unnamed_1:
    .ascii  "hello world :)"

I believe this is basically because we just lower the intrinsic as passing a pointer to the value into an inline asm block, which forces the spilling.

I don't believe this can be fixed by libs changes, as we are just calling into an intrinsic and need to remain that way to support all targets (and cases like miri). Additionally, the version posted in discord has a soundness hole, and is considered UB if T contains padding bytes (and can't be fixed at the moment as passing MaybeUninit via registers isn't currently possible).

However, because we just pass the argument to an intrinsic, it seems likely that the compiler can lower it in a more optimal way, which seems to be a less error-prone way of handling this anyway.

Improving this output seems beneficial, since the whole point of this intrinsic is to have as close to 0 cost as possible while still providing an optimization barrier. I think the basic idea behind the black_box provided above is a reasonable starting point of what would be good, but it's obviously not a requirement that it's lowered in that manner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-codegenArea: Code generationA-intrinsicsArea: IntrinsicsC-enhancementCategory: An issue proposing an enhancement or a PR with one.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions