Description
On discord, the user kangalioo (unsure of github name) shared a custom version of the black_box
(#64102) function they're using to improve the asm output of black_box, and reduce the overhead of its use. It does this by passing small things in registers instead of by pointers.
// Warning, not sound. Do not use.
pub fn black_box<T>(x: T) -> T {
use std::mem::{transmute_copy as t, forget as f};
use std::arch::asm;
unsafe { match std::mem::size_of::<T>() {
1 => { let mut y: u8 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg_byte) y, options(nostack)); t(&y) }
2 => { let mut y: u16 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg) y, options(nostack)); t(&y) }
4 => { let mut y: u32 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg) y, options(nostack)); t(&y) }
8 => { let mut y: u64 = t(&x); f(x); asm!("/*{y}*/", y = inout(reg) y, options(nostack)); t(&y) }
16 => { let [mut y, mut z]: [u64; 2] = t(&x); f(x); asm!("/*{y}{z}*/", y = inout(reg) y, z = inout(reg) z, options(nostack)); t(&[y, z]) }
_ => { x },
} }
}
pub fn example() {
black_box(black_box(2) + black_box(3));
extern "C" { fn print(_: &str); }
unsafe { print(black_box("hello world :)")); }
}
Which produces the following output:
example::example:
mov eax, 2
mov ecx, 3
add ecx, eax
lea rdi, [rip + .L__unnamed_1]
mov esi, 14
jmp qword ptr [rip + print@GOTPCREL]
.L__unnamed_1:
.ascii "hello world :)"
In comparison, the current black box black_box
spills the output in basically all cases. The equivalent output with the current black_box
is as follows (Godbolt for all this is available here https://godbolt.org/z/a7evcEP6x):
example::example:
sub rsp, 24
mov dword ptr [rsp + 8], 2
lea rax, [rsp + 8]
mov ecx, dword ptr [rsp + 8]
mov dword ptr [rsp + 8], 3
add ecx, dword ptr [rsp + 8]
mov dword ptr [rsp + 8], ecx
lea rcx, [rip + .L__unnamed_1]
mov qword ptr [rsp + 8], rcx
mov qword ptr [rsp + 16], 14
mov rdi, qword ptr [rsp + 8]
mov rsi, qword ptr [rsp + 16]
call qword ptr [rip + print@GOTPCREL]
add rsp, 24
ret
.L__unnamed_1:
.ascii "hello world :)"
I believe this is basically because we just lower the intrinsic as passing a pointer to the value into an inline asm block, which forces the spilling.
I don't believe this can be fixed by libs changes, as we are just calling into an intrinsic and need to remain that way to support all targets (and cases like miri). Additionally, the version posted in discord has a soundness hole, and is considered UB if T
contains padding bytes (and can't be fixed at the moment as passing MaybeUninit
via registers isn't currently possible).
However, because we just pass the argument to an intrinsic, it seems likely that the compiler can lower it in a more optimal way, which seems to be a less error-prone way of handling this anyway.
Improving this output seems beneficial, since the whole point of this intrinsic is to have as close to 0 cost as possible while still providing an optimization barrier. I think the basic idea behind the black_box
provided above is a reasonable starting point of what would be good, but it's obviously not a requirement that it's lowered in that manner.