Skip to content

Conversation

tarcieri
Copy link
Member

No description provided.

@tarcieri tarcieri requested a review from bifurcation February 13, 2025 00:07
@tarcieri
Copy link
Member Author

Well that's odd...

https://github.com/RustCrypto/signatures/actions/runs/13297403714/job/37132349904

     Running tests/pkcs8.rs (target/debug/deps/pkcs8-f6bcf597688a96a2)

running 1 test

thread 'private_key_serialization' has overflowed its stack
fatal runtime error: stack overflow

@tarcieri
Copy link
Member Author

The failure seems to be happening inside MlDsa87::key_gen_internal cc @bifurcation

It only seems to happen in non-release builds, which makes me curious if it might legitimately be a case of stack overflow

Comment on lines +49 to +50
[profile.dev]
opt-level = 2
Copy link
Member Author

@tarcieri tarcieri Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what caused the test failure: without it the stack seems to overflow when testing dev builds.

Might suggest a case for #[inline] or #[inline(always)]

@tarcieri
Copy link
Member Author

Hmm, the initial stack overflow was locally reproducible, but now I can't reproduce the issue at all and I have no idea why this simple update would've caused it in the first place

@bifurcation
Copy link
Contributor

First of all, thanks for doing this upgrade. I had a problem with another project having to translate between versions of rand because the Rust Crypto crates were behind.

Second, I have also had stack overflows with rand in another project. Unfortunately, I never fully debugged it before some refactoring made it moot. I think it was something to do with multiple calls to rand::rng() in the same stack?

In any case, I just pulled the branch and confirmed that cargo test does not stack overflow for me. Seems like it might have just been a cosmic ray or something :)

@tarcieri
Copy link
Member Author

@bifurcation the initial stack overflow (because I split the crate out of the workspace and didn't propagate the opt-level) was reproducible.

I can't reproduce the latest failures, however they seem to reproducibly break in CI.

@bifurcation
Copy link
Contributor

Sorry, I misunderstood. I can reproduce the overflow on macOS / Apple Silicon by commenting out that opt-level line. The stack trace I'm seeing in lldb doesn't look suspicious at first glance:

(lldb) bt
* thread #4, name = 'mldsa87_round_trip', stop reason = EXC_BAD_ACCESS (code=2, address=0x17021b900)
  * frame #0: 0x000000010002a1b4 proptests-563fb1afb6d10e8a`proptest::test_runner::runner::TestRunner::gen_and_run_case::h0ade747d3580794b(self=0x6ac310277db02f2f, strategy=0xaac72c3685d80168, f=0x102b3b7c2f92abbb, replay_from_fork=0x0000000000000000, result_cache=&mut dyn proptest::test_runner::result_cache::ResultCache @ 0x00000001702958d0, fork_output=0x0000000153809400, is_from_persisted_seed=false) at runner.rs:656
    frame #1: 0x000000010002f710 proptests-563fb1afb6d10e8a`proptest::test_runner::runner::TestRunner::run_in_process_with_replay::h428da222f118da8f(self=0x00000001703c71a8, strategy=0x00000001703f2d48, test={closure_env#1} @ 0x0000000170295a0b, replay_from_fork=IntoIter<core::result::Result<(), proptest::test_runner::errors::TestCaseError>, alloc::alloc::Global> @ 0x00000001703c6f30, fork_output=ForkOutput @ 0x0000000170295a0c) at runner.rs:604:13
    frame #2: 0x0000000100029ff0 proptests-563fb1afb6d10e8a`proptest::test_runner::runner::TestRunner::run_in_process::hbf2c17047ed06a16(self=0x00000001703c71a8, strategy=0x00000001703f2d48, test={closure_env#1} @ 0x00000001703c6f8f) at runner.rs:574:9
    frame #3: 0x0000000100030260 proptests-563fb1afb6d10e8a`proptest::test_runner::runner::TestRunner::run::h05bf44c50897ed89(self=0x00000001703c71a8, strategy=0x00000001703f2d48, test={closure_env#1} @ 0x00000001703c6fef) at runner.rs:417:13
    frame #4: 0x0000000100025a30 proptests-563fb1afb6d10e8a`proptests::mldsa87_round_trip::h1a426a6e8b484058 at sugar.rs:163:17
    frame #5: 0x0000000100084694 proptests-563fb1afb6d10e8a`proptests::mldsa87_round_trip::_$u7b$$u7b$closure$u7d$$u7d$::h57707c1ac8e0259b((null)=0x000000017041e7ae) at sugar.rs:159:28
    frame #6: 0x000000010003ece8 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once::hf4ea7a6e4734f0e6((null)={closure_env#0} @ 0x000000017041e7ae, (null)=<unavailable>) at function.rs:250:5
    frame #7: 0x00000001000b7c88 proptests-563fb1afb6d10e8a`test::__rust_begin_short_backtrace::h25df369aba12f4e4 [inlined] core::ops::function::FnOnce::call_once::h1bf0e66f927215c5 at function.rs:250:5 [opt]
    frame #8: 0x00000001000b7c80 proptests-563fb1afb6d10e8a`test::__rust_begin_short_backtrace::h25df369aba12f4e4 at lib.rs:632:18 [opt]
    frame #9: 0x00000001000b7714 proptests-563fb1afb6d10e8a`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::hd06e476d9dc12405 [inlined] test::run_test_in_process::_$u7b$$u7b$closure$u7d$$u7d$::haf604da89e7ba9eb at lib.rs:655:60 [opt]
    frame #10: 0x00000001000b7708 proptests-563fb1afb6d10e8a`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::hd06e476d9dc12405 [inlined] _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hc34e3cd6fe896fb2 at unwind_safe.rs:272:9 [opt]
    frame #11: 0x00000001000b7708 proptests-563fb1afb6d10e8a`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::hd06e476d9dc12405 [inlined] std::panicking::try::do_call::h3a64a87b3158e4f8 at panicking.rs:557:40 [opt]
    frame #12: 0x00000001000b7704 proptests-563fb1afb6d10e8a`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::hd06e476d9dc12405 [inlined] std::panicking::try::h067b23bb25f32cea at panicking.rs:520:19 [opt]
    frame #13: 0x00000001000b7704 proptests-563fb1afb6d10e8a`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::hd06e476d9dc12405 [inlined] std::panic::catch_unwind::hd304b1ccafab4de9 at panic.rs:358:14 [opt]
    frame #14: 0x00000001000b7704 proptests-563fb1afb6d10e8a`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::hd06e476d9dc12405 [inlined] test::run_test_in_process::hb219402bbada6669 at lib.rs:655:27 [opt]
    frame #15: 0x00000001000b7648 proptests-563fb1afb6d10e8a`test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::hd06e476d9dc12405 at lib.rs:576:43 [opt]
    frame #16: 0x00000001000860c0 proptests-563fb1afb6d10e8a`std::sys::backtrace::__rust_begin_short_backtrace::hdc9db1cebb5cc810 [inlined] test::run_test::_$u7b$$u7b$closure$u7d$$u7d$::h9d19938a5543f04d at lib.rs:606:41 [opt]
    frame #17: 0x0000000100086058 proptests-563fb1afb6d10e8a`std::sys::backtrace::__rust_begin_short_backtrace::hdc9db1cebb5cc810 at backtrace.rs:154:18 [opt]
    frame #18: 0x00000001000892a8 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h7a2e04e2071e73db [inlined] std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h814fae876ada20ce at mod.rs:561:17 [opt]
    frame #19: 0x0000000100089298 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h7a2e04e2071e73db [inlined] _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h5428d38fb0afc9c6 at unwind_safe.rs:272:9 [opt]
    frame #20: 0x0000000100089298 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h7a2e04e2071e73db [inlined] std::panicking::try::do_call::h3d965f4085d59f35 at panicking.rs:557:40 [opt]
    frame #21: 0x0000000100089294 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h7a2e04e2071e73db [inlined] std::panicking::try::h694cb6fae536645d at panicking.rs:520:19 [opt]
    frame #22: 0x0000000100089294 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h7a2e04e2071e73db [inlined] std::panic::catch_unwind::hc6d1afbf5c741097 at panic.rs:358:14 [opt]
    frame #23: 0x0000000100089294 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h7a2e04e2071e73db [inlined] std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h6051d10fa7d84cab at mod.rs:559:30 [opt]
    frame #24: 0x0000000100089234 proptests-563fb1afb6d10e8a`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h7a2e04e2071e73db at function.rs:250:5 [opt]
    frame #25: 0x0000000100158784 proptests-563fb1afb6d10e8a`std::sys::pal::unix::thread::Thread::new::thread_start::ha1530855ce6ee203 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h6ec079b34637540d at boxed.rs:1972:9 [opt]
    frame #26: 0x0000000100158778 proptests-563fb1afb6d10e8a`std::sys::pal::unix::thread::Thread::new::thread_start::ha1530855ce6ee203 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h19f78070ef0330b0 at boxed.rs:1972:9 [opt]
    frame #27: 0x0000000100158774 proptests-563fb1afb6d10e8a`std::sys::pal::unix::thread::Thread::new::thread_start::ha1530855ce6ee203 at thread.rs:105:17 [opt]
    frame #28: 0x000000019d1d82e4 libsystem_pthread.dylib`_pthread_start + 136

It's not ridiculously deep, for example. Which I suppose means that one of these stack frames is huge for some reason. But they're all within proptest or the runtime, so I'm not sure what there really is to do about it anyway. I'll see if changing the body of the test changes anything.

One thing I noticed that might be relevant: It looks like proptest is using version 0.8 of rand.

@bifurcation
Copy link
Contributor

Notes trying to dissect this:

The stack overflow happens even with the body of round_trip_test commented out, so the overflow is in the _keypair() generators.

mldsa65_round_trip and mldsa87_round_trip both overflow without any code in round_trip_test. mldsa44_round_trip works fine with the whole of round_trip_test commented out, but overflows if you uncomment the first line. Seems like a similar problem.

Adding a line #![proptest_config(ProptestConfig::with_cases(1))] doesn't help

The following prop_compose avoids overflow, which seems to indicate that it's not the computation of key pairs that's the problem.

    fn mldsa87_keypair()(seed_bytes in any::<[u8; 32]>()) -> Vec<u8> {
        let _unused = MlDsa87::key_gen_internal(seed_bytes.as_ref());
        seed_bytes.len()
    }

The following prop_compose does cause an overflow, where 178432 is core::mem::size_of::<KeyPair<MlDsa87>>(). This seems to indicate that proptest is storing enough of these values on the stack to overflow.

    fn mldsa87_keypair()(seed_bytes in any::<[u8; 32]>()) -> [u8; 178432] {
        [0; 178432]
    }

Looking at the source code for gen_and_run_case, which is where the overflow happens, the most likely culprit seems to be the call to new_tree, which seems like it at least sets things up to store a bunch of outputs, though the actual invocation appears to be deferred.

It is possible to allocate more stack to tests with the RUST_MIN_STACK environment variable. I found that increasing the stack to 6MB (from the default of 2MB) allowed the tests to run on my machine.

RUST_MIN_STACK=6291456 cargo test

@bifurcation
Copy link
Contributor

Net of all that:

  • It's still not clear why changing the version of rand led to this.
  • It makes sense why opt-level=2 would fix it on one platform and not another.
  • It might be most expeditious just to update the CI script to use a bigger stack.

@tarcieri
Copy link
Member Author

Increasing the stack size "fixed" the issue. I guess we'll go with that for now.

@tarcieri tarcieri merged commit bd3d94c into master Feb 13, 2025
47 checks passed
@tarcieri tarcieri deleted the ml-dsa/rand-v0.9 branch February 13, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants