Skip to content

Bug fixes: ungraceful test crash fixes #637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

miningexperiments
Copy link
Contributor

Bug fix

  • When running cargo tests the testing is crashed ungracefully leaving temp databases and such in tens of gigabytes, if they are run a few times, and still crashing.
  • This mainly due errors not propagating to the test functions, but panicking early.
  • tokio was underflowing UtxoChangeSubscription,

What I've done

  • Propagated errors from temp dir creation, create_temp_db macro, unwrapper errors in test functions.
  • New test functions with consensus.init() were not shutdown, and they were Arc'd so could possibly stay haunting after test is over?
  • connection_handler.rs serve function was forcefully panicked, I changed this to error
  • added math check to UtxoChangeSubscription

Additional thoughts

  • protocol/p2p/src/core/connection_handler.rs This kept going on for 15 tries, and then shut. Perhaps another check should be added so it can close earlier in the test.
    • It is common for servers and e.g. VPNs to not use ipv6, so graceful error propagation is needed in testing (not panic)
  • cargo test without --release is still complaining in current master branch (before any pr changes). This is not tested in ci.yaml.
error: 2 targets failed:
    `-p kaspa-testing-integration --lib`
    `-p kaspa-wallet-core --lib`
    ```

Changed functions to use std::time::Instant which is monotonic, to avoid Rust panics with SystemTime. 

Replaced some unwraps with an expect.

Removed redundant brackets, and secp256k1::
Changed fn clean_old_pending_outpoints to retain keys that are younger than an hour, instead of collecting older than an hour ones as a vector, and then using a new for loop to deleting them.

linting with cargo fmt
# Optimization
flow.rs ; async sync_missing_relay_past_headers

# Fix
- Save memory; changed jobs to pass an iterator to try_join_all instead of collecting a vector and then passing it.

- Changed error check to return an explicit error, instead of implicit
create_temp_db now returns a Result
create_test_db returns a Result
fixes cargo test crashes (ungraceful termination)
create_temp_db now returns Result
to fix test crashing ungracefully
create_temp_db returns result to prevent test crashing ungracefully, so we need to add error handling
tokio-runtime-worker subtracted below zero and caused overflow during cargo test, and ungraceful termination
serve_result causes testing to crash ungracefully using panic! when error recieved. Now error is propagated
create temp db returns Result
create_temp_db returns Result, add error handling
…eads to hitting File Descriptor Limit and failing tests due to panic in get_kaspa_tempdir func in database/src/utils.rs
… reaches System File Descriptor limit and crashes testing ungracefully leaving temp files in system
@miningexperiments
Copy link
Contributor Author

this is fixed in PR: test daemon_integration_tests::daemon_utxos_propagation_test ... ok

The other test failing is the

 thread 'tx::generator::test::test_generator_inputs_250k_outputs_2_sweep' has overflowed its stack
fatal runtime error: stack overflow

I think it might be just that the vector with 250k elements is in stack, so it overflows. You could allocate it in heap with fixed size, but then you are fixing the test, and not sure how it reflects to testing the making of utxo_entries in wallet/core/src/tx/generator/generator.rs @michaelsutton @coderofstuff @someone235

Copy link
Collaborator

@coderofstuff coderofstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review. I'll probably on be able to get back to this after the next few weeks but please take a look.

self
);
// Prevent underflow; [ERROR] thread 'tokio-runtime-worker' panicked at notify/src/subscription/single.rs:388:13: attempt to subtract with overflow
let _ = UTXOS_CHANGED_SUBSCRIPTIONS.fetch_update(Ordering::SeqCst, Ordering::SeqCst, |count| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just did:

UTXOS_CHANGED_SUBSCRIPTIONS.fetch_sub(1, Ordering::SeqCst).saturating_sub(1)

Also, I wonder what the root cause of UTXOS_CHANGED_SUBSCRIPTIONS being 0 is. The above fixes the underflow, but the incorrect usage that allowed the situation still needs to be determined. the expectation here is that UTXOS_CHANGED_SUBSCRIPTIONS should always be above 0 when drop occurs OR document the situations when it can be 0.

Copy link
Contributor Author

@miningexperiments miningexperiments Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik saturating_sub only applies to old value from fetch_sub not the atomic itself, so it doesnt prevent actual atomic variable from becoming negative. Suggested fetch_update solution only applies if the value is greater than zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
btw this happens frequently in testing

@@ -113,6 +113,9 @@ jobs:
target/
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}

- name: Run cargo tests non--release
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We moved away from test and switched to using nextest. But I do wonder why nextest doesn't capture the errors that test is displaying here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo test still shows tests failing that nextest does not, so what is a recommended solution?

match serve_result {
Ok(_) => info!("P2P Server stopped: {}", serve_address),
Err(err) => panic!("P2P, Server {serve_address} stopped with error: {err:?}"),
Err(err) => log::error!("P2P, Server {serve_address} stopped with error: {err:?}"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error in the use at line 8 and just use that here.

@@ -1820,7 +1821,7 @@ async fn run_kip10_activation_test() {
let mut genesis_multiset = MuHash::new();
consensus.append_imported_pruning_point_utxos(&initial_utxo_collection, &mut genesis_multiset);
consensus.import_pruning_point_utxo_set(config.genesis.hash, genesis_multiset).unwrap();
consensus.init();
let wait_handles = consensus.init();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments on why this is necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was done in previous tests as well.
Here

  pub fn init(&self) -> Vec<JoinHandle<()>> {
        self.consensus.run_processors()
    }

and

    pub fn shutdown(&mut self) {
        self.core.shutdown();
        self.join();
    }

where core is wrapped in Arc, so its reference counted. Without it some background threads might stay running, thought it might be related to getting System File Descriptor problems and not getting /tmp/rusty-kaspa cleanups early.

@@ -1872,6 +1873,8 @@ async fn run_kip10_activation_test() {
let status = consensus.add_utxo_valid_block_with_parents((index + 1).into(), vec![index.into()], vec![spending_tx.clone()]).await;
assert!(matches!(status, Ok(BlockStatus::StatusUTXOValid)));
assert!(consensus.lkg_virtual_state.load().accepted_tx_ids.contains(&tx_id));
consensus.shutdown(wait_handles);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and the rest of similar changes: Add comments on why this is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants