Skip to content

Conversation

@duongcongtoai
Copy link

@duongcongtoai duongcongtoai commented Oct 19, 2025

Which issue does this PR close?

Rationale for this change

Implement custom interleaving logic for nested data types that have dictionary. Current main branch uses interleave_fallback which will concat the dictionaries even if they are logically equal (but pointing to different address). There are 2 approaches to work around this:

  • make MutableArrayData handle dictionary overflow error by best effort merging the dictionary, which introduce overhead, but since this falls into a rare case (where previous implementation would have panic anyway), it shouldn't affect performance of other flow
  • implement custom interleaving logic for these data types (The approach chosen by this PR)

The first solution may provides wider support for any nested data types which contains dictionary (i.e list<struct<list>> )

What changes are included in this PR?

Nested data types are interleaved using function interleave_fallback, which create a MutableArray, and it catch if there exists DictionaryKeyOverFlowError happen, then try merging the children dictionary arrays recursively.

Are these changes tested?

  • test in interleave.rs with complex nested data types containing dictionary

Since this logic only touches the error path, it should not affect performance of happy path

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 19, 2025
@duongcongtoai duongcongtoai changed the title fix: try merging list of dict if possible fix: try merging on interleaving list of dict Oct 19, 2025
@duongcongtoai duongcongtoai force-pushed the fix-overflow-on-interleave-list-of-dict branch from 5c3ee8c to d18f6f0 Compare October 25, 2025 13:46
@duongcongtoai duongcongtoai marked this pull request as ready for review October 25, 2025 14:49
@valkum
Copy link

valkum commented Oct 27, 2025

Thanks for this! I am wondering if it makes sense to try to fix this for all cases as well.
As you worked on this and wrote the other option would be "make interleave_fallback returns error in case dictionary overflow, and then try merging the dictionary and retry."
Do you have a rough estimate of the work needed and the potential impact of the other solution?

@duongcongtoai
Copy link
Author

I'll try adding the general purpose solution later, may takes 2 days

@duongcongtoai
Copy link
Author

Looks like the implementation of dictionary merge (Interner) does not guarantee unique values post-merge, meaning if we merge two value arrays with 256 items each and with key type UInt8, keyoverflow error can still happen, we need a different merge logic for this

@duongcongtoai
Copy link
Author

I decided to add general purpose dictionary overflow handling and remove case specific interleave logic (because it does not add any performance improvement and complicates the codebase)

@duongcongtoai
Copy link
Author

New finding: if we add merging logic just inside interleave_fallback it won't protect other places that use MutableArrayData from panic.

thread 'tokio-runtime-worker' (4045685) panicked at /home/toai/proj/rust/arrow-rs/arrow-data/src/transform/mod.rs:730:14:
MutableArrayData::with_capacities is infallible: DictionaryKeyOverflowError
stack backtrace:
   0:     0x7cb3581295b2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hcf59a46293aa5eb1
   1:     0x7cb35543bc0f - core::fmt::write::h8dc485dd906e4442
   2:     0x7cb35812906f - std::io::Write::write_fmt::hba8247aec2c11ae5
   3:     0x7cb358129413 - std::sys::backtrace::BacktraceLock::print::h854838da61675cc9
   4:     0x7cb358128a6f - std::panicking::panic_with_hook::had394b1c5c651ceb
   5:     0x7cb3581523a8 - std::panicking::panic_handler::{{closure}}::h28bb7215e5f9ceb3
   6:     0x7cb358152309 - std::sys::backtrace::__rust_end_short_backtrace::h08b455234703e3cf
   7:     0x7cb3581522fc - __rustc[46daf8625d84f73a]::rust_begin_unwind
   8:     0x7cb355436fff - core::panicking::panic_fmt::hae136728593523c5
   9:     0x7cb355443b35 - core::result::unwrap_failed::h70a521a76fd4e9a4
  10:     0x7cb354f2e30f - arrow_data::transform::MutableArrayData::with_capacities::hf363caf9918199ea
  11:     0x7cb35677b9cf - <datafusion_functions_nested::extract::ArrayElement as datafusion_expr::udf::ScalarUDFImpl>::invoke_with_args::h07994f269aa3720c
  12:     0x7cb356de2a0f - <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate::h2d325768d5b0df5c
  13:     0x7cb356de5c52 - <datafusion_physical_expr::expressions::is_not_null::IsNotNullExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate::hdbe4f93b738522c1

I recompiled with the recent commit (without datatype case specific dictionary merge), and the error still happen somewhere else. I think MutableArrayData must own the dictionary merge logic itself.

@duongcongtoai duongcongtoai changed the title fix: try merging on interleaving list of dict fix: try merging dictionary as a fallback on overflow error Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DictionaryKeyOverflowError on interleave with nested type containing dictionary

2 participants