Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Nov 19, 2025

Which issue does this PR close?

Rationale for this change

Self-referential INTERSECT and EXCEPT queries (where both sides originate from the same table) failed during Substrait round‑trip consumption with the error:

"Schema contains duplicate qualified field name"

This happened because the join-based implementation of set operations attempted to merge two identical schemas without requalification, resulting in duplicate or ambiguous field names. By ensuring both sides are requalified when needed, DataFusion can correctly construct valid logical plans for these operations.

Before

❯ cargo test --test sqllogictests -- --substrait-round-trip intersection.slt:33
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-917e139464eeea33)
Completed 1 test files in 0 seconds                                              External error: 1 errors in file /Users/kosiew/GitHub/datafusion/datafusion/sqllogictest/test_files/intersection.slt

1. query failed: DataFusion error: Schema error: Schema contains duplicate qualified field name alltypes_plain.int_col
...

After

❯ cargo test --test sqllogictests -- --substrait-round-trip intersection.slt:33
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.64s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-917e139464eeea33)
Completed 1 test files in 0 seconds

What changes are included in this PR?

  • Added a requalification step (requalify_sides_if_needed) inside intersect_or_except to avoid duplicate or ambiguous field names.

  • Improved conflict detection logic in requalify_sides_if_needed to handle:

    1. Duplicate qualified fields
    2. Duplicate unqualified fields
    3. Ambiguous references (qualified vs. unqualified collisions)
  • Updated optimizer tests to reflect correct aliasing (left, right).

  • Added new Substrait round‑trip tests for:

    • INTERSECT and EXCEPT (both DISTINCT and ALL variants)
    • Self-referential queries that previously failed
  • Minor formatting and consistency improvements in Substrait consumer code.

Are these changes tested?

Yes. The PR includes comprehensive tests that:

  • Reproduce the original failure modes.
  • Validate that requalification produces stable and correct logical plans.
  • Confirm correct behavior across INTERSECT, EXCEPT, ALL, and DISTINCT cases.

Are there any user-facing changes?

No user-facing behavior changes.
This is a correctness improvement ensuring that valid SQL queries—previously failing only in Substrait round‑trip mode—now work without error.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and validated.

Add automatic requalification logic to LogicalPlanBuilder::intersect_or_except
to handle conflicting column qualifiers. Wrap input plans and use
temporary aliases when conflicts are detected. Update the Substrait
SetRel consumer to apply this logic for intersection and except
operations. Add integration tests to verify the functionality with
self-referential queries.
Improve the logic in builder.rs to detect conflicts
across all three error cases. Return early with
requalification as soon as a conflict is found, while
preserving the original plan structure when no
conflicts exist.
Revise intersect test snapshot to reflect correct behavior
with requalification. The query now properly triggers
requalification for the inner INTERSECT when both sides
reference the same test source.
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules substrait Changes to the substrait crate labels Nov 19, 2025
Comment on lines -273 to -276
LeftSemi Join: test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8
Aggregate: groupBy=[[test.col_int32, test.col_utf8]], aggr=[[]]
LeftSemi Join: test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8
Aggregate: groupBy=[[test.col_int32, test.col_utf8]], aggr=[[]]
Copy link
Contributor Author

@kosiew kosiew Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old snapshot passed in main but was not properly distinguishing left from right

test.col_int32 = test.col_int32, test.col_utf8 = test.col_utf8

// is optimized away, resulting in just the LeftAnti join
assert_expected_plan(
"SELECT a FROM data WHERE a > 0 EXCEPT SELECT a FROM data WHERE a < 5",
"LeftAnti Join: left.a = right.a\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between the plans for INTERSECT (self_referential_intersect) and EXCEPT (self_referential_except) ?
I don't see any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected plans look almost identical in the test assertions, which is confusing. The key difference is actually in the join type, not the overall structure:

  • self_referential_intersect produces: **LeftSemi** Join: left.a = right.a
  • self_referential_except produces: **LeftAnti** Join: left.a = right.a

The rest of the plan structure is identical because:

  1. Both operate on the same table (data) with similar filters
  2. Both include the DISTINCT operation (via Aggregate: groupBy=[[data.a]]) because neither uses ALL
  3. Both get requalified to left and right aliases due to the duplicate field name issue

// 2. Duplicate unqualified fields: both sides have same unqualified name
// 3. Ambiguous reference: one side qualified, other unqualified, same name
for l in &left_cols {
for r in &right_cols {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the complexity is O(n*m).
You could optimize it to O(n+m) by iterating over left_cols (O(n)) and storing them in a HashMap<ColumnName, Column>, then while iterating over right_cols (O(m)) lookup by name in the hashmap (O(1)) and do the checks when there is an entry for that name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent observation on the algorithmic complexity. You're correct that the current nested loop is O(n*m), and this can be optimized to O(n+m) using a HashMap.

Analysis:

Here are some reasons for the current implementation:

  1. Schema size is typically small: For example, the TPC-H benchmark schemas range from 3-16 columns (median ~8). Even the largest table (lineitem with 16 columns) would only result in 256 comparisons worst-case for this function., which is negligible for modern CPUs.

  2. Early return on conflict: The function returns immediately upon finding the first conflict, so in the common case where conflicts exist (which is when this function matters), we often exit very early in the iteration.

  3. Simple conflict detection logic: The current implementation is straightforward and easy to reason about. The match statement clearly shows all conflict scenarios.

  4. Called infrequently: This function is only called during logical plan construction, not during execution. It's not in a hot path that runs millions of times.

Trade-offs of HashMap approach:

Pros:

  • O(n+m) vs O(n*m) complexity
  • Scales better for schemas with hundreds of columns

Cons:

  • More memory allocation overhead for the HashMap
  • More complex code that's slightly harder to understand
  • HashMap construction and hashing overhead may not pay off for small schemas
  • Need to handle the case where multiple columns have the same name in one schema (which can happen with different qualifiers)

If you feel strongly about this or if we anticipate very wide schemas (hundreds of columns), I'm happy to implement the HashMap-based optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Such an answer!
Next time just tell me "We can improve it if it ever shows up in the profiler" 😄
Thank you!

@kosiew kosiew force-pushed the duplicate-schema-16295 branch from 1f7d563 to a1e2ca3 Compare November 19, 2025 13:56
@kosiew kosiew force-pushed the duplicate-schema-16295 branch from a1e2ca3 to fbde854 Compare November 19, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[substrait] [sqllogictest] Schema contains duplicate qualified field name

2 participants