Upsert merge strategy for iceberg #2671

anuunchin · 2025-05-22T14:49:45Z

Description

This PR enables the "upsert" merge write disposition for Iceberg. The functionality is similar to how delta works. Underneath, Iceberg's upsert table is used.

Related Issues

Resolves enable upsert for filesystem / iceberg destination #2549

Additional Context (for failing tests)

The test test_merge_on_keys_in_schema_nested_hints would likely fail with maximum recursion error, as Iceberg's upsert has a limitation mentioned here. This is apparently resolved in pyiceberg 0.9.1, but I'm still getting a sigill error with 0.9.1 and maximum recursion error with 0.9.0, the reason for which I'm unsure about.
- Possible solution: The issue can be addressed with batching the tables with size 1000 each.
The test test_table_format_schema_evolution is failing with Data type struct<a: int64> is not supported in join non-key field json because iceberg's upsert uses arrow's hash join which doesn't handle nested types. The issue was addressed with this PR in pyiceberg 0.9.1, but with 0.9.1 I get a different error saying Unexpected physical type FIXED_LEN_BYTE_ARRAY for decimal(6, 4), expected INT32. The latter issue was also addressed in this PR in pyiceberg, but it's not yet released. Therefore, the issue should go away with pyiceberg's next release. (I tested with the newest code and it works 👀 )
- Possible solution: We wait for the next pyiceberg release and upgrade to the latest version.
The tests test_pipeline_load_parquet and test_table_format_child_tables fail because iceberg's upsert strictly forbids duplicates in input data as well as target table - if the key columns are not unique inside the source it raises an error. Delta doesn't even seem to be working properly 👀 .
- Possible solution: We keep it explicit in the docs and leave it to the user to follow the uniqueness requirement + adjust the failing tests.

netlify · 2025-05-22T14:49:51Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`b8f57be`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/68415425965d740008263ca6

dlt/common/libs/pyiceberg.py

sh-rp

Thanks for working on this! About your questions:

You can use the suggested code and document the behavior. I would also update to 0.9.1, try to isolate the sigill in a very small code block (I get it too) and open an issue in pyicerg, they seem to have some kind of bug.
If you update to pyiceberg 0.9.1, you'll see the tests for append and replace also start failing, so they changed something there is unrelated to merging. I would also try to isolate this and see if you can figure out what provokes this error and wether this is a problem on our part or theirs.
Yes, please document (you already did) and fix the tests accordingly. What happens in delta? Do we need a code or docs update there?

dlt/common/libs/deltalake.py

docs/website/docs/dlt-ecosystem/destinations/iceberg.md

dlt/common/libs/pyiceberg.py

sh-rp · 2025-05-26T09:13:47Z

dlt/common/libs/pyiceberg.py

+        with table.update_schema() as update:
+            update.union_by_name(ensure_iceberg_compatible_arrow_schema(data.schema))
+
+        if "parent" in schema:


I'm not sure about this child table loading strategy. I know you took it from delta, but it seems to me that the first unique column will be the _dlt_id which will always be new since it is generated in the normalized step and thus the merge condition is never met and we could just append. But maybe let's leave it like this for now.

upsert is using deterministic row_key that is computed from primary_key if the root table

rudolfix

LGTM! we should document batching and use the PyPI version of pyiceberg

rudolfix · 2025-06-02T11:46:11Z

dlt/common/libs/pyiceberg.py

+        with table.update_schema() as update:
+            update.union_by_name(ensure_iceberg_compatible_arrow_schema(data.schema))
+
+        if "parent" in schema:


upsert is using deterministic row_key that is computed from primary_key if the root table

docs/website/docs/dlt-ecosystem/destinations/iceberg.md

dlt/common/libs/pyiceberg.py

pyproject.toml

tests/load/pipeline/test_merge_disposition.py

rudolfix

LGTM! but you need to solve the conflict and merge devel...

anuunchin self-assigned this May 22, 2025

anuunchin requested a review from rudolfix May 22, 2025 15:29

anuunchin commented May 23, 2025

View reviewed changes

dlt/common/libs/pyiceberg.py Outdated Show resolved Hide resolved

sh-rp requested changes May 26, 2025

View reviewed changes

anuunchin force-pushed the feat/2549-upsert-filesystem-iceberg branch 2 times, most recently from 8cd57d7 to 93cbfee Compare May 30, 2025 13:44

rudolfix requested changes Jun 2, 2025

View reviewed changes

anuunchin force-pushed the feat/2549-upsert-filesystem-iceberg branch 2 times, most recently from 811ed93 to 7dfec68 Compare June 4, 2025 12:19

rudolfix previously approved these changes Jun 4, 2025

View reviewed changes

anuunchin added 5 commits June 5, 2025 09:28

Upsert for iceberg

ccef83e

Docs adjustment

0643a51

test_resolve_merge_strategy corrected for iceberg

fb67b79

Docs adjustments and athena iceberg test fix [ci skip]

3909e58

Pyiceberg bumped, improved error messages, batching for iceberg

f07f995

anuunchin dismissed rudolfix’s stale review via c17d230 June 5, 2025 07:31

anuunchin force-pushed the feat/2549-upsert-filesystem-iceberg branch from 7dfec68 to c17d230 Compare June 5, 2025 07:31

Test fix for filesystem iceberg

b8f57be

anuunchin force-pushed the feat/2549-upsert-filesystem-iceberg branch from c17d230 to b8f57be Compare June 5, 2025 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upsert merge strategy for iceberg #2671

Upsert merge strategy for iceberg #2671

Uh oh!

anuunchin commented May 22, 2025 •

edited

Loading

Uh oh!

netlify bot commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

sh-rp left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sh-rp May 26, 2025

Uh oh!

rudolfix Jun 2, 2025

Uh oh!

rudolfix left a comment

Uh oh!

rudolfix Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rudolfix left a comment

Uh oh!

Uh oh!

Upsert merge strategy for iceberg #2671

Are you sure you want to change the base?

Upsert merge strategy for iceberg #2671

Uh oh!

Conversation

anuunchin commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Additional Context (for failing tests)

Uh oh!

netlify bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs canceled.

Uh oh!

Uh oh!

sh-rp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sh-rp May 26, 2025

Choose a reason for hiding this comment

Uh oh!

rudolfix Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

rudolfix Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anuunchin commented May 22, 2025 •

edited

Loading

netlify bot commented May 22, 2025 •

edited

Loading