Skip to content

Conversation

stephen-up
Copy link

Hi, when dbt runs for a incremental+append it always drops the temporary tables that are used for full refreshes. Even if its not going to use them.

In my system, I have scheduled incremental runs of dbt that run frequently updating incremental+append tables. Whenever I do a full refresh (concurrently), the scheduled run drops the tables of the full refresh causing it to crash.

This PR stops that, by making the incremental+append runs not drop tables.

I see there have been a few other issues similar to this, but this seems like an easy minimal way to fix the issue.

Thanks

@CLAassistant
Copy link

CLAassistant commented Oct 9, 2024

CLA assistant check
All committers have signed the CLA.

@stephen-up
Copy link
Author

I signed the CLA btw. Dont know how long it takes to update.

@BentsiLeviav
Copy link
Contributor

Hi @stephen-up,

Thank you for your contribution! I appreciate the effort you’ve put into this.

I'm hesitant to approve this, as it could leave a hanging table behind. While I understand that it would be dropped in the next run, it’s still something to consider.

Additionally, this solution seems to address a specific edge case rather than providing a more comprehensive approach. A broader solution for handling parallel runs would be preferable.

I'll review the suggested alternative, though it currently applies only to temporary tables and not to relations created via make_intermediate_relation. Perhaps we could override make_intermediate_relation to include the invocation_id, similar to what was done in #150 and #353?

We could also introduce a feature flag for this macro, defaulting to false (disabled). What do you think?

@stephen-up
Copy link
Author

Hi Bentsi.

Leaving hanging tables does sound like something we want to avoid. Having random ids in the names like #150 or #353 do sound like solutions that ensure better concurrency, support multiple concurrent runs, but they increase the risk / cost of hanging tables.

Just not dropping the shadow tables that arent even used during an incremental+append seems simpler and more desirable for me. Im happy to add a feature flag for it.

@stephen-up
Copy link
Author

I've added the feature flag. Is there a way to set the feature flag for every model in a dbt project? They way i've implemented it, it looks like it needs to be set in every model like...

{{ config(
  order_by='project_id, igid',
  engine='ReplacingMergeTree(version)',
  materialized='incremental',
  incremental_strategy='append',
  incremental_append_skip_drop_table='true'
) }}

@BentsiLeviav
Copy link
Contributor

The feature flag I referred to in my previous reply was for the suggested make_intermediate_relation overwrite (my mistake if it wasn't clear).
I created a new issue #420 to gather the discussion around this because multiple PRs and users are trying to solve this problem. Would you mind adding your thoughts there?

@stephen-up
Copy link
Author

Hi @BentsiLeviav, I've added some comments to #420. When thinking about race conditions, it gets tricky.

Do you want to support concurrent full refreshes. Or a pair of a refresh and an incremental run. If you just want the latter, like me. I think this PR is perfect.

The incremental + refresh seems like a common pattern. We've setup a pipeline where we have an incremental run every 1/2 hour then occasionally run full refreshes which take hours, easily leading to a clash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants