Skip to content

[BUG] ma_sat_v0 macro is not working properly #331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JulioGarciaVW opened this issue Mar 13, 2025 · 8 comments
Closed

[BUG] ma_sat_v0 macro is not working properly #331

JulioGarciaVW opened this issue Mar 13, 2025 · 8 comments
Labels
bug Something isn't working

Comments

@JulioGarciaVW
Copy link

JulioGarciaVW commented Mar 13, 2025

Describe the bug

We have figure out a current bug with ma_sat_v0 macro.

First run works as expected. Bug was found in the incremental way.

In our case, in staging we don't have the creation date of the record and we are using a sysdate function (as DV recomends if you don't have the date). So, each day we check if some hashdiff have changed because the dv_load_date is always newer.
The ma_sat primary key is parent_hashkey + src_ma_key but the macro source code is checking only the parent_hashkey.

{%- if is_incremental() %}
latest_entries_in_sat AS (

    SELECT
        {{ parent_hashkey }},
        {{ ns.hdiff_alias }}
    FROM 
        {{ this }}
    QUALIFY ROW_NUMBER() OVER(PARTITION BY {{ parent_hashkey|lower }} ORDER BY {{ src_ldts }} DESC) = 1  
),
{%- endif %}

So, in our case we have two records per parent_hashkey but the query above is only keeping one of them because it's checking by parent_hashkey and not by parent_hashkey + src_ma_key.

If you combine this query with the query below, every day you are inserting one of the two records (even if they don't have any change) and you don't have the real latest one, every day you will have one of them as the latest.

records_to_insert AS (

    SELECT
        deduped_rows.{{ parent_hashkey }},
        deduped_rows.{{ ns.hdiff_alias }},
        {{ datavault4dbt.alias_all(columns=source_cols, prefix='deduped_rows') }}
    FROM deduped_rows
    {%- if is_incremental() %}
    WHERE NOT EXISTS (
        SELECT 1
        FROM latest_entries_in_sat
        WHERE {{ datavault4dbt.multikey(parent_hashkey, prefix=['latest_entries_in_sat', 'deduped_rows'], condition='=') }}
            AND {{ datavault4dbt.multikey(ns.hdiff_alias, prefix=['latest_entries_in_sat', 'deduped_rows'], condition='=') }} 
            )
    {%- endif %}

    )

Environment

  • dbt version: core, 1.9.0
  • datavault4dbt version: 1.9.0
  • Database/Platform: Redshift/aws

To Reproduce

SQL example_model:

select
  1 as parent_hashkey, 
  'a' as src_ma_attribute,
  'comment' as payload_field,
  'abc' as src_hashdiff,
  sysdate as dv_load_date,
  'source1' as dv_source_syste
union all
select
  1 as parent_hashkey, 
  'b' as src_ma_attribute,
  'comment2' as payload_field,
  'bcd' as src_hashdiff,
  sysdate as dv_load_date,
  'source1' as dv_source_syste

ma_sat_model:

{{ config(materialized='incremental') }}

{%- set yaml_metadata -%} 
source_model: 'example_model'
parent_hashkey: 'parent_hashkey'
src_hashdiff: 'src_hashdiff'
src_ma_key: 'src_ma_attribute'
src_payload: 
    - comment2
{%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.ma_sat_v0(source_model=metadata_dict.get('source_model'),
                        parent_hashkey=metadata_dict.get('parent_hashkey'),
                        src_hashdiff=metadata_dict.get('src_hashdiff'),
                        src_ma_key=metadata_dict.get('src_ma_key'),
                        src_payload=metadata_dict.get('src_payload')) }}
  1. Run the example_model.
  2. Run ma_sat_v0.
  3. Run again the example_model.
  4. Run again ma_sat_v0.
  5. Check ma_sat_v0 results

Expected behavior

Each parent_hashkey + src_ma_key are treated separately.

Screenshots

Image

In this case I have filter by only one parent_hashkey (first column).
Second column is the hash diff. Where you can see that it's not changing in the same primary key (parent_hashkey + nation)
Fourth column (nation) is the src_ma_key.

Additional context

Add any other context about the problem here.

@JulioGarciaVW JulioGarciaVW added the bug Something isn't working label Mar 13, 2025
@JulioGarciaVW
Copy link
Author

Could you take a look?

@tkirschke
Copy link
Member

Hi @JulioGarciaVW and sorry for the late reply.

The latest_entry_in_sat CTE has the only purpose to get the latest hashdiff value for each parent_hashkey. In theory, the hashdiff for all participants of one multi-active group (x rows for one hashkey+ldts with different ma_key values) would be identically, because a block hashdiff is used in the stage model.

This is whats causing the loading error on your side - your stage seems to not be configured for multiactivity, and hence does not calculate this multi-active block hashdiff. To do so, see Example 3 on the staging docs page.

You basically have to add this config to your stage yaml:

multi_active_config:
    multi_active_key: 'src_ma_attribute'
    main_hashkey_column: 'parent_hashkey'

Let me know if the delta loading process now works!

@JulioGarciaVW
Copy link
Author

Understood.
Our problem is we are not ussing staging macros because we have the same problem is posted here.
We have data with leading and trailing spaces that are really important and staging macro is trimming everything.

Are you working in 319 issue? If you add some flag to keep spaces we can change to use staging macros.

@JulioGarciaVW
Copy link
Author

I have also seen there can be some problem with SHA256 algorithm in redshift?

it looks like it's replaced by SHA256 but this doesn't exists in redshift, the correct one has to be SHA2.

{%- macro default__hash_default_values(hash_function, hash_datatype) -%}

    {%- set dict_result = {} -%}
    {%- set hash_alg = '' -%}
    {%- set unknown_key = '' -%}
    {%- set error_key = '' -%}

    {%- if hash_function == 'MD5' -%}
        {%- set hash_alg = 'MD5' -%}
        {%- set unknown_key = '!00000000000000000000000000000000' -%}
        {%- set error_key = '!ffffffffffffffffffffffffffffffff' -%}
    {%- elif hash_function == 'SHA' or hash_function == 'SHA1' -%}
        {%- set hash_alg = 'SHA1' -%}
        {%- set unknown_key = '!0000000000000000000000000000000000000000' -%}
        {%- set error_key = '!ffffffffffffffffffffffffffffffffffffffff' -%}
    {%- elif hash_function == 'SHA2' or hash_function == 'SHA256' -%}
        {%- set hash_alg = 'SHA256' -%}
        {%- set unknown_key = '!0000000000000000000000000000000000000000000000000000000000000000' -%}
        {%- set error_key = '!ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff' -%}
    {%- endif -%}

    {%- do dict_result.update({"hash_alg": hash_alg, "unknown_key": unknown_key, "error_key": error_key }) -%}

    {{ return(dict_result | tojson ) }}

{%- endmacro -%}

This is the macro because there is no one specific to redshift. I guess that hash_alg would have to be 'SHA2' for redshift.

@tkirschke
Copy link
Member

We have data with leading and trailing spaces that are really important and staging macro is trimming everything.

Are you working in 319 issue? If you add some flag to keep spaces we can change to use staging macros.

Okay understood. This is an existing feature request, where active development is not currently planned yet.

However, i described a workaround in this issue, which involves derived columns in the staging model, to replace these whitespaces with alternative characters. If you would implement it like that, you could use the block hashdiff and the multi-active should work fine.

However, if you do not want to do this, you would need to modifiy your current hashing script to use a LISTAGG / STRINGAGG for hash calculation. More details can be found here under the keyword "delta check on".

@JulioGarciaVW
Copy link
Author

ok, thanks for your quick answer @tkirschke !

We will try with the second option and when you add the feature request about leading and trailing spaces maybe we will refactor the solution with that approach.

Thanks for oppenind the redshif issue as well!

@tkirschke
Copy link
Member

ok, thanks for your quick answer @tkirschke !

We will try with the second option and when you add the feature request about leading and trailing spaces maybe we will refactor the solution with that approach.

Thanks for oppenind the redshif issue as well!

Great, let me know if the satellite works as intended then. If so, I would appreciate if you close this bug! :)

@JulioGarciaVW
Copy link
Author

Solved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants