[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

SystemOfaDrow · 2024-10-08T09:22:24Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently, the delete+insert incremental strategy requires a "unique_key", but there's nothing actually enforcing this key be unique. I've done this before, but the issue is that it creates a cartesian join between all the records with matching (non-unique) unique_key values because of the USING. The new microbatch strategy does something similar, where if you don't have an additional predicate, the USING bit is unnecessary. In both cases, the join explosion can take up a significant amount of resources.

Describe alternatives you've considered

For the delete+insert strategy, I would propose possibly changing the name the "unique_key" config to "incremental_key", and modifying the query to use a correlated EXISTS subquery.

delete from {{ target }}
where exists (
    select 1
    from {{ source }}
    where
    {% if incremental_key is sequence and incremental_key is not string %}
        {% for key in incremental_key %}
            {{ source }}.{{ key }} = {{ target }}.{{ key }}
            {{ "and " if not loop.last}}
        {% endfor %}
    {% else %}
        {{ source }}.{{ incremental_key }} = {{ target}}.{{ incremental_key }}
    {% endif %}
)
{% if incremental_predicates %}
    {% for predicate in incremental_predicates %}
        {{ predicate }} {{ "and" if not loop.last }}
    {% endfor %}
{% endif %};

The using {{ source }} line in the microbatch strategy should simply be deleted.

Who will this benefit?

This would benefit every who uses these incremental strategies by significantly lowering warehouse load (and therefore costs) for these types of queries.

Are you interested in contributing this feature?

I think the code above should work. It may take me a while before I have time to do all the requisite testing before creating a PR.

Anything else?

I ran one query like DELETE FROM target USING source with no additional predicates where the target table had 10 million records and the source table a billion. You can see how relatively expensive that cartesian join is.

Most Expensive Nodes
CartesianJoin 47.3%
Delete 27.0%
TableScan 1.3%

The text was updated successfully, but these errors were encountered:

graciegoheen · 2024-10-23T20:56:12Z

This is related to:

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228
[Feature] new configuration to run tests on only the "new" data for snapshots and incremental models dbt-core#10877

SystemOfaDrow added enhancement New feature or request triage labels Oct 8, 2024

ugmuka linked a pull request Oct 18, 2024 that will close this issue

fix: [microbatch] delete redundant using clause #1214

Open

4 tasks

amychen1776 added incremental and removed triage labels Oct 23, 2024

graciegoheen mentioned this issue Oct 23, 2024

[Feature] new configuration to run tests on only the "new" data for snapshots and incremental models dbt-labs/dbt-core#10877

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

SystemOfaDrow commented Oct 8, 2024 •

edited

Loading

graciegoheen commented Oct 23, 2024

[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

Comments

SystemOfaDrow commented Oct 8, 2024 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

graciegoheen commented Oct 23, 2024

SystemOfaDrow commented Oct 8, 2024 •

edited

Loading