Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add transient configs #777

Closed
wants to merge 3 commits into from
Closed

Conversation

jeremyyeo
Copy link
Contributor

resolves #776
docs dbt-labs/docs.getdbt.com/#

Problem

Python models don't respect transient config.

Solution

A snowpark table created via save_as_table can be transient via setting the param table_type = 'transient' (https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.DataFrameWriter.save_as_table).

  1. Pull out the transient config setting in the CTAS (snowflake__create_table_as()) from being exclusive to SQL models.
  2. Now that that config is available for Python models too - use that to determine the right value to set for the table_type arg.

There's probably something else to address.

  1. Assuming users don't set +transient config anywhere (i.e. profile, project, model level).
  2. By default, SQL tables are created as transient.
  3. By default, Python tables are created as non-transient.

This change would make Python tables also be created as transient - in line with (2).

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@jeremyyeo jeremyyeo requested a review from a team as a code owner September 17, 2023 23:04
@cla-bot cla-bot bot added the cla:yes label Sep 17, 2023
@dbeatty10
Copy link
Contributor

User experience change

There's probably something else to address.

  1. Assuming users don't set +transient config anywhere (i.e. profile, project, model level).
  2. By default, SQL tables are created as transient.
  3. By default, Python tables are created as non-transient.

This change would make Python tables also be created as transient - in line with (2).

Changing the default behavior for Python tables to be created as transient instead of non-transient seems reasonable to me since it would bring it in-line with the default value for SQL tables. It makes sense for them to have the same default.

The difference

The key difference is that transient tables do not have a Fail-safe period (whereas permanent tables do). So a user can trade data protection and recovery for lower cost.

The consequences

If this change is adopted as-is, then users will need to explicitly add transient: false / dbt.config(transient=False) if they want to keep the Fail-safe period's data protection and recovery abilities (along the associated costs).

Documentation

It would be nice to add some documentation in docs.getdbt.com how to configure the transient boolean (including explanation of the default value).

I don't know if this is the best place or if there's a better place.

We'd probably want to show the non-default config:

  1. Project setup:
# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: 1.0

models:
  my_dbt_project:
    +materialized: table
    +transient: false
# models/snek.py
import pandas as pd
def model(dbt, session):
    dbt.config(transient=False)
    return pd.DataFrame({"id": [1]})

Testing

We'd probably want to add some test case to make sure that both transient true & false behave as intended.

Would probably be accomplished by using config like shown above and then checking the value in the information schema afterwards.

Would you be able to add those tests @jeremyyeo ? Or would you be looking for further guidance or assistance from the Adapters team on the testing portion?

@jeremyyeo
Copy link
Contributor Author

Would probably be accomplished by using config like shown above and then checking the value in the information schema afterwards.
Would you be able to add those tests @jeremyyeo ? Or would you be looking for further guidance or assistance from the Adapters team on the testing portion?

Certainly would love some guidance added those - couldn't find any examples in the repo that does this (but could just be me essentially having a lack of testing knowledge).

@dbeatty10 dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Sep 19, 2023
@dbeatty10
Copy link
Contributor

Certainly would love some guidance added those - couldn't find any examples in the repo that does this (but could just be me essentially having a lack of testing knowledge).

@jeremyyeo We'll need to get insight from someone on the Adapters team of what kind of tests they'd like for this.

I didn't see any tests here that seemed like a perfect match:

@colin-rogers-dbt
Copy link
Contributor

@jeremyyeo this is a great change

To add some tests:

  1. Create an example python model as a string like here. You should be able to just copy the models__simple_python_model fixture and add the transient config to it
  2. Add a new test class like:
class TestTransientModelSnowflake:
    @pytest.fixture(scope="class")
    def models(self):
        return {"transient_python_model.py": models__transient_python_model}

    def test_transient_model(self, project):
        run_dbt(["run"])

@@ -38,7 +38,7 @@

{% endmaterialization %}

{% macro py_write_table(compiled_code, target_relation, temporary=False) %}
{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type='transient') %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather pass transient as a boolean into this macro and then handle the translation here for the call to df.write.mode(); the translation is specific to this call and we assume this is a boolean in the user config. Right now we have the translation in the create macro, which maps to "transient" or "". Also, is empty string correct here? What happens if we submit a call to df.write.mode() with table_type="" (versus omitting the kwarg table_type entirely)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is empty string correct here? What happens if we submit a call to df.write.mode() with table_type="" (versus omitting the kwarg table_type entirely)?

@mikealfare that's a wise question 🧠

I didn't test this out personally, but the comments in the snowpark source code describe that case here.

So the way it is implemented in this PR currently, the string value of table_type just becomes a pass-through in line 55 below:

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Snowflake docs (and what Doug linked to):

table_type – The table type of table to be created. The supported values are: temp, temporary, and transient. An empty string means to create a permanent table. Learn more about table types here.

So effectively:

# Table is transient.
save_as_table(..., table_type='transient')

# Table is non-transient.
save_as_table(..., table_type='')

# Table is non-transient (so default of `table_type` seems to be empty string).
save_as_table(...)

Could swap it out to (according to Mike's suggestion):

-- dbt/include/snowflake/macros/materializations/table.sql
{% macro py_write_table(compiled_code, target_relation, temporary=False, transient=False) %}
...
{% set table_type = 'transient' if transient else '' %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')

Not too sure if it's worth writing additional logic to omit (or not) the table_type arg when it is an empty string (or not) - i.e.

-- dbt/include/snowflake/macros/materializations/table.sql
{% macro py_write_table(compiled_code, target_relation, temporary=False, transient=False) %}
...
{% if transient %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='transient')
{% else %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}})

Happy to follow a suggested pattern @mikealfare

Copy link
Contributor Author

@jeremyyeo jeremyyeo Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ftr, the docs linked above point out that the kwarg create_temp_table is deprecated:

create_temp_table – (Deprecated) The to-be-created table will be temporary if this is set to True.

Which of course we still use in our save_as_table() method calls - so potentially would be a good time to replace that with table_type kwarg entirely.

Go from (status quo):

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}})

Or (this PR's original intended change):

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')

To just:

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

And then - we'll have all types via:

# Table is transient.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='transient')

# Table is non-transient.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='')

# Table is non-transient. Same as above - so pick a style we want I guess.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}')

# Table is temporary.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='temporary')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, just getting around to plowing through my GH notifications. All of this context helps a lot.

Given that SF is deprecating create_temp_table in favor of table_type, I agree with using table_type in the signature, as a string. And I would default it to empty string in alignment with the docs. This also keeps config parsing out of a macro that otherwise does not care about the jinja global config that's floating around.

I wouldn't worry about removing the table_type argument if table_type is not in the call. However, I think we need to be cognizant of backwards compatibility for the temporary argument in the py_write_table macro. So that means we need to take both arguments in the py_write_table macro and combine them. @dbeatty10, correct me if I'm wrong here, but I think that amounts to something like this:

{% if temporary == True %}
{% set table_type = "temporary" %}  -- hence override the value of `table_type` that was passed in
{% else %}
-- this else is not needed, but communicates the concept
-- keep the value of `table_type` that was passed in (which could be the default empty string)
{% endif %}

Then we update the call to save_as_table to exclude create_temp_table in alignment with your third option above:

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds about right @mikealfare.

Logic

To support both backwards-compatibility as well as forward-facing use cases, I'd suggest that the new table_type parameter takes precedence over temporary (but only when it is specified).

So like this (in alignment with the Snowflake docs here):

  • use table_type when specified
  • use "temporary" as the table type only when temporary is True and table_type is not specified
  • default the table type to "transient" if all else fails

e.g., something similar to what you wrote, but with the precedence flipped:

An untested implementation

{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type=none) %}

...

{% if table_type is none and temporary %}
    {% set table_type = "temporary" %}
{% elif table_type is none %}
    {# Default to "transient", just like dbt SQL tables in Snowflake #}
    {% set table_type = "transient" %}
{% elif table_type == "permanent" %}
    {# Snowflake treats "" as meaning "permanent" #}
    {% set table_type = "" %}
{% endif %}

...

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

Unnecessarily complicated?

This code might appear unnecessarily complicated on first blush. But it's a four-fold consequence of:

  1. Goal of giving precedence to table_type when a consumer uses the 3-parameter version of the py_write_table macro.
  2. Goal of being backwards-compatible when a consumer uses the 2-parameter version of the py_write_table macro.
  3. Goal of using "transient" as the default value (unless overridden)
  4. Snowflake docs say: "The supported values of table_type are: temp, temporary, and transient. An empty string means to create a permanent table."

Copy link
Contributor

@mikealfare mikealfare Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, but I have one suggestion.

Since we're mimicking the behavior of py_write_table and we want folks to use table_type moving forward, we want to support table_type="transient", table_type="temp", and table_type="temporary". This logic does actually do that, but I read it multiple times before I realized that happens because we're updating the parameter that comes in, hence it passes straight through. For the sake of readability, especially because it's jinja, I'd like to add an else clause to the if block that just contains the comment "otherwise leave table_type as it is" (or something along those lines). It would save folks some time in the future.

{# Snowflake treats "" as meaning "permanent" #}

I completely misread the docs. I conflated this with the default value of "transient". It's kind of wild that empty string is a valid value that means something and is also not the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wild indeed! 🤠

Your suggestion works for me 👍

Here's what it would look like after adding that piece:

{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type=none) %}

...

{% if table_type is none and temporary %}
    {% set table_type = "temporary" %}
{% elif table_type is none %}
    {#- Default to "transient", just like dbt SQL tables in Snowflake -#}
    {% set table_type = "transient" %}
{% elif table_type == "permanent" %}
    {#- Snowflake treats "" as meaning "permanent" -#}
    {% set table_type = "" %}
{% else %}
    {#- Otherwise leave table_type as it is -#}
{% endif %}

...

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

@@ -46,7 +46,8 @@
{%- endif -%}

{%- elif language == 'python' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary) }}
{%- set table_type = 'transient' if transient else '' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change

{%- set table_type = 'transient' if transient else '' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }}

to

{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, transient=transient) }}

and handle within py_write_table instead if we want to proceed with Mike's suggestion above (https://github.com/dbt-labs/dbt-snowflake/pull/777/files#r1330527575).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do a few things here in alignment with my comment above. I would update the definition of table_type here to consider temporary as discussed above. And then I would not pass temporary into the py_write_table call at all. That tidies up all the config parsing. You would still need to keep that if block in the py_write_table macro in the even that other folks are using it; but then when we eventually retire the temporary argument in that macro, we don't need to remember to come back here and deal with it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbeatty10 I think this still aligns with what you're saying above, correct? We're effectively forcing the use of table_type here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikealfare there's one more thing to handle here:

Which brings us to three-valued logic ...

There's three-valued logic to handle for the transient boolean config:

  • None (pass-through and let py_write_table decide what to do)
  • True (use transient)
  • False (use permanent?)

How 'bout this?

So I think we'd need handle the null case first to be sure the table_type is set (or not set!) correctly:

{% if transient is none %}
    {% set table_type = none %}
{% elif transient %}
    {% set table_type = "transient" %}
{% else %}
    {% set table_type = "permanent" %}
{% endif %}

{{ py_write_table(compiled_code=compiled_code, target_relation=relation, table_type=table_type) }}

This is assuming an implementation of py_write_table like our most recent iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If transient is None, then wouldn't we set table_table based on temporary since we're not passing temporary into py_write_table anymore? Put another way, I don't think you would be able to create a table with table_type="temporary" in your logic flow without also passing temporary into the call to py_write_table. And I think we don't want to do that from what we said.

{% if transient is none and temporary %}
    {% set table_type = "temporary" %}
{% elif transient is none %}
    {% set table_type = none %}
{% elif transient %}
    {% set table_type = "transient" %}
{% else %}
    {% set table_type = "permanent" %}
{% endif %}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sharp eyes -- you read that right!

After examining the logic for sql tables in dbt-snowflake it became clear that it prioritizes the temporary config over the transient config.

So I flipped-flopped from our earlier discussions and switched the implementation to standardize on the behavior of sql tables.

Specifically, I just re-factored the code so that this logic applies to both sql and python tables.

Here is the relevant portion of the diff:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but then if someone calls the py_write_table macro directly, then we want the override to go in the other direction because then we're only in the scope of python tables, not sql tables. If that's right, then let's get this updated with what you have and up for review. I'm working on getting another thing over the line, but can help push this along when it's ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikealfare Currently failing CI, but draft PR is up here: #802

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a comment there. You're missing the close braces on the sets, so the code quality failed. Will we be moving forward with 802 then as the primary PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's consider #802 the primary PR. It preserves @jeremyyeo's commits history and authorship credit.

Copy link
Contributor

@mikealfare mikealfare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I answered all of your questions. Let me know if I missed something. Thanks again for all of the write up; it really helps.

@@ -38,7 +38,7 @@

{% endmaterialization %}

{% macro py_write_table(compiled_code, target_relation, temporary=False) %}
{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type='transient') %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, just getting around to plowing through my GH notifications. All of this context helps a lot.

Given that SF is deprecating create_temp_table in favor of table_type, I agree with using table_type in the signature, as a string. And I would default it to empty string in alignment with the docs. This also keeps config parsing out of a macro that otherwise does not care about the jinja global config that's floating around.

I wouldn't worry about removing the table_type argument if table_type is not in the call. However, I think we need to be cognizant of backwards compatibility for the temporary argument in the py_write_table macro. So that means we need to take both arguments in the py_write_table macro and combine them. @dbeatty10, correct me if I'm wrong here, but I think that amounts to something like this:

{% if temporary == True %}
{% set table_type = "temporary" %}  -- hence override the value of `table_type` that was passed in
{% else %}
-- this else is not needed, but communicates the concept
-- keep the value of `table_type` that was passed in (which could be the default empty string)
{% endif %}

Then we update the call to save_as_table to exclude create_temp_table in alignment with your third option above:

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

@@ -46,7 +46,8 @@
{%- endif -%}

{%- elif language == 'python' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary) }}
{%- set table_type = 'transient' if transient else '' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do a few things here in alignment with my comment above. I would update the definition of table_type here to consider temporary as discussed above. And then I would not pass temporary into the py_write_table call at all. That tidies up all the config parsing. You would still need to keep that if block in the py_write_table macro in the even that other folks are using it; but then when we eventually retire the temporary argument in that macro, we don't need to remember to come back here and deal with it here.

@dbeatty10
Copy link
Contributor

Closing in favor of #802

Thanks for all your work on this @jeremyyeo between finding the issue, explaining it so thoroughly (including links to Snowflake docs), and even fixing it! 🏆

Your authorship and commits are preserved within #802 -- @mikealfare and I just added some tests and tweaked the logic slightly.

@dbeatty10 dbeatty10 closed this Oct 12, 2023
@mikealfare mikealfare deleted the fix-776/transient-python-models branch July 17, 2024 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-901] [Bug] Not possible to set transient config on Python models
4 participants