Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: set work_group in dbt_project.yml #377

Closed
SoumayaMauthoorMOJ opened this issue Aug 20, 2023 · 14 comments · Fixed by #412
Closed

Feature: set work_group in dbt_project.yml #377

SoumayaMauthoorMOJ opened this issue Aug 20, 2023 · 14 comments · Fixed by #412

Comments

@SoumayaMauthoorMOJ
Copy link
Contributor

Would it be possible to set work_group in dbt_project.yml? This means we could assign different queries to different workgroups within a single dbt run, for example to separate the curation queries vs the derivation queries. I would be happy to do the work if someone could point me in the right direction. I'm guessing I would need to modify https://github.com/dbt-athena/dbt-athena/blob/main/dbt/adapters/athena/impl.py?

@SoumayaMauthoorMOJ SoumayaMauthoorMOJ changed the title set work_group in dbt_project.yml Feature: set work_group in dbt_project.yml Aug 20, 2023
@nicor88
Copy link
Contributor

nicor88 commented Aug 21, 2023

@SoumayaMauthoorMOJ did you try this

name: 'my_project'
version: '1.0.0'
config-version: 2


profile: 'my_project'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"

models:
  my_project:
    # Config indicated by + and applies to all files under models/example/
    # except if they are explicitly overwritten
    curation:
       work_group: "curation_workgroup"
    derivation:
       work_group: "derivation_workgroup"

I didn't test this - therefore I'm not sure that it works - but I expect that it should pick the workgroup per folders in your models.

@SoumayaMauthoorMOJ
Copy link
Contributor Author

SoumayaMauthoorMOJ commented Aug 22, 2023

Thanks @nicor88 for getting back to me! I should have said that I already tried modifying dbt_project.yml with no luck. I added some additional tags to make sure I was using the right structure:

models:
  probation:
    +materialized: table

    curated:
      delius:
        work_group: delius_sandbox # doesn't pick it up
        +meta:
          owner: delius
      oasys:
        work_group: oasys_sandbox # doesn't pick it up
        +meta:
          owner: oasys

If I do a dbt run all models use the work_group set in the profiles.yml

If I then run dbt ls --select result:success --state ./target --output json --output-keys name meta I get:

{"name": "delius__store_sales", "meta": {"owner": "delius"}}
{"name": "oasys__call_center", "meta": {"owner": "oasys"}}
{"name": "oasys__catalog_page", "meta": {"owner": "oasys"}}

@nicor88
Copy link
Contributor

nicor88 commented Aug 22, 2023

@SoumayaMauthoorMOJ Thanks for checking that. I believe that the issue is due to this: https://github.com/dbt-athena/dbt-athena/blob/main/dbt/adapters/athena/connections.py#L46 the workgroup is pick for run - as the connection is created once - Not really sure that we can do this.

@SoumayaMauthoorMOJ
Copy link
Contributor Author

Thanks @nicor88 this is more of a nice to have so I'll park this thread for now but might come back to it when I have more time

@nicor88
Copy link
Contributor

nicor88 commented Aug 22, 2023

As a possible workaround @SoumayaMauthoorMOJ you can consider to run your curation models separately - overwrite the workgroup with env variable - same for your derived models (using a env variable to override the profile workgroup)

I've a setup where I run silver models independently from gold models - different names but similar meaning to what you describe as curation and derivation, and in my setup I could achieve this quite easily.

@SoumayaMauthoorMOJ
Copy link
Contributor Author

Yes that's the approach I'm going with for now. It would be nice to run the silver and gold models in one dbt build because dbt then tracks table dependencies , but I could have a global dependency on the silver pipeline passing successfully instead.

@nicor88
Copy link
Contributor

nicor88 commented Aug 22, 2023

@SoumayaMauthoorMOJ indeed your right - it will be nice to overwrite that parameter per group of models

@SoumayaMauthoorMOJ
Copy link
Contributor Author

SoumayaMauthoorMOJ commented Sep 9, 2023

Hey @nicor88 we've decided to try using a single dbt build so we can have a single DAG across all silver and gold models. I've done some investigation and it's possible with pyathena to change the work_group for different executions of a connection see https://pypi.org/project/pyathena/:

from pyathena import connect

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2").cursor()
cursor.execute("SELECT * FROM one_row",
               work_group="YOUR_WORK_GROUP",
               result_reuse_enable=True,
               result_reuse_minutes=60)

The AthenaCursor execute() function accepts a work_group parameter see: https://github.com/dbt-athena/dbt-athena/blob/6ef6c970134abbbf52afb58110545825fde3a36d/dbt/adapters/athena/connections.py#L134

Hence I "think" it should be possible to change the workgroup, as long as I can pass in model config values to overwrite the connection values.

Any ideas? Perhaps something similar to run_query_with_partitions_limit_catching() see: https://github.com/dbt-athena/dbt-athena/blob/6ef6c970134abbbf52afb58110545825fde3a36d/dbt/adapters/athena/impl.py#L988

which is then used in the materializations see: https://github.com/dbt-athena/dbt-athena/blob/6ef6c970134abbbf52afb58110545825fde3a36d/dbt/include/athena/macros/materializations/models/table/create_table_as.sql#L150

@nicor88
Copy link
Contributor

nicor88 commented Sep 11, 2023

good findings - sure consider to propose a PR with your changes, I believe that it make sense.

If you use configs means that the workgroup can be set like that in dbt_project exactly as you proposed here #377 (comment)

@SoumayaMauthoorMOJ
Copy link
Contributor Author

I have raised issue with dbt-core to pass model configuration to execute(): dbt-labs/dbt-adapters#226

@nicor88
Copy link
Contributor

nicor88 commented Sep 15, 2023

@svdimchenko there is still some work to do here. Unfortunately we cannot still full pass workgroup to specifi model. I'm re-opening now ;)

@svdimchenko
Copy link
Contributor

Ahhh, yes. I believe it was closed automatically🚆

@SoumayaMauthoorMOJ
Copy link
Contributor Author

@nicor88 @svdimchenko I want to propose closing this ticket until dbt-labs/dbt-adapters#226 is completed. Modifying the various materizaliations to pass in wprk_group manually will require touching a lot of the code, not just run_query_with_partitions_limit_catching() , which will be complex to maintain in the future. Instead I'm planning to track costs per database through some other methods (which is I why I was keen on this feature). Any thoughts?

@nicor88
Copy link
Contributor

nicor88 commented Sep 18, 2023

Sure, happy to close it and then re-consider to open again if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants