-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Activity Registry #25
Comments
@tnightengale looking for the following feedback:
|
I think the materialization is an abuse of the abstraction in dbt, in order to add attributes to the graph. By overloading that abstraction, we would take away users' choice to use existing materializations they may need like "incremental". Therefore I think What I did when implementing this with other data teams, is just create a macro as a dict, eg:
Then folks just access it as an attribute. For example, in a model to create an activity:
One limitation of this approach is that the users must implement this macro themselves; there's no "registration" interface. An upside however is that it's easy to grok all the activities, and their feature json columns in one location. Ideally, I'd like to have a solution that allows for both a registration interface, automatic schema checks on "activity" models, and most of all:
So it feels like perhaps activity names and feature_json should be registered in a yml, either in a meta tag, or as vars in the dbt_project.yml. Perhaps the registration key could be the model identifier: # some yml
activities:
activity__sign_up:
text: "sign up"
feature_json:
- feat_1
- feat_2
activity__bought_something:
... And we should provide macros like the ones above, to look up the activity in the registry yml, and pack json. By including the key in the look up, we should be able to apply schema checks to those models. Thoughts? |
@tnightengale nice! Very different approach. The implementation makes sense, so I'm focusing this reply on areas where we disagree.
My thought here is that all activity dbt models will be persisted as tables, and that the
One note on the above - the example code would require an override of the Then for the query itself, the final select statement for every activity should look roughly like the following to enforce schema consistency:
With the materialization approach, this could be appended in one of two ways:
For this to work, users would also need to specify:
Then, the registry interface would be a macro that no dbt developer needs to touch:
The returned value is a dictionary of dictionaries, where each top-level key is the model name of an activity model that feeds into the stream. And activities could be accessed like so:
The end result (i.e. how activities and features are referenced in datasets) is similar to yours, but I personally prefer an approach like this for a few reasons:
To clarify, if we were to go with the yaml-centric approach, this would get resolved by having users define their activity metadata in the yaml and the macro could parse that yaml without users needing to register again in the macro, right? Or would they need to do both? If it's the latter, then I'd advocate away from this approach, as that redundant effort will prove to be tedious during development. |
Yes the macros would just parse the yml. I like this approach because it allows folks to be flexible about how they want to list their activities: they could do it all in one yml or in yml for each activity model. The
We also inherit all the hierarchy that dbt does with paths and configs. For those reasons, I can't support the Schema verification is easy: just check the config of registered activities, and run a singular test, which can be included in the package, against each of those models. I have a private macro that already does this on a client's project. Finally, I like do like the convenience function: |
I'm fine to concede on the custom materialization as the solution to this - I think I was conflating this work with a personal disdain for the ergonomics of having to maintain relevant metadata for a single model across multiple files and file types. But that's a problem that needs to be solved by dbt, not by a dbt package. (Plus I know there are open tickets to discuss moving model config into the corresponding sql file, so I can wait for that to be implemented.) For the registry syntax, I agree that the key should be the model identifier. We'll also need users to specify data type registered for each feature, which will be used when unpacking json during dataset compilation. We should also require users to specify which activity stream/schema associated with the activity. (I'm working on issues for how to support multiple Activity Schemas in a single project and a feature for building Activity Streams, and this tag will be beneficial for those implementations). For schema verification, I'd personally prefer for the model fail to materialize rather than run a test after the fact. Maybe I'm thinking about this wrong, but most dbt dag runs I've seen materialize all models and then run all tests, so having the model fail to materialize prevents data that violates the schema contract to be made available downstream (e.g. in the stream itself or in datasets). This will be especially pertinent for developers who want to leverage incremental models, as those developers will need to re-run the entire pipeline if a schema test fails. I think this can effectively be solved simply by using the |
Overview
The
activity
API takes anactivity_name
as an argument, and in its current state, users can pass any arbitrary string to it and the query will execute successfully against the warehouse. The following shortcomings exist with this interface:activity_name
value in the activity stream, which leaves no room for error (e.g. typos)activity_name
values (e.g.visited page
vs Visited Pagevs
visited_pagevs
visitedPage`)Proposal
To address these shortcomings, I'm proposing an Activity Registry, or in other words, a mechanism that puts guardrails in place to ensure that users can interface with individual activities in a fluid manner when creating datasets. This solution should contain the following features:
Optional features include:
Implementation
In order to achieve the above functionality, under the hood, dbt (specifically the graph) needs to be made aware of all of the activities that exist in a given project. Assuming a 1:1 mapping from activities to dbt models, there are multiple approaches to achieve this outcome:
activity
materialization - most robust as it can allow for enforcement of custom model config parameters and easy retrieval from the dbt graph via materialization type search, but most laborious to implementmeta
attribute in the model yaml for each activity model - easiest to implement but most cumbersome to maintain in production due to context switching between YML and SQL files during developmentThen, when creating dbt models with the
dataset
macro, users should be able to reference some object that is aware of the names of each activity for the activity stream being referenced in the dataset, and that object should allow them to browse and tab-complete activity names. This will likely require leveraging new/existing VS Code extensions - specifics tbd.The text was updated successfully, but these errors were encountered: