Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activity Registry #25

Open
bcodell opened this issue Mar 23, 2023 · 5 comments
Open

Activity Registry #25

bcodell opened this issue Mar 23, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@bcodell
Copy link
Collaborator

bcodell commented Mar 23, 2023

Overview

The activity API takes an activity_name as an argument, and in its current state, users can pass any arbitrary string to it and the query will execute successfully against the warehouse. The following shortcomings exist with this interface:

  • it expects users to know the precise string values of each distinct activity_name value in the activity stream, which leaves no room for error (e.g. typos)
  • it allows for users to have non-standard naming conventions for activity_name values (e.g. visited page vs Visited Pagevsvisited_pagevsvisitedPage`)
  • it allows users to pass the name of a non-existent activity in the activity stream, and the query will run and generate a table with 0 rows in it
  • it gives users no hints as to which activities are available to choose from for a given activity stream

Proposal

To address these shortcomings, I'm proposing an Activity Registry, or in other words, a mechanism that puts guardrails in place to ensure that users can interface with individual activities in a fluid manner when creating datasets. This solution should contain the following features:

  • automated enforcement of activity naming conventions
  • errors or warnings during development that a user specified a non-existent activity name
  • hints for which activities exist within a given activity stream so that users can quickly browse and select the activities they need on the fly
  • maintains accurate lineage within the dbt project

Optional features include:

  • limited manipulation of yaml files (the worst part of dbt development in my opinion)

Implementation

In order to achieve the above functionality, under the hood, dbt (specifically the graph) needs to be made aware of all of the activities that exist in a given project. Assuming a 1:1 mapping from activities to dbt models, there are multiple approaches to achieve this outcome:

  • custom activity materialization - most robust as it can allow for enforcement of custom model config parameters and easy retrieval from the dbt graph via materialization type search, but most laborious to implement
  • custom sub-attributes in the meta attribute in the model yaml for each activity model - easiest to implement but most cumbersome to maintain in production due to context switching between YML and SQL files during development
  • inferring activities based on the upstream dependencies of the activity stream being used in the dataset - potentially clever hack that requires limited additional implementation and doesn't change the dev experience, but it assumes that all activity schema implementations will materialize an activity stream as a dbt model (which is optional per the Activity Schema v2 spec), and it makes downstream hints potentially more challenging since this information will only be available after the project is parsed

Then, when creating dbt models with the dataset macro, users should be able to reference some object that is aware of the names of each activity for the activity stream being referenced in the dataset, and that object should allow them to browse and tab-complete activity names. This will likely require leveraging new/existing VS Code extensions - specifics tbd.

@bcodell bcodell added the enhancement New feature or request label Mar 23, 2023
@bcodell
Copy link
Collaborator Author

bcodell commented Mar 23, 2023

@tnightengale looking for the following feedback:

  • is this something worth pursuing?
  • what other features should this solution include?
  • what other implementation options exist?

@tnightengale
Copy link
Owner

tnightengale commented Mar 24, 2023

@bcodell

  1. Absolutely I think this is worth pursuing 👍
  2. Overall I think you've covered the problem well.
  3. My initial thoughts on implementations are:

I think the materialization is an abuse of the abstraction in dbt, in order to add attributes to the graph. By overloading that abstraction, we would take away users' choice to use existing materializations they may need like "incremental". Therefore I think meta or tags config is a better option to assign properties to the node in the graph.

What I did when implementing this with other data teams, is just create a macro as a dict, eg:

{% macro get_activity(activity, attr="text") %}

{#
    Define a map of activites to ensure that the text for each activity
    is identical everywhere an activity is used. Can also define other
    features of an activity, for example a list of columns to include in
    the json_metadata.
#}

{% set defined_activities = dict(
        signed_up={
            "text": "signed up",
            "json_metadata": json_metadata([
                "page_id"
            ])
        },
        bought_something={
            "text": "bought something,
            "json_metadata": json_metadata([
                "item_id",
                "category_id",
                "promo_code"
            ])
        }
    )
%}

{% do return(defined_activities[activity][attr]) %}

{% endmacro %}

Then folks just access it as an attribute. For example, in a model to create an activity:

select
    customer,
    ts,
    {{ get_activity().sign_up }} as activity,
    {{ pack_json(get_activity().sign_up.json_metadata) }},

    ...

One limitation of this approach is that the users must implement this macro themselves; there's no "registration" interface.

An upside however is that it's easy to grok all the activities, and their feature json columns in one location.

Ideally, I'd like to have a solution that allows for both a registration interface, automatic schema checks on "activity" models, and most of all:

inferring activities based on the upstream dependencies of the activity stream being used in the dataset - potentially clever hack that requires limited additional implementation and doesn't change the dev experience, but it assumes that all activity schema implementations will materialize an activity stream as a dbt model (which is optional per the Activity Schema v2 spec), and it makes downstream hints potentially more challenging since this information will only be available after the project is parsed

So it feels like perhaps activity names and feature_json should be registered in a yml, either in a meta tag, or as vars in the dbt_project.yml.

Perhaps the registration key could be the model identifier:

# some yml
activities:
    activity__sign_up:
        text: "sign up"
        feature_json:
            - feat_1
            - feat_2
     activity__bought_something:
     ...

And we should provide macros like the ones above, to look up the activity in the registry yml, and pack json. By including the key in the look up, we should be able to apply schema checks to those models.

Thoughts?

@bcodell
Copy link
Collaborator Author

bcodell commented Mar 26, 2023

@tnightengale nice! Very different approach. The implementation makes sense, so I'm focusing this reply on areas where we disagree.

I think the materialization is an abuse of the abstraction in dbt, in order to add attributes to the graph. By overloading that abstraction, we would take away users' choice to use existing materializations they may need like "incremental".

My thought here is that all activity dbt models will be persisted as tables, and that the activity materialization could have a config parameter called materialization_strategy that defaults to table (or something more explicit like replace) but users could optionally specify incremental. My expectation is that incremental loads for activity models should follow fairly standard patterns (i.e. a self-reference to get the latest timestamp, which can then be referenced arbitrarily in the query via a macro). Then the materialization could have standard logic implemented based on whether the model should be built incrementally or not. It would look something like this:

activity.sql

{{ config(
    materialized='activity',
    text='activity name' -- optional - we can also add some string processing based on the model name
    stream=streams().stream_name -- we'll likely need an interface for a stream registry as well to support multiple streams in a single project, but will save for a separate issue, but the stream should be explicitly stated in the config to support cases where a project has multiple Activity Schemas (e.g. customers, users) but doesn't materialize the streams.
    materialization_strategy='table', -- or 'incremental'
    features = {
        'feature_1': {'data_type': dbt.type_string()},
        'feature_2': {'data_type': dbt.type_int()},
        'feature_3': {'data_type': dbt.type_timestamp()},
        'feature_4': {'data_type': 'boolean'},
    }

)}}

with base_cte as (
    select
        id as entity_id,
        created_at as ts,
        a as feature_1,
        b as feature_2,
        c as feature_3,
        d as feature_4
    from {{ ref('stg_model') }}
    {% if is_incremental() %}
    where created_at > {{ max_activity_ts() }}
    {% endif %
)

One note on the above - the example code would require an override of the is_incremental macro, but the code for it hasn't changed in over 2 years, so porting the logic for backwards compatibility and adding some additional to accommodate this use case seems straightforward and stable enough.

Then for the query itself, the final select statement for every activity should look roughly like the following to enforce schema consistency:

select
    -- port surrogate_key from dbt_utils to reduce dependency on other packages
    cast({{ dbt_activity_schema.surrogate_key(surrogate_key_fields) }} as {{dbt.type_string()}}) as activity_id
    , cast({{ project_config.entity_id }} as {{dbt.type_string()}}) as {{ entity_id }}
    , cast('{{config.text}}' as {{dbt.type_string()}}) as activity_name
    , ts
    , {{ dbt_activity_schema.pack_json(config.features.keys()) }} as feature_json
from {{cte}}

With the materialization approach, this could be appended in one of two ways:

  1. injected into the sql context variable in the materialization logic itself
  2. users call a macro like {{ dbt_activity_schema.build_activity() }} as the last line of each activity model query (and with an activity materialization, we can actually enforce usage of the build_activity macro as a method of schema enforcement)

For this to work, users would also need to specify:

  • the CTE name to inject into the from clause in the final select statement
  • any additional columns that should be used to ensure uniqueness of the activity_id field

Then, the registry interface would be a macro that no dbt developer needs to touch:

{% macro activity_registry(stream) %}
{ % if execute %}
{% set activities = n for n in graph.nodes.values() if n.config.materialized=='activity' and n.stream=stream %}
{% do some_preprocessing... %}
{% return(preprocessed_activities) %}
{% endif %}
{% endmacro %}

The returned value is a dictionary of dictionaries, where each top-level key is the model name of an activity model that feeds into the stream. And activities could be accessed like so:

-- specify an activity
{{ activity_registry(stream).activity_name }}
-- specify a feature
{{ activity_registry(stream).activity_name.features.feature_1 }}

The end result (i.e. how activities and features are referenced in datasets) is similar to yours, but I personally prefer an approach like this for a few reasons:

  • It makes a given activity easier to grok by putting everything (code and config) for the activity in a single file
  • It abstracts away the redundant aspects of activity modeling (defining activity metadata and enforcing schema requirements) while still giving developers the flexibility to customize where needed (e.g. incremental)
  • It makes development faster by keeping all development for a given activity in a single file

One limitation of this approach is that the users must implement this macro themselves; there's no "registration" interface.

To clarify, if we were to go with the yaml-centric approach, this would get resolved by having users define their activity metadata in the yaml and the macro could parse that yaml without users needing to register again in the macro, right? Or would they need to do both? If it's the latter, then I'd advocate away from this approach, as that redundant effort will prove to be tedious during development.

@tnightengale
Copy link
Owner

tnightengale commented Mar 31, 2023

To clarify, if we were to go with the yaml-centric approach, this would get resolved by having users define their activity metadata in the yaml and the macro could parse that yaml without users needing to register again in the macro, right? Or would they need to do both? If it's the latter, then I'd advocate away from this approach, as that redundant effort will prove to be tedious during development.

Yes the macros would just parse the yml. I like this approach because it allows folks to be flexible about how they want to list their activities: they could do it all in one yml or in yml for each activity model.

The config() arg in a model is just another ingress to the same data that can be supplied via yml. So really, if we implement the macro to fetch the config, based on the meta field, then we allow users any of the following options to configure their activities:

  1. one large yml
  2. many yml
  3. model specific config()

We also inherit all the hierarchy that dbt does with paths and configs. For those reasons, I can't support the materialization approach. I think we can gain all the benefits by just using the meta keyword instead, without the complexity of creating a new materialization class, which actually isn't a materialization at all. It's just a way to pass configs.

Schema verification is easy: just check the config of registered activities, and run a singular test, which can be included in the package, against each of those models. I have a private macro that already does this on a client's project.

Finally, I like do like the convenience function: {{ dbt_activity_schema.build_activity() }} - but this can just be added explicitly at the bottom of the model. I don't think we need to have it be injected via a materialization. It will be awesome because it will just call to this and render everything as needed, based on the config().

@bcodell
Copy link
Collaborator Author

bcodell commented Mar 31, 2023

I'm fine to concede on the custom materialization as the solution to this - I think I was conflating this work with a personal disdain for the ergonomics of having to maintain relevant metadata for a single model across multiple files and file types. But that's a problem that needs to be solved by dbt, not by a dbt package. (Plus I know there are open tickets to discuss moving model config into the corresponding sql file, so I can wait for that to be implemented.)

For the registry syntax, I agree that the key should be the model identifier. We'll also need users to specify data type registered for each feature, which will be used when unpacking json during dataset compilation. We should also require users to specify which activity stream/schema associated with the activity. (I'm working on issues for how to support multiple Activity Schemas in a single project and a feature for building Activity Streams, and this tag will be beneficial for those implementations).

For schema verification, I'd personally prefer for the model fail to materialize rather than run a test after the fact. Maybe I'm thinking about this wrong, but most dbt dag runs I've seen materialize all models and then run all tests, so having the model fail to materialize prevents data that violates the schema contract to be made available downstream (e.g. in the stream itself or in datasets). This will be especially pertinent for developers who want to leverage incremental models, as those developers will need to re-run the entire pipeline if a schema test fails. I think this can effectively be solved simply by using the {{ dbt_activity_schema.build_activity() }} convenience function. But would you mind sharing the private macro you have here? It'd be helpful to see a concrete example of how you're thinking about this part of the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants