diff --git a/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/unified/index.md b/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/unified/index.md index 6366225032..9d37e6c45d 100644 --- a/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/unified/index.md +++ b/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/unified/index.md @@ -4,193 +4,6 @@ sidebar_position: 10 title: "Unified Digital Quickstart" --- +:::info -## Requirements - -In addition to [dbt](https://github.com/dbt-labs/dbt) being installed: - -
-To model web events - -- web events dataset being available in your database -- [Snowplow Javascript tracker](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/index.md) version 2 or later implemented. -- Web Page context [enabled](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracker-setup/initialization-options/index.md#webpage-context) (enabled by default in [v3+](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracker-setup/initialization-options/index.md#webpage-context)). -- [Page view events](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/index.md#page-views) implemented. - -
- -
-To model mobile events - -- mobile events dataset being available in your database -- Snowplow [Android](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/android-tracker/index.md), [iOS](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/objective-c-tracker/index.md) mobile tracker version 1.1.0 (or later) or [React Native tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/react-native-tracker/) implemented -- Mobile session context enabled ([ios](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/objective-c-tracker/ios-tracker-1-7-0/index.md#session-context) or [android](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/android-tracker/android-1-7-0/index.md#session-tracking)). -- Screen view events enabled ([ios](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/objective-c-tracker/ios-tracker-1-7-0/index.md#tracking-features) or [android](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/android-tracker/android-1-7-0/index.md#tracking-features)). - -
- -## Installation - -```mdx-code-block -import DbtPackageInstallation from "@site/docs/reusable/dbt-package-installation/_index.md" - - -``` - -## Setup - -### 1. Override the dispatch order in your project -To take advantage of the optimized upsert that the Snowplow packages offer you need to ensure that certain macros are called from `snowplow_utils` first before `dbt-core`. This can be achieved by adding the following to the top level of your `dbt_project.yml` file: - -```yml title="dbt_project.yml" -dispatch: - - macro_namespace: dbt - search_order: ['snowplow_utils', 'dbt'] -``` - -If you do not do this the package will still work, but the incremental upserts will become more costly over time. - -### 2. Adding the `selectors.yml` file - -Within the packages we have provided a suite of suggested selectors to run and test the models within the package together with the Unified Digital Model. This leverages dbt's [selector flag](https://docs.getdbt.com/reference/node-selection/syntax). You can find out more about each selector in the [YAML Selectors](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-operation/index.md#yaml-selectors) section. - -These are defined in the `selectors.yml` file ([source](https://github.com/snowplow/dbt-snowplow-unified/blob/main/selectors.yml)) within the package, however in order to use these selections you will need to copy this file into your own dbt project directory. This is a top-level file and therefore should sit alongside your `dbt_project.yml` file. If you are using multiple packages in your project you will need to combine the contents of these into a single file. - -### 3. Check source data - -This package will by default assume your Snowplow events data is contained in the `atomic` schema of your [target.database](https://docs.getdbt.com/docs/running-a-dbt-project/using-the-command-line-interface/configure-your-profile). In order to change this, please add the following to your `dbt_project.yml` file: - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__atomic_schema: schema_with_snowplow_events - snowplow__database: database_with_snowplow_events -``` -:::info Databricks only - -Please note that your `target.database` is NULL if using Databricks. In Databricks, schemas and databases are used interchangeably and in the dbt implementation of Databricks therefore we always use the schema value, so adjust your `snowplow__atomic_schema` value if you need to. - -::: - -Next, Unified Digital assumes you are modeling both web and mobile events and expects certain fields to exist based on this. If you are only tracking and modeling e.g. web data, you can disable the other as below: - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__enable_mobile: false - snowplow__enable_web: true -``` -Note these are both `true` by default so you only need to add the one you wish to disable. - -### 4. Enabled desired contexts - -The Unified Digital Model has the option to join in data from the following Snowplow enrichments and out-of-the-box context entities: - -- [IAB enrichment](/docs/enriching-your-data/available-enrichments/iab-enrichment/index.md) -- [UA Parser enrichment](/docs/enriching-your-data/available-enrichments/ua-parser-enrichment/index.md) -- [YAUAA enrichment](/docs/enriching-your-data/available-enrichments/yauaa-enrichment/index.md) -- Browser context -- Mobile context -- Geolocation context -- App context -- Screen context -- Deep Link context -- App Error context -- Core Web Vitals -- Consent (Preferences & cmp visible) -- Mobile screen summary (used for screen engagement calculation) - -By default these are **all disabled** in the Unified Digital Model. Assuming you have the enrichments turned on in your Snowplow pipeline, to enable the contexts within the package please add the following to your `dbt_project.yml` file: - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__enable_iab: true - snowplow__enable_ua: true - snowplow__enable_yauaa: true - snowplow__enable_browser_context: true - snowplow__enable_mobile_context: true - snowplow__enable_geolocation_context: true - snowplow__enable_application_context: true - snowplow__enable_screen_context: true - snowplow__enable_deep_link_context: true - snowplow__enable_consent: true - snowplow__enable_cwv: true - snowplow__enable_app_errors: true - snowplow__enable_screen_summary_context: true -``` - -### 5. Filter your data set - -You can specify both `start_date` at which to start processing events and the `app_id`'s to filter for. By default the `start_date` is set to `2020-01-01` and all `app_id`'s are selected. To change this please add the following to your `dbt_project.yml` file: - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__start_date: 'yyyy-mm-dd' - snowplow__app_id: ['my_app_1','my_app_2'] -``` - -### 6. Verify page ping variables - -The Unified Digital Model processes page ping events to calculate web page engagement times. If your [tracker configuration](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/index.md#activity-tracking-page-pings) for `min_visit_length` (default 5) and `heartbeat` (default 10) differs from the defaults provided in this package, you can override by adding to your `dbt_project.yml`: - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__min_visit_length: 5 # Default value - snowplow__heartbeat: 10 # Default value -``` - -### 7. Additional vendor specific configuration - -:::info BigQuery Only -Verify which column your events table is partitioned on. It will likely be partitioned on `collector_tstamp` or `derived_tstamp`. If it is partitioned on `collector_tstamp` you should set `snowplow__derived_tstamp_partitioned` to `false`. This will ensure only the `collector_tstamp` column is used for partition pruning when querying the events table: - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__derived_tstamp_partitioned: false -``` -::: - -:::info Databricks only - setting the databricks_catalog - -Add the following variable to your dbt project's `dbt_project.yml` file - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__databricks_catalog: 'hive_metastore' -``` -Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards, defaulted to 'hive_metastore') or the same value as your `snowplow__atomic_schema` (unless changed it should be 'atomic'). This is needed to handle the database property within `models/base/src_base.yml`. - -**A more detailed explanation for how to set up your Databricks configuration properly can be found in [Unity Catalog support](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/index.md#unity-catalog-support).** - -::: - -### 8. Optimize your project - -There are ways how you can deal with [high volume optimizations](/docs/modeling-your-data/modeling-your-data-with-dbt/ddbt-custom-models/high-volume-optimizations/) at a later stage, if needed, but you can do a lot upfront by selecting carefully which variable to use for `snowplow__session_timestamp`, which helps identify the timestamp column used for sessionization. This timestamp column should ideally be set to the column your event table is partitioned on. It is defaulted to `collector_tstamp` but depending on your loader it can be the `load_tstamp` as the sensible value to use: - -```yml title="dbt_project.yml" -vars: - snowplow_unified: - snowplow__session_timestamp: 'load_tstamp' -``` - -### 9. Verify your variables using our Config guides (Optional) - -If you are unsure whether the default values set are good enough in your case or you would already like to maximize the potential of your models, you can dive deeper into the meaning behind our variables on our [Config](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/) page. It includes a [Config Generator](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/#Generator) to help you create all your variable configurations, if necessary. - -### 10. Run your model - -You can run your models for the first time by running the below command (see the [operation](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-operation/index.md) page for more information on operation of the package). As this package contains some seed files, you will need to seed these first - -```bash -dbt seed --select snowplow_unified --full-refresh -dbt run --selector snowplow_unified -``` - -### 11. Enable extras -The package comes with additional modules and functionality that you can enable, for more information see the [consent tracking](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/consent-module/index.md), [conversions](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/conversions/index.md), and [core web vitals](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/core-web-vitals-module/index.md) documentation. +We are gradually moving tutorials like the QuickStart Guides from our standard Documentation to the **Tutorials & Guides** section. You can now find the latest tutorial on how to get started with the Unified Digital package [here](/tutorials/unified-digital/intro). diff --git a/tutorials/unified-digital/intro.md b/tutorials/unified-digital/intro.md index f19fad9026..0273c9cc08 100644 --- a/tutorials/unified-digital/intro.md +++ b/tutorials/unified-digital/intro.md @@ -7,7 +7,7 @@ This tutorial walks you through the process of setting up our Unified Digital DB ### Prerequisites -- DBT installed +- [DBT](https://github.com/dbt-labs/dbt) installed - Connection to a warehouse - for web: - web events dataset being available in your database diff --git a/tutorials/unified-digital/setting-up-locally.md b/tutorials/unified-digital/setting-up-locally.md index b0760863a1..7c7e08bde7 100644 --- a/tutorials/unified-digital/setting-up-locally.md +++ b/tutorials/unified-digital/setting-up-locally.md @@ -50,8 +50,7 @@ position: 2 Now we’ll get to using our variables, which is how you enable the parts of the model that are relevant to your use-case. -1. Define the location of your source data within your `vars` block where your raw events are being loaded into. -If you're using Databricks you might need to also manage the [Hive catalogue](https://docs.snowplow.io/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/unified/#7-additional-vendor-specific-configuration). Make sure to update these with your actual table names! +1. Define the location of your source data within your `vars` block where your raw events are being loaded into. Make sure to update these with your actual table names! ```yaml vars: @@ -59,17 +58,35 @@ If you're using Databricks you might need to also manage the [Hive catalogue](ht snowplow__atomic_schema: schema_with_snowplow_events snowplow__database: database_with_snowplow_events ``` + +:::info Databricks only -2. Choose web and/or mobile data. In many cases you’ll only be tracking web data. +Please note that your `target.database` is NULL if using Databricks. In Databricks, schemas and databases are used interchangeably and in the dbt implementation of Databricks therefore we always use the schema value, so adjust your `snowplow__atomic_schema` value if you need to. - ```yaml - vars: - snowplow_unified: - snowplow__enable_mobile: false - snowplow__enable_web: true - ``` +Add the following variable to your dbt project's `dbt_project.yml` file + +```yml title="dbt_project.yml" +vars: + snowplow_unified: + snowplow__databricks_catalog: 'hive_metastore' +``` +Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards, defaulted to 'hive_metastore') or the same value as your `snowplow__atomic_schema` (unless changed it should be 'atomic'). This is needed to handle the database property within `models/base/src_base.yml`. + +A more detailed explanation for how to set up your Databricks configuration properly can be found in [Unity Catalog support](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/#unity-catalog-support). -3. Enable contexts (also known as [entities](https://docs.snowplow.io/docs/understanding-your-pipeline/entities/)) to make sure that they're processed within the package - this means they will be un-nested from the atomic columns and made available in the derived tables. Make sure to only enable the ones you need. +::: + + +2. Unified Digital assumes you are modeling both web and mobile events and expects certain fields to exist based on this. If you are only tracking and modeling e.g. web data, you can disable the other as below: + +```yml title="dbt_project.yml" +vars: + snowplow_unified: + snowplow__enable_mobile: false + snowplow__enable_web: true +``` + +3. Enable [entities](https://docs.snowplow.io/docs/understanding-your-pipeline/entities/) to make sure that they're processed within the package - this means they will be un-nested from the atomic columns and made available in the derived tables. Make sure to only enable the ones you need. ```yaml vars: @@ -96,9 +113,64 @@ If you're using Databricks you might need to also manage the [Hive catalogue](ht snowplow_unified: snowplow__start_date: 'yyyy-mm-dd' ``` + +5. Optimize your data processing + +There are ways how you can deal with [high volume optimizations](/docs/modeling-your-data/modeling-your-data-with-dbt/ddbt-custom-models/high-volume-optimizations/) at a later stage, if needed, but you can do a lot upfront by selecting carefully which variable to use for `snowplow__session_timestamp`, which helps identify the timestamp column used for sessionization. This timestamp column should ideally be set to the column your event table is partitioned on. It is defaulted to `collector_tstamp` but depending on your loader it can be the `load_tstamp` as the sensible value to use: + +```yml title="dbt_project.yml" +vars: + snowplow_unified: + snowplow__session_timestamp: 'load_tstamp' +``` + +:::info BigQuery Only +Verify which column your events table is partitioned on. It will likely be partitioned on `collector_tstamp` or `derived_tstamp`. If it is partitioned on `collector_tstamp` you should set `snowplow__derived_tstamp_partitioned` to `false`. This will ensure only the `collector_tstamp` column is used for partition pruning when querying the events table: + +```yml title="dbt_project.yml" +vars: + snowplow_unified: + snowplow__derived_tstamp_partitioned: false +``` +::: + +6. Configure more vars as necessary but theoretically this is all you need just to get started. If you are unsure whether the default values set are good enough in your case or you would already like to maximize the potential of your models, you can dive deeper into the meaning behind our variables on our [Config](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/) page. It includes a [Config Generator](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/#Generator) to help you create all your variable configurations, if necessary. + +7. Filter your data set + +You can specify both `start_date` at which to start processing events and the `app_id`'s to filter for. By default the `start_date` is set to `2020-01-01` and all `app_id`'s are selected. To change this please add the following to your `dbt_project.yml` file: + +```yml title="dbt_project.yml" +vars: + snowplow_unified: + snowplow__start_date: 'yyyy-mm-dd' + snowplow__app_id: ['my_app_1','my_app_2'] +``` + +Below we list a few more that might be of interest depending on your setup or modelling needs: + + - Enable extras +The package comes with additional modules and functionality that you can enable, for more information see the [consent tracking](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/consent-module), [conversions](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/conversions/), and [core web vitals](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/core-web-vitals-module) documentation. + +- adjust page ping variables, if needed +The Unified Digital Model processes page ping events to calculate web page engagement times. If your [tracker configuration](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/#activity-tracking-page-pings) for `min_visit_length` (default 5) and `heartbeat` (default 10) differs from the defaults provided in this package, you can override by adding to your `dbt_project.yml`: + +```yml title="dbt_project.yml" +vars: + snowplow_unified: + snowplow__min_visit_length: 5 # Default value + snowplow__heartbeat: 10 # Default value +``` + +### Adding the `selectors.yml` file + +Within the packages we have provided a suite of suggested selectors to run and test the models within the package together with the Unified Digital Model. This leverages dbt's [selector flag](https://docs.getdbt.com/reference/node-selection/syntax). You can find out more about each selector in the [YAML Selectors](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-operation/#yaml-selectors) section. + +These are defined in the `selectors.yml` file ([source](https://github.com/snowplow/dbt-snowplow-unified/blob/main/selectors.yml)) within the package, however in order to use these selections you will need to copy this file into your own dbt project directory. This is a top-level file and therefore should sit alongside your `dbt_project.yml` file. If you are using multiple packages in your project you will need to combine the contents of these into a single file. + +### Running the package -5. Configure more vars as necessary - theoretically this is all you need just to get started. -6. Run dbt_seed to make sure you see some data, so we have some seeds in our packages, run that in there, and then run the actual model. +Run dbt_seed to make sure you see some data, so we have some seeds in our packages, run that in there, and then run the actual model. ```yaml dbt seed --select snowplow_unified --full-refresh diff --git a/tutorials/unified-digital/setting-up-via-console.md b/tutorials/unified-digital/setting-up-via-console.md index 193e867499..d63784e751 100644 --- a/tutorials/unified-digital/setting-up-via-console.md +++ b/tutorials/unified-digital/setting-up-via-console.md @@ -16,8 +16,8 @@ Snowplow provides a fully managed service for running data models. We recommend 3. Set a name, warehouse connection and owner who will receive failure alerts. ![](./screenshots/Screenshot_2024-07-04_at_17.41.51.png) -4. Edit your configuration variables, pay particular attention to web and mobile data, and the core enrichments e.g. IAB, YAUAA. This means they will be un-nested from the atomic columns and made available in the derived tables. -![](./screenshots//Screenshot_2024-07-04_at_17.43.22.png) +4. Edit your configuration variables. Pay particular attention to web and mobile data, and the core enrichments e.g. IAB, YAUAA. This means they will be un-nested from the atomic columns and made available in the derived tables. +![](./screenshots//Screenshot_2024-07-04_at_17.43.22.png) For a more detailed guide check out the [Setting Variables](/tutorials/unified-digital/setting-up-locally/#setting-variables) section of the local setup section of this tutorial. 5. Set a schedule - use a CRON editor if necessary ![](./screenshots/Screenshot_2024-07-04_at_17.44.04.png)