Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge unified quickstart into tutorial #1052

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,193 +4,6 @@ sidebar_position: 10
title: "Unified Digital Quickstart"
---

:::info

## Requirements

In addition to [dbt](https://github.com/dbt-labs/dbt) being installed:

<details>
<summary>To model web events</summary>

- web events dataset being available in your database
- [Snowplow Javascript tracker](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/index.md) version 2 or later implemented.
- Web Page context [enabled](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracker-setup/initialization-options/index.md#webpage-context) (enabled by default in [v3+](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracker-setup/initialization-options/index.md#webpage-context)).
- [Page view events](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/index.md#page-views) implemented.

</details>

<details>
<summary>To model mobile events</summary>

- mobile events dataset being available in your database
- Snowplow [Android](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/android-tracker/index.md), [iOS](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/objective-c-tracker/index.md) mobile tracker version 1.1.0 (or later) or [React Native tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/react-native-tracker/) implemented
- Mobile session context enabled ([ios](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/objective-c-tracker/ios-tracker-1-7-0/index.md#session-context) or [android](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/android-tracker/android-1-7-0/index.md#session-tracking)).
- Screen view events enabled ([ios](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/objective-c-tracker/ios-tracker-1-7-0/index.md#tracking-features) or [android](/docs/collecting-data/collecting-from-own-applications/mobile-trackers/previous-versions/android-tracker/android-1-7-0/index.md#tracking-features)).

</details>

## Installation

```mdx-code-block
import DbtPackageInstallation from "@site/docs/reusable/dbt-package-installation/_index.md"

<DbtPackageInstallation package='unified' fullname='dbtSnowplowUnified'/>
```

## Setup

### 1. Override the dispatch order in your project
To take advantage of the optimized upsert that the Snowplow packages offer you need to ensure that certain macros are called from `snowplow_utils` first before `dbt-core`. This can be achieved by adding the following to the top level of your `dbt_project.yml` file:

```yml title="dbt_project.yml"
dispatch:
- macro_namespace: dbt
search_order: ['snowplow_utils', 'dbt']
```

If you do not do this the package will still work, but the incremental upserts will become more costly over time.

### 2. Adding the `selectors.yml` file

Within the packages we have provided a suite of suggested selectors to run and test the models within the package together with the Unified Digital Model. This leverages dbt's [selector flag](https://docs.getdbt.com/reference/node-selection/syntax). You can find out more about each selector in the [YAML Selectors](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-operation/index.md#yaml-selectors) section.

These are defined in the `selectors.yml` file ([source](https://github.com/snowplow/dbt-snowplow-unified/blob/main/selectors.yml)) within the package, however in order to use these selections you will need to copy this file into your own dbt project directory. This is a top-level file and therefore should sit alongside your `dbt_project.yml` file. If you are using multiple packages in your project you will need to combine the contents of these into a single file.

### 3. Check source data

This package will by default assume your Snowplow events data is contained in the `atomic` schema of your [target.database](https://docs.getdbt.com/docs/running-a-dbt-project/using-the-command-line-interface/configure-your-profile). In order to change this, please add the following to your `dbt_project.yml` file:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__atomic_schema: schema_with_snowplow_events
snowplow__database: database_with_snowplow_events
```
:::info Databricks only

Please note that your `target.database` is NULL if using Databricks. In Databricks, schemas and databases are used interchangeably and in the dbt implementation of Databricks therefore we always use the schema value, so adjust your `snowplow__atomic_schema` value if you need to.

:::

Next, Unified Digital assumes you are modeling both web and mobile events and expects certain fields to exist based on this. If you are only tracking and modeling e.g. web data, you can disable the other as below:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__enable_mobile: false
snowplow__enable_web: true
```
Note these are both `true` by default so you only need to add the one you wish to disable.

### 4. Enabled desired contexts

The Unified Digital Model has the option to join in data from the following Snowplow enrichments and out-of-the-box context entities:

- [IAB enrichment](/docs/enriching-your-data/available-enrichments/iab-enrichment/index.md)
- [UA Parser enrichment](/docs/enriching-your-data/available-enrichments/ua-parser-enrichment/index.md)
- [YAUAA enrichment](/docs/enriching-your-data/available-enrichments/yauaa-enrichment/index.md)
- Browser context
- Mobile context
- Geolocation context
- App context
- Screen context
- Deep Link context
- App Error context
- Core Web Vitals
- Consent (Preferences & cmp visible)
- Mobile screen summary (used for screen engagement calculation)

By default these are **all disabled** in the Unified Digital Model. Assuming you have the enrichments turned on in your Snowplow pipeline, to enable the contexts within the package please add the following to your `dbt_project.yml` file:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__enable_iab: true
snowplow__enable_ua: true
snowplow__enable_yauaa: true
snowplow__enable_browser_context: true
snowplow__enable_mobile_context: true
snowplow__enable_geolocation_context: true
snowplow__enable_application_context: true
snowplow__enable_screen_context: true
snowplow__enable_deep_link_context: true
snowplow__enable_consent: true
snowplow__enable_cwv: true
snowplow__enable_app_errors: true
snowplow__enable_screen_summary_context: true
```

### 5. Filter your data set

You can specify both `start_date` at which to start processing events and the `app_id`'s to filter for. By default the `start_date` is set to `2020-01-01` and all `app_id`'s are selected. To change this please add the following to your `dbt_project.yml` file:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__start_date: 'yyyy-mm-dd'
snowplow__app_id: ['my_app_1','my_app_2']
```

### 6. Verify page ping variables

The Unified Digital Model processes page ping events to calculate web page engagement times. If your [tracker configuration](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/index.md#activity-tracking-page-pings) for `min_visit_length` (default 5) and `heartbeat` (default 10) differs from the defaults provided in this package, you can override by adding to your `dbt_project.yml`:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__min_visit_length: 5 # Default value
snowplow__heartbeat: 10 # Default value
```

### 7. Additional vendor specific configuration

:::info BigQuery Only
Verify which column your events table is partitioned on. It will likely be partitioned on `collector_tstamp` or `derived_tstamp`. If it is partitioned on `collector_tstamp` you should set `snowplow__derived_tstamp_partitioned` to `false`. This will ensure only the `collector_tstamp` column is used for partition pruning when querying the events table:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__derived_tstamp_partitioned: false
```
:::

:::info Databricks only - setting the databricks_catalog

Add the following variable to your dbt project's `dbt_project.yml` file

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__databricks_catalog: 'hive_metastore'
```
Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards, defaulted to 'hive_metastore') or the same value as your `snowplow__atomic_schema` (unless changed it should be 'atomic'). This is needed to handle the database property within `models/base/src_base.yml`.

**A more detailed explanation for how to set up your Databricks configuration properly can be found in [Unity Catalog support](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/index.md#unity-catalog-support).**

:::

### 8. Optimize your project

There are ways how you can deal with [high volume optimizations](/docs/modeling-your-data/modeling-your-data-with-dbt/ddbt-custom-models/high-volume-optimizations/) at a later stage, if needed, but you can do a lot upfront by selecting carefully which variable to use for `snowplow__session_timestamp`, which helps identify the timestamp column used for sessionization. This timestamp column should ideally be set to the column your event table is partitioned on. It is defaulted to `collector_tstamp` but depending on your loader it can be the `load_tstamp` as the sensible value to use:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__session_timestamp: 'load_tstamp'
```

### 9. Verify your variables using our Config guides (Optional)

If you are unsure whether the default values set are good enough in your case or you would already like to maximize the potential of your models, you can dive deeper into the meaning behind our variables on our [Config](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/) page. It includes a [Config Generator](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/#Generator) to help you create all your variable configurations, if necessary.

### 10. Run your model

You can run your models for the first time by running the below command (see the [operation](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-operation/index.md) page for more information on operation of the package). As this package contains some seed files, you will need to seed these first

```bash
dbt seed --select snowplow_unified --full-refresh
dbt run --selector snowplow_unified
```

### 11. Enable extras
The package comes with additional modules and functionality that you can enable, for more information see the [consent tracking](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/consent-module/index.md), [conversions](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/conversions/index.md), and [core web vitals](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/core-web-vitals-module/index.md) documentation.
We are gradually moving tutorials like the QuickStart Guides from our standard Documentation to the **Tutorials & Guides** section. You can now find the latest tutorial on how to get started with the Unified Digital package [here](/tutorials/unified-digital/intro).
2 changes: 1 addition & 1 deletion tutorials/unified-digital/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This tutorial walks you through the process of setting up our Unified Digital DB

### Prerequisites

- DBT installed
- [DBT](https://github.com/dbt-labs/dbt) installed
- Connection to a warehouse
- for web:
- web events dataset being available in your database
Expand Down
96 changes: 84 additions & 12 deletions tutorials/unified-digital/setting-up-locally.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,26 +50,43 @@ position: 2

Now we’ll get to using our variables, which is how you enable the parts of the model that are relevant to your use-case.

1. Define the location of your source data within your `vars` block where your raw events are being loaded into.
If you're using Databricks you might need to also manage the [Hive catalogue](https://docs.snowplow.io/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/unified/#7-additional-vendor-specific-configuration). Make sure to update these with your actual table names!
1. Define the location of your source data within your `vars` block where your raw events are being loaded into. Make sure to update these with your actual table names!

```yaml
vars:
snowplow_unified:
snowplow__atomic_schema: schema_with_snowplow_events
snowplow__database: database_with_snowplow_events
```

:::info Databricks only

2. Choose web and/or mobile data. In many cases you’ll only be tracking web data.
Please note that your `target.database` is NULL if using Databricks. In Databricks, schemas and databases are used interchangeably and in the dbt implementation of Databricks therefore we always use the schema value, so adjust your `snowplow__atomic_schema` value if you need to.

```yaml
vars:
snowplow_unified:
snowplow__enable_mobile: false
snowplow__enable_web: true
```
Add the following variable to your dbt project's `dbt_project.yml` file

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__databricks_catalog: 'hive_metastore'
```
Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards, defaulted to 'hive_metastore') or the same value as your `snowplow__atomic_schema` (unless changed it should be 'atomic'). This is needed to handle the database property within `models/base/src_base.yml`.

A more detailed explanation for how to set up your Databricks configuration properly can be found in [Unity Catalog support](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/#unity-catalog-support).

3. Enable contexts (also known as [entities](https://docs.snowplow.io/docs/understanding-your-pipeline/entities/)) to make sure that they're processed within the package - this means they will be un-nested from the atomic columns and made available in the derived tables. Make sure to only enable the ones you need.
:::


2. Unified Digital assumes you are modeling both web and mobile events and expects certain fields to exist based on this. If you are only tracking and modeling e.g. web data, you can disable the other as below:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__enable_mobile: false
snowplow__enable_web: true
```

3. Enable [entities](https://docs.snowplow.io/docs/understanding-your-pipeline/entities/) to make sure that they're processed within the package - this means they will be un-nested from the atomic columns and made available in the derived tables. Make sure to only enable the ones you need.

```yaml
vars:
Expand All @@ -96,9 +113,64 @@ If you're using Databricks you might need to also manage the [Hive catalogue](ht
snowplow_unified:
snowplow__start_date: 'yyyy-mm-dd'
```

5. Optimize your data processing

There are ways how you can deal with [high volume optimizations](/docs/modeling-your-data/modeling-your-data-with-dbt/ddbt-custom-models/high-volume-optimizations/) at a later stage, if needed, but you can do a lot upfront by selecting carefully which variable to use for `snowplow__session_timestamp`, which helps identify the timestamp column used for sessionization. This timestamp column should ideally be set to the column your event table is partitioned on. It is defaulted to `collector_tstamp` but depending on your loader it can be the `load_tstamp` as the sensible value to use:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__session_timestamp: 'load_tstamp'
```

:::info BigQuery Only
Verify which column your events table is partitioned on. It will likely be partitioned on `collector_tstamp` or `derived_tstamp`. If it is partitioned on `collector_tstamp` you should set `snowplow__derived_tstamp_partitioned` to `false`. This will ensure only the `collector_tstamp` column is used for partition pruning when querying the events table:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__derived_tstamp_partitioned: false
```
:::

6. Configure more vars as necessary but theoretically this is all you need just to get started. If you are unsure whether the default values set are good enough in your case or you would already like to maximize the potential of your models, you can dive deeper into the meaning behind our variables on our [Config](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/) page. It includes a [Config Generator](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/unified/#Generator) to help you create all your variable configurations, if necessary.

7. Filter your data set

You can specify both `start_date` at which to start processing events and the `app_id`'s to filter for. By default the `start_date` is set to `2020-01-01` and all `app_id`'s are selected. To change this please add the following to your `dbt_project.yml` file:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__start_date: 'yyyy-mm-dd'
snowplow__app_id: ['my_app_1','my_app_2']
```

Below we list a few more that might be of interest depending on your setup or modelling needs:

- Enable extras
The package comes with additional modules and functionality that you can enable, for more information see the [consent tracking](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/consent-module), [conversions](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/conversions/), and [core web vitals](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-unified-data-model/core-web-vitals-module) documentation.

- adjust page ping variables, if needed
The Unified Digital Model processes page ping events to calculate web page engagement times. If your [tracker configuration](/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/#activity-tracking-page-pings) for `min_visit_length` (default 5) and `heartbeat` (default 10) differs from the defaults provided in this package, you can override by adding to your `dbt_project.yml`:

```yml title="dbt_project.yml"
vars:
snowplow_unified:
snowplow__min_visit_length: 5 # Default value
snowplow__heartbeat: 10 # Default value
```

### Adding the `selectors.yml` file

Within the packages we have provided a suite of suggested selectors to run and test the models within the package together with the Unified Digital Model. This leverages dbt's [selector flag](https://docs.getdbt.com/reference/node-selection/syntax). You can find out more about each selector in the [YAML Selectors](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-operation/#yaml-selectors) section.

These are defined in the `selectors.yml` file ([source](https://github.com/snowplow/dbt-snowplow-unified/blob/main/selectors.yml)) within the package, however in order to use these selections you will need to copy this file into your own dbt project directory. This is a top-level file and therefore should sit alongside your `dbt_project.yml` file. If you are using multiple packages in your project you will need to combine the contents of these into a single file.

### Running the package

5. Configure more vars as necessary - theoretically this is all you need just to get started.
6. Run dbt_seed to make sure you see some data, so we have some seeds in our packages, run that in there, and then run the actual model.
Run dbt_seed to make sure you see some data, so we have some seeds in our packages, run that in there, and then run the actual model.

```yaml
dbt seed --select snowplow_unified --full-refresh
Expand Down
Loading