-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chore: Added stage tables for portal pageviews #1267
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes here :)
@@ -0,0 +1,5 @@ | |||
{% set rudder_relations = dbt_utils.get_relations_by_prefix(schema="PORTAL_PROD", database="RAW", prefix="PAGEVIEW_") %} | |||
|
|||
{{ dbt_utils.union_relations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all pageview columns required?
Unioning the tables to create a staging "tracks" from events is not really following DBT's suggestion for project structure, but definitely a necessary thing to do. One thing that can be done to make things more predictable is to explicitly define which columns to keep. In this case it should be the columns shared between event tables. This way whenever a new property is added on any pageview_
table the current model won't be affected.
WITH identifies as( | ||
SELECT | ||
{{ dbt_utils.star(source('portal_prod', 'identifies')) }} | ||
FROM | ||
{{ source('portal_prod', 'identifies') }} | ||
) | ||
|
||
SELECT | ||
user_id, | ||
portal_customer_id, | ||
context_traits_portal_customer_id, | ||
received_at | ||
FROM | ||
identifies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach introduces a new approach for staging models. Let's follow the pattern described in DBT's documentation. DBT codegen automatically generated this, so less effort there.
with pageviews as( | ||
SELECT | ||
{{ dbt_utils.star(ref('base_portal_prod__tracks')) }} | ||
FROM | ||
{{ ref ('base_portal_prod__tracks') }} | ||
) | ||
|
||
select | ||
id as pageview_id, | ||
user_id, | ||
event as event_table, | ||
received_at | ||
from | ||
pageviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the CTE is needed.
FROM | ||
{{ ref('stg_portal_prod__identifies') }} | ||
WHERE | ||
coalesce(portal_customer_id, context_traits_portal_customer_id) IS NOT NULL and received_at >= '2023-04-04' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2023-04-04
? Any chance this is a left-over to make things run faster?
false AS workspace_created | ||
FROM | ||
pageviews | ||
JOIN (select * from identifies where row_number = 1) identifies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use qualify
in identifies CTE?
{{ | ||
config({ | ||
"materialized": "table", | ||
"incremental_strategy": "merge", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incremental strategy is defined, but the model is not incremental.
-- Email is verified and user is redirected to `pageview_create_workspace` screen. | ||
MAX(CASE WHEN pageviews.event_table = 'pageview_create_workspace' THEN true ELSE false END) AS email_verified, | ||
-- Setting to false as we consider `workspace_installation_id` from stripe as source of truth | ||
false AS workspace_created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this false always? If it is information that will come from a different source, why add it here?
transform/mattermost-analytics/models/intermediate/data_eng/signup/_int_signup__models.yml
Show resolved
Hide resolved
@@ -0,0 +1,45 @@ | |||
{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this model capture the users that have completed signups. The only aggregation that happens seems to be happening in order to achieve some deduplication (?). Perhaps the name of the model could be int_user_signups
.
WITH identifies as ( | ||
SELECT user_id, | ||
coalesce(portal_customer_id, context_traits_portal_customer_id) as portal_customer_id, | ||
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY RECEIVED_AT) AS row_number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use qualify
in identifies CTE?
If I understand correctly, the idea is to create a mapping between user_id
and portal_customer_id
. For this case, the logic is rather simple, as for each user_id
a single portal_customer_id
must exist and vice versa. This can be moved to a separate intermediate model because:
- it's extremely likely that it will be reused,
- this way tests on the 1:1 mapping can be added.
In fact it's a common challenge when using data from CDP platforms. Here's a few examples on how it's performed using DBT:
@@ -0,0 +1,17 @@ | |||
version: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better place to place signup related models can be under the name of the team responsible for this flow. Placing them under data_eng
feels a bit strange.
.../mattermost-analytics/models/intermediate/data_eng/signup/int_rudder_portal_user_mapping.sql
Outdated
Show resolved
Hide resolved
transform/mattermost-analytics/models/staging/portal_prod/_portal_prod__models.yml
Outdated
Show resolved
Hide resolved
transform/mattermost-analytics/models/intermediate/data_eng/signup/_int_signup__models.yml
Show resolved
Hide resolved
WHEN pageview_create_workspace.pageview_id IS NOT NULL | ||
THEN TRUE | ||
ELSE FALSE | ||
END AS email_verified, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps is_email_verified
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding the is_
prefix.
account_created, | ||
email_verified | ||
from user_signup_stages | ||
qualify row_number() over (partition by portal_customer_id order by timestamp) = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very sure on email_verified
specification is.
Here's an example of two :
| portal_customer_id | email_verified | timestamp |
|--------------------|----------------|------------|
| 1 | false | 2023-05-01 |
| 1 | true | 2023-05-02 |
| 1 | false | 2023-05-03 |
| 2 | true | 2023-05-02 |
| 2 | false | 2023-05-03 |
- With
quailify
, resultingemail_verified
will befalse
. - With i.e.
group by portal_customer_id
andmax(email_verified)
,email_verified
will betrue
.
1/5
which one should be used.
FROM | ||
{{ ref('stg_portal_prod__identifies') }} | ||
WHERE | ||
portal_customer_id IS NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use same ordering like in the SELECT
Summary
pageview*
tables to capture cloud signups.select count(*) from ANALYTICS.DBT_CLOUD_PR_226810_1267.INT_PORTAL_PROD_SIGNUPS_AGGREGATED_TO_USERS signups join analytics.stripe.customers c on c.cws_customer = signups.portal_customer_id where account_created and c.id is not null; -- equal to unique customers in identify.
Ticket Link
Required for changes to PR -> #1261