You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When using normalize from within BI tools (we use Looker), we want to use derived_tstamp instead of collector_tstamp as this is the better representation of when the events actually happened.
When we allow the user to filter on time on just derived_tstamp, all partitions are scanned, as BigQuery doesn't have a perspective on how those timestamps are connected. To combat this we've created a rubber band filter that filters +/1 one day of the derived_tstamp on collector_tstamp. But this has the downside that it doesn't catch late arriving events. See SQL examples below.
Furthermore, I imagine the use of collector_tstamp is based on the assumption that the atomic table is partitioned on this field. With the new loader structure for BigQuery (my_schema_1 not my_schema_1_2_3) our events table is partitioned on load_timestamp which again I assume has implications on the data scanned for the actual runs of the normalize package, as that also needs to scan all partitions
Describe the solution you'd like
One option (and probably the easiest) would be to use start_tstamp similar to what the unified models use - from my understanding, this is mostly the same as derived_tstamp and thus utilize the partitions when the user filters on derived_tstamp.
Another option would be to allow the user to set the partition key and then this is updated in the code. Now collector_tstamp is hard coded into the model_gen.py-script as well as the normalize_events macro.
I imagine this could be part of the config-section on top of the normalize_config.json-file (and should probably be the same for all models, as otherwise there will need to be added complexity to the models)
There might also need to be set a var in dbt_project.yml that says which partition key is used for the atomic table, as I imagine this can be a number of different values. This would probably also need to be considered in order to have the manifest work best and reduce the data processed(?)
As mentioned above, we have created a filter that's always applied in order to not always query all the data available. Furthermore, we're always enforcing date filters, so that the users don't query too much data just because they can.
Example query without the rubber band filters applied:
SELECT
app_id,
FROM
`warehouse.snowplow_normalize.my_events`
WHERE
TIMESTAMP_TRUNC(derived_tstamp, DAY) >= TIMESTAMP("2024-08-01")
This query processes 6.05 GB of data, despite only querying on 5 days (as of 5th of August)
With an added filter on this, we can reduce the amount processed by x-folds:
SELECT
app_id,
FROM
`warehouse.snowplow_normalize.my_events`
WHERE
TIMESTAMP_TRUNC(derived_tstamp, DAY) >= TIMESTAMP("2024-08-01")
AND
TIMESTAMP_TRUNC(collector_tstamp, DAY) >= TIMESTAMP("2024-08-01")
which only processes 0.7 GB
In other queries we have tried, this has the effect of lowering from 65 GB to around 2 GB, and we currently only have around a month of data. This problem will only increase in size if we don't enforce a collector_tstamp filter as well.
As mentioned above, this has the downside that it doesn't include potentially late events
Additional context
N/A
Are you interested in contributing towards this feature?
Yeah, me or my team at TV 2 Norway like the normalize package and are willing to contribute
The text was updated successfully, but these errors were encountered:
Hi @tv2-tjebsen, Thank you for the detailed explanation, we understand the need to be able to be flexible with partitioning/filtering for optimizations. Regarding optimizing the event table querying, we already support changing the incremental timestamp field with the variable snowplow__session_timestamp which in your case should ideally be set to load_tstamp as that is what your atomic events table is partitioned on. The lookback_window_hours could also be reduced in this case.
As for the actual normalized derived table partitioning, currently the additional flat columns (event_columns in the config file) to take from the events table are optional, the derived_tstamp would have to be hardcoded into the package (perhaps renamed as start_tstamp to be consistent with other packages as you said) and then possibly add it as a default partitioning field. This however, would be a breaking change, therefore need to think about how best to support this. Alternatively, we could tie the derived table partitioning field universally to a variable, but then things can break easily, therefore there should be some checks in place at the time of the model creation that the column exists for all the event specifications within the config. I will raise this with the team to add as a feature request, until then the only option I see is to overwrite the partition_by field in the model config for all the derived tables as a "custom" solution.
Thanks for your great reply, @agnessnowplow! I was not aware of the snowplow__session_timestamp-variable. Thanks for pointing it out, it's now implemented in our instance of normalize.
I agree with the arguments for the partitioning part and agree that the fastest way to goal for us would probably be to add a more custom solution. I guess it could be solved by just changing the model_gen-script slightly. I think it would be a great addition to the package to have it more aligned with the unified setup though (or as a variable). Totally understandable that you can't introduce breaking changes and need to prioritize ✌️
Is your feature request related to a problem? Please describe.
When using normalize from within BI tools (we use Looker), we want to use
derived_tstamp
instead ofcollector_tstamp
as this is the better representation of when the events actually happened.When we allow the user to filter on time on just
derived_tstamp
, all partitions are scanned, as BigQuery doesn't have a perspective on how those timestamps are connected. To combat this we've created a rubber band filter that filters +/1 one day of thederived_tstamp
oncollector_tstamp
. But this has the downside that it doesn't catch late arriving events. See SQL examples below.Furthermore, I imagine the use of
collector_tstamp
is based on the assumption that the atomic table is partitioned on this field. With the new loader structure for BigQuery (my_schema_1
notmy_schema_1_2_3
) our events table is partitioned onload_timestamp
which again I assume has implications on the data scanned for the actual runs of the normalize package, as that also needs to scan all partitionsDescribe the solution you'd like
One option (and probably the easiest) would be to use
start_tstamp
similar to what the unified models use - from my understanding, this is mostly the same asderived_tstamp
and thus utilize the partitions when the user filters on derived_tstamp.Another option would be to allow the user to set the partition key and then this is updated in the code. Now
collector_tstamp
is hard coded into the model_gen.py-script as well as the normalize_events macro.I imagine this could be part of the config-section on top of the
normalize_config.json
-file (and should probably be the same for all models, as otherwise there will need to be added complexity to the models)There might also need to be set a var in dbt_project.yml that says which partition key is used for the atomic table, as I imagine this can be a number of different values. This would probably also need to be considered in order to have the manifest work best and reduce the data processed(?)
Describe alternatives you've considered
As mentioned above, we have created a filter that's always applied in order to not always query all the data available. Furthermore, we're always enforcing date filters, so that the users don't query too much data just because they can.
Example query without the rubber band filters applied:
This query processes 6.05 GB of data, despite only querying on 5 days (as of 5th of August)
With an added filter on this, we can reduce the amount processed by x-folds:
which only processes 0.7 GB
In other queries we have tried, this has the effect of lowering from 65 GB to around 2 GB, and we currently only have around a month of data. This problem will only increase in size if we don't enforce a
collector_tstamp
filter as well.As mentioned above, this has the downside that it doesn't include potentially late events
Additional context
N/A
Are you interested in contributing towards this feature?
Yeah, me or my team at TV 2 Norway like the normalize package and are willing to contribute
The text was updated successfully, but these errors were encountered: