large data sets #1751

mtoy-googly-moogly · 2024-06-18T19:35:31Z

mtoy-googly-moogly
Jun 18, 2024
Collaborator

Problem

Many data sets are so large that giving unfiltered query access creates the possibility of very expensive queries being constructed.

Solutions

Malloy does cost analysis of queries

Perhaps there would be metadata on a model or on a query indicating how much resources the user would be willing to allocate to a query.

# maximum_budget=176
run: events -> big_dashboaard

Pre-filtering on large tables

This approach provides no guarantee that query limits will be constrained, but provides a mechanism for controlling one of the most common ways that a query can explode.

source: safer_events is events extend {
  DEFAULT_WHERE: event_time  > now.day    // example keyword for demonstration only
}

run: events -> { ... } // very expensive too many events
run: safer_events -> { ... } // subset of all events, safer. not guaranteed safe

Another way to express a pre-filtered table would be to use the planned but not implemented parameterization

source: safer_events is events extend
  (boolean: subset is event_time > now.day)
run: safer_events -> { ... }

The question in pre-filtering is, are there places where the pre-filtering being an explicit property of a source would make it possible to write better code than having the source be more like a source template. Gestures like joining a pre-filtered source are the places where I am wondering if this matters.

leandro-hermes · 2024-06-18T21:20:30Z

leandro-hermes
Jun 18, 2024

"Default wheres" would be great, and going further, they could be set for specific fields. I mean, fact tables are essentially time-based, and very often, just a short time window is analyzed.

Given this, by giving some users Malloy as a playground may lead them to an expensive query if they forget to filter the time - this is not a Malloy specific problem, but it certainly could help by preventing this.

1 reply

mtoy-googly-moogly Jun 19, 2024
Collaborator Author

Oh yeah ... something like

source: events is sprugle.table("PETABYTE_EVENT_STREAM") extend {
    // this is not the syntax, but just for show
    default_wheres: {
        field: eventTime
        expression: > now.hour
    }
    view: dashboard is { ... }
}

// uses the default_where
run: events -> dashboard

// walks the wheres and sees a reference to eventTime and so
// does not add the default where
run: events { where: eventTime > now.day } -> dashboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large data sets #1751

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

large data sets #1751

mtoy-googly-moogly Jun 18, 2024 Collaborator

Problem

Solutions

Malloy does cost analysis of queries

Pre-filtering on large tables

Replies: 1 comment · 1 reply

leandro-hermes Jun 18, 2024

mtoy-googly-moogly Jun 19, 2024 Collaborator Author

mtoy-googly-moogly
Jun 18, 2024
Collaborator

Replies: 1 comment 1 reply

leandro-hermes
Jun 18, 2024

mtoy-googly-moogly Jun 19, 2024
Collaborator Author