Dagster + SQLMesh Metrics: Run sqlmesh python models on trino #2444

ravenac95 · 2024-11-04T05:23:38Z

What is it?

The second part of a set of issues for getting Dagster + SQLMesh Metrics running. This is to get sqlmesh running the python models on trino. If, somehow, this is performant enough without any additional work to run the metrics with the special pre-warmed duckdb as a cache + rolling query runner, then we can still accomplish all the metrics work without as much time spent developing a more complex option.

ravenac95 · 2024-11-05T19:56:08Z

Sadly, after having spent quite a few hours on this (there were some setup things to fix and then bugs in the python models). running this with just the python models as is on trino is still not enough. Some observations when running

We can't seem to saturate the requests to trino.
Queries for each rolling day take on the order of seconds 1-10s each. So for a 10 year period this would take ~10 hours.
- This is vastly slower than our duckdb pre-warmed cache implementation
Trino has some limitations with query text size. This might cause issues with the dataframe writing.
- We can adjust settings here to a point so this is less an immediate problem.
Still running into periodic errors.
- I think this is due to the scaling mechanisms. So I may disable those for now until we get prometheus and KEDA setup to be able to scale based on additional dimensions like http requests.

Places we will go from here to explore (still not using duckdb):

We should try to see if we can saturate the requests to trino's workers. It would be nice to see if it's possible make the workers actually queue queries. At least then we will know it's functioning at it's limit.
We should use an external process to handle this test as adding this directly as part of the sqlmesh python model adds some complication. My current thought is to use an "Arrow Flight Protocol" compliant server that will stream table results to the caller. This is an open protocol used by arrow so it's something we could easily use in other places. We would call the service from the python model and stream the rows in and periodically write the results out.

github-project-automation bot added this to OSO Nov 4, 2024

github-project-automation bot moved this to Backlog in OSO Nov 4, 2024

ravenac95 self-assigned this Nov 4, 2024

ravenac95 moved this from Backlog to In Progress in OSO Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dagster + SQLMesh Metrics: Run sqlmesh python models on trino #2444

Dagster + SQLMesh Metrics: Run sqlmesh python models on trino #2444

ravenac95 commented Nov 4, 2024

ravenac95 commented Nov 5, 2024

Dagster + SQLMesh Metrics: Run sqlmesh python models on trino #2444

Dagster + SQLMesh Metrics: Run sqlmesh python models on trino #2444

Comments

ravenac95 commented Nov 4, 2024

What is it?

ravenac95 commented Nov 5, 2024