-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-46479: add low-level support for union queries over multiple dataset types #1104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to merge as-is so we can unblock Andy and I can start working with it. The refactors all look good and the new code seems to be mostly segregated from existing code paths.
The only comment that I think is a major issue is the deepcopy
thing, but that code is not currently being executed so it's not immediately harmful. I can fix it up when I go in there to start using this new functionality.
We'll have to see what happens with performance -- I'd guess that Postgres's query planner is going to treat each member of the union as an independent query. I'm also a little concerned that these queries are going to end up complex and difficult to debug.
callback and keys. | ||
""" | ||
result = NonemptyMapping[_K, _V](self._default_factory) | ||
result._mapping = copy.deepcopy(self._mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deepcopy
seems hazardous if the inner value type is non-trivial. e.g. this gets called on a NonemptyMapping[str, list[sqlalchemy.ColumnElement]]
.
I dunno what the behavior of calling deepcopy
on a sqlalchemy object is and I'm not sure I want to find out :D The author of sqlalchemy has been quoted as saying:
I think deepcopy is kind of crazy to use ever.
and I can't find any documentation saying it's intended to work.
Maybe we should add more-concrete subclasses for NonemptyMapping[T, list[S]]
, or just shallow-copy the values to limit the blast radius.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could get way with versions of NonemptyMapping
whose values either have a .copy()
(which should be used) or do not (in which case we assume the result is immutable). Might be cleaner overall to just dispatch on that hasattr
check than have subclasses.
I figured the worst-case scenario would be unnecessary copies, and I'd have hoped library authors where consequences would be more severe would just implement __deepcopy__
or __reduce__
to fix it. But it sounds like that's not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual worst-case scenario is more like double-freeing of pointers, accidental sharing of sockets, duplicate copies of large caches that are meant to be shared, or spooky action at a distance involving accidental sharing of state between separate copies.
I think the average library author isn't that interested in exploring all the possible interactions of their system with different weird Python double-underscore features -- if everyone dealt with every possible thing that anyone could do to their objects in Python, every library would be 10x bigger. Same reason you wouldn't assume you can subclass an arbitrary library type or use it from multiple threads simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out this was even easier than I thought: we're already only using it with set
, list
, or dict
values, and I think it's only useful when the value type is a mutable containers. So I just added a Protocol
bound on the value type that requires it to have a copy
method, and used that.
"""A struct describing a dataset search joined into a query, after | ||
resolving its collection search path. | ||
""" | ||
|
||
name: str | ||
"""Name of the dataset type.""" | ||
name: _T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to figure out which types this could have, and under which circumstances it was which, and how that interacted with the function overloads associated with this generic.
I wonder if this could just be list[str]
instead of needing the generic?
Or maybe two classes like:
class SingleResolvedDatasetSearch:
name: str
search: ResolvedDatasetSearch
class UnionResolvedDatasetSearch:
names: list[str]
search: ResolvedDatasetSearch
since it seems like usually we want to know which of the generic cases we have, and when we don't we can pass in the search
object instead of the whole thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a case where T
can be ...
, too; I'm not sure. I don't really have much of an opinion in the abstract here; it'd come down to what the options actually look like in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Punting this one to future tickets.
# Gather the filtered collection search path for each union dataset | ||
# type. | ||
collections_by_dataset_type = defaultdict[str, list[str]](list) | ||
for collection_record, collection_summary in collection_analysis.summaries_by_dataset_type[...]: | ||
for dataset_type in collection_summary.dataset_types: | ||
if dataset_type.dimensions == tree.any_dataset.dimensions: | ||
collections_by_dataset_type[dataset_type.name].append(collection_record.name) | ||
# Reverse the lookup order on the mapping we just made to group | ||
# dataset types by their collection search path. Each such group | ||
# yields an output plan. | ||
dataset_searches_by_collections: dict[tuple[str, ...], ResolvedDatasetSearch[list[str]]] = {} | ||
for dataset_type_name, collection_path in collections_by_dataset_type.items(): | ||
key = tuple(collection_path) | ||
if (resolved_search := dataset_searches_by_collections.get(key)) is None: | ||
resolved_search = ResolvedDatasetSearch[list[str]]( | ||
[], | ||
dimensions=tree.any_dataset.dimensions, | ||
collection_records=[ | ||
collection_analysis.collection_records[collection_name] | ||
for collection_name in collection_path | ||
], | ||
messages=[], | ||
) | ||
resolved_search.is_calibration_search = any( | ||
r.type is CollectionType.CALIBRATION for r in resolved_search.collection_records | ||
) | ||
dataset_searches_by_collections[key] = resolved_search | ||
resolved_search.name.append(dataset_type_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can help readability to pull out independent chunks like this to a separate function. I find that it makes it easier to see the high-level structure of the logic without getting bogged down in the details. I like to use Right click -> Refactor -> Extract Method in VSCode to experimentally check candidate chunks -- in this case it comes out cleanly as a function of 2 parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's mostly the number of parameters (especially modified-in-place parameters) that makes me lean towards doing this a bit less than you, but if this one really is just two parameters then I agree it's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
|
||
@dataclasses.dataclass(kw_only=True) | ||
class SqlColumns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the refactor to pull this stuff out into its own class -- these are logically separate from some of the things they were mixed in with before. I'd like it even better if SqlJoinsBuilder consumed it by composition instead of inheritance. (Looks like maybe you were moving that direction and didn't get that far?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need to pull this out happened pretty late in the refactor and I just wasn't looking for further refinements at that stage. I agree composition at least intuitively feels a little better, though I could also imagine that just adding a lot of .columns
visual noise in practice, since the usage of the base class on its own is pretty niche.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Punting this one to future tickets.
raise NotImplementedError() | ||
|
||
|
||
class SingleSelectQueryBuilder(QueryBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it might be possible to unify more of this near-duplicate code with UnionQueryBuilder
by treating this is a union with only one term, though there's a few corners that will make that difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't claim I tried super hard to make that work, and I may have been overly concerned about pessimizing the common case, but there were indeed a lot of small differences that seemed hard to iron out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Punting this one to future tickets.
tests/test_pydantic_utils.py
Outdated
json_roundtripped = adapter.validate_json(adapter.dump_json(...)) | ||
self.assertIs(json_roundtripped, ...) | ||
python_roundtripped = adapter.validate_python(adapter.dump_python(...)) | ||
self.assertIs(python_roundtripped, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a missing test case here for adapter.validate_python(...)
which I think would fail. This will make it hard to instantiate models using SerializableEllipsis from inside Python rather than from JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, bleh.
I tried rewriting this with __get_pydantic_core_schema__
(since I understand that better), but any solution that fixes this problem seems to fail to work with the union. I'm inclined to go replace ellipsis usage here with a single-variant enum
; I had that idea later in the ticket and wondered if it might be better but didn't see a need, but I think this problem puts me on the other side of that fence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the enum thing will probably work, or one thing that I often see in JSON is using an object like { "special": "any_type" }
, which can't be confused with a string. (Better if you know you need it up front so you end up with a discriminated union instead of an awkward union of string and object, but str | object
ends up happening a lot.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redone with an enum. Turns out it's important to set the enum value to something that a string (I used -1
) since that's what Pydantic uses for serialization, and it's not able to handle unions whose members are overlapping in the JSON space.
This is setup work for associating each QueryPlan with multiple QueryBuilders (to be unioned together).
This obfuscates exactly which part of the plan each method uses and modifies, but that's about to change dramatically, and in the end each method will have to see a lot more of the whole anyway. And in exchange the method signatures get a lot simpler.
This is a major refactor of the many classes involved in translating butler queries to SQL, including some renaming to reflect new roles. - The low-level SqlBuilder and SqlJoiner classes have been renamed to SqlSelectBuilder and SqlJoinsBuilder, and some of SqlJoinsBuilder has been factored out into a base class, SqlColumns. - The QueryPlan objects have been split up into "analysis" objects that are still mostly plan-like, and QueryBuilder objects that have both that planning information and one or more SqlSelectBuilder objects and a Postprocessing inside them. - The new QueryBuilder objects are a hierarchy: there's a QueryBuilder abstrat base class and two derived classes: SingleSelectQueryBuilder is a refactoring of the code path we had before, while UnionQueryBuilder is a UNION ALL over dataset types. - DirectQueryDriver.build_query is still the main entry point, and it's now where the overview docs for the system live. It delegates to methods on the QueryBuilder objects to handle the differences in the single-select vs. union cases, and those delegate back to other DirectQueryDriver methods for logic that's the same between the two cases.
Turns out the Pydantic adapters were buggy in a hard-to-fix way, and while the enum is a little more verbose, it's more self-describing.
We're only using it with list, dict, and set values, and it's really only useful when the value type is a mutable container anyway.
2bab92a
to
4649edc
Compare
Checklist
doc/changes
configs/old_dimensions