DM-46479: add low-level support for union queries over multiple dataset types #1104

TallJimbo · 2024-10-22T17:21:28Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

codecov · 2024-10-22T17:36:55Z

Codecov Report

Attention: Patch coverage is 77.68240% with 156 lines in your changes missing coverage. Please review.

Project coverage is 89.47%. Comparing base (84471a8) to head (4649edc).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
...t/daf/butler/direct_query_driver/_query_builder.py	61.78%	69 Missing and 4 partials ⚠️
...hon/lsst/daf/butler/direct_query_driver/_driver.py	74.70%	30 Missing and 13 partials ⚠️
python/lsst/daf/butler/queries/tree/_query_tree.py	47.36%	7 Missing and 3 partials ⚠️
...st/daf/butler/direct_query_driver/_sql_builders.py	94.61%	4 Missing and 5 partials ⚠️
python/lsst/daf/butler/pydantic_utils.py	36.36%	7 Missing ⚠️
python/lsst/daf/butler/nonempty_mapping.py	44.44%	4 Missing and 1 partial ⚠️
.../daf/butler/direct_query_driver/_query_analysis.py	94.59%	1 Missing and 1 partial ⚠️
python/lsst/daf/butler/queries/_query.py	0.00%	1 Missing and 1 partial ⚠️
python/lsst/daf/butler/queries/tree/_base.py	75.00%	2 Missing ⚠️
.../lsst/daf/butler/queries/tree/_column_reference.py	66.66%	1 Missing and 1 partial ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
- Coverage   89.71%   89.47%   -0.24%     
==========================================
  Files         360      361       +1     
  Lines       47451    47741     +290     
  Branches     5734     5794      +60     
==========================================
+ Hits        42571    42717     +146     
- Misses       3518     3640     +122     
- Partials     1362     1384      +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dhirving

I think this is good to merge as-is so we can unblock Andy and I can start working with it. The refactors all look good and the new code seems to be mostly segregated from existing code paths.

The only comment that I think is a major issue is the deepcopy thing, but that code is not currently being executed so it's not immediately harmful. I can fix it up when I go in there to start using this new functionality.

We'll have to see what happens with performance -- I'd guess that Postgres's query planner is going to treat each member of the union as an independent query. I'm also a little concerned that these queries are going to end up complex and difficult to debug.

dhirving · 2024-10-23T18:27:10Z

python/lsst/daf/butler/nonempty_mapping.py

+        callback and keys.
+        """
+        result = NonemptyMapping[_K, _V](self._default_factory)
+        result._mapping = copy.deepcopy(self._mapping)


This deepcopy seems hazardous if the inner value type is non-trivial. e.g. this gets called on a NonemptyMapping[str, list[sqlalchemy.ColumnElement]].

I dunno what the behavior of calling deepcopy on a sqlalchemy object is and I'm not sure I want to find out :D The author of sqlalchemy has been quoted as saying:

I think deepcopy is kind of crazy to use ever.

and I can't find any documentation saying it's intended to work.

Maybe we should add more-concrete subclasses for NonemptyMapping[T, list[S]], or just shallow-copy the values to limit the blast radius.

I think we could get way with versions of NonemptyMapping whose values either have a .copy() (which should be used) or do not (in which case we assume the result is immutable). Might be cleaner overall to just dispatch on that hasattr check than have subclasses.

I figured the worst-case scenario would be unnecessary copies, and I'd have hoped library authors where consequences would be more severe would just implement __deepcopy__ or __reduce__ to fix it. But it sounds like that's not the case.

The actual worst-case scenario is more like double-freeing of pointers, accidental sharing of sockets, duplicate copies of large caches that are meant to be shared, or spooky action at a distance involving accidental sharing of state between separate copies.

I think the average library author isn't that interested in exploring all the possible interactions of their system with different weird Python double-underscore features -- if everyone dealt with every possible thing that anyone could do to their objects in Python, every library would be 10x bigger. Same reason you wouldn't assume you can subclass an arbitrary library type or use it from multiple threads simultaneously.

Turns out this was even easier than I thought: we're already only using it with set, list, or dict values, and I think it's only useful when the value type is a mutable containers. So I just added a Protocol bound on the value type that requires it to have a copy method, and used that.

dhirving · 2024-10-23T22:38:53Z

python/lsst/daf/butler/direct_query_driver/_query_analysis.py

    """A struct describing a dataset search joined into a query, after
    resolving its collection search path.
    """

-    name: str
-    """Name of the dataset type."""
+    name: _T


It took me a while to figure out which types this could have, and under which circumstances it was which, and how that interacted with the function overloads associated with this generic.

I wonder if this could just be list[str] instead of needing the generic?

Or maybe two classes like:

class SingleResolvedDatasetSearch: name: str search: ResolvedDatasetSearch class UnionResolvedDatasetSearch: names: list[str] search: ResolvedDatasetSearch

since it seems like usually we want to know which of the generic cases we have, and when we don't we can pass in the search object instead of the whole thing.

There might be a case where T can be ..., too; I'm not sure. I don't really have much of an opinion in the abstract here; it'd come down to what the options actually look like in practice.

Punting this one to future tickets.

dhirving · 2024-10-25T00:18:24Z

python/lsst/daf/butler/direct_query_driver/_driver.py

+        # Gather the filtered collection search path for each union dataset
+        # type.
+        collections_by_dataset_type = defaultdict[str, list[str]](list)
+        for collection_record, collection_summary in collection_analysis.summaries_by_dataset_type[...]:
+            for dataset_type in collection_summary.dataset_types:
+                if dataset_type.dimensions == tree.any_dataset.dimensions:
+                    collections_by_dataset_type[dataset_type.name].append(collection_record.name)
+        # Reverse the lookup order on the mapping we just made to group
+        # dataset types by their collection search path.  Each such group
+        # yields an output plan.
+        dataset_searches_by_collections: dict[tuple[str, ...], ResolvedDatasetSearch[list[str]]] = {}
+        for dataset_type_name, collection_path in collections_by_dataset_type.items():
+            key = tuple(collection_path)
+            if (resolved_search := dataset_searches_by_collections.get(key)) is None:
+                resolved_search = ResolvedDatasetSearch[list[str]](
+                    [],
+                    dimensions=tree.any_dataset.dimensions,
+                    collection_records=[
+                        collection_analysis.collection_records[collection_name]
+                        for collection_name in collection_path
+                    ],
+                    messages=[],
+                )
+                resolved_search.is_calibration_search = any(
+                    r.type is CollectionType.CALIBRATION for r in resolved_search.collection_records
+                )
+                dataset_searches_by_collections[key] = resolved_search
+            resolved_search.name.append(dataset_type_name)


It can help readability to pull out independent chunks like this to a separate function. I find that it makes it easier to see the high-level structure of the logic without getting bogged down in the details. I like to use Right click -> Refactor -> Extract Method in VSCode to experimentally check candidate chunks -- in this case it comes out cleanly as a function of 2 parameters.

Yeah, it's mostly the number of parameters (especially modified-in-place parameters) that makes me lean towards doing this a bit less than you, but if this one really is just two parameters then I agree it's a good idea.

dhirving · 2024-10-25T00:46:54Z

python/lsst/daf/butler/direct_query_driver/_sql_builders.py

+
+
+@dataclasses.dataclass(kw_only=True)
+class SqlColumns:


I like the refactor to pull this stuff out into its own class -- these are logically separate from some of the things they were mixed in with before. I'd like it even better if SqlJoinsBuilder consumed it by composition instead of inheritance. (Looks like maybe you were moving that direction and didn't get that far?)

The need to pull this out happened pretty late in the refactor and I just wasn't looking for further refinements at that stage. I agree composition at least intuitively feels a little better, though I could also imagine that just adding a lot of .columns visual noise in practice, since the usage of the base class on its own is pretty niche.

Punting this one to future tickets.

dhirving · 2024-10-25T00:56:50Z

python/lsst/daf/butler/direct_query_driver/_query_builder.py

+        raise NotImplementedError()
+
+
+class SingleSelectQueryBuilder(QueryBuilder):


It seems like it might be possible to unify more of this near-duplicate code with UnionQueryBuilder by treating this is a union with only one term, though there's a few corners that will make that difficult.

I can't claim I tried super hard to make that work, and I may have been overly concerned about pessimizing the common case, but there were indeed a lot of small differences that seemed hard to iron out.

Punting this one to future tickets.

dhirving · 2024-10-25T00:59:48Z

tests/test_pydantic_utils.py

+        json_roundtripped = adapter.validate_json(adapter.dump_json(...))
+        self.assertIs(json_roundtripped, ...)
+        python_roundtripped = adapter.validate_python(adapter.dump_python(...))
+        self.assertIs(python_roundtripped, ...)


There's a missing test case here for adapter.validate_python(...) which I think would fail. This will make it hard to instantiate models using SerializableEllipsis from inside Python rather than from JSON.

Oh, bleh.

I tried rewriting this with __get_pydantic_core_schema__ (since I understand that better), but any solution that fixes this problem seems to fail to work with the union. I'm inclined to go replace ellipsis usage here with a single-variant enum; I had that idea later in the ticket and wondered if it might be better but didn't see a need, but I think this problem puts me on the other side of that fence.

Yeah the enum thing will probably work, or one thing that I often see in JSON is using an object like { "special": "any_type" }, which can't be confused with a string. (Better if you know you need it up front so you end up with a discriminated union instead of an awkward union of string and object, but str | object ends up happening a lot.)

Redone with an enum. Turns out it's important to set the enum value to something that a string (I used -1) since that's what Pydantic uses for serialization, and it's not able to handle unions whose members are overlapping in the JSON space.

This is setup work for associating each QueryPlan with multiple QueryBuilders (to be unioned together).

This obfuscates exactly which part of the plan each method uses and modifies, but that's about to change dramatically, and in the end each method will have to see a lot more of the whole anyway. And in exchange the method signatures get a lot simpler.

This is a major refactor of the many classes involved in translating butler queries to SQL, including some renaming to reflect new roles. - The low-level SqlBuilder and SqlJoiner classes have been renamed to SqlSelectBuilder and SqlJoinsBuilder, and some of SqlJoinsBuilder has been factored out into a base class, SqlColumns. - The QueryPlan objects have been split up into "analysis" objects that are still mostly plan-like, and QueryBuilder objects that have both that planning information and one or more SqlSelectBuilder objects and a Postprocessing inside them. - The new QueryBuilder objects are a hierarchy: there's a QueryBuilder abstrat base class and two derived classes: SingleSelectQueryBuilder is a refactoring of the code path we had before, while UnionQueryBuilder is a UNION ALL over dataset types. - DirectQueryDriver.build_query is still the main entry point, and it's now where the overview docs for the system live. It delegates to methods on the QueryBuilder objects to handle the differences in the single-select vs. union cases, and those delegate back to other DirectQueryDriver methods for logic that's the same between the two cases.

Turns out the Pydantic adapters were buggy in a hard-to-fix way, and while the enum is a little more verbose, it's more self-describing.

We're only using it with list, dict, and set values, and it's really only useful when the value type is a mutable container anyway.

dhirving approved these changes Oct 25, 2024

View reviewed changes

TallJimbo added 10 commits October 25, 2024 13:50

Add serialization annotation for Ellipsis.

e8c58cf

Remove postprocessing from QueryBuilder.

5424f48

Move QueryBuilder into QueryPlan.

d49e692

This is setup work for associating each QueryPlan with multiple QueryBuilders (to be unioned together).

Add copy method to NonemptyMapping.

0dd01d9

Add "union dataset" to query model and interface classes.

0720d3a

Drop SerializedEllipsis in favor of single-element ANY_DATASET enum.

bf0ab0a

Turns out the Pydantic adapters were buggy in a hard-to-fix way, and while the enum is a little more verbose, it's more self-describing.

Require that NonemptyMapping value types have a copy method.

43e748c

We're only using it with list, dict, and set values, and it's really only useful when the value type is a mutable container anyway.

Refactor union dataset search resolution into another method.

4649edc

TallJimbo force-pushed the tickets/DM-46479 branch from 2bab92a to 4649edc Compare October 25, 2024 17:50

TallJimbo merged commit 6333457 into main Oct 28, 2024
16 of 18 checks passed

TallJimbo deleted the tickets/DM-46479 branch October 28, 2024 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-46479: add low-level support for union queries over multiple dataset types #1104

DM-46479: add low-level support for union queries over multiple dataset types #1104

TallJimbo commented Oct 22, 2024 •

edited

Loading

codecov bot commented Oct 22, 2024 •

edited

Loading

dhirving left a comment

dhirving Oct 23, 2024

TallJimbo Oct 25, 2024

dhirving Oct 25, 2024

TallJimbo Oct 25, 2024

dhirving Oct 23, 2024

TallJimbo Oct 25, 2024

TallJimbo Oct 25, 2024

dhirving Oct 25, 2024

TallJimbo Oct 25, 2024

TallJimbo Oct 25, 2024

dhirving Oct 25, 2024

TallJimbo Oct 25, 2024

TallJimbo Oct 25, 2024

dhirving Oct 25, 2024

TallJimbo Oct 25, 2024

TallJimbo Oct 25, 2024

dhirving Oct 25, 2024

TallJimbo Oct 25, 2024

dhirving Oct 25, 2024

TallJimbo Oct 25, 2024

		raise NotImplementedError()


		class SingleSelectQueryBuilder(QueryBuilder):

DM-46479: add low-level support for union queries over multiple dataset types #1104

DM-46479: add low-level support for union queries over multiple dataset types #1104

Conversation

TallJimbo commented Oct 22, 2024 • edited Loading

Checklist

codecov bot commented Oct 22, 2024 • edited Loading

Codecov Report

dhirving left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo commented Oct 22, 2024 •

edited

Loading

codecov bot commented Oct 22, 2024 •

edited

Loading