Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-46479: add low-level support for union queries over multiple dataset types #1104

Merged
merged 10 commits into from
Oct 28, 2024

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Oct 22, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 77.68240% with 156 lines in your changes missing coverage. Please review.

Project coverage is 89.47%. Comparing base (84471a8) to head (4649edc).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...t/daf/butler/direct_query_driver/_query_builder.py 61.78% 69 Missing and 4 partials ⚠️
...hon/lsst/daf/butler/direct_query_driver/_driver.py 74.70% 30 Missing and 13 partials ⚠️
python/lsst/daf/butler/queries/tree/_query_tree.py 47.36% 7 Missing and 3 partials ⚠️
...st/daf/butler/direct_query_driver/_sql_builders.py 94.61% 4 Missing and 5 partials ⚠️
python/lsst/daf/butler/pydantic_utils.py 36.36% 7 Missing ⚠️
python/lsst/daf/butler/nonempty_mapping.py 44.44% 4 Missing and 1 partial ⚠️
.../daf/butler/direct_query_driver/_query_analysis.py 94.59% 1 Missing and 1 partial ⚠️
python/lsst/daf/butler/queries/_query.py 0.00% 1 Missing and 1 partial ⚠️
python/lsst/daf/butler/queries/tree/_base.py 75.00% 2 Missing ⚠️
.../lsst/daf/butler/queries/tree/_column_reference.py 66.66% 1 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1104      +/-   ##
==========================================
- Coverage   89.71%   89.47%   -0.24%     
==========================================
  Files         360      361       +1     
  Lines       47451    47741     +290     
  Branches     5734     5794      +60     
==========================================
+ Hits        42571    42717     +146     
- Misses       3518     3640     +122     
- Partials     1362     1384      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to merge as-is so we can unblock Andy and I can start working with it. The refactors all look good and the new code seems to be mostly segregated from existing code paths.

The only comment that I think is a major issue is the deepcopy thing, but that code is not currently being executed so it's not immediately harmful. I can fix it up when I go in there to start using this new functionality.

We'll have to see what happens with performance -- I'd guess that Postgres's query planner is going to treat each member of the union as an independent query. I'm also a little concerned that these queries are going to end up complex and difficult to debug.

callback and keys.
"""
result = NonemptyMapping[_K, _V](self._default_factory)
result._mapping = copy.deepcopy(self._mapping)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deepcopy seems hazardous if the inner value type is non-trivial. e.g. this gets called on a NonemptyMapping[str, list[sqlalchemy.ColumnElement]].

I dunno what the behavior of calling deepcopy on a sqlalchemy object is and I'm not sure I want to find out :D The author of sqlalchemy has been quoted as saying:

I think deepcopy is kind of crazy to use ever.

and I can't find any documentation saying it's intended to work.

Maybe we should add more-concrete subclasses for NonemptyMapping[T, list[S]], or just shallow-copy the values to limit the blast radius.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get way with versions of NonemptyMapping whose values either have a .copy() (which should be used) or do not (in which case we assume the result is immutable). Might be cleaner overall to just dispatch on that hasattr check than have subclasses.

I figured the worst-case scenario would be unnecessary copies, and I'd have hoped library authors where consequences would be more severe would just implement __deepcopy__ or __reduce__ to fix it. But it sounds like that's not the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual worst-case scenario is more like double-freeing of pointers, accidental sharing of sockets, duplicate copies of large caches that are meant to be shared, or spooky action at a distance involving accidental sharing of state between separate copies.

I think the average library author isn't that interested in exploring all the possible interactions of their system with different weird Python double-underscore features -- if everyone dealt with every possible thing that anyone could do to their objects in Python, every library would be 10x bigger. Same reason you wouldn't assume you can subclass an arbitrary library type or use it from multiple threads simultaneously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this was even easier than I thought: we're already only using it with set, list, or dict values, and I think it's only useful when the value type is a mutable containers. So I just added a Protocol bound on the value type that requires it to have a copy method, and used that.

"""A struct describing a dataset search joined into a query, after
resolving its collection search path.
"""

name: str
"""Name of the dataset type."""
name: _T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to figure out which types this could have, and under which circumstances it was which, and how that interacted with the function overloads associated with this generic.

I wonder if this could just be list[str] instead of needing the generic?

Or maybe two classes like:

class SingleResolvedDatasetSearch:
    name: str
    search: ResolvedDatasetSearch
class UnionResolvedDatasetSearch:
    names: list[str]
    search: ResolvedDatasetSearch

since it seems like usually we want to know which of the generic cases we have, and when we don't we can pass in the search object instead of the whole thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a case where T can be ..., too; I'm not sure. I don't really have much of an opinion in the abstract here; it'd come down to what the options actually look like in practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punting this one to future tickets.

Comment on lines 694 to 749
# Gather the filtered collection search path for each union dataset
# type.
collections_by_dataset_type = defaultdict[str, list[str]](list)
for collection_record, collection_summary in collection_analysis.summaries_by_dataset_type[...]:
for dataset_type in collection_summary.dataset_types:
if dataset_type.dimensions == tree.any_dataset.dimensions:
collections_by_dataset_type[dataset_type.name].append(collection_record.name)
# Reverse the lookup order on the mapping we just made to group
# dataset types by their collection search path. Each such group
# yields an output plan.
dataset_searches_by_collections: dict[tuple[str, ...], ResolvedDatasetSearch[list[str]]] = {}
for dataset_type_name, collection_path in collections_by_dataset_type.items():
key = tuple(collection_path)
if (resolved_search := dataset_searches_by_collections.get(key)) is None:
resolved_search = ResolvedDatasetSearch[list[str]](
[],
dimensions=tree.any_dataset.dimensions,
collection_records=[
collection_analysis.collection_records[collection_name]
for collection_name in collection_path
],
messages=[],
)
resolved_search.is_calibration_search = any(
r.type is CollectionType.CALIBRATION for r in resolved_search.collection_records
)
dataset_searches_by_collections[key] = resolved_search
resolved_search.name.append(dataset_type_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can help readability to pull out independent chunks like this to a separate function. I find that it makes it easier to see the high-level structure of the logic without getting bogged down in the details. I like to use Right click -> Refactor -> Extract Method in VSCode to experimentally check candidate chunks -- in this case it comes out cleanly as a function of 2 parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's mostly the number of parameters (especially modified-in-place parameters) that makes me lean towards doing this a bit less than you, but if this one really is just two parameters then I agree it's a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



@dataclasses.dataclass(kw_only=True)
class SqlColumns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the refactor to pull this stuff out into its own class -- these are logically separate from some of the things they were mixed in with before. I'd like it even better if SqlJoinsBuilder consumed it by composition instead of inheritance. (Looks like maybe you were moving that direction and didn't get that far?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need to pull this out happened pretty late in the refactor and I just wasn't looking for further refinements at that stage. I agree composition at least intuitively feels a little better, though I could also imagine that just adding a lot of .columns visual noise in practice, since the usage of the base class on its own is pretty niche.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punting this one to future tickets.

raise NotImplementedError()


class SingleSelectQueryBuilder(QueryBuilder):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might be possible to unify more of this near-duplicate code with UnionQueryBuilder by treating this is a union with only one term, though there's a few corners that will make that difficult.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't claim I tried super hard to make that work, and I may have been overly concerned about pessimizing the common case, but there were indeed a lot of small differences that seemed hard to iron out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punting this one to future tickets.

json_roundtripped = adapter.validate_json(adapter.dump_json(...))
self.assertIs(json_roundtripped, ...)
python_roundtripped = adapter.validate_python(adapter.dump_python(...))
self.assertIs(python_roundtripped, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a missing test case here for adapter.validate_python(...) which I think would fail. This will make it hard to instantiate models using SerializableEllipsis from inside Python rather than from JSON.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, bleh.

I tried rewriting this with __get_pydantic_core_schema__ (since I understand that better), but any solution that fixes this problem seems to fail to work with the union. I'm inclined to go replace ellipsis usage here with a single-variant enum; I had that idea later in the ticket and wondered if it might be better but didn't see a need, but I think this problem puts me on the other side of that fence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the enum thing will probably work, or one thing that I often see in JSON is using an object like { "special": "any_type" }, which can't be confused with a string. (Better if you know you need it up front so you end up with a discriminated union instead of an awkward union of string and object, but str | object ends up happening a lot.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redone with an enum. Turns out it's important to set the enum value to something that a string (I used -1) since that's what Pydantic uses for serialization, and it's not able to handle unions whose members are overlapping in the JSON space.

This is setup work for associating each QueryPlan with multiple
QueryBuilders (to be unioned together).
This obfuscates exactly which part of the plan each method uses and
modifies, but that's about to change dramatically, and in the end
each method will have to see a lot more of the whole anyway.  And in
exchange the method signatures get a lot simpler.
This is a major refactor of the many classes involved in translating
butler queries to SQL, including some renaming to reflect new roles.

- The low-level SqlBuilder and SqlJoiner classes have been renamed to
  SqlSelectBuilder and SqlJoinsBuilder, and some of SqlJoinsBuilder
  has been factored out into a base class, SqlColumns.

- The QueryPlan objects have been split up into "analysis" objects
  that are still mostly plan-like, and QueryBuilder objects that have
  both that planning information and one or more SqlSelectBuilder
  objects and a Postprocessing inside them.

- The new QueryBuilder objects are a hierarchy: there's a QueryBuilder
  abstrat base class and two derived classes: SingleSelectQueryBuilder
  is a refactoring of the code path we had before, while
  UnionQueryBuilder is a UNION ALL over dataset types.

- DirectQueryDriver.build_query is still the main entry point, and
  it's now where the overview docs for the system live.  It delegates
  to methods on the QueryBuilder objects to handle the differences in
  the single-select vs. union cases, and those delegate back to other
  DirectQueryDriver methods for logic that's the same between the two
  cases.
Turns out the Pydantic adapters were buggy in a hard-to-fix way, and
while the enum is a little more verbose, it's more self-describing.
We're only using it with list, dict, and set values, and it's really
only useful when the value type is a mutable container anyway.
@TallJimbo TallJimbo merged commit 6333457 into main Oct 28, 2024
16 of 18 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-46479 branch October 28, 2024 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants