Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support array types #191

Merged
merged 14 commits into from
Dec 1, 2023
Merged

Support array types #191

merged 14 commits into from
Dec 1, 2023

Conversation

plcplc
Copy link
Contributor

@plcplc plcplc commented Nov 28, 2023

What

We now recognize array types in several places:

  • Database introspection columns and comparison operators of array type
  • Columns of array type can be queried
  • Arguments of array type can be input
  • Arrays work in tables and native queries alike

Notable limitations, due in part to the modest expressiveness of the ndc-spec in its current state:

  • There is no way to filter array fields
  • There is no way to project fields from the elements of an array field
  • We only support monomorphic comparison operators

How

The introspection query has had its types sub-query split into scalar_types and array_types subqueries. We benefit from the fact that postgres only supports one-dimensional array types (i.e., array dimensionality is a dynamic aspect of an array value), so we don't have to recurse to construct the json-representation of array types.

Most of the rust-language changes of this PR follow mechanically from changing metadata::ScalarType to the new metadata::Type, which also models array types.

However, since our field types now include arrays we need to observe that certain parts (e.g. ComparisonOperators) only accept base scalar types (e.g. in version1.rs, filtering.rs).

In defining the new type for column types we also take care to maintain backwards compatibility, which means deserialization gets a bit more verbose than just the automatically derived implementations we used to get by with (see metadata/database.rs).

Since we also support array-typed arguments we have to be able to output SQL array constructors (see values.rs, sql/convert.rs).

We introduce two new tests:

  • One test which just verifies that we can output an array column from a native query
  • Another which verifies that we can take an array-valued argument to a native query.

Update

Altering the representation of column types in the configuration constitues a breaking change. Even if we are still able to ingest the historic configuration formats these are not described by the published json schema, and we would only ever output in the newest format.

In order to address the issue of versioning the original contents of this PR is being split in two:

  • The first part (this PR) contains the logic for dealing with array types, and includes a new level of indirection between the api transport types and those used internally.
  • A second follow-up PR will introduce a version 2 of the configuration format, update introspection to include array types, and add tests of the same.

An interesting consequence of introducing the distinction between api transport types and internal business types is that the various tables.json files of the query-engine translation tests contain json-serialized data of business types.

Worth highlighting is this means we cannot simply copy a subsection of a deployment configuration into a tables.json file like we used to, as the business types do not necessarily correspond to any api-version.

While this may seem confusing it also has some benefits:

  • Test files have no version drift. They are always representative of the current internal business data model
  • The test remains coupled with the query-engine crate rather than the connector crate.

Updating test files can become a chore too. There is no tool that can upgrade them automatically like we do for deployments, and it would probably not be worth the effort to try and make one.

What I ended up doing in the particular case of this PR was a small shell script:

for f in $(fd -t f 'tables\.json')
do
  jq 'walk(if type == "object" and has("type") then .type |= {scalarType: .} else . end)' < "$f" \
  | sponge "$f"
done

The jq script matches any object that has a type key and wraps its value with a {"scalarType": ..} object. Happily this covered the full breadth of my change.

sponge is a tool I just discovered which buffers all the stdin it receives and writes everything into the file it's given. Essentially this endows our command with the ability to make in-place edits with shell pipe filters.

@plcplc plcplc disabled auto-merge November 29, 2023 11:32
@plcplc plcplc enabled auto-merge December 1, 2023 12:09
RuntimeConfiguration {
metadata: &self.config.metadata,
metadata: metadata_to_current(&self.config.metadata),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this mapping on every request don't we? I wonder if in future we do it whenever RawConfiguration becomes Configuration (one of Connector trait functions run by the config server IIRC), and keep RuntimeConfiguration inside Configuration.

Not in this PR, mind. Probably worth benchmarking it too to check a speculative speed up becomes a real one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this makes sense.

IIUC, what the user will consider as being "their configuration metadata" is the json-serialized Configuration, not just the RawConfiguration, meaning that the whole thing is what needs to be able to exist in different versions.

It's still a bit nebulous to me though, so I could be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I mean is, currently we take the whole Configuration and derive RuntimeConfiguration from it on each request, surely we could just do this once?

It's not really relevant to this PR, mind, just something I noticed because you made a change near the code that does this.

Copy link
Contributor

@danieljharvey danieljharvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems sensible. Left a couple of questions, but they could probably both be answered in a couple of follow-up tidy ups or optimisation passes at some later point.

@plcplc plcplc added this pull request to the merge queue Dec 1, 2023
Merged via the queue into main with commit addc37f Dec 1, 2023
26 checks passed
@plcplc plcplc deleted the plc/issues/NDAT-950 branch December 1, 2023 15:22
github-merge-queue bot pushed a commit that referenced this pull request Dec 12, 2023
### What

This PR adds a new version (`"2"`) of the deployment configuration data
format.

This version of the configuration is capable of expressing array types
in collections and arguments.

Since this is the first time a new version is introduced there are a lot
of changes the only purpose of which is to distinguish between versions.

Only the infrastructure-related shell of the connector is aware of
different versions of deployment configurations existing. The core of
the connector only works with a single internal version.

This PR is also the one to introduce tests of array types. In hindsight
this ought to have been possible in the previous PR that introduced the
internal types and transformations (#191).

Note that there is not yet any automated way to upgrade a configuration
to a newer version, but this will be introduced shortly.

This PR also adds a changelog entry.

### How

The file `version2.rs` is a duplicate of `version1.rs`, which has been
adapted to use the new data types (incidentally these are just the ones
of the internal model).

`configuration.sql` now exists as `version1.sql` and `version2.sql`
respectively, since these have different capabilities. This is because
`version2.sql` introduces the ability to introspect array types.

`configuration.rs` now exposes `RawConfiguration` and `Configuration`
types which are enums of all the supported versions (currently 1 and
"2").
One big wart on the implementation is that serde and schemars are unable
to derive trait implementations for these types correctly, since they
only support strings as enum tags, and we used a number literal for
version 1.

Once we drop support of version 1 completely we can remove the manually
implemented instances.

The various `Connector` trait implementations now explicitly work on the
internal representation of a configuration, `RuntimeConfiguration`.

---------

Co-authored-by: Daniel Harvey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants