Schema inference: `Shape::widen()` #1126

jshearer · 2023-07-26T22:04:29Z

Implement Shape::widen which enables widening a Shape to fit a provided AsNode. This sets the groundwork for permanently keeping track of a running inferred schema in the combiner.

Of note, the schemas "inferred" by widen() are maximally strict:

By default, newly inferred objects have have additionalProperties: false.
Object fields initially have required: true until we encounter a document missing that field, at which point it will be downgraded to required: false. Note: This isn't required AFAICT. Should we really do this?

The next piece of work is implementing the stubbed-out enforce_field_count_limits. With that, we should have everything we need to implement the running inferred schema and emit it to the ops logs. Also of note is that the reduce: flow-inferred-schema-merge reduction annotation implementation can and should also use enforce_field_count_limits, as it's possible for itra-transaction documents to not exceed the limits while inter-transaction documents do, and we care about limiting both of those cases.

This change is

crates/doc/Cargo.toml

crates/doc/src/inference.rs

psFried · 2023-07-27T16:29:05Z

What's the plan for determining whether a shape was actually modified by calls to widen? Are we keeping two copies and doing an equality comparison? It'd be nice if widen could return a bool indicating whether or not it was actually mutated by the operation, though I'm also recognizing that that approach might add to an already considerable level of complexity in that function.

We need this in order to determine whether or not to actually emit the new schema after draining the combiner. Totally fine to implement it in a subsequent PR, IMO, as long as we have a plan for it.

jshearer · 2023-07-27T17:57:18Z

What's the plan for determining whether a shape was actually modified by calls to widen?

I'm inclined to keep it simple for the moment and give each transaction a copy of the running shape, which we can then compare after the transaction commits. Yes it's less efficient than keeping track of whether the shape was modified or not, but since it only happens once every transaction vs once for every document, I don't think it'd be that painful performance-wise.

jgraettinger

More feedback for you

crates/doc/Cargo.toml

crates/doc/src/inference.rs

jgraettinger

LGTM % comments below

We can land this, but the next big question is, what does this do to performance? This is adding more overhead in an area that's already critical path, so we'll have to accept some regression, but do need to quantify "how much?".

There's a combiner benchmark that models a parameterized mix of real-world documents and can be further extended to drive widen and quantify some of this impact.

jgraettinger · 2023-08-03T14:08:03Z

crates/doc/src/inference.rs

+        N: AsNode,
+    {
+        use crate::{Field, Fields};
+        // If a particular location defines `additionalProperties` as a subschema


~~nit:~~ this comment could be a bit crisper. For example:

Two possible cases: * This schema has individual `properties` as well as `additionalProperties: false`, meaning no other properties are possible. We recursively widen each of the input `fields` into their respective `properties`, adding new ones as required. * This schema has `additionalProperties` _other_ than `false`. In this case, we apply each of `fields` to widen matched `properties`, and otherwise widen `additionalProperties` where not matched.

Right, and this is the subtle power of good comments. Writing this out made me realize that our behavior is incorrect in the general case, where we're widening an ObjShape that may have started as an actual schema.

To be fully correct, we would need to look for matched properties and even patternProperties to widen before we resort to widening additionalProperties.

In your shoes I would correct the implementation now -- because I'll otherwise forget about it -- but I'm okay punting given the time pressure for this feature. If we do punt, please add a good comment here and issue on this defect.

Fixed and tested that:

We widen explicitly named properties even if additionalProperties is non-false

We widen matching patternProperties before widening additionalProperties

jgraettinger · 2023-08-03T14:21:17Z

crates/doc/src/inference.rs

+impl Shape {
+    // Widen a Shape to make the provided AsNode value fit.
+    // Returns a hint if some locations might exceed their maximum allowable size.
+    // NOTE: If a particular location defines `additionalProperties` as a subschema, don't


nit: this comment / NOTE doesn't make sense here anymore.

I actually decided to move the docstring from ObjShape::widen here instead since this is a public function so it'll be more visible and explains what's going on better.

crates/doc/src/inference.rs

… fit a provided `AsNode`. This sets the groundwork for performantly keeping track of a running inferred schema in the combiner. Of note, the schemas "inferred" by `widen()` are maximally strict: * By default, newly inferred objects have have `additionalProperties: false`. * Object fields initially have `required: true` until we encounter a document missing that field, at which point it will be downgraded to `required: false` The next piece of work is implementing the stubbed-out `enforce_field_count_limits`. With that, we should have everything we need to implement the running inferred schema and emit it to the ops logs. Also of note is that the `reduce: flow-inferred-schema-merge` reduction annotation implementation can and should also use `enforce_field_count_limits`, as it's possible for itra-transaction documents to not exceed the limits while inter-transaction documents do, and we care about limiting both of those cases.

* Factor out `ObjShape::widen` * Refactor to be zero-cost by default * Ensure ObjShape.properties stays sorted * Get rid of the helper function `Shape::widen_inner` * Set `is_required` for new fields properly based on whether they have always been present or not

* Infer string formats following similar logic to `is_required`: the first string gets inferred, subsequent ones get checked, and after a non-conforming string causes the format to drop off, don't re-infer it * Reduce some nesting in `ObjShape::widen` * Widen array min and max length * Correctly detect `integer`, and `fractional` types * Tests

…lready in a location that's getting squashed because we also need to ensure that the newly-squashed `additionalProperties` isn't _also_ excessively large

…perties first, even if `additionalProperties` is defined.

jshearer requested review from jgraettinger and psFried July 26, 2023 22:04

jgraettinger reviewed Jul 26, 2023

View reviewed changes

jshearer requested a review from jgraettinger July 28, 2023 15:46

jgraettinger reviewed Jul 28, 2023

View reviewed changes

jshearer mentioned this pull request Jul 28, 2023

feature: Teach the combiner to keep track of a running inferred schema and log whenever it changes. #1128

Merged

jgraettinger approved these changes Aug 3, 2023

View reviewed changes

jshearer force-pushed the feature/shape_widening branch from d66e69e to 04eba63 Compare August 4, 2023 16:19

jshearer added 11 commits August 4, 2023 12:20

PR review feedback:

8e2e6ce

* Factor out `ObjShape::widen` * Refactor to be zero-cost by default * Ensure ObjShape.properties stays sorted * Get rid of the helper function `Shape::widen_inner` * Set `is_required` for new fields properly based on whether they have always been present or not

Implement Shape::enforce_field_count_limits, and test it

80e620a

fix: Make Shape::widen and Shape::enforce_field_count_limits public

db38eec

fix: Limit field counts for objects inside arrays

a26b89d

fix: Recur in Shape::enforce_field_count_limits() even when we're a…

3d5bf01

…lready in a location that's getting squashed because we also need to ensure that the newly-squashed `additionalProperties` isn't _also_ excessively large

fix: forgot to commit Set::for_number

e53ec23

fix: Support widening string formats correctly

612df07

fix: alphabetize pretty_assertions in Cargo.toml

9d636ee

Update logic to handle patternProperties, and to widen explicit pro…

c895a0e

…perties first, even if `additionalProperties` is defined.

jshearer force-pushed the feature/shape_widening branch from 04eba63 to c895a0e Compare August 4, 2023 16:20

jshearer merged commit ccf622e into master Aug 4, 2023
4 checks passed

jshearer deleted the feature/shape_widening branch August 4, 2023 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema inference: `Shape::widen()` #1126

Schema inference: `Shape::widen()` #1126

jshearer commented Jul 26, 2023 •

edited by jgraettinger

Loading

psFried commented Jul 27, 2023

jshearer commented Jul 27, 2023

jgraettinger left a comment

jgraettinger left a comment

jgraettinger Aug 3, 2023

jshearer Aug 4, 2023

jgraettinger Aug 3, 2023

jshearer Aug 4, 2023

Schema inference: Shape::widen() #1126

Schema inference: Shape::widen() #1126

Conversation

jshearer commented Jul 26, 2023 • edited by jgraettinger Loading

psFried commented Jul 27, 2023

jshearer commented Jul 27, 2023

jgraettinger left a comment

Choose a reason for hiding this comment

jgraettinger left a comment

Choose a reason for hiding this comment

jgraettinger Aug 3, 2023

Choose a reason for hiding this comment

jshearer Aug 4, 2023

Choose a reason for hiding this comment

jgraettinger Aug 3, 2023

Choose a reason for hiding this comment

jshearer Aug 4, 2023

Choose a reason for hiding this comment

Schema inference: `Shape::widen()` #1126

Schema inference: `Shape::widen()` #1126

jshearer commented Jul 26, 2023 •

edited by jgraettinger

Loading