Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema inference: Shape::widen() #1126

Merged
merged 11 commits into from
Aug 4, 2023
Merged

Schema inference: Shape::widen() #1126

merged 11 commits into from
Aug 4, 2023

Conversation

jshearer
Copy link
Contributor

@jshearer jshearer commented Jul 26, 2023

Implement Shape::widen which enables widening a Shape to fit a provided AsNode. This sets the groundwork for permanently keeping track of a running inferred schema in the combiner.

Of note, the schemas "inferred" by widen() are maximally strict:

  • By default, newly inferred objects have have additionalProperties: false.
  • Object fields initially have required: true until we encounter a document missing that field, at which point it will be downgraded to required: false. Note: This isn't required AFAICT. Should we really do this?

The next piece of work is implementing the stubbed-out enforce_field_count_limits. With that, we should have everything we need to implement the running inferred schema and emit it to the ops logs. Also of note is that the reduce: flow-inferred-schema-merge reduction annotation implementation can and should also use enforce_field_count_limits, as it's possible for itra-transaction documents to not exceed the limits while inter-transaction documents do, and we care about limiting both of those cases.


This change is Reviewable

crates/doc/Cargo.toml Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
@psFried
Copy link
Member

psFried commented Jul 27, 2023

What's the plan for determining whether a shape was actually modified by calls to widen? Are we keeping two copies and doing an equality comparison? It'd be nice if widen could return a bool indicating whether or not it was actually mutated by the operation, though I'm also recognizing that that approach might add to an already considerable level of complexity in that function.

We need this in order to determine whether or not to actually emit the new schema after draining the combiner. Totally fine to implement it in a subsequent PR, IMO, as long as we have a plan for it.

@jshearer
Copy link
Contributor Author

What's the plan for determining whether a shape was actually modified by calls to widen?

I'm inclined to keep it simple for the moment and give each transaction a copy of the running shape, which we can then compare after the transaction commits. Yes it's less efficient than keeping track of whether the shape was modified or not, but since it only happens once every transaction vs once for every document, I don't think it'd be that painful performance-wise.

Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More feedback for you

crates/doc/Cargo.toml Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Show resolved Hide resolved
crates/doc/src/inference.rs Outdated Show resolved Hide resolved
Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments below

We can land this, but the next big question is, what does this do to performance? This is adding more overhead in an area that's already critical path, so we'll have to accept some regression, but do need to quantify "how much?".

There's a combiner benchmark that models a parameterized mix of real-world documents and can be further extended to drive widen and quantify some of this impact.

N: AsNode,
{
use crate::{Field, Fields};
// If a particular location defines `additionalProperties` as a subschema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment could be a bit crisper. For example:

Two possible cases:
* This schema has individual `properties` as well as `additionalProperties: false`,
   meaning no other properties are possible. We recursively widen each of the input
  `fields` into their respective `properties`, adding new ones as required.
* This schema has `additionalProperties` _other_ than `false`.
   In this case, we apply each of `fields` to widen matched `properties`,
   and otherwise widen `additionalProperties` where not matched.

Right, and this is the subtle power of good comments. Writing this out made me realize that our behavior is incorrect in the general case, where we're widening an ObjShape that may have started as an actual schema.

To be fully correct, we would need to look for matched properties and even patternProperties to widen before we resort to widening additionalProperties.

In your shoes I would correct the implementation now -- because I'll otherwise forget about it -- but I'm okay punting given the time pressure for this feature. If we do punt, please add a good comment here and issue on this defect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and tested that:

  • We widen explicitly named properties even if additionalProperties is non-false
  • We widen matching patternProperties before widening additionalProperties

impl Shape {
// Widen a Shape to make the provided AsNode value fit.
// Returns a hint if some locations might exceed their maximum allowable size.
// NOTE: If a particular location defines `additionalProperties` as a subschema, don't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment / NOTE doesn't make sense here anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually decided to move the docstring from ObjShape::widen here instead since this is a public function so it'll be more visible and explains what's going on better.

crates/doc/src/inference.rs Outdated Show resolved Hide resolved
crates/doc/src/inference.rs Show resolved Hide resolved
… fit a provided `AsNode`. This sets the groundwork for performantly keeping track of a running inferred schema in the combiner.

Of note, the schemas "inferred" by `widen()` are maximally strict:
* By default, newly inferred objects have have `additionalProperties: false`.
* Object fields initially have `required: true` until we encounter a document missing that field, at which point it will be downgraded to `required: false`

The next piece of work is implementing the stubbed-out `enforce_field_count_limits`. With that, we should have everything we need to implement the running inferred schema and emit it to the ops logs. Also of note is that the `reduce: flow-inferred-schema-merge` reduction annotation implementation can and should also use `enforce_field_count_limits`, as it's possible for itra-transaction documents to not exceed the limits while inter-transaction documents do, and we care about limiting both of those cases.
* Factor out `ObjShape::widen`
* Refactor to be zero-cost by default
* Ensure ObjShape.properties stays sorted
* Get rid of the helper function `Shape::widen_inner`
* Set `is_required` for new fields properly based on whether they have always been present or not
* Infer string formats following similar logic to `is_required`: the first string gets inferred, subsequent ones get checked, and after a non-conforming string causes the format to drop off, don't re-infer it
* Reduce some nesting in `ObjShape::widen`
* Widen array min and max length
* Correctly detect `integer`, and `fractional` types
* Tests
…lready in a location that's getting squashed because we also need to ensure that the newly-squashed `additionalProperties` isn't _also_ excessively large
…perties first, even if `additionalProperties` is defined.
@jshearer jshearer merged commit ccf622e into master Aug 4, 2023
4 checks passed
@jshearer jshearer deleted the feature/shape_widening branch August 4, 2023 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants