-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema inference: Shape::widen()
#1126
Conversation
What's the plan for determining whether a shape was actually modified by calls to We need this in order to determine whether or not to actually emit the new schema after draining the combiner. Totally fine to implement it in a subsequent PR, IMO, as long as we have a plan for it. |
I'm inclined to keep it simple for the moment and give each transaction a copy of the running shape, which we can then compare after the transaction commits. Yes it's less efficient than keeping track of whether the shape was modified or not, but since it only happens once every transaction vs once for every document, I don't think it'd be that painful performance-wise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More feedback for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comments below
We can land this, but the next big question is, what does this do to performance? This is adding more overhead in an area that's already critical path, so we'll have to accept some regression, but do need to quantify "how much?".
There's a combiner benchmark that models a parameterized mix of real-world documents and can be further extended to drive widen
and quantify some of this impact.
crates/doc/src/inference.rs
Outdated
N: AsNode, | ||
{ | ||
use crate::{Field, Fields}; | ||
// If a particular location defines `additionalProperties` as a subschema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this comment could be a bit crisper. For example:
Two possible cases:
* This schema has individual `properties` as well as `additionalProperties: false`,
meaning no other properties are possible. We recursively widen each of the input
`fields` into their respective `properties`, adding new ones as required.
* This schema has `additionalProperties` _other_ than `false`.
In this case, we apply each of `fields` to widen matched `properties`,
and otherwise widen `additionalProperties` where not matched.
Right, and this is the subtle power of good comments. Writing this out made me realize that our behavior is incorrect in the general case, where we're widening an ObjShape
that may have started as an actual schema.
To be fully correct, we would need to look for matched properties
and even patternProperties
to widen before we resort to widening additionalProperties
.
In your shoes I would correct the implementation now -- because I'll otherwise forget about it -- but I'm okay punting given the time pressure for this feature. If we do punt, please add a good comment here and issue on this defect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and tested that:
- We widen explicitly named properties even if
additionalProperties
is non-false - We widen matching
patternProperties
before wideningadditionalProperties
crates/doc/src/inference.rs
Outdated
impl Shape { | ||
// Widen a Shape to make the provided AsNode value fit. | ||
// Returns a hint if some locations might exceed their maximum allowable size. | ||
// NOTE: If a particular location defines `additionalProperties` as a subschema, don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this comment / NOTE doesn't make sense here anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually decided to move the docstring from ObjShape::widen
here instead since this is a public function so it'll be more visible and explains what's going on better.
d66e69e
to
04eba63
Compare
… fit a provided `AsNode`. This sets the groundwork for performantly keeping track of a running inferred schema in the combiner. Of note, the schemas "inferred" by `widen()` are maximally strict: * By default, newly inferred objects have have `additionalProperties: false`. * Object fields initially have `required: true` until we encounter a document missing that field, at which point it will be downgraded to `required: false` The next piece of work is implementing the stubbed-out `enforce_field_count_limits`. With that, we should have everything we need to implement the running inferred schema and emit it to the ops logs. Also of note is that the `reduce: flow-inferred-schema-merge` reduction annotation implementation can and should also use `enforce_field_count_limits`, as it's possible for itra-transaction documents to not exceed the limits while inter-transaction documents do, and we care about limiting both of those cases.
* Factor out `ObjShape::widen` * Refactor to be zero-cost by default * Ensure ObjShape.properties stays sorted * Get rid of the helper function `Shape::widen_inner` * Set `is_required` for new fields properly based on whether they have always been present or not
* Infer string formats following similar logic to `is_required`: the first string gets inferred, subsequent ones get checked, and after a non-conforming string causes the format to drop off, don't re-infer it * Reduce some nesting in `ObjShape::widen` * Widen array min and max length * Correctly detect `integer`, and `fractional` types * Tests
…lready in a location that's getting squashed because we also need to ensure that the newly-squashed `additionalProperties` isn't _also_ excessively large
…perties first, even if `additionalProperties` is defined.
04eba63
to
c895a0e
Compare
Implement
Shape::widen
which enables widening aShape
to fit a providedAsNode
. This sets the groundwork for permanently keeping track of a running inferred schema in the combiner.Of note, the schemas "inferred" by
widen()
are maximally strict:additionalProperties: false
.required: true
until we encounter a document missing that field, at which point it will be downgraded torequired: false
. Note: This isn't required AFAICT. Should we really do this?The next piece of work is implementing the stubbed-out
enforce_field_count_limits
. With that, we should have everything we need to implement the running inferred schema and emit it to the ops logs. Also of note is that thereduce: flow-inferred-schema-merge
reduction annotation implementation can and should also useenforce_field_count_limits
, as it's possible for itra-transaction documents to not exceed the limits while inter-transaction documents do, and we care about limiting both of those cases.This change is