Binding Behavior Updates #1219

williamhbaker · 2023-10-03T16:57:06Z

williamhbaker
Oct 3, 2023
Maintainer

Tracking issue for implementation: #1276

At a high level, these changes support situations where a way to easily "reset" a binding for a capture or materialization is needed. An example for captures is if a captured table has a table alteration that changes the types & values of a column, and that table needs to be re-backfilled in order to capture the altered values. For materializations, an example is if a field type changes in an incompatible way due to widening from schema inference, ex. { type: integer } -> { type: [integer, object] }, requiring a new column type & re-materialization of the current table for the materialization of that collection.

The new capabilities the changes described here will enable are:

Re-capturing a source binding into the same collection, appending the entire source onto the collection anew
Re-capturing a source binding into a new collection
Re-materializing the same collection to the same table name, re-creating the table and starting over from the beginning

A materialization can currently re-materialize a collection from the beginning into a new destination table, and that capability will remain unchanged.

Note: The word "table" is commonly used when describing the source of data for a capture binding and the destination for data of a materialization binding, but these changes are not strictly limited to table-based systems. For example, this would also apply to an ElasticSearch materialization that uses an "index" instead of a table, or a Kafka capture that reads from "topics".

There will be additional scope and design needed to expose these capabilities to the user for ex. via the UI that is not discussed in detail here.

Current Behavior

It is technically possible today to reset a capture binding for ~most captures by removing the binding from the spec, publishing, re-adding the binding, and publishing again. Each connector must implement logic to identify bindings that have been removed by comparing the spec sent to the connector with a previously persisted checkpoint, and then remove the state for the removed binding. Not all captures implement this feature and there is not strict consistency in precisely how it is implemented. It is not implemented by any materialization connectors, and there isn't a straightforward way for it to be implemented for materializations.

Some relevant details of how connectors manage state:

Captures use driver checkpoints to track state for individual bindings. These checkpoints are generally structured as a mapping of <table> -> <state>, where <table> is some unique identifier for the source within the captured system that corresponds to the ResourcePath, for example namespace.tablename, or maybe just tablename. Captures have full control over their state persisted in these checkpoints.
Materialization connectors generally don't interact with their runtime checkpoints directly, and these runtime checkpoints track the progress of the materialization. The common case of a transactional, authoritative endpoint is for the materialization to store the checkpoint in its binary form in a checkpoints table and return that to the runtime when opening the transactions RPC and update it as transactions are applied, without the connector ever knowing much (if anything) about it.

Proposed Changes

Capture Checkpoints

These changes are largely motivated to make resetting a binding easier (click a single button to reset), and to enforce consistency in how it works across all captures. Since it is technically possible for a capture connector to reset a binding by examining its previous driver checkpoint and the provided spec, these aren't absolutely necessary. But it is difficult to require every capture connector to always implement this in a consistent way.

Change 1: Re-starting a capture binding should be possible by a single publication of an updated spec. There must be a way to convey the desire for the connector to forget everything it knew about the binding previously and start from scratch on a go-forward basis. A distinct backfillVersion property of the binding's specification that is incorporated into the key for the binding in the capture's state could accomplish this: To re-capture the binding, change the binding's backfillVersion and publish the spec. The backfillVersion will be a monotonically increasing integer value.
Change 2: The Flow runtime should prune state checkpoints so that each capture connector doesn't have to. This will ensure consistency across all capture connectors including those implementing the native Flow protocol written in Go or some other language, as well as connectors built to other protocols through an appropriate translation layer (ATF, etc... maybe). For this to work, there needs to be a common structure to a capture checkpoint that can be used by the runtime and capture connector.

A binding key is the proposed mechanism for this: It will be formed by joining the resource path of the binding with the backfillVersion of the binding to form a unique identifier for the binding, so changing the backfillVersion will force a fresh start of the binding. As a sketch, the checkpoint might a serialized JSON object that includes a key of bindingStateV1, which is an object itself having keys like <resourcePath>/<backfillVersion>. As part of sending the Open message to the connector, the runtime could then inspect the last committed checkpoint and remove any keys in bindingStateV1 that aren't part of the current specification. The checkpoint could contain other keys/values as the connector requires, but the bindingStateV1 property is specifically for automatically pruned binding states.

The binding key will be added to the connector protocol and computed by the runtime. It will be a string to allow for flexibility in how it is provided.

A snippet from an example specification with the new backfillVersion field is shown below:

captures:
  acmeCo/myCapture/source-capture:
    endpoint:
      ...
    bindings:
      - resource:
          namespace: "mySchema"
          stream: "myTable"
        target: "acmeCo/myCapture/mySchema_myTable"
        backfillVersion: 1

The driver checkpoint for a capture that is fully opted in to runtime-managed pruning could look like this:

{
  "bindingStateV1": {
    "mySchema.myTable/1": { <...connector-defined state for this binding...> }
  },
  "otherStuff": { <...anything else the connector wants to persist...> }
}

For the capture to access the state of mySchema.myTable, it would use the bindingKey from the runtime (mySchema.myTable/1, composed of the resource path and backfillVersion) and get the state from bindingStateV1. Note the other top-level property "otherStuff" which could be anything - even including binding state, if the capture has not opted in to runtime-managed state pruning.

Materialization Checkpoints

Change 3: Materialization runtime checkpoints should also include a backfillVersion field to allow for re-starting a binding from the same collection to the same table. Removing a binding from a materialization should cause the state to be pruned from the checkpoint. This should also happen in the runtime, especially since materializations are oblivious to the contents of their checkpoints. The backfillVersion and bindingKey fields should be added to the materialization part of the connector protocol as well, so that a materialization connector knows if it needs to re-initialize a table based on a change in the backfillVersion of the binding (see below).

Management of Materialized Tables

To properly re-start a materialization of a binding, the table in the endpoint needs to be truncated, and its columns probably need to be changed to allow for different types. Effectively this will entail a re-creation of the table, by dropping & re-creating, or a CREATE OR REPLACE TABLE ... type of operation. This is tricky for at least a couple of reasons: 1) What happens with existing shard(s) of the materialization when their target table is dropped out from the under them while they are running? For example, how can a race where a prior materialization shard commits data to a newly created data, registering that progress on a to-be-reset checkpoint update, be prevented? and 2) How do we avoid inadvertently dropping a table pre-created by a user that they don't want dropped?

Change 4: When applying materialization changes, first disable materialization shards and wait for their primary loop to exit prior to applying the changes, and then re-enable the shards after the changes have been applied. This is to ensure (as best we can) that there are no concurrently running shards of the materialization that will sneak in a write to any re-created tables that result from the publication. It also works to ensure that there won't be any locks held on tables to be altered by the materialization transactions that would block the actions needed to be taken by apply. This can actually already be an issue, such as the case of a running materialization locking a table for minutes+ and blocking a table alteration to add a column to the table.

Where possible, materializations using transactional endpoints should increment their fence while applying these changes. The disabling of shards will happen in the activate command.
Change 5: Re-create materialized tables (destructive action) only under the following conditions:
1. If the Apply request includes bindings that produce Unsatisfiable constraints. By extension, this means that the control plane will need some work done to handle the case of a Validate response that has Unsatisfiable constraints, but still wanting to go through with the Apply anyway.
2. If the Apply request includes bindings with a different backfillVersion than prior bindings materializing to the same table.
Re-creating tables only under these circumstances means that just removing a binding from a materialization will not drop any tables, which sort-of addresses the concern about inadvertently dropping a user table that they didn't want dropped.

Derivations

Derivations do already have a name for transforms which could be changed to re-derived from a source collection in a pinch, but for consistency with captures & materializations the backfillVersion property will be added to the specification for derivations as well. Admittedly I am less familiar with how derivations use checkpoints and state, so this is a little vague at the moment:

Change 6: Derivations will use the backfillVersion property of their specs (to be added) and the resulting bindingKey in a similar way as captures & materializations: A change in the backfillVersion will indicate that the derivation should should start over from the beginning of its source collection, and forget any state that it had previously persisted for that binding.

Migration

Captures will be able to opt in to the new form of checkpoints by emitting driver checkpoints with state in the bindingStateV1 object, keyed with the bindingKey. There will be a transition period where captures must understand states in both forms. They should endeavor to convert to the new form by emitting an updated, converted checkpoint on startup, if needed. There is an edge case where if a capture had previously been disabled for a long time and has the backfillVersion changed as part of publishing it to re-enabled for the first time, it won't actually restart the capture for that binding from scratch. This could easily be remedied by changing the backfillVersion again to re-capture the binding.

Materializations currently have checkpoints structured like this:

"some/journal/pivot=00;materialize/some/materialization/resource%2Fpath":{"read_through": <...>}

The backfillVersion field will change the materialize/{materialization}/{encoded-resource-path} part of this, into something like materialize/some/materialization/resource%2Fpath%2F1 (the trailing 1 is the backfillVersion). To enable a non-breaking transition, an omitted or 0 value for backfillVersion will not be appended to the resource path in general, for both materialization checkpoints and the binding keys provided to both captures and materializations.

psFried · 2023-10-05T20:58:10Z

psFried
Oct 5, 2023
Maintainer

When applying materialization changes, first disable materialization shards and wait for their connectors to exit, prior to applying the changes, and then re-enable the shards are the changes have been applied.

Thinking through some of the implications of this:

The main concern is just due to the activation taking longer, since this will make publications take longer.
- Longer publication times could lead to more frequent issues with lock contention, so that might be something we'll want to address
Currently, there are some materialization connectors that can take a very long time to exit.
- I'm not actually certain whether there's even a way for agent to know whether a given connector has exited.
- I think @williamhbaker may already be looking into these issues.
This would also provide more opportunities for failures during the activation phase. We don't have very graceful error handling here, and certain failures can leave the live specs inconsistent with the applied materializations or shards.
- This is a current problem, but one that could be exacerbated a bit by making the apply process more involved

3 replies

jgraettinger Oct 10, 2023
Maintainer

Longer publication times could lead to more frequent issues with lock contention, so that might be something we'll want to address

Perhaps we should commit the DB transaction ahead of actually doing activations.

As it is, if there are failures during activation we're in an inconsistent state already, where the effects of a publication may have already been partially rolled out.

psFried Oct 10, 2023
Maintainer

Perhaps we should commit the DB transaction ahead of actually doing activations.

Yeah, I think that's a good idea.

williamhbaker Oct 10, 2023
Maintainer Author

Currently, there are some materialization connectors that can take a very long time to exit.

Indeed this is an area I have been looking into more.

I think it is not precisely a matter of materialization connectors taking a very long time to exit, but rather the connectors holding open long-running transactions that lock materialization tables until the transactions are complete.

As an example, Snowflake transactions may take many minutes to complete, even after the connector has exited and the client has disconnected. This has led to the appearance of a re-assigned Snowflake shard being stuck in "pending" for an usually long time, since the flow_checkpoints table was locked by the previous invocation still and the newly assigned shard must wait to install its fence. A separate case we've seen is with Redshift not allowing table alterations to go through while a table is locked for a transaction. This causes concurrent Apply invocations to time out. I don't think either of these scenarios are unique to these systems, but they are actual examples that have been observed.

I've been able to demonstrate that our current SQL warehouse materializations are able to abort their running transactions and remove any held locks if they are given the chance to shutdown gracefully. A challenge you have alluded to is knowing when a given connector has actually exited. As I understand it, capture connectors use a sort of psuedo-journal to track how many connector shutdowns a shard has observed. I had hoped materializations could do something similar to expose information about when a connector has actually exited, but I'm not sure if that will work when the shard assignment is removed, as would be the case for disabling a shard. It's something that needs more exploration.

Just to call it out: Materializations needing to thoroughly clean up after themselves does add some complexity to them. Generally I think this is a good idea anyway and worth the effort, since it has already caused issues & I could easily see it continuing to be a pain point.

psFried · 2023-10-05T21:10:00Z

psFried
Oct 5, 2023
Maintainer

I like the idea of adding an explicit revision to the bindings. I think this is something that we could also make the evolutions handler be aware of, and have it increment that instead of using the _v2 suffix on the resource name.

0 replies

jgraettinger · 2023-10-10T14:56:15Z

jgraettinger
Oct 10, 2023
Maintainer

Meta-comment: while we discuss the proposal in the comments, would you please edit the proposal with updates and outcomes so that it stays ever-green?

The runtime doesn't know anything about the connector-specific , but it does know about collection names (author's note: I'm assuming it does, somehow). For this to work, there needs to be a common structure to a capture checkpoint that can be used by the runtime and capture connector. As a sketch, the checkpoint might a serialized JSON object that includes a key of streamStateV1, which is an object itself having keys like //. As part of sending the Open message to the connector, the runtime could then inspect the last committed checkpoint and remove any keys in streamStateV1 that have a prefix that aren't part of the capture specification. The checkpoint could contain other keys/values as the connector requires, but the streamStateV1 property is specifically for automatically pruned stream states.

A few observations:

Where you use streamID here, I believe this is one-to-one with the "resource path" concept which the runtime does have access to. During the Validate RPC, capture and materialization connectors are responsible for returning a resourcePath to the runtime (an array of strings). The "name" field of a derivation transform serves the same function, of uniquely identifying the binding. In all cases, the runtime has access to an identifier that uniquely identifies a resource of the endpoint system (like a table).
I'd propose replacing the "stream" concept here with "binding" -- I think it's interchangeable -- just to avoid adding new concepts ("stream") or terminology to the discussion if we can avoid it.
I agree that adding something like revision seems required in order to "cache bust" the prior states of a binding. I also don't see another way to do this given the "click to re-backfill" goals of the PRD.
As the runtime does have access to the (source/target collection, resource-path, revision) tuple, we could bundle this tuple concept into a simpler "binding key" concept that's presented to connectors in the built spec. Then they can avoid the requirement (and surface area for bugs) of needing to construct their own hierarchical state keys. There's some precedent for this already with the journalReadSuffix concept present for derivation and materialization bindings, which ties together a source/target collection with an encoded resourcePath. Perhaps we could deprecate this concept and replace it with a "binding key" that encodes (collection, resource-path, revision), so that there's one binding "key" concept for both connector states and runtime checkpoints, which the runtime has full access to.
These (source/target collection, resource-path, revision) tuples could be a bit verbose, and we're embedding them into state checkpoints that are written out to recovery logs with every transaction. Do we hash them? If the revision is also included in the built spec, then all of the raw ingredients are still present to re-create the hash, without requiring that the connector implement our precise hashing algorithm (it could instead use a pre-computed hash included in the built spec binding).
We need to figure out how this applies to derivations as well. At surface, adding a revision to a transform makes sense to me, and I think the transform name can be used equivalently to the resource path of a capture or materialization.

1 reply

williamhbaker Oct 10, 2023
Maintainer Author

Where you use streamID here, I believe this is one-to-one with the "resource path" concept which the runtime does have access to. During the Validate RPC, capture and materialization connectors are responsible for returning a resourcePath to the runtime (an array of strings). The "name" field of a derivation transform serves the same function, of uniquely identifying the binding. In all cases, the runtime has access to an identifier that uniquely identifies a resource of the endpoint system (like a table).

This is certainly true in spirit. In practice I think it's always true as well, although in a less direct way. A common pattern that connectors currently employ is to provide a Validate response with a resource path based on values from the resource config (as one might expect), and then re-build the identifier during for ex. a Pull RPC in the same way, rather than using the ResourcePath from the runtime. I am not aware of any cases where these are not equivalent, and it should probably be considered a connector bug if they aren't. So I agree that it is a good idea to enforce the expectation that for a given binding the ResourcePath provided by the connector to the runtime from Validate is exactly the same as the value the connector uses to identify the resource in the source or target system.

Given the above, I also agree with the "binding key" concept in the built spec. This does seem like it would be much better than every connector needing to know how to build a hierarchical state key in the same way as the runtime. I also think it would make sense to hash these tuples since checkpoint sizes can be a limiting factor. I'll leave it at that for now pending the discussion below about using the resource path + nonce as the binding key though.

psFried · 2023-10-10T15:46:29Z

psFried
Oct 10, 2023
Maintainer

These (source/target collection, resource-path, revision) tuples could be a bit verbose, and we're embedding them into state checkpoints that are written out to recovery logs with every transaction

Given the addition of revision, I wonder if it's even necessary to include the source/target collection here. If we want to re-start backfills whenever the source/target collection changes, we could instead handle that by explicitly incrementing the revision. This seems preferable to be because it seems simpler to understand and read, and also allows for the possibility of changing the collection without re-triggering a backfill (I'm a little hesitant about giving up that capability).

4 replies

jgraettinger Oct 10, 2023
Maintainer

Given the addition of revision, I wonder if it's even necessary to include the source/target collection here.

🤔 this is interesting and a hard call.

On the one hand the simplicity of "change the collection, get a backfill" is appealing. It's easy to explain and hard to mess up. If, instead, we backfill only on revision change then we're signing up to have the UI basically paper-over this reality so that we don't smack users over the head with it; so that the garden path UX of changing a collection does the "right thing". Which is also more surface area for UI bugs ?

On the the other hand, I'm also not thrilled about removing capability, even if we haven't encountered many (any?) use cases where it's needed. I can imagine a migration use-case where it would be beneficial, for example.

jgraettinger Oct 10, 2023
Maintainer

Playing this out a bit more, and assuming we go with not folding collection into a "binding key":

The revision we've been discussing is effectively, then, just a suffix that's added to a resourcePath. It's a nonce that acts like a cache-bust, and allows the connector & runtime to each disambiguate that the capture / materialization state of a specific endpoint resource has been reset.

It doesn't need to be an integer (it could be a string), and we could also interpret an empty value as "don't add a suffix component to the resource path", which could simplify the migration because the resource paths of today remain valid tomorrow -- and are even the default if no nonce is specified.

In this modeling, a resource path (potentially extended with a nonce) is then the one and only unique key of a binding. It's already given to connectors today, so connectors could start using it to encode stream states (although: we pass the []string form of the resource path, and what's still needed is a common & consistent string encoding that can be used as flat property names).

It also fits into / avoids breaking the existing journal_read_suffix model, which has values like materialize/{materialization}/{encoded-resource-path}.

So, there's certainly a lot to like about the modeling and how we might roll it out... 🤔

williamhbaker Oct 10, 2023
Maintainer Author

The thing I don't like about the way this currently works and would continue to work without adding the collection to the binding key for a capture is that it would have captures work differently than materializations: Changing the collection of a materialization binding will result in a conceptual re-backfill of that binding, but changing the collection of a capture binding will not. It has been and would continue to be some mental overhead going forward to have these work in opposite ways. I realize this is mostly a re-statement of @jgraettinger's first reply here, and perhaps abstracting it out to appear as though it behaves the same way in the UI would work. I do like the simplicity & symmetry of it though.

jgraettinger Oct 10, 2023
Maintainer

You know, there's a possible symmetry here. Earlier in the discussion @psFried called out that backfilling using only a nonce:

also allows for the possibility of changing the collection without re-triggering a backfill

The same is true for materializations. Today, you're able to have a table that's materialized from collection A, and then switch the binding to source from collection B, all without changing the table content. As we've discussed it so far, we would instead update the behavior to delete and re-create the table, which is usually what you want but is also a loss of capability.

But we could instead only do that if the nonce of a binding is changing.

williamhbaker · 2023-10-12T16:34:45Z

williamhbaker
Oct 12, 2023
Maintainer Author

I have updated the original write-up per our discussion here and other VC discussions:

General cleanup: Replaced the terminology of "stream" with "binding" everywhere
Incorporated the concept of the "binding key", which does not include the source collection, and is made up of only the resource path and revision
- I've backtracked on thinking that this should be hashed to reduce checkpoint size, since it won't include source collection names anymore, so they shouldn't be that big.
Extending on that, we won't drop materialized tables or re-start the capture of a binding if only the target/source collection is changing. Changing a collection isn't really a concept that is user-facing currently through the UI, and losing the capability to swap out a collection without restarting doesn't seem well motivated. It will still be possible to restart based on the revision.
Just to acknowledge it, I'm still using the word revision although that probably isn't the right word for it.

There's a couple of areas IMO that need further definition and discussion:

Derivations, as @jgraettinger pointed out. We'd like a way to re-start these as well.
Materialization table dropping, and how to coordinate that. I've initially described a system where we use the current Apply RPC to disable shards, wait for them to be disabled, and then run the table alterations, but I haven't been able to prove out a way for the "wait for them to be disabled" part to work yet.

0 replies

williamhbaker · 2023-10-13T19:47:16Z

williamhbaker
Oct 13, 2023
Maintainer Author

A couple of additional updates, based on further discussion:

Derivations already have the ability re-derive from a source collection into the derived collection by just changing the name of the transform, which seems sufficient for now.
There's not currently a way to know when a connector has fully exited after its shard has been disabled, but we can tell when the shard context has been cancelled and when the primary loop for the shard has completed via a stat call. This should work in the vast majority of cases for dropping & re-creating tables, provided that materialization connectors do a good job at releasing transaction locks on tables when exiting and incrementing their fences (where possible) as part of their table alterations. Raced writes to newly created tables from prior shard assignments for tables of non-transactional systems would be the remaining risk, and this could be an area of future improvement, though it may never be a 100% guarantee that it cannot happen for certain kinds of systems.

0 replies

williamhbaker · 2023-11-02T15:47:31Z

williamhbaker
Nov 2, 2023
Maintainer Author

A few more updates per separate discussions:

Updated the working title revision to backfillVersion everywhere in the original writeup
Noted that the backfillVersion will be a monotonically increasing integer
Added the change for backfillVersion & bindingKey additions to derivations as well

0 replies

williamhbaker · 2023-11-03T13:26:37Z

williamhbaker
Nov 3, 2023
Maintainer Author

High-level issue for tracking the work discussed here: #1276

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binding Behavior Updates #1219

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Binding Behavior Updates #1219

williamhbaker Oct 3, 2023 Maintainer

Current Behavior

Proposed Changes

Capture Checkpoints

Materialization Checkpoints

Management of Materialized Tables

Derivations

Migration

Replies: 8 comments · 8 replies

psFried Oct 5, 2023 Maintainer

jgraettinger Oct 10, 2023 Maintainer

psFried Oct 10, 2023 Maintainer

williamhbaker Oct 10, 2023 Maintainer Author

psFried Oct 5, 2023 Maintainer

jgraettinger Oct 10, 2023 Maintainer

williamhbaker Oct 10, 2023 Maintainer Author

psFried Oct 10, 2023 Maintainer

jgraettinger Oct 10, 2023 Maintainer

jgraettinger Oct 10, 2023 Maintainer

williamhbaker Oct 10, 2023 Maintainer Author

jgraettinger Oct 10, 2023 Maintainer

williamhbaker Oct 12, 2023 Maintainer Author

williamhbaker Oct 13, 2023 Maintainer Author

williamhbaker Nov 2, 2023 Maintainer Author

williamhbaker Nov 3, 2023 Maintainer Author

williamhbaker
Oct 3, 2023
Maintainer

Replies: 8 comments 8 replies

psFried
Oct 5, 2023
Maintainer

jgraettinger Oct 10, 2023
Maintainer

psFried Oct 10, 2023
Maintainer

williamhbaker Oct 10, 2023
Maintainer Author

psFried
Oct 5, 2023
Maintainer

jgraettinger
Oct 10, 2023
Maintainer

williamhbaker Oct 10, 2023
Maintainer Author

psFried
Oct 10, 2023
Maintainer

jgraettinger Oct 10, 2023
Maintainer

jgraettinger Oct 10, 2023
Maintainer

williamhbaker Oct 10, 2023
Maintainer Author

jgraettinger Oct 10, 2023
Maintainer

williamhbaker
Oct 12, 2023
Maintainer Author

williamhbaker
Oct 13, 2023
Maintainer Author

williamhbaker
Nov 2, 2023
Maintainer Author

williamhbaker
Nov 3, 2023
Maintainer Author