agent: defer handling of publications that require blocking locks #1390

psFried · 2024-02-21T21:45:39Z

Description:

There's an issue related to locking that's been plaguing publications handling. We sometimes see deadlocks due to multiple publications that affect the same live_specs rows.

To start from the beginning: for each individual live_specs row, we need to guarantee a sensible order of operations. For example, let's say we have two different captures into two separate collections, and a single materialization of both collections. Auto-discovers run for the captures, and result in two separate publications, one for each of the materialized collections. Each of those publications will result in an update to the built_spec of the materialization, and it's critical the modifications from one publication don't clobber those of the other.

We solve this issue today by implicitly locking the "expanded" live_specs rows when we update the built_specs. This is why we occasionally see deadlocks that cause the agent to crash. We'd like to find another way to ensure a sensible ordering of operations that doesn't result in deadlocks and agent crashes.

It's tempting to think of an approach where we first determine the set of specs that would be affected by a given publication, and use that knowledge to process publications in an order that avoids simultaneous processing of publications that would affect the same live_specs. But the set of specs that's affected by a given publication is dependent on the current state of the drafted and live specs, and is subject to change. We need to do a decent amount of parsing and validation before we can even identify which "expanded" rows might be affected by a publication. As long as that's the case, we can really only answer the question, "can we continue working on this publication right now?". So the basic idea is to do just that. If we're unable to continue working on the publication, then just defer it and try a different job in the meantime.

We can identify publications that affect overlapping live_specs rows, we can rely on postgres row-level locks, as we do today. We just need to add for update of live_specs nowait to the SQL query for the expanded specifications. If the rows can't be locked immediately, then it'll return a specific error code, which we can handle in the agent. When we see an error with that code, we handle it specially and update its job_status to {"type": "queued", "backoff": "3s" }. When we dequeue jobs, add a clause to filter out job where updated_at + coalesce(job_status->>'backoff'::interval, '0s') < now(). There's a few other details, but that's largely it.

The end result is that the agent should now defer publications that previously might have deadlocked. It will try to work on something else in the meantime.

Notes for reviewers:

This is a bit of a rough cut, because I'd like to get some feedback on the overall approach before polishing it up.

This change is

jgraettinger

LGTM, couple comments

jgraettinger · 2024-02-27T16:47:06Z

crates/agent/src/evolution.rs

@@ -55,8 +58,8 @@ pub struct EvolvedCollection {
    pub updated_captures: Vec<String>,
 }

-fn error_status(err: impl Into<String>) -> anyhow::Result<JobStatus> {
-    Ok(JobStatus::EvolutionFailed { error: err.into() })
+fn error_status(err: impl Into<String>) -> anyhow::Result<Result<JobStatus, CannotAcquireLock>> {


nit: could you do a dynamic downcast to CannotAcquireLock rather than nesting Result ?

jgraettinger · 2024-02-27T16:52:51Z

crates/agent/src/evolution.rs

+                    // try again. Evolutions jobs will fail _quite_ quickly in this scenario, so
+                    // wait a whole second before re-trying.
+                    txn.rollback().await?;
+                    tokio::time::sleep(std::time::Duration::from_secs(1)).await;


nit: why is this one second, while the publications handler backoff is 500 ms?

I'm thinking of this as just a back-off so we don't spam the DB in a busy loop, right?

It should avoid deadlock because we've rolled back and released all locks, so progress is likely being made, though I could imagine racy scenarios where to jobs keep on detecting potential deadlock without waiting, backoff, and retry at the precise time that causes the condition to happen again.

Perhaps this should be jittered to make sure that ☝️ doesn't happen indefinitely?

Yes, this is really just so we don't spam the DB in a busy loop. I arrived at these values via the very scientific 😉 process of reproducing failures using a couple of different values and squinting at the logs in my terminal. Evolutions tend to fail much faster (single digit ms) than publications, which have a lot more back and forths with the database before attempting to lock expanded spec rows. In testing, it's been much more common for another agent process to pick up the job while this one is sleeping, which is why I didn't see a need for jitter. All agents process the job queue in the same order, so the next time any agent polls the handler, it should get this same job, since we've released the lock on the job row.

☝️ This clearly should have been a code comment, so I'll add that.

…cking locks There's an issue related to locking that's been plaguing publications handling. We sometimes see deadlocks due to multiple publications that affect the same `live_specs` rows. To start from the beginning: for each individual `live_specs` row, we need to guarantee a sensible order of operations. For example, let's say we have two different captures into two separate collections, and a single materialization of both collections. Auto-discovers run for the captures, and result in two separate publications, one for each of the materialized collections. Each of those publications will result in an update to the `built_spec` of the materialization, and it's critical the modifications from one publication don't clobber those of the other. We solve this issue today by implicitly locking the "expanded" `live_specs` rows when we update the `built_spec`s. This is why we occasionally see deadlocks that cause the agent to crash. We'd like to find another way to ensure a sensible ordering of operations that doesn't result in deadlocks and agent crashes. It's tempting to think of an approach where we first determine the set of specs that would be affected by a given publication, and use that knowledge to process publications in an order that avoids simultaneous processing of publications that would affect the same `live_specs`. But the set of specs that's affected by a given publication is dependent on the current state of the drafted and live specs, and is subject to change. We need to do a decent amount of parsing and validation before we can even identify which "expanded" rows might be affected by a publication. As long as that's the case, we can really only answer the question, "can we continue working on this publication right now?". So the basic idea is to do just that. If we're unable to continue working on the publication, then just defer it and try a different job in the meantime. We can identify publications that affect overlapping `live_specs` rows, we can rely on postgres row-level locks, as we do today. We just need to add `for update of live_specs nowait` to the SQL that fetches `live_specs` rows. If the rows can't be locked immediately, then it'll return a specific error code, which we can handle in the `agent`. When we see an error with that code, we handle it specially and leave its `job_status` as `{"type": "queued" }`. It will be retried after a short delay. The end result is that the agent should now defer publications that previously might have deadlocked. It will _not_ try to work on something else in the meantime, but will instead continue to re-try until the job can run to completion. The change to `evolutions` started out as accidental, due to the evolutions handler using the same `resolve_expanded_rows` function. But it turns out that it's really helpful to prevent background publication failures. Without it, `evolutions` have tended to create `publications` having stale `expect_pub_id`s in scenarios where there's multiple captures feeding into one materialization. Those stale `expect_pub_id`s mean that the publication will fail, and we'll have to wait for the next auto-discover in order to try again. But deferring evolutions that affect the same `live_specs` as an in-progress publication allows the evolution to result in a publication that's much more likely to succeed the first time.

psFried force-pushed the phil/agent-deadlocks branch 3 times, most recently from 212ae01 to 306fc38 Compare February 27, 2024 16:14

psFried requested a review from jgraettinger February 27, 2024 16:20

psFried marked this pull request as ready for review February 27, 2024 16:21

jgraettinger approved these changes Feb 27, 2024

View reviewed changes

psFried force-pushed the phil/agent-deadlocks branch from 306fc38 to 10039de Compare February 27, 2024 18:13

psFried force-pushed the phil/agent-deadlocks branch from 10039de to 43c68be Compare February 27, 2024 18:24

psFried merged commit 104c9e9 into master Feb 27, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: defer handling of publications that require blocking locks #1390

agent: defer handling of publications that require blocking locks #1390

psFried commented Feb 21, 2024 •

edited by jgraettinger

Loading

jgraettinger left a comment

jgraettinger Feb 27, 2024

psFried Feb 27, 2024

jgraettinger Feb 27, 2024

psFried Feb 27, 2024

agent: defer handling of publications that require blocking locks #1390

agent: defer handling of publications that require blocking locks #1390

Conversation

psFried commented Feb 21, 2024 • edited by jgraettinger Loading

jgraettinger left a comment

Choose a reason for hiding this comment

jgraettinger Feb 27, 2024

Choose a reason for hiding this comment

psFried Feb 27, 2024

Choose a reason for hiding this comment

jgraettinger Feb 27, 2024

Choose a reason for hiding this comment

psFried Feb 27, 2024

Choose a reason for hiding this comment

psFried commented Feb 21, 2024 •

edited by jgraettinger

Loading