non stop ConditionalCheckFailedException error #30

jney · 2019-06-27T16:08:03Z

Hello there, I'm now always getting the following error:

shard error (shardId-000000000005) in checkpointer.release: error releasing checkpoint: ConditionalCheckFailedException: The conditional request failed\n\tstatus code: 400, request id: RR9U1G5VFUCV6AHGT5DFJIPNJVVV4KQNSO5AEMVJF66Q9ASUAAJG

Any idea what's happened and how to fix that?

btw, I noticed #27 ticket

Thank you

The text was updated successfully, but these errors were encountered:

Sytten · 2019-06-27T19:22:17Z

I also get those errors when new consumers start

cb-anthony-merrill · 2019-07-14T03:17:49Z

Anyone had any luck on this? We are hitting this as well. Otherwise, we're starting our deep dive investigation.

garethlewin · 2019-07-14T06:39:50Z

I have seen this but have never been able to successfully reproduce it. If you find any strong reproducible steps that would really be helpful.

Typically we see it so rarely that just restarting the process gets past it, but I am very aware that this is not idea.

cb-anthony-merrill · 2019-07-29T17:42:29Z

Hey @garethlewin, question for you about checkpoints.go, there is a comment in the checkpointRecord struct that OwnerID, LastUpdateRFC and FinishedRFC are not used for decision making yet, a few lines below it; we are using it

if record.OwnerID != nil && record.LastUpdate > cutoff {

Maybe this is related? Or should it just be replaced with OwnerName; or does the original comment no longer apply?

garethlewin · 2019-07-29T19:46:48Z

I believe that is a mistake in checkpoints.go and OwnerID should be in the block of used variables and OwnerName should be in the "debug only" bit. I will verify this and update the code. From my quick check ownerName is never checked but ownerID is checked.

ckatsaras-godaddy · 2022-10-11T16:35:14Z

Just checking in if there was ever a solution found for this issue. I'm also experiencing this @jney, @Sytten , @garethlewin, @cb-anthony-merrill

garethlewin · 2022-10-11T19:32:19Z

No.

Still have never managed to get a reproducable situation, just a "sometimes it happens."

I've read through the code and debugged it multiple times and I cannot find a reason for this. I do not want to point at AWS/DDB, so my assumption is there is a bug in kinsumer, I just can't find it.

If you can reproduce or add any kind of diagnostic information beyond what has been shared, I'd love to fix this.

ckatsaras-godaddy · 2022-10-11T19:35:51Z

Ah, fair enough @garethlewin ! Thanks for the quick response! In terms of reproduction, what I've noticed is that it happens whenever I boot up new containers in ECS and the older containers are attempting to release(). I wonder if maybe the times I had configured for WithLeaderActionFrequency and WithShardCheckFrequency are causing issues (both set to 1 second which may be far to fast and causing issues)

ckatsaras-godaddy · 2022-10-11T20:13:52Z

Also, I think this is an issue when you have more containers than shards for a stream (e.g 3 containers consuming from 2 shards). In the case where all 3 containers deregistered together, 1 of those containers will not be the owner of a shard (which is to be expected) and thus, will fail DynamoDB query in release() since it never was the owner of any shard. Is this correct @garethlewin?

Sadly, I tested the theory above just now and it still seemed to show up once when I brought up 2 new containers and deregistered 2 old ones. The odd thing is, Kinsumer seems to still properly consume from my stream so I'm not actually sure if this really is an issue I should be too worried about

ckatsaras-godaddy · 2022-10-12T14:29:24Z

So after quite a bit of testing with ECS, I think I can provide a bit more context into what is happening @garethlewin .

My test includes 2 containers that are initially running in ECS and are consuming from a stream with 2 shards
Given this configuration, the checkpoints table will have the following owner IDs

Shard	OwnerID	...
Shard1	Container 1 ID	...
Shard2	Container 2 ID	...

Whenever I deploy new containers to ECS, what happens is 2 completely new containers are deployed and for a brief period of time, there are 4 containers running together. During this period of time, it appears that Container 3 and Container 4 are sometimes assigned ownership of the shards which makes the checkpoints table look like this:

Shard	OwnerID	...
Shard1	Container 3 ID	...
Shard2	Container 4 ID	...

So, when ECS attempts to scale down to the desired 2 containers and deregister Container1 and Container2, it will fail to remove their ownership from the checkpoints table as they are no longer the owners

It's important to note that I can't seem to simulate this locally running multiple instances of my service but I can easily reproduce it in ECS. Given this information, do you have any idea why Kinsumer would be promoting the two new containers to be the owners of the shards prematurely?

Any help is really appreciated 😃

ckatsaras-godaddy · 2022-10-12T18:51:30Z

Sorry for all the messages @garethlewin , but I think I have some additional information that may be useful as well.

I'm wondering if sorting the clients by OwnerID in the getClients function (https://github.com/twitchscience/kinsumer/blob/master/clients.go#L123)
https://github.com/twitchscience/kinsumer/blob/master/clients.go#L36-L38
is potentially problematic and aggravating this issue as well.

It looks like getClients is used within refreshShards and it seems like the order in which these clients are returned has a direct impact on whether or not a client is "promoted" or not https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L137-L157

So, with that being said, if the list is returned in a "new" order, it potentially leads to new clients being promoted and older clients no longer having their `ID associated with a checkpoint when the attempt to "release". It seems a bit odd to be sorting on a "random" UUID since it will lead to different outcomes each time (if I'm understanding correctly)

Let me know what you think of this and if there's any other information I can provide 😄

garethlewin · 2022-10-12T19:02:47Z

No worries about the comments, but I no longer maintain this package so I might not be super fast and responsive but I'll do my best.

I haven't read (or written) this code in several years so I need to knock the rust off, but the sorting by OwnerID is primary idea that I built this entire library around.

By having all clients all sort by the same field, we get a consistent understanding between all the clients over which shard gets allocated to which client. We all agree that shard 0 goes to the first client etc. When new clients are added to the system it is correct, a reallocation of shards will happen. But everyone is supposed to take a break and then move to the newly allocated shards.

Your ECS vs Local example is worrying to me. I wonder if there is a timestamp issue here. The older workers are supposed to give up consuming their shards and it seems like you are finding they are not giving it up.

Here is where everyone is supposed to check what shards they should consume:

https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L421

when we capture shard, we are supposed to not take one until the existing reader has said they are done with it

https://github.com/twitchscience/kinsumer/blob/master/shard_consumer.go#L75

I wonder if there is a race condition between releasing a shard and shutting down.

ckatsaras-godaddy · 2022-10-12T19:13:08Z

Thanks for all the details @garethlewin !
Okay, so it seems like sorting is to be expected here to "mix things up" and keep a healthy rotation. Yeah, it seems like there is some sort of issue where we basically are "prematurely" assigning new owners to shards before the older consumers can "release" their hold on it
To be honest, I've done quite a bit of testing and since the error happens on shutdown anyways, I simply call kinsumer.Stop() to ensure we clear the client table, etc. Ideally I'd love to have no errors on shutdown but it doesn't seem to actually effect the functionality as it's basically just saying

"I couldn't find that I was the Owner of a checkpoint, so I can't clear my ownership"

Given that you no longer maintain the project, I really appreciate you going out of your way to help me 😄 !

garethlewin · 2022-10-12T19:16:05Z

No, the sorting isn't to mix things up. It's so that all the clients have consensus over who owns a shard, and who doesn't.

You should DEFINITELY call kinsumer.Stop() on shutdown or else your clients will hold onto the shards for longer than is needed.

ckatsaras-godaddy · 2022-10-12T19:19:01Z

Oh, I see! Sorry for the confusion!

Just to clarify, I wonder if it's possible that release() is written to "fail" on any failure to update, when in reality, there are scenarios where it isn't actually a failure

Like in the scenario above, if there are 4 consumers and only 2 shards, we are guaranteed to have at least 2 consumers that don't own a checkpoint. If I'm understanding correctly, there's nothing stopping the 2 consumers that don't have ownership from exiting, and if so, it isn't an error to clear their ownership since they don't have any.

I wonder if the improved logic for release() would be:

Check if ownership exists
Clear ownership IF applicable

What do you think of that @garethlewin ? 😄
https://github.com/twitchscience/kinsumer/blob/master/checkpoints.go#L203-L236

garethlewin · 2022-10-12T19:21:14Z

I think you might be on to something, there is an error for "No shard assigned". But I am not sure what happens in that case if the consumer had a shard before.

ckatsaras-godaddy · 2022-10-12T19:27:23Z

That's good to hear!

For the "No shard assigned" case, I think this section here https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L205-L207 covers the scenario where we have more consumers than shards which is why we wouldn't hit that particular error.

I think I'll look into creating a simple check for ownership and putting it in a PR for review. Once again, since you're not longer the maintainer of the project, don't feel you have to review, etc 😄

As I was saying earlier, now that we have the knowledge above about when the scenario occurs and how to safely clean up in this scenario, it's more about just removing the error so that others in the future aren't lead in the wrong direction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non stop ConditionalCheckFailedException error #30

non stop ConditionalCheckFailedException error #30

jney commented Jun 27, 2019

Sytten commented Jun 27, 2019

cb-anthony-merrill commented Jul 14, 2019

garethlewin commented Jul 14, 2019

cb-anthony-merrill commented Jul 29, 2019

garethlewin commented Jul 29, 2019

ckatsaras-godaddy commented Oct 11, 2022

garethlewin commented Oct 11, 2022

ckatsaras-godaddy commented Oct 11, 2022

ckatsaras-godaddy commented Oct 11, 2022

ckatsaras-godaddy commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022

garethlewin commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022

garethlewin commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022

garethlewin commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022

non stop ConditionalCheckFailedException error #30

non stop ConditionalCheckFailedException error #30

Comments

jney commented Jun 27, 2019

Sytten commented Jun 27, 2019

cb-anthony-merrill commented Jul 14, 2019

garethlewin commented Jul 14, 2019

cb-anthony-merrill commented Jul 29, 2019

garethlewin commented Jul 29, 2019

ckatsaras-godaddy commented Oct 11, 2022

garethlewin commented Oct 11, 2022

ckatsaras-godaddy commented Oct 11, 2022

ckatsaras-godaddy commented Oct 11, 2022

ckatsaras-godaddy commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022

garethlewin commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022

garethlewin commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022

garethlewin commented Oct 12, 2022

ckatsaras-godaddy commented Oct 12, 2022