-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non stop ConditionalCheckFailedException error #30
Comments
I also get those errors when new consumers start |
Anyone had any luck on this? We are hitting this as well. Otherwise, we're starting our deep dive investigation. |
I have seen this but have never been able to successfully reproduce it. If you find any strong reproducible steps that would really be helpful. Typically we see it so rarely that just restarting the process gets past it, but I am very aware that this is not idea. |
Hey @garethlewin, question for you about
Maybe this is related? Or should it just be replaced with OwnerName; or does the original comment no longer apply? |
I believe that is a mistake in checkpoints.go and OwnerID should be in the block of used variables and OwnerName should be in the "debug only" bit. I will verify this and update the code. From my quick check ownerName is never checked but ownerID is checked. |
Just checking in if there was ever a solution found for this issue. I'm also experiencing this @jney, @Sytten , @garethlewin, @cb-anthony-merrill |
No. Still have never managed to get a reproducable situation, just a "sometimes it happens." I've read through the code and debugged it multiple times and I cannot find a reason for this. I do not want to point at AWS/DDB, so my assumption is there is a bug in kinsumer, I just can't find it. If you can reproduce or add any kind of diagnostic information beyond what has been shared, I'd love to fix this. |
Ah, fair enough @garethlewin ! Thanks for the quick response! In terms of reproduction, what I've noticed is that it happens whenever I boot up new containers in ECS and the older containers are attempting to |
Also, I think this is an issue when you have more containers than shards for a stream (e.g 3 containers consuming from 2 shards). In the case where all 3 containers deregistered together, 1 of those containers will not be the owner of a shard (which is to be expected) and thus, will fail DynamoDB query in Sadly, I tested the theory above just now and it still seemed to show up once when I brought up 2 new containers and deregistered 2 old ones. The odd thing is, Kinsumer seems to still properly consume from my stream so I'm not actually sure if this really is an issue I should be too worried about |
So after quite a bit of testing with ECS, I think I can provide a bit more context into what is happening @garethlewin . My test includes 2 containers that are initially running in
Whenever I deploy new containers to
So, when ECS attempts to scale down to the desired 2 containers and deregister It's important to note that I can't seem to simulate this locally running multiple instances of my service but I can easily reproduce it in Any help is really appreciated 😃 |
Sorry for all the messages @garethlewin , but I think I have some additional information that may be useful as well. I'm wondering if sorting the clients by It looks like So, with that being said, if the list is returned in a "new" order, it potentially leads to new clients being promoted and older clients no longer having their `ID associated with a checkpoint when the attempt to "release". It seems a bit odd to be sorting on a "random" UUID since it will lead to different outcomes each time (if I'm understanding correctly) Let me know what you think of this and if there's any other information I can provide 😄 |
No worries about the comments, but I no longer maintain this package so I might not be super fast and responsive but I'll do my best. I haven't read (or written) this code in several years so I need to knock the rust off, but the sorting by OwnerID is primary idea that I built this entire library around. By having all clients all sort by the same field, we get a consistent understanding between all the clients over which shard gets allocated to which client. We all agree that shard 0 goes to the first client etc. When new clients are added to the system it is correct, a reallocation of shards will happen. But everyone is supposed to take a break and then move to the newly allocated shards. Your ECS vs Local example is worrying to me. I wonder if there is a timestamp issue here. The older workers are supposed to give up consuming their shards and it seems like you are finding they are not giving it up. Here is where everyone is supposed to check what shards they should consume: https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L421 when we capture shard, we are supposed to not take one until the existing reader has said they are done with it https://github.com/twitchscience/kinsumer/blob/master/shard_consumer.go#L75 I wonder if there is a race condition between releasing a shard and shutting down. |
Thanks for all the details @garethlewin ! "I couldn't find that I was the Owner of a checkpoint, so I can't clear my ownership" Given that you no longer maintain the project, I really appreciate you going out of your way to help me 😄 ! |
No, the sorting isn't to mix things up. It's so that all the clients have consensus over who owns a shard, and who doesn't. You should DEFINITELY call kinsumer.Stop() on shutdown or else your clients will hold onto the shards for longer than is needed. |
Oh, I see! Sorry for the confusion! Just to clarify, I wonder if it's possible that Like in the scenario above, if there are 4 consumers and only 2 shards, we are guaranteed to have at least 2 consumers that don't own a checkpoint. If I'm understanding correctly, there's nothing stopping the 2 consumers that don't have ownership from exiting, and if so, it isn't an error to clear their ownership since they don't have any. I wonder if the improved logic for
What do you think of that @garethlewin ? 😄 |
I think you might be on to something, there is an error for "No shard assigned". But I am not sure what happens in that case if the consumer had a shard before. |
That's good to hear! For the "No shard assigned" case, I think this section here https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L205-L207 covers the scenario where we have more consumers than shards which is why we wouldn't hit that particular error. I think I'll look into creating a simple check for ownership and putting it in a PR for review. Once again, since you're not longer the maintainer of the project, don't feel you have to review, etc 😄 As I was saying earlier, now that we have the knowledge above about when the scenario occurs and how to safely clean up in this scenario, it's more about just removing the error so that others in the future aren't lead in the wrong direction |
Hello there, I'm now always getting the following error:
Any idea what's happened and how to fix that?
btw, I noticed #27 ticket
Thank you
The text was updated successfully, but these errors were encountered: