Skip to content
This repository has been archived by the owner on Oct 10, 2023. It is now read-only.

Event Streams access controller loops on failure causing other clients to be disconnected #158

Open
EmmaHumber opened this issue Apr 21, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@EmmaHumber
Copy link
Contributor

Issue Description

Consumers connected to Event Streams 2019.N see frequent consumer group re-balances caused by heartbeat expiration, after an application using an invalid API key was introduced to the environment.

Multi second gaps are seen in client side logs, at trace level, between a heartbeat being sent to the Kafka Broker, and the heartbeat being responded to.

A delay is also seen between the heartbeat being sent by the client and the log line in the Kafka Broker showing the heartbeat has arrived at the broker.

This caused the client heartbeat to expire, as it was not received by the broker within the time specified by session.timeout.ms

Similar errors were also seen with the poll interval also timing out.

Issue Resolution

Kafka has a number of Processor threads, which are used for processing work arriving with the broker.

The same thread was being used to process heartbeats as was being used for authorization checks.

The custom Kafka authorizer calls out to the Event Streams Access controller, which in turn calls out to IAM. If IAM returns an error then Access Controller retries twice more, with a 3 second gap between the retries.

The processing thread was blocking on the many failing authorization calls, which caused delays to processing of the heartbeats and the heartbeat expiration and slow responses to the client.

The fix was to update Access Controller to not retry if an error was returned saying the API Key was invalid.

Workaround

Ensure API keys in use are all valid - this prevents the looping that causes the delay
OR
Increase session.timeout.ms to a larger value, so that the heartbeat does not expire

It may also be beneficial to increase max.poll.interval.ms, as this can also time out if it was previously set to a smaller than default value.

Fix details

IBM Internal Issue Number - 6780
Fix target - Not yet available

@EmmaHumber EmmaHumber added the bug Something isn't working label Apr 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant