[QUERY] Managing event consumption from EventHub in batch mode with multiple consumers #47654

davidedg87 · 2024-12-24T14:01:42Z

Library name and version

Azure.Messaging.EventHubs.Processor 5.11.5

Query/Question

In my usage scenario, I need to consume events from an EventHub with 4 partitions.
The goal is to process the events in batch mode to optimize the underlying business logic by aggregating queries to handle 'batches' of events rather than individual events.
To achieve this, I created a CustomEventProcessor by extending the EventProcessor class because the EventProcessorClient does not allow processing multiple events at once.

The issues I am currently facing are as follows:

When I kill a consumer and restart it, there is a moment when it seems that multiple consumers read the same data from the same partition. This forces me to perform a check to verify whether the event has already been processed or not.
[Currently, this check is done in the database by verifying a unique property of the message.] Is this approach correct, or am I managing it incorrectly?

When the situation described in point 1 occurs, I skip that event and process only the events that have not been processed. At the end, I update the checkpoint using UpdateCheckpointAsync for the event [not skipped] with the highest offset.
This approach seems to work, but occasionally, I have noticed that some events are 'lost,' as if a checkpoint for a subsequent offset was performed, causing some events to not be correctly processed.

This is the code of Consumer

EventBatchProcessor.txt

The goal is to create a robust consumer that can scale with multiple replicas of the consumers (always a maximum equal to the number of partitions) and to correctly manage checkpoints and duplicate control.

Thanks in advance for your help.

Environment

No response

github-actions · 2024-12-24T14:02:40Z

Thank you for your feedback. Tagging and routing to the team member best able to assist.

jsquire · 2024-12-24T18:46:39Z

@davidedg87: Thanks for reaching out and we regret that you're experiencing difficulties. Event Hubs has an at-least-once delivery guarantee; your application must be tolerant of duplicates and idempotent in its processing.

General information

When you "kill" a processor - whether by graceful stop or terminating the process - any partitions that it had owned will be claimed by other processors in the group and restart processing from the last recorded checkpoint. When a partition moves between processors, there is a potential for 1-2 batches of overlap in which the new owner has taken control, but the old owner is not yet aware and has a batch held in memory that it is being dispatched for processing. When old owner's load balancing loop runs (every 30 seconds by default) or it attempts to read the next batch from the partition, it recognizes the new owner and updates its state to reflect the change.

During this period of overlap, there will be two processor instances processing data from the same partition. Either the old or new owner may write a checkpoint during this time.
Your application must be tolerant of that when the count of processors in the group changes, such as during scaling operations or when a node crashes.

Checkpointing guidance

I update the checkpoint using UpdateCheckpointAsync for the event [not skipped] with the highest offset.

Offsets are opaque data, and they do not change in a predictable pattern from one event to another. You cannot safely reason about the highest offset being later in the stream than lower. It is quite possible that the opposite is true. Sequence numbers, on the other hand, do change in a predictable pattern and are safe to infer a relationship between.

Snippet analysis

I have noticed that some events are 'lost,' as if a checkpoint for a subsequent offset was performed, causing some events to not be correctly processed.

The Event Hubs clients prioritize data integrity above all else; they will always resend events unless your application has explicitly created a checkpoint. There is no client scenario where data loss takes place. Unless your application is creating checkpoints for events that it has not yet processed or has not fully processed, the stream will always rewind for ownership changes or recovery, and you would see duplicates.

In the snippet that you shared, your error handling seems problematic when processing events. On L119, you're catching all exceptions as you process the payload of an event. If any exception takes place, you ignore it and move on. You don't account for this later in the logic and will create a checkpoint for the batch (L139) and cache entry for every event in the batch - including those with errors. (L148-152) As a result, you will skip any events that triggered an exception and would not see an entry in your database for them.

Likewise, when an event does not have a payload, you'll explicitly skip over that event (L106) and write the checkpoint and cache entry. Those will also not appear in your database.

I'm going to assume that your publisher is explicitly setting the MessageId property on events, as that would otherwise be null and cause additional issues with the cache logic.

Next steps

There's not much else that I can offer with the available context. If you'd like to collect a +/- 5-minute slice of Azure SDK logs around the behavior that you're asking about, we'd be happy to take a look and offer thoughts. You'll want to capture logs at the "Verbose" level and filter to the ids discussed in this section of the Troubleshooting Guide. Discussion of capturing logs can be found in this sample and in the article Logging with the Azure SDK for .NET.

I'm going to mark this as addressed. If you'd like us to take a look at a logs, please unresolve once logs are available.

github-actions · 2024-12-24T18:47:02Z

Hi @davidedg87. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

davidedg87 · 2024-12-24T20:45:18Z

/unresolve

davidedg87 · 2024-12-24T20:52:46Z

Hi @jsquire first of all thank you for time dedicated to my issue.
If I've well understood what you suggest is to maintain my logic to remove already elaborated events but to update the checkpoint searching the effective last one through the sequence number and not by the offset.
Furthermore I have to do a better managing of the exception in order to always update the checkpoint unless for some internal error related to the consuming of the event and not related to the specific business logic right? For example in case of empty payload the event must be checkpointed. I will take a look at the exception guidelines you wrote on the chat.
Regarding the logs I will try taking some logs for the range you suggested me and I will send you asap.
Thanks

github-actions · 2024-12-24T21:58:22Z

Hi @davidedg87. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions bot assigned jsquire Dec 24, 2024

github-actions bot added the question The issue doesn't require a change to the product in order to be resolved. Most issues start as that label Dec 24, 2024

jsquire added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Dec 24, 2024

github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Dec 24, 2024

github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. labels Dec 24, 2024

jsquire added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUERY] Managing event consumption from EventHub in batch mode with multiple consumers #47654

[QUERY] Managing event consumption from EventHub in batch mode with multiple consumers #47654

davidedg87 commented Dec 24, 2024

github-actions bot commented Dec 24, 2024

jsquire commented Dec 24, 2024

github-actions bot commented Dec 24, 2024

davidedg87 commented Dec 24, 2024

davidedg87 commented Dec 24, 2024

github-actions bot commented Dec 24, 2024

[QUERY] Managing event consumption from EventHub in batch mode with multiple consumers #47654

[QUERY] Managing event consumption from EventHub in batch mode with multiple consumers #47654

Comments

davidedg87 commented Dec 24, 2024

Library name and version

Query/Question

Environment

github-actions bot commented Dec 24, 2024

jsquire commented Dec 24, 2024

General information

Checkpointing guidance

Snippet analysis

Next steps

github-actions bot commented Dec 24, 2024

davidedg87 commented Dec 24, 2024

davidedg87 commented Dec 24, 2024

github-actions bot commented Dec 24, 2024