GCP PubSub Source Flow Control #12990
-
Hey @bruceg, I have a question regarding the flow control of messages pulled in by this PubSub Source. I am trying to figure out the reason for this backlog and could come up with 1 yet to be tested hypothesis : The subscription messages are itself being pulled not fast enough causing the backlog to build in the subscription. Vector receives messages at a very low rate thereby not requiring it to vertically scale up the CPU needs to what is being requested. To it, my question : is there a way to identify where the bottleneck is via vector internal metrics or some debugging magic ? Any pro-tip would be super helpful. |
Beta Was this translation helpful? Give feedback.
Replies: 13 comments 3 replies
-
Hi @atibdialpad ! I'd suggest taking a look at the cc/ @bruceg for any other thoughts since he worked on that source. |
Beta Was this translation helpful? Give feedback.
-
The PubSub source uses a gRPC streaming client to receive the data from GCP, which should result minimize the bottlenecks on the Vector side. I would look at the |
Beta Was this translation helpful? Give feedback.
-
Thanks folks for the suggestions. I will try them out today and report back. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
new observation, I see a ton of error logs in the google_pubsub src trying to pull messages from the subscription : Exact Error :
which is basically this vector/src/internal_events/gcp_pubsub.rs Line 59 in a4ff8b3 This could be the errors why the subscription <-> vector path is bottlenecked, right ? @bruceg @jszwedko Need to check the error further |
Beta Was this translation helpful? Give feedback.
-
I might be hitting this #12660 . The error and throughput observation seems identical. I will try with the latest image tomorrow and report. |
Beta Was this translation helpful? Give feedback.
-
[Update] I see this error 3-4 times in an hour (far less frequent then the earlier invalid arg error). Doesn't look much harmful.
However the backlog and underutilisation still persists. |
Beta Was this translation helpful? Give feedback.
-
I did a bit of deep dive into this issue and also compared the performance with a logstash GKE Deployment (same cluster, same Node types, same request and resources). We found out that the logstash gcp_pubsub_input plugin performs just fine and keeps the number of backlogged msg well under limits. Looking at the GCP PubSub Subscription Monitoring Metrics it appears that the poor performance in this setup has to do with Acknowledgement Request Latencies (subscription/ack_latencies ) ~3-5 mins for Vector but ~60 ms for Logstash @bruceg Could the fact that Vector is using streaming_pull while logstash is using just Normal pull play a role here ? (Need to see if streaming pull involves ack'ing in bulk ) The subscription/ack_message_count is almost similar but when I look at how many ack requests timed out (subscription/expired_ack_deadlines_count) Almost 50/sec ack requests timed out for Vector but ~0 for Logstash Further, just looking at the number of ack requests (subscription/pull_ack_request_count) The Vector number is very high ~200K /s while logstash is ~10 /s So, the root cause atm looks like the high ack latency. Will have to check if we can play around with https://vector.dev/docs/reference/configuration/sources/gcp_pubsub/#ack_deadline_seconds ? @jszwedko @bruceg any suggestions here ? Another question ? |
Beta Was this translation helpful? Give feedback.
-
@bruceg Ping !! :-) |
Beta Was this translation helpful? Give feedback.
-
Thanks @jszwedko
|
Beta Was this translation helpful? Give feedback.
-
We just merged a PR that should at least improve the handling of this problem. Could you try out the next nightly after tonight to see if it improves the situation for you? |
Beta Was this translation helpful? Give feedback.
-
Sure @bruceg will try the latest nightly. Thanks for looking into this. |
Beta Was this translation helpful? Give feedback.
-
I have been running the Vector Deployment since the last 2 days and the fix look super solid !!!!! Thanks @bruceg once again. |
Beta Was this translation helpful? Give feedback.
We just merged a PR that should at least improve the handling of this problem. Could you try out the next nightly after tonight to see if it improves the situation for you?