GCP PubSub Source Flow Control #12990

atibdialpad · 2022-06-06T09:41:00Z

atibdialpad
Jun 6, 2022

Hey @bruceg,
Thanks for building the gcp pubsub source. Appreciate !!!

I have a question regarding the flow control of messages pulled in by this PubSub Source.
I am running a GKE Vector Deployment which pulls in messages from a GCP PubSub subscription. As part of the Deployment I am running a bunch of pods each with 2.5-3 vCPUs request-limits. But I see the Vector Pods are only using 0.05-0.07 vCPU on an average however there is significant and constant backlog of messages in the GCP Subscription.

I am trying to figure out the reason for this backlog and could come up with 1 yet to be tested hypothesis : The subscription messages are itself being pulled not fast enough causing the backlog to build in the subscription. Vector receives messages at a very low rate thereby not requiring it to vertically scale up the CPU needs to what is being requested.

To it, my question : is there a way to identify where the bottleneck is via vector internal metrics or some debugging magic ? Any pro-tip would be super helpful.

@jszwedko

Answered by bruceg

Jun 17, 2022

We just merged a PR that should at least improve the handling of this problem. Could you try out the next nightly after tonight to see if it improves the situation for you?

View full answer

jszwedko · 2022-06-06T15:59:19Z

jszwedko
Jun 6, 2022
Maintainer

Hi @atibdialpad !

I'd suggest taking a look at the utilization metric for your components to determine if there is a bottleneck in Vector's processing that is causing backpressure or if it is the source that is not feeding fast enough. A value lower than 1 indicates that the component is spending some time idle so the last one that is 1 would indicate it is a bottleneck. If all of the components fed by the GCP PubSub source are idle, that could indicate the source could be receiving faster, but is not.

cc/ @bruceg for any other thoughts since he worked on that source.

0 replies

bruceg · 2022-06-06T22:37:55Z

bruceg
Jun 6, 2022
Collaborator

The PubSub source uses a gRPC streaming client to receive the data from GCP, which should result minimize the bottlenecks on the Vector side. I would look at the utilization metric as Jesse suggested. You could also try replacing the output sink to something known to have minimal overhead (ie file or even blackhole) and see if that moves the needle.

0 replies

atibdialpad · 2022-06-07T01:46:32Z

atibdialpad
Jun 7, 2022
Author

Thanks folks for the suggestions. I will try them out today and report back.

0 replies

atibdialpad · 2022-06-07T15:36:13Z

atibdialpad
Jun 7, 2022
Author

I observed the utilisation of all the stages/transforms in the pipeline. Below is a timeseries of the utilisation metric from the very first vector transform (just after the pubsub source) from each of the pod in the Deployment. As can be seen the transform is hugely under utilised. The metric is in order of 0.003-0.004 (0.3% to 0.4%)

One more observation: There was 4 hr window when the number of messages in the topic ramped up and caused a huge increase in the number of unack'ed msgs in the subscription. The Vector utilisation of the same (first) transform increased to 0.25-0.30 (25%-30%) (though not much increase in the POD's CPU Utilisation). The utilisation went to earlier/normal low levels when the backlog went back to the normal level (~30K-40K unack'ed msgs)

So it does look to me that the gcppub subscription <-> vector pubsub src is the bottleneck.

My Config:

[sources.cloud_log_source]
  type = "gcp_pubsub"
  ack_deadline_seconds = 10                                                                                                                                                                           
  project = "fststaging"
  subscription = "log-ingest-vector"
  credentials_path = "gcp_ingestion_key_staging.json"

0 replies

atibdialpad · 2022-06-07T17:26:47Z

atibdialpad
Jun 7, 2022
Author

new observation, I see a ton of error logs in the google_pubsub src trying to pull messages from the subscription :

Exact Error :

severity: "ERROR"
textPayload: "2022-06-07T14:07:56.633928Z ERROR source{component_kind="source" component_id=cloud_log_source component_type=gcp_pubsub component_name=cloud_log_source}: vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: InvalidArgument, message: "Request contains an invalid argument.", details: [], metadata: MetadataMap { headers: {"content-disposition": "attachment"} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving""

severity: "ERROR"
textPayload: "2022-06-07T14:07:56.634022Z  INFO source{component_kind="source" component_id=cloud_log_source component_type=gcp_pubsub component_name=cloud_log_source}: vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=1.0"

which is basically this

vector/src/internal_events/gcp_pubsub.rs

Line 59 in a4ff8b3

message = "Failed to fetch events.",

error
This could be the errors why the subscription <-> vector path is bottlenecked, right ? @bruceg @jszwedko
Need to check the error further

0 replies

atibdialpad · 2022-06-07T18:12:07Z

atibdialpad
Jun 7, 2022
Author

I might be hitting this #12660 . The error and throughput observation seems identical. I will try with the latest image tomorrow and report.

0 replies

atibdialpad · 2022-06-08T13:11:06Z

atibdialpad
Jun 8, 2022
Author

[Update]
With the latest image, the "invalid argument" errors are not seen anymore.

I see this error 3-4 times in an hour (far less frequent then the earlier invalid arg error). Doesn't look much harmful.

textPayload: "2022-06-08T11:30:50.941488Z ERROR source{component_kind="source" component_id=cloud_log_source component_type=gcp_pubsub component_name=cloud_log_source}: vector::internal_events::gcp_pubsub: Failed to fetch events. error=status: Unavailable, message: "The service is currently unavailable.", details: [], metadata: MetadataMap { headers: {"content-disposition": "attachment"} } error_code="failed_fetching_events" error_type="request_failed" stage="receiving""

severity: "ERROR"
textPayload: "2022-06-08T11:30:50.941704Z  INFO source{component_kind="source" component_id=cloud_log_source component_type=gcp_pubsub component_name=cloud_log_source}: vector::sources::gcp_pubsub: Retrying after timeout. timeout_secs=0.5"
timestamp: "2022-06-08T11:30:50.941879019Z"

However the backlog and underutilisation still persists.

0 replies

atibdialpad · 2022-06-10T09:08:16Z

atibdialpad
Jun 10, 2022
Author

I did a bit of deep dive into this issue and also compared the performance with a logstash GKE Deployment (same cluster, same Node types, same request and resources). We found out that the logstash gcp_pubsub_input plugin performs just fine and keeps the number of backlogged msg well under limits.

Looking at the GCP PubSub Subscription Monitoring Metrics it appears that the poor performance in this setup has to do with Acknowledgement Request Latencies (subscription/ack_latencies ) ~3-5 mins for Vector but ~60 ms for Logstash

@bruceg Could the fact that Vector is using streaming_pull while logstash is using just Normal pull play a role here ? (Need to see if streaming pull involves ack'ing in bulk )

The subscription/ack_message_count is almost similar

but when I look at how many ack requests timed out (subscription/expired_ack_deadlines_count) Almost 50/sec ack requests timed out for Vector but ~0 for Logstash

which looks like it is because of the ack latency.

Further, just looking at the number of ack requests (subscription/pull_ack_request_count) The Vector number is very high ~200K /s while logstash is ~10 /s

which I feel is because of the retries since many Vector ack times out (as seen in the earlier pic)

So, the root cause atm looks like the high ack latency. Will have to check if we can play around with https://vector.dev/docs/reference/configuration/sources/gcp_pubsub/#ack_deadline_seconds ? @jszwedko @bruceg any suggestions here ?

Another question ?
Are we not using FlowControl for the streaming pull request ? Will Adding an explicity max_outstading_message/bytes help ? Ref gcp pub sub api, proto, not using flow control in vector code
Vs
Logstash seems to be using it cofig : code

1 reply

bruceg Jun 17, 2022
Collaborator

@bruceg Could the fact that Vector is using streaming_pull while logstash is using just Normal pull play a role here ? (Need to see if streaming pull involves ack'ing in bulk )

It could be. After today's stream request fix we don't seem to be limited by our local processing speed, however. If there are still speed differences, it may have more to do with GCP limitations on stream bandwidth than Vector. Regarding acking in bulk, the streaming support allows us to stream back arrays of acknowledgements as fast as they happen.

atibdialpad · 2022-06-14T10:31:23Z

atibdialpad
Jun 14, 2022
Author

@bruceg Ping !! :-)
Also, JFYI: I tried by setting a 10 min ack deadline on both the Google PubSub Subscription as well as in the Vector PubSub Source config but no luck.

1 reply

jszwedko Jun 14, 2022
Maintainer

Thanks for the investigation @atibdialpad ! @bruceg will be looking into this soon once he gets to a good context switching point.

Could you share the gcp_pubsub source config you are using?

atibdialpad · 2022-06-14T18:53:24Z

atibdialpad
Jun 14, 2022
Author

Thanks @jszwedko

[sources.cloud_log_source]
  type = "gcp_pubsub"
  ack_deadline_seconds = 600 # Tried with 10 seconds as well
  retry_delay_seconds = 1 # Tried with 0.5s
  project = "fststaging"
  subscription = "log-ingest-vector-v3"
  # credentials_path = "gcp_test_key.json" # a) Tried with the json key b) Tried without Json key with GKE Workload Identify
  acknowledgements.enabled = false # All sinks have this as false for experimentation.

0 replies

bruceg · 2022-06-17T21:48:55Z

bruceg
Jun 17, 2022
Collaborator

We just merged a PR that should at least improve the handling of this problem. Could you try out the next nightly after tonight to see if it improves the situation for you?

0 replies

atibdialpad · 2022-06-18T13:45:32Z

atibdialpad
Jun 18, 2022
Author

Sure @bruceg will try the latest nightly. Thanks for looking into this.

0 replies

atibdialpad · 2022-06-20T15:12:32Z

atibdialpad
Jun 20, 2022
Author

I have been running the Vector Deployment since the last 2 days and the fix look super solid !!!!! Thanks @bruceg once again.

1 reply

bruceg Jun 21, 2022
Collaborator

Great, thanks for confirming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP PubSub Source Flow Control #12990

{{title}}

Replies: 13 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

GCP PubSub Source Flow Control #12990

atibdialpad Jun 6, 2022

Replies: 13 comments · 3 replies

jszwedko Jun 6, 2022 Maintainer

bruceg Jun 6, 2022 Collaborator

atibdialpad Jun 7, 2022 Author

atibdialpad Jun 7, 2022 Author

atibdialpad Jun 7, 2022 Author

atibdialpad Jun 7, 2022 Author

atibdialpad Jun 8, 2022 Author

atibdialpad Jun 10, 2022 Author

bruceg Jun 17, 2022 Collaborator

atibdialpad Jun 14, 2022 Author

jszwedko Jun 14, 2022 Maintainer

atibdialpad Jun 14, 2022 Author

bruceg Jun 17, 2022 Collaborator

atibdialpad Jun 18, 2022 Author

atibdialpad Jun 20, 2022 Author

bruceg Jun 21, 2022 Collaborator

atibdialpad
Jun 6, 2022

Replies: 13 comments 3 replies

jszwedko
Jun 6, 2022
Maintainer

bruceg
Jun 6, 2022
Collaborator

atibdialpad
Jun 7, 2022
Author

atibdialpad
Jun 7, 2022
Author

atibdialpad
Jun 7, 2022
Author

atibdialpad
Jun 7, 2022
Author

atibdialpad
Jun 8, 2022
Author

atibdialpad
Jun 10, 2022
Author

bruceg Jun 17, 2022
Collaborator

atibdialpad
Jun 14, 2022
Author

jszwedko Jun 14, 2022
Maintainer

atibdialpad
Jun 14, 2022
Author

bruceg
Jun 17, 2022
Collaborator

atibdialpad
Jun 18, 2022
Author

atibdialpad
Jun 20, 2022
Author

bruceg Jun 21, 2022
Collaborator