Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async option will be dropped. #1522

Closed
st0012 opened this issue Aug 1, 2021 · 25 comments · Fixed by #1894
Closed

async option will be dropped. #1522

st0012 opened this issue Aug 1, 2021 · 25 comments · Fixed by #1894
Assignees
Milestone

Comments

@st0012
Copy link
Collaborator

st0012 commented Aug 1, 2021

The async option is unique to the Ruby SDK. It was designed to help send events asynchronously through different backends (e.g. a Ruby thread, Sidekiq worker...etc.). Depends on the backend, it can pose thread to the system due to its additional extra memory consumption. So it's an option with some trade-offs.

But since version 4.1, the SDK now has its own background worker managed (implemented with the famous concurrent-ruby library). It can handle most of the async

The async Option Approach

  1. The SDK serializes the event and event hint into json-compatible Ruby hashes.
  2. It passes the event payload and hint to the block.
  3. In general, the block would enqueue a background job with the above data.
    • Some earlier apps use a new Ruby thread to send the data. This is unrecommended.
    • With background job libraries like Sidekiq or Resque, this means adding objects into Redis.
    • With delayed_job, this means adding a new delayed_job record.
  4. A background worker (e.g. Sidekiq worker) then picks the event and hint and send it.

Pros

Users can customize their event sending logic. But generally it's just a worker with Sentry.send_event(event, hint).

Cons

  • The event payload (usually dozens of kbs) could be copied twice: first copied to the medium storage and then allocates the background worker process.
  • When there is an event spike, it can flood the medium storage (Redis) and take down the entire system.

The Background Worker

  1. The SDK passes the event and its hint to the background worker (a pool of threads managed by concurrent-ruby).
  2. A worker then picks the event, serializes it, and sents it.

Pros

  • It doesn't allocate extra memory other than the original event payload.
  • It's faster.
  • It doesn't require any user code.
  • The background worker doesn't queue more than 30 events. So even when there's a spike, it's unlikely to consume all the memory.

Cons

  • Unsent events will die with the process. Generally speaking, the queue time in background worker is very low. And the chance of missing events due to this reason is small in web apps. But for script programs, the process often leaves before the worker is able to send the event. This is why hint: { background: false } is required in rake integrations.
    • However, I don't think this problem can be solved with the async option.

This drawback has been addressed in #1617.

Missing Events During A Spike Because of Queue Limit

I know many users have concern about the background worker's 30 events queue limit will make them lose events during a spike. But as the maintainer and a user of this SDK, I don't worry about it because:

  1. The spike is likely to be an urgent case, and that'll probably be fixed in a short time. So not seeing a few instances of other errors should not affect the overall coverage.
  2. Given these characteristics of the SDK's background worker:
    • The default number of background workers are determined by the number of process cores on your machine.
    • They're a lot faster than the using the async approach with a sidekiq/resque...etc. worker due to the reason I described in the issue.
    • A 30-event queue is only shared within the process/web instance, depends on the concurrency model you have. Not at a global level.
      If there's a spike big enough to overflow the SDK's queue and drop some events, it'll probably overflow your background job queue with the async option too and/or pose a greater damage to your system.
  3. Sentry has a rate-limiting mechanism to prevent overflow on the platform side, which works by both rejecting new events and telling the SDK not to send new events with a 429 response. When the SDK receives a 429 response from Sentry during a spike, it'll stop sending "all events" for a given period of time.

What I'm trying to say is, it's not possible to expect Sentry to accept "all events" during a big spike regardless which approach you use. But when a spike happens, async is more likely to become another bottleneck and/or cause other problems in your system.

My Opinion

The async option seems redundant now and it could sometimes cause more harm. So I think we should drop it in version 5.0.

Questions

The above analysis is only based on my personal usage of the SDK and a few cases I helped debug with. So if you're willing to share your experience, I'd like to know

Even though the decision has been made, we still would like to hear feedback about it:

  • Do you use the async option in your apps?
    • If you do, what's the motivation? Will you still use it after reading the above description?
    • If you don't, is it an intentional decision? If it is, what's the reason behind it?
  • Do you disable the background workers with the background_worker_threads config option?
    • If you do, why?
  • Or any feedback related to this topic.
@st0012 st0012 pinned this issue Aug 1, 2021
@st0012 st0012 added this to the 5.0.0 milestone Aug 1, 2021
@josh-m-sharpe
Copy link

Long time user of sentry-ruby and sentry-raven. I didn't know this feature existed until a current migration of a large rails app to sentry. I noticed it while reading through the docs and attempted to set it up but immediately ran into issues and elected to defer in order to simplify the migration. In addition, we have a large amount of complex queues. Injecting this option into our system would likely require a bit of thought so as not to bowl over important queues while keeping error reporting timely.

I suppose this is a vote to drop it?

@louim
Copy link
Contributor

louim commented Aug 20, 2021

Hey! We currently use the async option. Mostly because it was recommended in the docs when we setup the app a long time ago (not sure the background worker option was even a thing at that time 👴🏼 ).

I'm curious about that part:

The background worker doesn't queue more than 30 events. So even when there's a spike, it's unlikely to consume all the memory.

What would happen if there is a spike in events, let's say from a noisy error filling the queue. Would another error happening at the same time be silently dropped because the queue is full? Or am I misunderstanding how it works?

I'd like to switch away from async because json serialization from the async version mean we have to check two version of the payload when doing custom processing in before_send, ex:

data = event.to_hash
errors = data.dig(:exception, :values) if data[:exception]
errors = data.dig("exception", "values") if data["exception"]

@st0012
Copy link
Collaborator Author

st0012 commented Aug 21, 2021

not sure the background worker option was even a thing at that time 👴🏼

Background worker was added since v4.1.0. So it's a new thing for most users I think 🙂

What would happen if there is a spike in events, let's say from a noisy error filling the queue. Would another error happening at the same time be silently dropped because the queue is full? Or am I misunderstanding how it works?

As of now, the queue doesn't distinguish events. So if there's a spike of a particular error, other errors may not make it into the queue. But personally I'm not worry about this because:

  1. The spike is likely to be an urgent case, and that'll probably be fixed in a short time. So not seeing a few instances of other errors should not affect the overall coverage.
  2. Given these characteristics of the SDK's background worker:
    • The default number of background workers are determined by the number of process cores on your machine.
    • They're a lot faster than the using the async approach with a sidekiq/resque...etc. worker due to the reason I described in the issue.
    • A 30-event queue is only shared within the process/web instance, depends on the concurrency model you have. Not at a global level.

If there's a spike big enough to overflow the SDK's queue and drop some events, it'll probably overflow your background job queue with the async option too and/or pose a greater damage to your system.

@kzaitsev
Copy link
Contributor

At my job, we are working fine with the async because we already have the sidekiq in our stack. I can't understand why async can't be an optional way to deliver events without async deprecation? As a solution, you can highlight possible issues with async in the documentation.

Maybe I can't understand the problem because using sentry only for exceptions without APM.

@st0012
Copy link
Collaborator Author

st0012 commented Sep 27, 2021

@Bugagazavr

There are 2 main cost of having this option around:

  1. Among all Sentry SDKs, sentry-ruby is the only one that supports such option. This means we always need to consider this extra condition when making changes to the event sending logic. And it'll make future SDK alignment harder. (It surely made sentry-raven -> sentry-ruby conversion harder)
  2. We need to spend additional effort maintaining the code for this option, and sometimes the result isn't pretty:

def dispatch_async_event(async_block, event, hint)
# We have to convert to a JSON-like hash, because background job
# processors (esp ActiveJob) may not like weird types in the event hash
event_hash = event.to_json_compatible
if async_block.arity == 2
hint = JSON.parse(JSON.generate(hint))
async_block.call(event_hash, hint)
else
async_block.call(event_hash)
end
rescue => e
loggable_event_type = event_hash["type"] || "event"
log_error("Async #{loggable_event_type} sending failed", e, debug: configuration.debug)
send_event(event, hint)
end

if defined?(ActiveJob)
module Sentry
parent_job =
if defined?(::ApplicationJob) && ::ApplicationJob.ancestors.include?(::ActiveJob::Base)
::ApplicationJob
else
::ActiveJob::Base
end
class SendEventJob < parent_job
# the event argument is usually large and creates noise
self.log_arguments = false if respond_to?(:log_arguments=)
# this will prevent infinite loop when there's an issue deserializing SentryJob
if respond_to?(:discard_on)
discard_on ActiveJob::DeserializationError
else
# mimic what discard_on does for Rails 5.0
rescue_from ActiveJob::DeserializationError do
logger.error "Discarded #{self.class} due to a #{exception}. The original exception was #{error.cause.inspect}."
end
end
def perform(event, hint = {})
Sentry.send_event(event, hint)
end
end
end
else
module Sentry
class SendEventJob; end
end
end

If the upside is high, these cost wouldn't be an issue. That's why we have had it for many years. But since we already have a better solution for the problem (background worker) that has much less downside, I don't think it's worth it now.

rsanheim added a commit to simpledotorg/simple-server that referenced this issue Sep 28, 2021
rsanheim added a commit to simpledotorg/simple-server that referenced this issue Sep 28, 2021
…rker (#2951)

* Drop sentry async - use Sentry's builtin background worker

See getsentry/sentry-ruby#1522 for details and
why we want to do this change.

We suspect this also may be a cause of https://app.shortcut.com/simpledotorg/story/5282/redis-down-in-production

* Explicitly turn off Sentry tracing

We are using Datadog for tracing currently, and having two tools
capturing trace info is not necessary and may add some overhead we don't
want..

* Revert "Explicitly turn off Sentry tracing"

This reverts commit e1eae44.
@github-actions
Copy link

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@benoittgt
Copy link

We were using a custom sidekiq task in an async block. But multiple times, we had the issue of

When there is an event spike, it can flood the medium storage (Redis) and take down the entire system.

And it was a nightmare.

We just switched to default by removing async and use the BackgroundWorker that use concurrent-ruby. Work perfectly for us for the moment. I will post a message on the next spike. 😄

@trevorturk
Copy link

I think removing async seems fine. I was using it to queue up a small custom ActiveJob that just called Sentry, but I'm happy to just use the built in method. I agree event spikes are fine to cap at X events since Sentry would drop them anyway and you don't want to overwhelm your system anyway. You might consider in the docs telling users that they can just remove the async line and things should work fine. I'm testing that now to be sure, but I believe that's the recommendation?

@st0012 st0012 modified the milestones: 5.0.0, 6.0.0 Jan 7, 2022
@st0012
Copy link
Collaborator Author

st0012 commented Jan 15, 2022

@trevorturk Thanks for the feedback. The current documents on Sentry and RDoc both suggests that it's a deprecated option and points user to this issue for more info 😄

I believe that's the recommendation?

Yes you should be able to delete it directly without any issue. But if you do, please report it here and I'll fix it ASAP 😉

@trevorturk
Copy link

It all worked fine when I just deleted it, thank you!

@dillonwelch
Copy link

But for script programs, the process often leaves before the worker is able to send the event. This is why hint: { background: false } is required in rake integrations.

Can you elaborate on why this is and why this wouldn't be in https://github.com/getsentry/sentry-ruby/blob/efcf170b5f6dd65c3b047825bddd8fde87fc6b7b/sentry-ruby/lib/sentry/rake.rb?

@st0012
Copy link
Collaborator Author

st0012 commented Jan 27, 2022

@dillonwelch sorry that I forgot to update the description. that issue has been addressed in #1617 and don't exist anymore 🙂

@jordan-brough
Copy link

jordan-brough commented Feb 11, 2022

For instrumenting AWS Lambda, it seems like it'd be ideal to use Lambda Extensions to deliver error reports. At first glance it looks like the async option would be the way to integrate with Lambda Extensions. Do any of the Sentry libraries for other languages hook into Lambda Extensions? Would there be another way to do that with the Sentry Ruby SDK?

@jordan-brough
Copy link

@sl0thentr0py thanks 👍 I've commented over in that issue.

@jordan-brough
Copy link

@st0012 how Lambda works is AWS spins up an instance of some code you've written and invokes "handler" method with an "event" payload (e.g. a web request). And then:

After the function ... [has] completed, Lambda maintains the execution environment for some time in anticipation of another function invocation. In effect, Lambda freezes the execution environment. When the function is invoked again, Lambda thaws the environment for reuse

https://docs.aws.amazon.com/lambda/latest/dg/runtimes-context.html

So you have some code that runs and then gets "frozen" and then maybe gets "unfrozen" again in the future to handle future events (e.g. web requests). But if you deploy new code, or if there is too long of delay between events then Lambda discards the frozen execution environment.

So if you have some async code you may end up in this situation:

  • Run the main "handler" (application) code
  • Start an async process along the way (e.g. sending events to Sentry)
  • Finish the handler code before the async code finishes (or perhaps even starts)
  • Lambda gets frozen

And then your async code may never run if Lambda ends up discarding the frozen execution environment. And Sentry events would get lost.

So currently the way to make Lambda work reliably with Sentry would be to make Sentry operate synchronously. But then you have Sentry possibly slowing down your application code and/or potentially affecting it if there were Sentry bugs/outages etc (even though I'm sure that would never happen! 😉). Which I assume is one reason why Sentry runs asynchronously in the first place.

Last year AWS released a solution to this issue called "Lambda Extensions". You can use Lambda extensions to allow a Lambda function to handle application code synchronously while also enqueuing asynchronous events to "extensions" (e.g. a Sentry extension) which don't block the main application code.

A configuration option like the async config discussed in this PR might be a good way to integrate w/ that, though I haven't looked into the code here or the details on that in depth.

@hogelog
Copy link

hogelog commented Mar 13, 2022

I agree dropping the async option.

I'm developing rails app that uses config.async with Sidekiq, but I will disable this config and use background worker or send synchronously.

I was originally aware of this issue and was considering disabling config.async. However, before disabling it, an error spike caused sidekiq redis down.

I plan to disable config.async and use background worker.
I am a little concerned about the queue limit. However, I checked latency of our rails app annd error reporting http request (to sentry.io), queue overflow will probably not occur.

It would be more reassuring to be able to check the queue overflow.
sentry-ruby records queue overflow event and send that client report.
https://github.com/getsentry/sentry-ruby/blob/5.2.0/sentry-ruby/lib/sentry/client.rb#L63
I would be pleasure if these queue overflow errors visualized on sentry.io's UI.

It doesn't look bad approach sending errors synchronously too.
Our rails app's APM reports http request's to sentry.io latency is almost within 300ms. I feel that it is not that much of a problem for rack workers to consume that amount of time when an error occurs.

@sl0thentr0py
Copy link
Member

sl0thentr0py commented Mar 14, 2022

I would be pleasure if these queue overflow errors visualized on sentry.io's UI.

@hogelog this is on the product road map, but I can't give you an ETA on when it will be shipped yet. But we certainly want to expose these statistics to the user eventually.

@st0012
Copy link
Collaborator Author

st0012 commented Mar 14, 2022

It doesn't look bad approach sending errors synchronously too.

@hogelog I should also mention that if you have tracing enabled, transaction events should also be taken into consideration as well. so I won't recommend sending errors synchronously if you do.

@hogelog
Copy link

hogelog commented Mar 15, 2022

I would be pleasure if these queue overflow errors visualized on sentry.io's UI.

@hogelog this is on the product road map, but I can't give you an ETA on when it will be shipped yet. But we certainly want to expose these statistics to the user eventually.

I'm looking forward to it!

@hogelog I should also mention that if you have tracing enabled, transaction events should also be taken into consideration as well. so I won't recommend sending errors synchronously if you do.

I’m not using tracing, so I was not aware. Considering the future, it may not be good to send events directly. thanks!

@ariccio
Copy link

ariccio commented May 24, 2022

I greatly appreciate this extra warning message! It's a good warning. Can you tell me exactly what I should do? Should I simply remove this code from my codebase, or do I need to add something else?

  config.async = lambda do |event, hint|
    ::Sentry::SendEventJob.perform_later(event, hint)
  end

@st0012
Copy link
Collaborator Author

st0012 commented May 28, 2022

@ariccio You can simply delete it 😉

@vadviktor
Copy link

We at my company have been plagued by the async feature for 1-2 years now; hitting the limit of the payload and the rate limit made us scratch our heads on how to safeguard against those (before Sentry internally remedied them). How we've caught these issues? By seeing Sidekiq being brought to its knees and important messages not processed.

Right now, we have decided to end the async reign, and I was glad to see this feature being deprecated since our last version update. 🙌

ragesoss added a commit to WikiEducationFoundation/WikiEduDashboard that referenced this issue Aug 8, 2022
This feature will be removed in a future version of Sentry, and it might be a source of some Redis memory problems (which have been and maybe still are a problem for P&E Dashboard). See getsentry/sentry-ruby#1522
shanamatthews pushed a commit to getsentry/sentry-docs that referenced this issue Aug 2, 2023
@sl0thentr0py
Copy link
Member

async has been deprecated for long enough and is removed in the next 6.0 major branch so we can close this placeholder issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.