Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delayed job integration: payload reached its size limit #1603

Closed
lacco opened this issue Nov 5, 2021 · 23 comments
Closed

delayed job integration: payload reached its size limit #1603

lacco opened this issue Nov 5, 2021 · 23 comments
Assignees
Milestone

Comments

@lacco
Copy link

lacco commented Nov 5, 2021

Issue Description

We recently switched from sentry-raven to the new version of the SDK.
Everything seems to work for Rails HTTP requests, but exception tracking for DelayedJob causes some issues.

From time to time, we see the following log entries:

Transaction sending failed: the server responded with status 413
body: {"detail":"failed to read request body","causes":["payload reached its size limit"]}

Event sending failed: the server responded with status 413
body: {"detail":"failed to read request body","causes":["payload reached its size limit"]}

This only happens for tracing/ exception tracking in our delayed jobs.

I tried to narrow down the issue in two ways:

  1. config.debug = true didn't get any additional information
  2. Rails.logger.error(event.inspect) in config.before_send is never logging anything

Any suggestions about what I can do to make sure that none of our exceptions are getting lost?

Reproduction Steps

Hard to reproduce on our non-production systems

Expected Behavior

It should send all exceptions that are happening to Sentry

Actual Behavior

Some exceptions (and probably traces) are missing

Ruby Version

2.6.3

SDK Version

sentry-delayed_job=4.7.3, sentry-ruby=4.7.3

Integration and Its Version

Rails, DelayedJob

Sentry Config

Sentry.init do |config|
  config.breadcrumbs_logger = [:active_support_logger]

  filter = ActiveSupport::ParameterFilter.new(Rails.application.config.filter_parameters)
  config.before_send = lambda do |event, _hint|
    filter.filter(event.to_hash)
  end

  config.traces_sample_rate = 0.2
end
@st0012
Copy link
Collaborator

st0012 commented Nov 5, 2021

@lacco it's probably because the breadcrumb generated for the delayed job query. can you try disabling breadcrumbs_logger (remove the line)? (I'm not suggesting this as a solution but only to identify the cause).

@lacco
Copy link
Author

lacco commented Nov 5, 2021

@st0012 Breadcrumbs are disabled now, exceptions are no longer showing them.

But the issue is still there, is there something else that I can try?

@st0012
Copy link
Collaborator

st0012 commented Nov 6, 2021

@lacco can you try using the log-event-in-debug-mode branch with debug mode? it should print events in json format (notice: it may be verbose) to help you collect the related event. you can then paste a processed version here or send me privately on discord.

@lacco
Copy link
Author

lacco commented Nov 8, 2021

@st0012 I tried to get more useful information, it looks like that the event is even too big to be logged appropriately.

Also, I needed to turn off performance monitoring (for a different reason than we are discussing here).
This means that the error will take much longer now to reproduce.

I want to try a different thing for the moment:

  config.before_send = lambda do |event, _hint|
    Rails.logger.error("Sentry before_send")
    Rails.logger.error(event.to_json_compatible.transform_values { |v| JSON.generate(v).bytesize }.inspect)

    ...
  end

This is safe enough to put it on production, and it should bring us a little bit closer to the actual issue.

I will let you know as soon as I have results (which might take more than one day).

@lacco
Copy link
Author

lacco commented Nov 18, 2021

@st0012 After turning off the performance monitoring, we were not able to reproduce this error anymore.
This is very strange, but it also means that we can just close this ticket for the time being.

Thank you for helping out, I will let you know if we see something like that again.

@lacco lacco closed this as completed Nov 18, 2021
@st0012
Copy link
Collaborator

st0012 commented Nov 18, 2021

@lacco do you know if the event that were lost are transactions (performance) or normal events?
if they were normal events, that would be really strange if it's fixed by turning off tracing.
but if they are transactions, it probably means the tracing feature records too much data in your app and became too large. for example, it recorded hundreds or event 1000+ queries.

@lacco
Copy link
Author

lacco commented Nov 19, 2021

@st0012 When performance monitoring was enabled, I saw those types of errors:

Transaction sending failed: the server responded with status 413
body: {"detail":"failed to read request body","causes":["payload reached its size limit"]}

Event sending failed: the server responded with status 413
body: {"detail":"failed to read request body","causes":["payload reached its size limit"]}

I guess the second one is related to exceptions, right?
It only happened for exceptions in the context of delayed jobs...

@lacco
Copy link
Author

lacco commented Dec 14, 2021

@st0012 The problem started to happen again.

I was able to log the size of the payload, it turned out that in certain cases we are sending ~2,5Mb of breadcrumbs:

{"event_id"=>34, "level"=>7, "timestamp"=>22, "release"=>42, "environment"=>12, "server_name"=>13, "modules"=>4572, "message"=>2, "user"=>2, "tags"=>63, "contexts"=>1577, "extra"=>2, "fingerprint"=>2, "breadcrumbs"=>2493137, "transaction"=>22, "platform"=>6, "sdk"=>52, "exception"=>57003, "threads"=>87}

Turning off breadcrumbs solved the issue.
This is not ideal, but we are really relying on Sentry as exception reporting.
I will keep breadcrumbs turned off for the time being.

@st0012
Copy link
Collaborator

st0012 commented Dec 19, 2021

@lacco thanks for the update. I suspect it's caused by breadcrumbs of updating/loading delayed job records.
I think this config should help you filter out those breadcrumbs:

  config.before_breadcrumb = lambda do |crumb, hint|
    if crumb.data.is_a?(Hash) && name = crumb.data[:name]
      next if name.match?(/Delayed::Backend::ActiveRecord::Job/)
    end

    crumb
  end

let me know if you'd give it a try? if it works, I'll bake it into sentry-delayed_job.

@st0012 st0012 reopened this Dec 19, 2021
@github-actions
Copy link

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@bgvo
Copy link

bgvo commented Jan 13, 2022

I have a similar issue but in Sidekiq. The job in charge of sending the exception to Sentry (Sentry::SendEventJob) is failing. Error from executing the job:

the server responded with status 413
body: {"detail":"failed to read request body","causes":["A payload reached size limit."]}"

The job failing (json string taken directly from Redis "retry" zset key):

{
   "retry":true,
   "queue":"default",
   "class":"ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper",
   "wrapped":"Sentry::SendEventJob",
   "args":[
      {
        "job_class":"Sentry::SendEventJob",
        "job_id":"739d6c0b-8ba7-4ff1-bb16-dcace6feaf11",
        "provider_job_id":null,
        "queue_name":"default",
        "priority":null,
        "arguments":[
           {...},// This first argument contains the breadcrumbs. Which in turn and among other things contains base64 encoded images (millions of characters long).
           {
              "integration":"rails",
              "exception":"undefined method `category' for nil:NilClass",
              "_aj_symbol_keys":[]
           }
        ],
        "jid":"41e7c1518a0be9666f2111",
        "created_at":1642036959.9971983,
        "newrelic":{},
        "enqueued_at":1642078065.571963,
        "error_message":"the server responded with status 413\nbody: {\"detail\":\"failed to read request body\",\"causes\":[\"A payload reached size limit.\"]}",
        "error_class":"Sentry::ExternalError",
        "failed_at":1642036960.3461637,
        "retry_count":12,
        "retried_at":1642078066.0670938
      }
    ]
  }

As mentioned in the code, I think the problem comes from having breadcrumbs too long. One solution that comes to mind:

  • Filter out / truncate Base64 encoded data from params when including them in the breadcrumbs

I'd like to avoid having to disable breadcrumbs altogether. For us, that's one of Sentry strengths.

cc @lacco

@lacco
Copy link
Author

lacco commented Jan 15, 2022

@st0012 Sorry for the late answer.
Since I am only able to reproduce the issue in the production environment, I am a little bit hesitant to try it out.
Once we re-enable breadcrumbs, we risk losing important exceptions.
I could try to set up extra alerting based on the logs, but I won't have the time in January to look into it.

I would feel much more comfortable experimenting if Sentry gem would have a way to deal with bigger payloads.
For example, it could check the payload size of the breadcrumbs, and if it's too big, just send everything else.
This way we would not lose exceptions, just having some exceptions without breadcrumbs.

@st0012
Copy link
Collaborator

st0012 commented Jan 15, 2022

@bgvo @lacco Thanks for your feedback. I think removing breadcrumbs altogether seem to be a solution. I vaguely remember sentry-go seems to do this. But ideally it should preserve as many breadcrumbs as possible, so deleting the earliest breadcrumbs until the size is acceptable seems to be better.

If we decided to take this data-trimming approach, I think another question is who (client vs server) should do this task? Performing such tasks on client's app with their resource seem wrong to me. Also, duplicating the logic in different SDKs is also not good.
I believe Sentry has a series of pipeline to process events. If that's the case, trimming oversized data should be performed on the server side. @sl0thentr0py Do you have any idea on if this is possible?

@st0012
Copy link
Collaborator

st0012 commented Jan 20, 2022

ping @sl0thentr0py

@sl0thentr0py
Copy link
Member

sl0thentr0py commented Jan 20, 2022

hey @st0012, sorry I wanted to check this out properly before responding.
I know that the python sdk does a lot more client side logic while serializing to ensure nice size behavior before sending it off.

We could try to replicate some of that in ruby too. The server generally does do size logic as well but I'm not sure what all it does. But we cannot rely solely on the server because the SDKs definitely should also at least do some sanity checking before sending huge payloads off on the wire.

I'll research this properly when I have some time and add more info here.

@bgvo
Copy link

bgvo commented Jan 20, 2022

Thanks @st0012 @sl0thentr0py for your replies.

I agree with @sl0thentr0py that the client needs to at least do some basic processing to make sure nothing too wild is sent over the wire.

If we decided to take this data-trimming approach, I think another question is who (client vs server) should do this task? Performing such tasks on client's app with their resource seem wrong to me. Also, duplicating the logic in different SDKs is also not good.

I'm not sure how the trimming would do any harm since the process is done on a worker and besides, running some very simple regexp matching on the beginning of some attributes can make the trick.

Thanks again.

@st0012
Copy link
Collaborator

st0012 commented Jan 20, 2022

But we cannot rely solely on the server because the SDKs definitely should also at least do some sanity checking before sending huge payloads off on the wire.

We already apply size checks (actually, trimming) on message-ish attributes, like event message and breadcrumb message. The problem is that trimming non-string data based on the size is not easy. For example, the API we can use to check object size is not universally available. ObjectSpace.memsize_of(obj) is a possible tool, but:

  • It only works with CRuby. We also have JRuby support so we need to find an alternative?
  • It's not 100% reliable (see the document)

My argument is that, since the server side has more resource, more fine-grained data trimming could be done there.

I'm not sure how the trimming would do any harm since the process is done on a worker and besides, running some very simple regexp matching on the beginning of some attributes can make the trick.

That'll work in your case. But not every oversize breadcrumb is caused by base64 payload. I've also seen oversized events that's caused by too big of the ActiveJob arguments.

I'm going to provide some more information on this issue. They're something to consider regardless we want to do it on the client/server side.

Potential Causes

There are many potential causes the can make the event oversized:

  • A single breadcrumb's data payload (a non-string structure)
  • The overall breadcrumbs (many mid-sized breadcrumbs collectively oversize the payload)
  • Stacktrace data (e.g. context line = 5 with 100 app frames)
  • Oversized Rack environment (likely request body)
  • Oversized contextual data added by user
  • Oversized contextual data added by integration (in some cases, ActiveJob's arguments is huge).

They may individually or collectively make the event oversize.

Possible Approaches

Simpler Ones

  • Only target breadcrumbs:
    1. Check if the event payload is oversized
    2. If it is, remove breadcrumbs
    3. Regenerate the event payload
  • Trim data in a fix order, regardless individual component's size. e.g.
    • oversized -> remove breadcrumbs -> still oversized -> remove part of the stacktrace -> still oversized -> remove request body.
  • Given each component a size quota and trim the oversized ones.
    • We need to define a quota for each component.

More Sophisticated Ones

  • Instead of removing the entire section of data, we remove the individual ones based on order or size.
    • For example, remove the older breadcrumbs until it fits the size.
  • Replace oversized object with a placeholder value. Similar to [filtered], but like [oversized].

My Opinion

If we just want to make the oversized events pass the size check, we can target the breadcrumbs section like I described above. I think it can solve many cases.

However, I think oversized data should be treated as problematic attributes. Sentry should be able to tell the user that:

  • The event reached its size limit.
  • What exceed the size limit.

Since this requires a more sophisticated process, I want to do most of it on server side because:

  • On client side, any bug will directly affect the user's app.
  • Because the process will happen asynchronously, issues will be hard to spot and report.
  • If anything failed on the client side, no one will get the payload (unless we log the oversized data in user's log files). We already have this problem in this issue. But if we do this one the server side, we can still get the payload.

@sl0thentr0py
Copy link
Member

Thanks @st0012, that's a great breakdown of the problem space!

Yes, there are a lot of issues/corner cases with a 'magical' truncation solution which is why I wanted to fully grok the python serializer because it is indeed doing conditional truncation based on what type the object is (for instance, depth-based trimming of big hashes/dicts).

I pinged people internally and this has also been discussed previously in other SDKs. So Java doesn't do client-side magic as one data point, while python does and there is already some inconsistency across SDKs here.

Getting the server to support something like this would be a longer term goal/project which would need consensus and then implementation, so at least for the short term, we cannot count on it. So if we don't want to do truncation logic right now, the only other solution is specific before_breadcrumb solutions that we can help users out with.

@st0012
Copy link
Collaborator

st0012 commented Jan 22, 2022

@sl0thentr0py Thanks for the information. It's great to know that the long-term goal is to do sophisticated processing on the server side.

My current plan is:

  • Do a adhoc client side check that:
    • Examines the event's payload size (200kb)
    • If the event is oversized, removes all breadcrumbs and insert a placeholder crumb that explains this process.
    • Examines the event's payload size again
    • If it's still oversized for whatever reason, computes and logs information like this:
      • "Event oversized without breadcrumbs: { message: 1024, request: 1000, ....etc. }"
      • The idea is that we can at least have more information for further debugging.
      • This may not be feasible due to the way we serialize events. But I'll give it a try first.
    • Drop the event.

And in the meantime, I'd hope the server side can start recording size information about the oversized events. It can still reject those events immediately, but it'd be helpful to see the size breakdown of different components regardless what we're going to do next.

@st0012 st0012 added this to the 5.1.0 milestone Jan 27, 2022
@st0012 st0012 modified the milestones: 5.1.0, 5.2.0 Feb 8, 2022
st0012 added a commit that referenced this issue Feb 25, 2022
Because we now need to check if a serialized envelope item is oversized
(see
#1603 (comment)),
envelope serialization is better performed by Transport to have more
control to filter oversized items.
st0012 added a commit that referenced this issue Feb 25, 2022
Because we now need to check if a serialized envelope item is oversized
(see
#1603 (comment)),
envelope serialization is better performed by Transport to have more
control to filter oversized items.
@st0012 st0012 modified the milestones: 5.2.0, 5.3.0 Mar 1, 2022
st0012 added a commit that referenced this issue Mar 5, 2022
Because we now need to check if a serialized envelope item is oversized
(see
#1603 (comment)),
envelope serialization is better performed by Transport to have more
control to filter oversized items.
st0012 added a commit that referenced this issue Mar 7, 2022
Because we now need to check if a serialized envelope item is oversized
(see
#1603 (comment)),
envelope serialization is better performed by Transport to have more
control to filter oversized items.
st0012 added a commit that referenced this issue Mar 7, 2022
* Add Transport#serialize_envelope to control envelope serialization

Because we now need to check if a serialized envelope item is oversized
(see
#1603 (comment)),
envelope serialization is better performed by Transport to have more
control to filter oversized items.

* Avoid mutating the envelope's items

* Update changelog

* Resize changelog image

* Update changelog

* Update CHANGELOG.md
@ragesoss
Copy link

ragesoss commented Mar 8, 2022

My experience with this problem is as follows:

  • Using Sentry with Sidekiq
  • An error with a too-large payload happens. (The bulk of the payload is stack frames in threads, from what I can tell by inspecting the Sidekiq RetrySet job.)
  • Now Sentry wants to to send a new exception, with all the details from the original wrapped into contexts, so this one has an even larger payload.
  • Upon each retry, the payload gets larger. Eventually, Redis is using too much memory and can't persist to to disk... bringing my whole system down.

So, that was exciting!

(I guess the origin of this was upgrading our self-hosted Sentry installation, as the payload errors only started after that.)

ragesoss added a commit to WikiEducationFoundation/WikiEduDashboard that referenced this issue Mar 8, 2022
This will help somewhat with the Sentry too-large-payload problem that bloats Sidekiq memory and brought down Programs & Events Dashboard. It looks like there's a much better solution coming via the Sentry client. See getsentry/sentry-ruby#1603
@st0012
Copy link
Collaborator

st0012 commented Mar 20, 2022

@ragesoss the case you described seems to be the reason I'm deprecating the async option, which is described here #1522

I think you can avoid this problem by simply not using the async option and just let the SDK's background worker send those events.

@ragesoss
Copy link

@ragesoss the case you described seems to be the reason I'm deprecating the async option, which is described here #1522

I think you can avoid this problem by simply not using the async option and just let the SDK's background worker send those events.

Interesting, thank you! Upgrading to the latest version already fixed it, I think, but it sounds like there's no good reason to continue using async anyway.

@st0012
Copy link
Collaborator

st0012 commented Apr 9, 2022

Since envelope size check has been introduced in 5.2.0 (#1747), I'm closing this now. If you found any new issue related to payload size limit again, please open a new one, thx 🙂

@st0012 st0012 modified the milestones: 5.3.0, 5.2.0 Apr 9, 2022
@st0012 st0012 closed this as completed Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants