-
-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delayed job integration: payload reached its size limit #1603
Comments
@lacco it's probably because the breadcrumb generated for the delayed job query. can you try disabling |
@st0012 Breadcrumbs are disabled now, exceptions are no longer showing them. But the issue is still there, is there something else that I can try? |
@lacco can you try using the |
@st0012 I tried to get more useful information, it looks like that the event is even too big to be logged appropriately. Also, I needed to turn off performance monitoring (for a different reason than we are discussing here). I want to try a different thing for the moment:
This is safe enough to put it on production, and it should bring us a little bit closer to the actual issue. I will let you know as soon as I have results (which might take more than one day). |
@st0012 After turning off the performance monitoring, we were not able to reproduce this error anymore. Thank you for helping out, I will let you know if we see something like that again. |
@lacco do you know if the event that were lost are transactions (performance) or normal events? |
@st0012 When performance monitoring was enabled, I saw those types of errors:
I guess the second one is related to exceptions, right? |
@st0012 The problem started to happen again. I was able to log the size of the payload, it turned out that in certain cases we are sending ~2,5Mb of breadcrumbs:
Turning off breadcrumbs solved the issue. |
@lacco thanks for the update. I suspect it's caused by breadcrumbs of updating/loading delayed job records. config.before_breadcrumb = lambda do |crumb, hint|
if crumb.data.is_a?(Hash) && name = crumb.data[:name]
next if name.match?(/Delayed::Backend::ActiveRecord::Job/)
end
crumb
end let me know if you'd give it a try? if it works, I'll bake it into |
This issue has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you label it "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
I have a similar issue but in Sidekiq. The job in charge of sending the exception to Sentry (Sentry::SendEventJob) is failing. Error from executing the job:
The job failing (json string taken directly from Redis "retry" zset key): {
"retry":true,
"queue":"default",
"class":"ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper",
"wrapped":"Sentry::SendEventJob",
"args":[
{
"job_class":"Sentry::SendEventJob",
"job_id":"739d6c0b-8ba7-4ff1-bb16-dcace6feaf11",
"provider_job_id":null,
"queue_name":"default",
"priority":null,
"arguments":[
{...},// This first argument contains the breadcrumbs. Which in turn and among other things contains base64 encoded images (millions of characters long).
{
"integration":"rails",
"exception":"undefined method `category' for nil:NilClass",
"_aj_symbol_keys":[]
}
],
"jid":"41e7c1518a0be9666f2111",
"created_at":1642036959.9971983,
"newrelic":{},
"enqueued_at":1642078065.571963,
"error_message":"the server responded with status 413\nbody: {\"detail\":\"failed to read request body\",\"causes\":[\"A payload reached size limit.\"]}",
"error_class":"Sentry::ExternalError",
"failed_at":1642036960.3461637,
"retry_count":12,
"retried_at":1642078066.0670938
}
]
} As mentioned in the code, I think the problem comes from having breadcrumbs too long. One solution that comes to mind:
I'd like to avoid having to disable breadcrumbs altogether. For us, that's one of Sentry strengths. cc @lacco |
@st0012 Sorry for the late answer. I would feel much more comfortable experimenting if Sentry gem would have a way to deal with bigger payloads. |
@bgvo @lacco Thanks for your feedback. I think removing breadcrumbs altogether seem to be a solution. I vaguely remember If we decided to take this data-trimming approach, I think another question is who (client vs server) should do this task? Performing such tasks on client's app with their resource seem wrong to me. Also, duplicating the logic in different SDKs is also not good. |
ping @sl0thentr0py |
hey @st0012, sorry I wanted to check this out properly before responding. We could try to replicate some of that in ruby too. The server generally does do size logic as well but I'm not sure what all it does. But we cannot rely solely on the server because the SDKs definitely should also at least do some sanity checking before sending huge payloads off on the wire. I'll research this properly when I have some time and add more info here. |
Thanks @st0012 @sl0thentr0py for your replies. I agree with @sl0thentr0py that the client needs to at least do some basic processing to make sure nothing too wild is sent over the wire.
I'm not sure how the trimming would do any harm since the process is done on a worker and besides, running some very simple regexp matching on the beginning of some attributes can make the trick. Thanks again. |
We already apply size checks (actually, trimming) on message-ish attributes, like event message and breadcrumb message. The problem is that trimming non-string data based on the size is not easy. For example, the API we can use to check object size is not universally available. ObjectSpace.memsize_of(obj) is a possible tool, but:
My argument is that, since the server side has more resource, more fine-grained data trimming could be done there.
That'll work in your case. But not every oversize breadcrumb is caused by base64 payload. I've also seen oversized events that's caused by too big of the ActiveJob arguments. I'm going to provide some more information on this issue. They're something to consider regardless we want to do it on the client/server side. Potential CausesThere are many potential causes the can make the event oversized:
They may individually or collectively make the event oversize. Possible ApproachesSimpler Ones
More Sophisticated Ones
My OpinionIf we just want to make the oversized events pass the size check, we can target the breadcrumbs section like I described above. I think it can solve many cases. However, I think oversized data should be treated as problematic attributes. Sentry should be able to tell the user that:
Since this requires a more sophisticated process, I want to do most of it on server side because:
|
Thanks @st0012, that's a great breakdown of the problem space! Yes, there are a lot of issues/corner cases with a 'magical' truncation solution which is why I wanted to fully grok the python serializer because it is indeed doing conditional truncation based on what type the object is (for instance, depth-based trimming of big hashes/dicts). I pinged people internally and this has also been discussed previously in other SDKs. So Java doesn't do client-side magic as one data point, while python does and there is already some inconsistency across SDKs here. Getting the server to support something like this would be a longer term goal/project which would need consensus and then implementation, so at least for the short term, we cannot count on it. So if we don't want to do truncation logic right now, the only other solution is specific |
@sl0thentr0py Thanks for the information. It's great to know that the long-term goal is to do sophisticated processing on the server side. My current plan is:
And in the meantime, I'd hope the server side can start recording size information about the oversized events. It can still reject those events immediately, but it'd be helpful to see the size breakdown of different components regardless what we're going to do next. |
Because we now need to check if a serialized envelope item is oversized (see #1603 (comment)), envelope serialization is better performed by Transport to have more control to filter oversized items.
Because we now need to check if a serialized envelope item is oversized (see #1603 (comment)), envelope serialization is better performed by Transport to have more control to filter oversized items.
Because we now need to check if a serialized envelope item is oversized (see #1603 (comment)), envelope serialization is better performed by Transport to have more control to filter oversized items.
Because we now need to check if a serialized envelope item is oversized (see #1603 (comment)), envelope serialization is better performed by Transport to have more control to filter oversized items.
* Add Transport#serialize_envelope to control envelope serialization Because we now need to check if a serialized envelope item is oversized (see #1603 (comment)), envelope serialization is better performed by Transport to have more control to filter oversized items. * Avoid mutating the envelope's items * Update changelog * Resize changelog image * Update changelog * Update CHANGELOG.md
My experience with this problem is as follows:
So, that was exciting! (I guess the origin of this was upgrading our self-hosted Sentry installation, as the payload errors only started after that.) |
This will help somewhat with the Sentry too-large-payload problem that bloats Sidekiq memory and brought down Programs & Events Dashboard. It looks like there's a much better solution coming via the Sentry client. See getsentry/sentry-ruby#1603
Interesting, thank you! Upgrading to the latest version already fixed it, I think, but it sounds like there's no good reason to continue using |
Since envelope size check has been introduced in |
Issue Description
We recently switched from
sentry-raven
to the new version of the SDK.Everything seems to work for Rails HTTP requests, but exception tracking for DelayedJob causes some issues.
From time to time, we see the following log entries:
This only happens for tracing/ exception tracking in our delayed jobs.
I tried to narrow down the issue in two ways:
config.debug = true
didn't get any additional informationRails.logger.error(event.inspect)
inconfig.before_send
is never logging anythingAny suggestions about what I can do to make sure that none of our exceptions are getting lost?
Reproduction Steps
Hard to reproduce on our non-production systems
Expected Behavior
It should send all exceptions that are happening to Sentry
Actual Behavior
Some exceptions (and probably traces) are missing
Ruby Version
2.6.3
SDK Version
sentry-delayed_job=4.7.3, sentry-ruby=4.7.3
Integration and Its Version
Rails, DelayedJob
Sentry Config
The text was updated successfully, but these errors were encountered: