-
-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting FormatException: Failed to parse message headers due to a buffering bug #991
Comments
The change to IsMboxMarker is wallpapering over the real problem. Somehow buffering state is getting messed up such that it is skipping over the real mbox marker and a ton of headers and getting confused, thinking that the mbox marker is in the middle of the DKIM header value. This could be caused by a few things:
Have you tried the ExperimentalMimeParser? At the very least it should be faster. |
Thanks for the quick feedback!
I'm using defaults options, so RespectContentLength is set to false. Also, I tried recreating the email context with several messages before and after, but that didn't reproduce the problem so far.
Persistent was the default (false) in the test (see posted code snippet) and I'm not messing with the stream.
That's my assumption as well, but I'm not sure how to tweak the code to force a reproduction of the issue.
I just did and it runs flawlessly! |
Yea, if it's a buffering issue (and based on the fact that you are using the defaults and not doing anything with the stream, it sounds like it is), it'll be difficult to construct an mbox that will reproduce the issue because it'll likely be a corner case where some token or another needs to cross a read boundary at some stage in parsing, causing things to get out of whack. IIRC, the ExperimentalMimeParser is a little more robust in that sense (if my memory is correct, it's been a while) because it tends to consume everything as it goes whereas the old MimeParser implementation has a few areas where it will abort an attempt to consume something if it can't get the whole token, call ReadAhead to refill the buffer (which moves any remaining buffer to the "prebuffer") and then tries again. This is most likely the cause of the issue, but I obviously can't be 100% certain (and there are several places that do that, so narrowing that down would be a pain). I think the most likely culprit would be the StepHeaders() (or related) logic in MimeParser.cs. That code has gotten a bit scary, even for me, which is one of the reasons it got rewritten in ExperimentalMimeParser. |
FWIW, my plan is to make ExperimentalMimeParser the default in v5.0 and rename the current MimeParser to LegacyMimeParser and more-or-less I meant to do that for v4.0 but forgot, haha. |
Describe the bug
I get a "Failed to parse message headers." exception when parsing my ~11GB gmail mbox. This happens reliably, always on the same email. When the exception is raised, MimeParser.MboxMarker contains the string 'From : Subject : Date : ', from the middle of the email's DKIM-Signature header (see attachements). Mimekit has no problem parsing the offending email when it's extracted to its own file. So far, extracting 1MB of data around the email in an attempt to reproduce the problem has not been successful. A previous complete gmail mbox (now lost), that contained the same email, also had no parsing issue. I'm guessing it's related to the ReadAheadSize somehow?
Platform:
To Reproduce
Result
The location of the error state origin is already lost when the exception is raised, but in this particular case, it comes from line 965 in StepHeaders(...):
Expected behavior
No exception
Code Snippets
Additional context
Workaround
If I modify IsMboxMarker() to allow 'From ', but ignore 'From :', the exception is not raised. However, I'm not familiar with RFC standards or the performance impact to suggest this as a permanent fix (also, my test case to prove the fix includes a very private ~11GB file):
The text was updated successfully, but these errors were encountered: