-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIKA-4303] Handle OneNotePropertyEnum.CachedTitleString as RichEditTextUnicode #2098
Conversation
1e96b94
to
bc6fd9d
Compare
Thank you for opening this and looking deeply into the code. Would you be able to add a unit test with the file that was contributed on JIRA? Thank you, again, and y, fixes to the other issues would be great. |
Of course, I can try generating a OneNote file containing multiple languages to test this issue. However, for subsequent modifications, should I continue using [TIKA-4303], or will there be a new issue on JIRA? |
Thank you! I'd prefer separate issues, but whatever it takes. 🤣 |
Sure, if you create a new issue in JIRA (I don’t have an account, so I can’t create one🤣), I’d be happy to work on resolving the issue with reading OneNote content. |
Sorry. We have a bunch of spam on our JIRA. If you re-request a JIRA account, I'll approve it. And, thank you. |
You can apply for an account. The secret of getting an account is to enter a small meaningful and specific text, e.g. mention this PR. https://selfserve.apache.org/jira-account.html |
Thank you for the reminder, I have submitted an account apply. |
@sunluman is this ready to go? |
@sunluman can you produce an example or unit test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a unit test to prove it works.
Sorry, I overlooked this unit test. I will add it right away. |
@tballison @nddipiazza The unit test is ready. This test only performs a simple check for garbled characters in the title. |
@nddipiazza wdyt? |
Fixes https://issues.apache.org/jira/browse/TIKA-4303
The issue of garbled text is caused by
OneNotePropertyEnum.CachedTitleString
not being correctly parsed. It should be parsed usinghandleRichEditTextUnicode
.As for why versions
2.7.0
and earlier did not encounter garbled text, I believe it was due to a previously erroneous line of code:This line caused the parsing of OneNote files to never append the parsed content of
OneNotePropertyEnum.ImageFilename, OneNotePropertyEnum.Author, and OneNotePropertyEnum.CachedTitleString
to the xhtml.However, when parsing
OneNotePropertyEnum.RichEditTextUnicode
, the logic for only parsing the latest version’s content was not added. As a result, the files appeared to be successfully parsed and without garbled text, but in reality, CachedTitleString was never parsed.I only fixed the bug in the issue where the title in the uploaded file was not parsed. During the testing process, I also discovered the following issues:
I am not sure whether to create a new issue before proceeding with these fixes, so these issues have not been addressed in this PR.