Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIKA-4303] Handle OneNotePropertyEnum.CachedTitleString as RichEditTextUnicode #2098

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

sunluman
Copy link
Contributor

Fixes https://issues.apache.org/jira/browse/TIKA-4303

The issue of garbled text is caused by OneNotePropertyEnum.CachedTitleString not being correctly parsed. It should be parsed using handleRichEditTextUnicode.

As for why versions 2.7.0 and earlier did not encounter garbled text, I believe it was due to a previously erroneous line of code:

if (options.getUtf16PropertiesToPrint().contains(propertyValue.propertyId))

This line caused the parsing of OneNote files to never append the parsed content of OneNotePropertyEnum.ImageFilename, OneNotePropertyEnum.Author, and OneNotePropertyEnum.CachedTitleString to the xhtml.

However, when parsing OneNotePropertyEnum.RichEditTextUnicode, the logic for only parsing the latest version’s content was not added. As a result, the files appeared to be successfully parsed and without garbled text, but in reality, CachedTitleString was never parsed.

I only fixed the bug in the issue where the title in the uploaded file was not parsed. During the testing process, I also discovered the following issues:

  • Non-rich text content is not checked for the latest version, so when the content is TextExtendedAscii, it is still parsed repeatedly.
  • Dates are not parsed.
  • Chinese (or other non-Ascii characters? i'm not sure) characters in the content are not parsed.

I am not sure whether to create a new issue before proceeding with these fixes, so these issues have not been addressed in this PR.

@sunluman sunluman force-pushed the main branch 3 times, most recently from 1e96b94 to bc6fd9d Compare January 14, 2025 13:14
@tballison
Copy link
Contributor

tballison commented Jan 14, 2025

Thank you for opening this and looking deeply into the code. Would you be able to add a unit test with the file that was contributed on JIRA?

Thank you, again, and y, fixes to the other issues would be great.

@sunluman
Copy link
Contributor Author

Of course, I can try generating a OneNote file containing multiple languages to test this issue. However, for subsequent modifications, should I continue using [TIKA-4303], or will there be a new issue on JIRA?

@tballison
Copy link
Contributor

Thank you! I'd prefer separate issues, but whatever it takes. 🤣

@sunluman
Copy link
Contributor Author

Sure, if you create a new issue in JIRA (I don’t have an account, so I can’t create one🤣), I’d be happy to work on resolving the issue with reading OneNote content.

@tballison
Copy link
Contributor

Sorry. We have a bunch of spam on our JIRA. If you re-request a JIRA account, I'll approve it. And, thank you.

@THausherr
Copy link
Contributor

You can apply for an account. The secret of getting an account is to enter a small meaningful and specific text, e.g. mention this PR. https://selfserve.apache.org/jira-account.html

@sunluman
Copy link
Contributor Author

Thank you for the reminder, I have submitted an account apply.

@tballison
Copy link
Contributor

@sunluman is this ready to go?

@nddipiazza
Copy link
Contributor

@sunluman can you produce an example or unit test?

Copy link
Contributor

@nddipiazza nddipiazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a unit test to prove it works.

@sunluman
Copy link
Contributor Author

Sorry, I overlooked this unit test. I will add it right away.

@sunluman
Copy link
Contributor Author

@tballison @nddipiazza The unit test is ready. This test only performs a simple check for garbled characters in the title.

@tballison
Copy link
Contributor

@nddipiazza wdyt?

@tballison tballison merged commit 7f94520 into apache:main Jan 27, 2025
1 check passed
tballison pushed a commit that referenced this pull request Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants