Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tika-2421 : About the encoding of HTML #338

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

PeterAlfredLee
Copy link
Member

Seems we can use charsetdetector.StandardHtmlEncodingDetector for charset detecting of HTML. I'm wondering why we are not using it?

And I stopped treating ISO-8859-1 as Windows-1252.

@tballison
Copy link
Contributor

Inertia... I never got around to doing a bakeoff between the two, and, unless there's evidence of improvement, I'm hesitant to make the change as the default detector.

@PeterAlfredLee
Copy link
Member Author

Like TIKA-2421 says , according to w3 description , we should read html byte mark order first.
If there is no BOM , that means it is ASCII-compatible , then we can read this html's meta tag with ACSII and get charset.

HtmlEncodingDetector will not read html's BOM first , it assume html's meta tag is ASCII-compatible.
StandardHtmlEncodingDetector will read BOM first , then read metadata if there is no BOM , then read meta tag if no charset in metadata.
So I think use StandardHtmlEncodingDetector is more compliant to the w3 standard.

Only problem I can see is StandardHtmlEncodingDetector treating ISO-8859-1 as Windows-1252 , I have modify that in this PR.

So I think we can change StandardHtmlEncodingDetector as default detector.
Or we can modify HtmlEncodingDetector to compliant to w3 standard. WDYT

@tballison
Copy link
Contributor

Wait, it turns out I did get around to doing this study...

https://github.com/tballison/share/blob/main/slides/Tika_charset_detector_study_201909.docx

Let me read it and remember what I found... 🤣

Replace HtmlEncodingDetector to StandardHtmlEncodingDetector
Adjust some test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants