Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize txt #836

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

christian-intra2net
Copy link
Contributor

olevba's heuristic for detecting plain text (no \x00 in the binary data) does not work with many unicode encodings like utf16. Improve on that heuristic and move it to ftguess.py, so we can at least deal with harmless text encoded with utf8, latin1, or utf16 (with or without BOMs). This is far from perfect, ignores popular Asian encodings, but according to wikipedia utf8 is by far the most popular encoding used in software. If we need something better still, I'd recommend not re-inventing the wheel here but use libmagic or other specialized libraries.

I created sample files for all the encodings used and unittests to check them.

Test-driven development: want to correctly detect these as text in ftguess.
Already use future ftguess text type.

Since we're at it: slightly improve output of unittest
This is not so simple since various text encodings can look rather
"binary", but a few simple heuristics will deal with many text types (at
least those encountered here in Europe).

Of course, all xml is text as well, so use checks for "is this text" only
after more specialized tests like "is this xml".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants