Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction fails on plain text files #114

Open
bogdanionitabp opened this issue Sep 5, 2023 · 0 comments
Open

Extraction fails on plain text files #114

bogdanionitabp opened this issue Sep 5, 2023 · 0 comments

Comments

@bogdanionitabp
Copy link

Some webpages (such as https://github.com/JeremySkinner/WebMatrix.Data.StronglyTyped/blob/master/License.txt ) detect whether they're loaded from a browser or from other tools and serve different content type. The one in the example will serve HTML when loaded from the browser, but plain text when loaded via wget.
Thus:

wget https://github.com/JeremySkinner/WebMatrix.Data.StronglyTyped/blob/master/License.txt | unfluff
will yield no content:
{"title":"","softTitle":"","date":null,"author":[],"publisher":null,"copyright":null,"lang":null,"tags":[],"image":null,"videos":[],"links":[],"text":""}

because the tool fails to detect plain text.

This makes it untrustworthy for parsing webpages extracted via GET requests made from tools.
I can do a workaround in my code and search for any HTML tags in the text before calling unfluff and only call it if I find any, otherwise assume it's plain text already, but it would be nice if the tool could do that automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant