Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: input contains invalid UTF-8 around byte XXXX #42

Open
Sripaad opened this issue Aug 20, 2021 · 3 comments
Open

error: input contains invalid UTF-8 around byte XXXX #42

Sripaad opened this issue Aug 20, 2021 · 3 comments

Comments

@Sripaad
Copy link

Sripaad commented Aug 20, 2021

I am trying to extract keywords from amazon_reviews dataset, when using it for spanish i encounter this error that am unable to resolve.

STACK TRACE
/python3.8/site-packages/multi_rake/algorithm.py in apply(self, text, text_for_stopwords)
     60 
     61         else:
---> 62             language_code = detect_language(text, self.lang_detect_threshold)
     63 
     64             if language_code is not None and language_code in STOPWORDS:

/opt/conda/lib/python3.8/site-packages/multi_rake/utils.py in detect_language(text, proba_threshold)
     12 
     13 def detect_language(text, proba_threshold):
---> 14     _, _, details = pycld2.detect(text)
     15 
     16     language_code = details[0][1]

error: input contains invalid UTF-8 around byte 2094 (of 5341)

Is there a workaround by manually entering Language code or something ?

@7homasSutter
Copy link

7homasSutter commented Nov 10, 2021

Seems like a problem in detecting the language-code correctly. I didn't go to deep into checking what exactly happens but it's either that the input text got modified in a bad way or a bug in pycld2.detect(text).

A workaround that works is to provide the language-code when the Rake() object is initialised:

from multi_rake import Rake

rake = Rake(language_code='en', stopwords=stopwords, max_words=3)
keywords = rake.apply(text)

@vgrabovets
Copy link
Owner

Could you, please, provide text that causes this error?

@7homasSutter
Copy link

I tested a bit around and it's not that simple to reproduce because I'm using text extracted from PDF files. If I just copy and past the text here or to a new text file the error seems to disappear. However, I set up a small python gist with an example code that triggers the bug: https://gist.github.com/7homasSutter/45c4fe43283c67feb1caff3175876baa

By default the gist scripts assumes the pdf file'./test/ChopChop.pdf'. And here is the example PDF: PDF-Download

I hope that helps you to reproduce the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants