error: input contains invalid UTF-8 around byte XXXX #42

Sripaad · 2021-08-20T14:45:54Z

I am trying to extract keywords from amazon_reviews dataset, when using it for spanish i encounter this error that am unable to resolve.

STACK TRACE
/python3.8/site-packages/multi_rake/algorithm.py in apply(self, text, text_for_stopwords)
     60 
     61         else:
---> 62             language_code = detect_language(text, self.lang_detect_threshold)
     63 
     64             if language_code is not None and language_code in STOPWORDS:

/opt/conda/lib/python3.8/site-packages/multi_rake/utils.py in detect_language(text, proba_threshold)
     12 
     13 def detect_language(text, proba_threshold):
---> 14     _, _, details = pycld2.detect(text)
     15 
     16     language_code = details[0][1]

error: input contains invalid UTF-8 around byte 2094 (of 5341)

Is there a workaround by manually entering Language code or something ?

The text was updated successfully, but these errors were encountered:

7homasSutter · 2021-11-10T11:58:36Z

Seems like a problem in detecting the language-code correctly. I didn't go to deep into checking what exactly happens but it's either that the input text got modified in a bad way or a bug in pycld2.detect(text).

A workaround that works is to provide the language-code when the Rake() object is initialised:

from multi_rake import Rake

rake = Rake(language_code='en', stopwords=stopwords, max_words=3)
keywords = rake.apply(text)

vgrabovets · 2021-11-10T12:53:13Z

Could you, please, provide text that causes this error?

7homasSutter · 2021-11-17T10:45:26Z

I tested a bit around and it's not that simple to reproduce because I'm using text extracted from PDF files. If I just copy and past the text here or to a new text file the error seems to disappear. However, I set up a small python gist with an example code that triggers the bug: https://gist.github.com/7homasSutter/45c4fe43283c67feb1caff3175876baa

By default the gist scripts assumes the pdf file'./test/ChopChop.pdf'. And here is the example PDF: PDF-Download

I hope that helps you to reproduce the bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error: input contains invalid UTF-8 around byte XXXX #42

error: input contains invalid UTF-8 around byte XXXX #42

Sripaad commented Aug 20, 2021

7homasSutter commented Nov 10, 2021 •

edited

Loading

vgrabovets commented Nov 10, 2021

7homasSutter commented Nov 17, 2021

error: input contains invalid UTF-8 around byte XXXX #42

error: input contains invalid UTF-8 around byte XXXX #42

Comments

Sripaad commented Aug 20, 2021

7homasSutter commented Nov 10, 2021 • edited Loading

vgrabovets commented Nov 10, 2021

7homasSutter commented Nov 17, 2021

7homasSutter commented Nov 10, 2021 •

edited

Loading