Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRPC and CoLA Dataset UnicodeDecodeError #1405

Open
ZhuHouYi opened this issue Apr 17, 2024 · 0 comments
Open

MRPC and CoLA Dataset UnicodeDecodeError #1405

ZhuHouYi opened this issue Apr 17, 2024 · 0 comments

Comments

@ZhuHouYi
Copy link

Error message:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 147: invalid continuation byte

I can't train properly after loading these two data sets. Still report an error after using "ISO-8859-1" and "latin-1" code

After checking the train.txt file of the MRPC dataset, I found that the error byte code corresponds to the character "é", but I modified train.txt and test.txt and preprocessed again to get train.tsv and test.tsv (the file also checked that it did not contain the character "é"). Finally, I still reported an error in training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant