Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset encoding format #20

Open
foreverlove944 opened this issue Apr 3, 2024 · 1 comment
Open

Dataset encoding format #20

foreverlove944 opened this issue Apr 3, 2024 · 1 comment

Comments

@foreverlove944
Copy link

What encoding method is used for the data set you provided? I opened it in UTF-8 encoding format. English characters are normal, but Russian and other languages are not normal.
屏幕截图 2024-04-03 205121

@HarshTrivedi
Copy link
Member

The contexts/paragraphs were taken from the original source datasets. However, I did apply ftfy at runtime. See commaqa/inference/dataset_readers.py for example. You might want to give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants