Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-NllbTokenizers (non-fast) aren't able to save and load new language codes #44

Open
mshannon-sil opened this issue Oct 13, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@mshannon-sil
Copy link
Collaborator

I was working on some tests with the stas/tiny-m2m_100 model, and I ran into an issue where new language codes that were added to the tokenizer seem to not be properly saving. When the tokenizer is loaded from the saved tokenizer files, it does not recognize the new language codes and crashes. This is the error message I'm getting:

    def get_lang_token(self, lang: str) -> str:
>       return self.lang_code_to_token[lang]
E       KeyError: 'src_Lang'

.venv/lib/python3.8/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py:378: KeyError
@mshannon-sil mshannon-sil self-assigned this Oct 13, 2023
@mshannon-sil mshannon-sil added the bug Something isn't working label Oct 13, 2023
@mshannon-sil
Copy link
Collaborator Author

It's failing for MBartTokenizer and M2M100Tokenizer. It's succeeding for NllbTokenizer, NllbTokenizerFast, and MBartTokenizerFast. The split is not evenly across PreTrainedTokenizer and PreTrainedTokenizerFast, so I'm not sure what the differentiator is yet.

@ddaspit
Copy link
Contributor

ddaspit commented Oct 18, 2023

This sounds like a bug in Huggingface transformers or tokenizers. If we can figure out exactly what the issues is, we could submit an issue to Huggingface. Until then, we should just work around it.

@mshannon-sil
Copy link
Collaborator Author

Since this appears to only be affecting a subset of PreTrainedTokenizer types and not any of the Fast tokenizers, it might just be an issue with certain slow tokenizer types not having been updated in a while to be compatible with later huggingface releases. This shouldn't be an issue for us as long as we use Fast tokenizers. @ddaspit Should we still submit an issue to huggingface for this?

@ddaspit
Copy link
Contributor

ddaspit commented Nov 6, 2023

I wouldn't worry about submitting an issue to Huggingface. We should leave this issue open to capture the inability to work with non-fast tokenizers. Once, we put in a workaround for these tokenizers, we can close this issue.

@johnml1135 johnml1135 changed the title Non-NllbTokenizers aren't able to save and load new language codes Non-NllbTokenizers (non-fast) aren't able to save and load new language codes Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants