Non-NllbTokenizers (non-fast) aren't able to save and load new language codes #44

mshannon-sil · 2023-10-13T23:03:13Z

I was working on some tests with the stas/tiny-m2m_100 model, and I ran into an issue where new language codes that were added to the tokenizer seem to not be properly saving. When the tokenizer is loaded from the saved tokenizer files, it does not recognize the new language codes and crashes. This is the error message I'm getting:

    def get_lang_token(self, lang: str) -> str:
>       return self.lang_code_to_token[lang]
E       KeyError: 'src_Lang'

.venv/lib/python3.8/site-packages/transformers/models/m2m_100/tokenization_m2m_100.py:378: KeyError

The text was updated successfully, but these errors were encountered:

mshannon-sil · 2023-10-16T23:15:12Z

It's failing for MBartTokenizer and M2M100Tokenizer. It's succeeding for NllbTokenizer, NllbTokenizerFast, and MBartTokenizerFast. The split is not evenly across PreTrainedTokenizer and PreTrainedTokenizerFast, so I'm not sure what the differentiator is yet.

ddaspit · 2023-10-18T16:37:11Z

This sounds like a bug in Huggingface transformers or tokenizers. If we can figure out exactly what the issues is, we could submit an issue to Huggingface. Until then, we should just work around it.

mshannon-sil · 2023-11-02T19:03:50Z

Since this appears to only be affecting a subset of PreTrainedTokenizer types and not any of the Fast tokenizers, it might just be an issue with certain slow tokenizer types not having been updated in a while to be compatible with later huggingface releases. This shouldn't be an issue for us as long as we use Fast tokenizers. @ddaspit Should we still submit an issue to huggingface for this?

ddaspit · 2023-11-06T22:47:09Z

I wouldn't worry about submitting an issue to Huggingface. We should leave this issue open to capture the inability to work with non-fast tokenizers. Once, we put in a workaround for these tokenizers, we can close this issue.

mshannon-sil self-assigned this Oct 13, 2023

mshannon-sil added the bug Something isn't working label Oct 13, 2023

johnml1135 changed the title ~~Non-NllbTokenizers aren't able to save and load new language codes~~ Non-NllbTokenizers (non-fast) aren't able to save and load new language codes Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-NllbTokenizers (non-fast) aren't able to save and load new language codes #44

Non-NllbTokenizers (non-fast) aren't able to save and load new language codes #44

mshannon-sil commented Oct 13, 2023

mshannon-sil commented Oct 16, 2023

ddaspit commented Oct 18, 2023

mshannon-sil commented Nov 2, 2023

ddaspit commented Nov 6, 2023

Non-NllbTokenizers (non-fast) aren't able to save and load new language codes #44

Non-NllbTokenizers (non-fast) aren't able to save and load new language codes #44

Comments

mshannon-sil commented Oct 13, 2023

mshannon-sil commented Oct 16, 2023

ddaspit commented Oct 18, 2023

mshannon-sil commented Nov 2, 2023

ddaspit commented Nov 6, 2023