You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was working on some tests with the stas/tiny-m2m_100 model, and I ran into an issue where new language codes that were added to the tokenizer seem to not be properly saving. When the tokenizer is loaded from the saved tokenizer files, it does not recognize the new language codes and crashes. This is the error message I'm getting:
It's failing for MBartTokenizer and M2M100Tokenizer. It's succeeding for NllbTokenizer, NllbTokenizerFast, and MBartTokenizerFast. The split is not evenly across PreTrainedTokenizer and PreTrainedTokenizerFast, so I'm not sure what the differentiator is yet.
This sounds like a bug in Huggingface transformers or tokenizers. If we can figure out exactly what the issues is, we could submit an issue to Huggingface. Until then, we should just work around it.
Since this appears to only be affecting a subset of PreTrainedTokenizer types and not any of the Fast tokenizers, it might just be an issue with certain slow tokenizer types not having been updated in a while to be compatible with later huggingface releases. This shouldn't be an issue for us as long as we use Fast tokenizers. @ddaspit Should we still submit an issue to huggingface for this?
I wouldn't worry about submitting an issue to Huggingface. We should leave this issue open to capture the inability to work with non-fast tokenizers. Once, we put in a workaround for these tokenizers, we can close this issue.
johnml1135
changed the title
Non-NllbTokenizers aren't able to save and load new language codes
Non-NllbTokenizers (non-fast) aren't able to save and load new language codes
Dec 1, 2023
I was working on some tests with the
stas/tiny-m2m_100
model, and I ran into an issue where new language codes that were added to the tokenizer seem to not be properly saving. When the tokenizer is loaded from the saved tokenizer files, it does not recognize the new language codes and crashes. This is the error message I'm getting:The text was updated successfully, but these errors were encountered: