Skip to content

FLORES‐200 Language Code Resolution for NMT Engine

John Lambert edited this page Jan 30, 2024 · 6 revisions

The NMT engine in Serval is based on the NLLB-200 model. In the NLLB model, languages are identified by a FLORES-200 code of the form {language}_{script}, where the language is an ISO 639-3 code and the script is an ISO 15924 code. In order to have the best chance of matching a language code used in NLLB-200, Serval attempts to convert the IETF language tags specified for an engine to a FLORES-200 code.

The language tag is converted to a FLORES-200 code using the following algorithm:

  1. Extract the language and script subtags from the language tag.
  2. Find the correct ISO 639-3 language code:
    1. If the language subtag is already an ISO 639-3 code, then use as-is.
    2. If the language subtag is a macrolanguage, then convert it to the closest ISO 639-3 code according to the following mapping:
      • ar -> arb
      • ms -> zsm
      • lv -> lvs
      • ne -> npi
      • sw -> swh
    3. If the language subtag is cmn (Mandarin Chinese), then convert to zho.
    4. If the language subtag is an ISO-639-1 code, then convert it to the corresponding ISO 639-3 code.
  3. Find the correct ISO 15924 script code:
    1. If the script subtag is specified, then use as-is.
    2. Find the default script for a language tag by searching the SLDR langtags.json file. If the language tag or language subtag matches a tag in the tags field of a language entry, then use the corresponding script field.
    3. If the script is Kore, then convert to Hang.
  4. If an ISO 639-3 code or ISO 15924 code cannot be found, then use the language tag as the FLORES-200 code.
  5. Construct the FLORES-200 code from the ISO 639-3 language code and ISO 15924 script code: {language}_{script}.