FLORES‐200 Language Code Resolution for NMT Engine

The NMT engine in Serval is based on the NLLB-200 model. In the NLLB model, languages are identified by a FLORES-200 code of the form {language}_{script}, where the language is an ISO 639-3 code and the script is an ISO 15924 code. In order to have the best chance of matching a language code used in NLLB-200, Serval attempts to convert the IETF language tags specified for an engine to a FLORES-200 code.

The language tag is converted to a FLORES-200 code using the following algorithm:

Extract the language and script subtags from the language tag.
Find the correct ISO 639-3 language code:
1. If the language subtag is already an ISO 639-3 code, then use as-is.
2. If the language subtag is a macrolanguage, then convert it to the closest ISO 639-3 code according to the following mapping:
  - ar -> arb
  - ms -> zsm
  - lv -> lvs
  - ne -> npi
  - sw -> swh
3. If the language subtag is cmn (Mandarin Chinese), then convert to zho.
4. If the language subtag is an ISO-639-1 code, then convert it to the corresponding ISO 639-3 code.
Find the correct ISO 15924 script code:
1. If the script subtag is specified, then use as-is.
2. Find the default script for a language tag by searching the SLDR langtags.json file. If the language tag or language subtag matches a tag in the tags field of a language entry, then use the corresponding script field.
3. If the script is Kore, then convert to Hang.
If an ISO 639-3 code or ISO 15924 code cannot be found, then use the language tag as the FLORES-200 code.
Construct the FLORES-200 code from the ISO 639-3 language code and ISO 15924 script code: {language}_{script}.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLORES‐200 Language Code Resolution for NMT Engine

Clone this wiki locally