-
-
Notifications
You must be signed in to change notification settings - Fork 0
FLORES‐200 Language Code Resolution for NMT Engine
John Lambert edited this page Jan 30, 2024
·
6 revisions
The NMT engine in Serval is based on the NLLB-200 model. In the NLLB model, languages are identified by a FLORES-200 code of the form {language}_{script}
, where the language
is an ISO 639-3 code and the script
is an ISO 15924 code. In order to have the best chance of matching a language code used in NLLB-200, Serval attempts to convert the IETF language tags specified for an engine to a FLORES-200 code.
The language tag is converted to a FLORES-200 code using the following algorithm:
- Extract the language and script subtags from the language tag.
- Find the correct ISO 639-3 language code:
- If the language subtag is already an ISO 639-3 code, then use as-is.
- If the language subtag is a macrolanguage, then convert it to the closest ISO 639-3 code according to the following mapping:
-
ar
->arb
-
ms
->zsm
-
lv
->lvs
-
ne
->npi
-
sw
->swh
-
- If the language subtag is
cmn
(Mandarin Chinese), then convert tozho
. - If the language subtag is an ISO-639-1 code, then convert it to the corresponding ISO 639-3 code.
- Find the correct ISO 15924 script code:
- If the script subtag is specified, then use as-is.
- Find the default script for a language tag by searching the SLDR langtags.json file. If the language tag or language subtag matches a tag in the
tags
field of a language entry, then use the correspondingscript
field. - If the script is
Kore
, then convert toHang
.
- If an ISO 639-3 code or ISO 15924 code cannot be found, then use the language tag as the FLORES-200 code.
- Construct the FLORES-200 code from the ISO 639-3 language code and ISO 15924 script code:
{language}_{script}
.