Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation with glossary and target "EN-GB" looses some words #111

Open
EnricoPicci opened this issue Jun 26, 2024 · 2 comments
Open

Translation with glossary and target "EN-GB" looses some words #111

EnricoPicci opened this issue Jun 26, 2024 · 2 comments

Comments

@EnricoPicci
Copy link

I have a text to translate from Italian to English, this one
text_to_translate = "| \\_VOEMI | Data emissione operazione | Deve essere maggiore o uguale alla data di emissione della polizza e minore o uguale alla data di sistema. |"

I have also a glossary I want to use

entries = {"Fattore": "Variable", "Data emissione": "Issuance date"}
my_glossary = translator.create_glossary(
    "My glossary",
    source_lang="IT",
    target_lang="EN",
    entries=entries,
)

If I translate the text with target "EN-GB" i get this result
| Issuance date | Must be greater than or equal to the policy issue date and less than or equal to the system date. |
The issue here is that the part | \\_VOEMI gets lost.

However, if I specify that the target language is "EN-US" I get this correct result
| | \_VOEMI | Issuance date transaction | Must be greater than or equal to the policy issue date and less than or equal to the system date. |

@JanEbbing
Copy link
Member

Im not 100% what your use case is, but you will get the highest possible translation quality by parsing structured data like this before feeding it into the API, for example in your case:

text_to_translate = "| \\_VOEMI            | Data emissione operazione | Deve essere maggiore o uguale alla data di emissione della polizza e minore o uguale alla data di sistema.     |"
special_tokens = ["\\_"]
delimiter = "|"
translator = deepl.Translator(...)

translated_texts = []
for text in text_to_translate.split(delimiter):
    if (not text.strip()) or any(map(lambda tok: text.contains(tok), special_tokens)):
        translated_texts.append(text)
        continue
    else:
        # you might want to trim the whitespace here as well with text.trim(), and maybe
        # fill up the missing whitespace when appending to translated_texts, as this looks like a table
        translated_texts.append(translator.translate_text(text, ...).text)
output = delimiter.join(translated_texts)

Due to the nature of ML models, we otherwise cannot guarantee that the output is stable/preserves these kinds of tokens. You can also take a look at ignore tags as another option.

@EnricoPicci
Copy link
Author

Jan, thanks for your prompt response. I will implement your suggestions. At the same time it is interesting the different behaviour between "EN-GB" and "EN-US".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants