Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding more control to the way that translation is performed on a per-marker basis. #560

Open
davidbaines opened this issue Oct 11, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request pipeline 6: infer Issue related to using a trained model to translate. research Research topics

Comments

@davidbaines
Copy link
Collaborator

It would be very helpful to have finer control over the way that SILNLP produces translations. It would be ideal to be able to specify what should happen with the data in each marker or group of markers. The translate_config.yml file might be a good place to configure these settings.

There are these actions that could be considered possible for Paragraph style markers:
Delete: Ignore the marker and its data and omit it from the output.
Translate: Copy the marker to the output along with the translation of its content.
Copy: Copy the marker and text to the output verbatim. Do not attempt to translate - useful for references for example.

Other actions are possible for Character style markers:
Translate without marker: Extract the text from the marker and translate it don't add the marker to the output.
Translate and move marker: Extract the text from the marker and translate it. Add the marker and end marker to the output.
This would have the option of adding the Marker and Endmarker to either the beginning or the end of the paragraph. Or adding the Marker to the beginning of the Paragraph and the End marker to the end of the paragraph.

Every marker has a \StyleType which is one of: Paragraph, Character, Milestone or Note. It might (or might not) be useful to be able to apply one action to all those markers with a specific \StyleType. Although this would likely not be very useful for the Paragraph or Character Styles which are widely used, it could be useful as a way to decide what should happen with Notes and Milestones.

Most, but not all markers have a \TextType which is one of: Title, ChapterNumber, VerseNumber, VerseText, Other, NoteText, Section.
It might be useful to be able to apply one action to all those markers with a specific \TextType

According to the USFM Reference, Markers which would be used in a broader text "environments" were named using a reserved initial letter and rather than an opening and closing tag.
In other words the markers beginning with \i form the introduction. All those beginning with f refer to a given footnote, etc.

Ideally, we would be able to specify what happens to these as a group without having to specify what happens to each individual marker within the group.

\i - Introductions
\f - Footnotes
\x - Cross references
\e - Explanatory (study) material

@davidbaines davidbaines added enhancement New feature or request pipeline 6: infer Issue related to using a trained model to translate. research Research topics labels Oct 11, 2024
@davidbaines
Copy link
Collaborator Author

Issue #306 is an earlier version of this request. Since this one has more detail let's keep this one.

@davidbaines
Copy link
Collaborator Author

Here are some ideas of how we could add more flexibility to the AI drafting.

  1. Add an option to transfer paragraph markers to the draft. (Some teams may not want this.)
    Provide a way to specify what should happen to each paragraph marker. So far the options are transfer or don't transfer, I'll ask for other ideas.

  2. Provide options for how each inline marker and its content should be treated.
    Options are:

Omit the marker and its content from the draft.

Retain the marker and place it:

At the beginning, middle or end of the paragraph.
At the beginning, middle or end of the verse.
Use word alignments to guess where in the paragraph the marker should be inserted.
Use word alignments to guess where in the verse the marker should be inserted.

These are options for what to do with the content of the marker.

Remove the content.
Leave the content unchanged.
Translate the content.
Apply one or more regular expressions to the content.
Send the content and marker to a python function that returns a replacement marker and content.

@davidbaines
Copy link
Collaborator Author

davidbaines commented Oct 25, 2024

For paragraph markers we could calculate the position of the marker in the source verse and reinsert it in the same relative position (to the nearest word) in the translation.
We could count words, characters or maybe even tokens as three ways of estimating the relative position. It's possible that each of those would give different results and the teams would have the option of choosing which way works best in their case.

It's possible that this may be a more effective solution than using word alignments since it easy to understand and has no chance of mixing up the order of markers that are in the original.

@davidbaines
Copy link
Collaborator Author

Here's another option for how we deal with Paragraph level markers. It might make a good default. There will be some challenges in implementing this.
It will save our EITL team quite a few emails asking why there are empty markers at the end of the verse text - which is our current system.

Many verses that are split over multiple paragraph style markers also have punctuation that preceeds the marker in the original AND in the translation.
Say we have this in the source:

\v 2 Abraham was the father of Isaac,
\li1 Isaac the father of Jacob,
\li1 Jacob the father of Judah and his brothers.

And the translated version looks like this:
\v 2 Abraham eut pour descendant Isaac, Isaac eut pour descendant Jacob, Jacob eut pour descendant Juda et ses frères.

In this case the punctuation matches exactly and we should have great confidence producing
\v 2 Abraham eut pour descendant Isaac,
\li1 Isaac eut pour descendant Jacob,
\li1 Jacob eut pour descendant Juda et ses frères.

One challenge is coping with translation across different scripts. Here we need to know the relationship between Arabic script punctuation and Latin script punctuation. We will need to know that for all the script pairs. (Ulf's URoman may have this information.)

\v 2 إِبْرَاهِيمُ أَنْجَبَ إِسْحَاقَ. وَإِسْحاقُ أَنْجَبَ يَعْقُوبَ. وَيَعْقُوبُ أَنْجَبَ يَهُوذَا وَإِخْوَتَهُ.

Another is ensuring that quotation marks are not split off from the quotation even though they contain punctuation.
\v 6 He says to himself, “Nothing will ever shake me.”
\q2 He swears, “No one will ever do me harm.”
\b
\q1

There are also quotations that are split over multiple paragraphs:

\v 5 “Because the poor are plundered and the needy groan,
\q2 I will now arise,” says the \nd Lord\nd*.
\q2 “I will protect them from those who malign them.”
\q1

Although this looks like a simple mechanical fix it's much more difficult to achieve than it first appears.

@davidbaines
Copy link
Collaborator Author

davidbaines commented Oct 29, 2024

Two translators have indicated that they think looking for punctuation would be good to try first and then fall back to counting words or characters and putting markers back in the same relative position in cases where we can't match on punctuation.

"... assuming that would get it right most of the time. Moving an occasional marker over a word or two isn't a big deal but inserting all the paragraph and quote marks manually is a significant amount of time."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pipeline 6: infer Issue related to using a trained model to translate. research Research topics
Projects
None yet
Development

No branches or pull requests

2 participants