Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SF-2900 Mixed Source: Selecting books from each source #2825

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

pmachapman
Copy link
Collaborator

@pmachapman pmachapman commented Oct 28, 2024

This PR adds support for parallel corpora in Serval, and backend support for selecting books from each source.

This is a key requirement for the mixed source logic, which is currently behind a feature flag.

The frontend does not yet allow selection of different books for each source.

Blockers for QA


This change is Reviewable

@pmachapman pmachapman marked this pull request as ready for review October 28, 2024 19:32
@pmachapman pmachapman marked this pull request as draft October 28, 2024 19:42
Copy link

codecov bot commented Oct 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.77%. Comparing base (e75b87f) to head (edb948a).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2825      +/-   ##
==========================================
+ Coverage   79.21%   79.77%   +0.55%     
==========================================
  Files         531      532       +1     
  Lines       30846    30816      -30     
  Branches     5032     5010      -22     
==========================================
+ Hits        24436    24583     +147     
+ Misses       5627     5482     -145     
+ Partials      783      751      -32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pmachapman pmachapman marked this pull request as ready for review October 28, 2024 23:14
@pmachapman pmachapman marked this pull request as draft October 28, 2024 23:34
@pmachapman pmachapman force-pushed the feature/SF-2900 branch 2 times, most recently from 48b32dd to 012546d Compare October 29, 2024 00:55
@pmachapman pmachapman added the will require testing PR should not be merged until testers confirm testing is complete label Oct 29, 2024
@pmachapman pmachapman marked this pull request as ready for review October 29, 2024 00:57
@pmachapman pmachapman changed the title WIP: SF-2900 Mixed Source: Selecting books from each source SF-2900 Mixed Source: Selecting books from each source Oct 29, 2024
@kylebuss kylebuss self-assigned this Oct 29, 2024
Copy link
Collaborator

@kylebuss kylebuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I can definitely see what you meant when you mentioned waiting on SF-2525 for these changes.

Reviewed 6 of 9 files at r1, 23 of 23 files at r2, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @pmachapman)


src/SIL.XForge.Scripture/Models/ServalAdditionalTrainingData.cs line 7 at r2 (raw file):

/// <summary>
/// Serval Configuration for Additional Training Data.
/// </summary>

Is there a link or class in Serval for reference to this?

Code quote:

/// </summary>

src/SIL.XForge.Scripture/Services/MachineApiService.cs line 729 at r2 (raw file):

            );
        }

Can you add a comment here to specify Translation section and above (line 702) for Training Section? The code is so similar I thought it was duplicated.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 394 at r2 (raw file):

        // If there is an alternate source, ensure that writing system and RTL is correct
        if (projectDoc.Data.TranslateConfig.DraftConfig.AlternateSource is not null)

We might consider updating the settings.FullName, as that can also be changed in the Paratext registry.

Code quote:

if (projectDoc.Data.TranslateConfig.DraftConfig.AlternateSource is not null)

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 408 at r2 (raw file):

                        alternateSourceSettings.IsRightToLeft
                    );
                    if (alternateSourceSettings.LanguageTag is not null)

While unlikely, should we verify the primary source and alternate/additional source language tags match and error if they do not?

Code quote:

if (alternateSourceSettings.LanguageTag is not null)

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 587 at r2 (raw file):

    /// <returns>The parallel corpus identifier.</returns>
    /// <remarks>This can be mocked in unit tests.</remarks>
    protected internal virtual async Task<string> CreateOrUpdateParallelCorpusAsync(

Curious to the reasoning behind making this method's virtual?

Code quote:

internal virtual async

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 860 at r2 (raw file):

                }

                // Update the source writing system tag

It might be an unlikely scenario, but I wonder if we may want to check if RTL has changed for the source?

Code quote:

// Update the source writing system tag

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1162 at r2 (raw file):

                cancellationToken
            );
            bool recreateTranslationEngine = false;

If RTL has changed should we also recreate the translation engine?

Code quote:

recreateTranslationEngine

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1346 at r2 (raw file):

            // Upload the source texts
            List<ServalCorpusFile> sourceCorpusFiles = [.. additionalTrainingData.CorpusFiles];

Seen the Range operator used a few times with this PR, what are some of the benefits to using this over simply having = additionalTrainingData.CorpusFiles;?

Code quote:

[.. additionalTrainingData.CorpusFiles]

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1701 at r2 (raw file):

                translationParallelCorpusId
            );
        }

Do we need the if/else here?

I wonder if we can differentiate between NMT and SMT in the few places like this:

string sourceProjectId = preTranslate ? hasAlternateSource
        ? project.TranslateConfig.DraftConfig.AlternateSource.ProjectRef
        : project.TranslateConfig.Source.ProjectRef
        : project.TranslateConfig.Source.ProjectRef;

And then put the alternate/additional sources in a single if(preTranslate) { ...}?

Code quote:

        else
        {
            // Set up the parallel corpus for SMT translation
            string sourceProjectId = project.TranslateConfig.Source.ProjectRef;
            List<ServalCorpusFile> sourceCorpora = [servalCorpusFiles.Single(f => f.ProjectId == sourceProjectId)];
            List<ServalCorpusFile> targetCorpora = [servalCorpusFiles.Single(f => f.ProjectId == project.Id)];
            List<string> sourceCorpusIds = sourceCorpora.Select(f => f.CorpusId).ToList();
            List<string> targetCorpusIds = targetCorpora.Select(f => f.CorpusId).ToList();

            // Get the pre-translate parallel corpus id (might be null)
            translationParallelCorpusId = projectSecret.ServalData.ParallelCorpusIdForSmt;

            // Create or update the pre-translate parallel corpora
            translationParallelCorpusId = await CreateOrUpdateParallelCorpusAsync(
                translationEngineId,
                translationParallelCorpusId,
                name: "SmtTranslation",
                sourceCorpusIds,
                targetCorpusIds,
                cancellationToken
            );

            // Record the corpus sync info
            corporaSyncInfo = RecordServalCorpusSyncInfo(
                corporaSyncInfo,
                sourceCorpora,
                targetCorpora,
                translationParallelCorpusId
            );
        }

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 25 of 30 files reviewed, 8 unresolved discussions (waiting on @kylebuss)


src/SIL.XForge.Scripture/Models/ServalAdditionalTrainingData.cs line 7 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

Is there a link or class in Serval for reference to this?

No, because this is the class where we keep our configuration data for specifying the additional training data on Serval. I have updated the comment to be clearer.


src/SIL.XForge.Scripture/Services/MachineApiService.cs line 729 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

Can you add a comment here to specify Translation section and above (line 702) for Training Section? The code is so similar I thought it was duplicated.

Done.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 394 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

We might consider updating the settings.FullName, as that can also be changed in the Paratext registry.

Done. Good point.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 408 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

While unlikely, should we verify the primary source and alternate/additional source language tags match and error if they do not?

I think our checks before we start are sufficient, and Serval will return if they are not compatible.

There is also the complexity of en_GB vs en_NZ - both are English, and so we may incorrectly mark as not equal. I think it's best not to do the checking here, and let Serval notify us if there is something wrong.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 587 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

Curious to the reasoning behind making this method's virtual?

So it can be overridden in unit tests.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 860 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

It might be an unlikely scenario, but I wonder if we may want to check if RTL has changed for the source?

This code is for a specific bug with back translations not being in the registry (see the big comment above the block). IsRighttoLeft was not affected by it, as far as I am aware. I'll make a note to investigate further if it is, as it would affect other areas of Scripture Forge if it is applicable.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1162 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

If RTL has changed should we also recreate the translation engine?

No. We don't send RTL to Serval.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1346 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

Seen the Range operator used a few times with this PR, what are some of the benefits to using this over simply having = additionalTrainingData.CorpusFiles;?

This syntax passes the items in the list, not a reference to the list. The older (pre-.NET 8.0) syntax would be:

List<ServalCorpusFile> sourceCorpusFiles = new List<ServalCorpusFile>(additionalTrainingData.CorpusFiles);

Passing a reference to the list would cause the original list to get trashed and all sorts of other bugs.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1701 at r2 (raw file):

Previously, kylebuss (Kyle Buss) wrote…

Do we need the if/else here?

I wonder if we can differentiate between NMT and SMT in the few places like this:

string sourceProjectId = preTranslate ? hasAlternateSource
        ? project.TranslateConfig.DraftConfig.AlternateSource.ProjectRef
        : project.TranslateConfig.Source.ProjectRef
        : project.TranslateConfig.Source.ProjectRef;

And then put the alternate/additional sources in a single if(preTranslate) { ...}?

Done.

Copy link
Collaborator

@kylebuss kylebuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for the explanations and clean up. I especially like the GetPreTranslationParametersAsync that was added!

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @pmachapman)


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 587 at r2 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

So it can be overridden in unit tests.

Thanks for clarifying. Looks like Raymond had it correct in our discussion!


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1346 at r2 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

This syntax passes the items in the list, not a reference to the list. The older (pre-.NET 8.0) syntax would be:

List<ServalCorpusFile> sourceCorpusFiles = new List<ServalCorpusFile>(additionalTrainingData.CorpusFiles);

Passing a reference to the list would cause the original list to get trashed and all sorts of other bugs.

Raymond was right again! And thanks for the explanation, I had never seen the [.. ] before this.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1701 at r2 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

Done.

Thanks!

Copy link
Collaborator

@RaymondLuong3 RaymondLuong3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have a few minor suggestions and a couple blocking comments. Can you add acceptance tests to the Jira issue? It may be a big list of scenarios so I almost don't want to ask hah

Reviewed 5 of 9 files at r1, 20 of 23 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @pmachapman)


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 639 at r3 (raw file):

        else
        {
            await translationEnginesClient.UpdateParallelCorpusAsync(

Calling UpdateParallelCorpusAsync will update the parallelCorpusId? It doesn't seem like a good idea to do that in an async method. Can you leave a comment stating this?

Code quote:

            await translationEnginesClient.UpdateParallelCorpusAsync(

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 838 at r3 (raw file):

        string translationEngineId = preTranslate
            ? projectSecret.ServalData?.PreTranslationEngineId
            : projectSecret.ServalData?.TranslationEngineId;

I have seen this a number of times. Perhaps we would be worth creating a helper method to get the translation engine Id and keep it more readable.

Code quote:

        string translationEngineId = preTranslate
            ? projectSecret.ServalData?.PreTranslationEngineId
            : projectSecret.ServalData?.TranslationEngineId;

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 932 at r3 (raw file):

        if (string.IsNullOrWhiteSpace(translationEngineId))
        {
            throw new DataNotFoundException("The translation engine is not specified.");

A better message would be "Failed to create a translation engine"

Code quote:

            throw new DataNotFoundException("The translation engine is not specified.");

src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1335 at r3 (raw file):

    /// <returns>The updated <paramref name="additionalTrainingData"/>.</returns>
    /// <remarks>This can be mocked in unit tests.</remarks>
    protected internal virtual async Task<ServalAdditionalTrainingData?> SyncAdditionalTrainingData(

It seems that this method is really doing two separate things. If the buildConfig.TrainingDataFiles exists, then we get it and update the parallel corpus. However, if it trainingDataFiles is empty, then we delete the parallel corpus and return null. Would it be more clear to split this into two methods have the caller can decide which one to call?

Code quote:

    protected internal virtual async Task<ServalAdditionalTrainingData?> SyncAdditionalTrainingData(

test/SIL.XForge.Scripture.Tests/Services/MachineProjectServiceTests.cs line 129 at r3 (raw file):

        Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationErrorMessage);
        Assert.IsNull(env.ProjectSecrets.Get(Project01).ServalData!.PreTranslationJobId);
        Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationQueuedAt);

Why are we getting the project secrets for both Project01 and Project02 here? I was expecting just Project01.

Code quote:

        Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationErrorMessage);
        Assert.IsNull(env.ProjectSecrets.Get(Project01).ServalData!.PreTranslationJobId);
        Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationQueuedAt);

@pmachapman pmachapman force-pushed the feature/SF-2900 branch 2 times, most recently from d6e0a16 to 3ebcd2b Compare November 3, 2024 22:28
Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add acceptance tests to the Jira issue?

I will when the changes to Serval are finalized (and before I send to QA). I still have a few questions I am investigating on the full scope of these changes across all systems (i.e. Serval, SF Backend, previous versions of SF Frontend).

Dismissed @Github-advanced-security[bot] from 2 discussions.
Reviewable status: 23 of 30 files reviewed, 3 unresolved discussions (waiting on @kylebuss and @RaymondLuong3)


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 639 at r3 (raw file):

Previously, RaymondLuong3 (Raymond Luong) wrote…

Calling UpdateParallelCorpusAsync will update the parallelCorpusId? It doesn't seem like a good idea to do that in an async method. Can you leave a comment stating this?

I am unsure what you mean? I have updated the code documentation to be a bit more descriptive.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 838 at r3 (raw file):

Previously, RaymondLuong3 (Raymond Luong) wrote…

I have seen this a number of times. Perhaps we would be worth creating a helper method to get the translation engine Id and keep it more readable.

Done.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 932 at r3 (raw file):

Previously, RaymondLuong3 (Raymond Luong) wrote…

A better message would be "Failed to create a translation engine"

Done.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1335 at r3 (raw file):

Previously, RaymondLuong3 (Raymond Luong) wrote…

It seems that this method is really doing two separate things. If the buildConfig.TrainingDataFiles exists, then we get it and update the parallel corpus. However, if it trainingDataFiles is empty, then we delete the parallel corpus and return null. Would it be more clear to split this into two methods have the caller can decide which one to call?

My thinking is whether it was creating, updating, or deleting, it would be keeping the data in sync between Serval, and so named this function "SyncAdditionalTrainingData". I have updated the method code documentation to make this a bit clearer.


test/SIL.XForge.Scripture.Tests/Services/MachineProjectServiceTests.cs line 129 at r3 (raw file):

Previously, RaymondLuong3 (Raymond Luong) wrote…

Why are we getting the project secrets for both Project01 and Project02 here? I was expecting just Project01.

Fixed. Good spotting - thank you!

Copy link
Collaborator

@RaymondLuong3 RaymondLuong3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will wait for the acceptance tests before giving the full approval.

Reviewed 4 of 7 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 639 at r3 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I am unsure what you mean? I have updated the code documentation to be a bit more descriptive.

Right, what I mean is that we can run into problems when we use asynchronous code to update parameters provided to the method since other threads could depend on the value of that parameter if my understanding is correct. Since this is awaited, it should be fine. Thanks for the comments.


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 1335 at r3 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

My thinking is whether it was creating, updating, or deleting, it would be keeping the data in sync between Serval, and so named this function "SyncAdditionalTrainingData". I have updated the method code documentation to make this a bit clearer.

Right, that does make sense when you think of Sync that way. Thanks for the additional comment.

Copy link
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)


src/SIL.XForge.Scripture/Services/MachineProjectService.cs line 639 at r3 (raw file):

Right, what I mean is that we can run into problems when we use asynchronous code to update parameters provided to the method since other threads could depend on the value of that parameter if my understanding is correct.

Yes, your understanding is correct. This is why, for example, you cannot have a ref parameter in an async method. Good thinking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
will require testing PR should not be merged until testers confirm testing is complete
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants