Percent complete #52

Enkidu93 · 2023-08-09T13:16:37Z

Note that this should be merged in parallel with matching PR in serval: sillsdev/serval#78

This change is

90% at 100% of training Working percent complete end-to-end --ECL

johnml1135 · 2023-08-10T01:21:07Z

johnml1135 · 2023-08-10T20:06:21Z

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 236 at r1 (raw file):

        return status;
    }

The code looks fine - see comments in sillsdev/serval#44 primarily about supporting inferencing. We need a plan that can work with it very shortly - and LastIteration won't work well.

johnml1135

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

ddaspit

Reviewed 3 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 116 at r1 (raw file):

                        if (status is not null)
                        {
                            await _platformService.UpdateBuildStatusAsync(buildId, (ProgressStatus)status!);

It shouldn't be necessary to cast status or use the null-forgiving operator (!) here.

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 222 at r1 (raw file):

        JsonArray entries = (JsonArray)entriesNode;
        int numTasksAheadInQueue = 0;
        foreach (var entry in entries)

var should only be used when the type is explicit elsewhere in the line of code. This helps to make the code more readable.

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 228 at r1 (raw file):

            numTasksAheadInQueue++;
        }
        ProgressStatus status =

It is the responsibility of the ClearMLNmtEngineBuildJob class to construct the ProgressStatus object and status message. I think it would be better to move the construction of the ProgressStatus object to that class. This class is only responsible for abstracting calls to the ClearML REST API.

ddaspit

I realized that getting the current position in the queue is actually not as straightforward as querying the position from ClearML. There are actually two queues: the Hangfire queue and the ClearML queue. A job is first placed in the Hangfire queue. Once, it is actually running on the job server, then the job will be placed in the ClearML queue. The main purpose of the Hangfire queue is to put a limit on the number of jobs that Machine needs to monitor on ClearML at the same time.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)

johnml1135 · 2023-08-11T12:59:39Z

@ddaspit - Should we put them all on the ClearML queue? How many can it handle? Should the be in both queues? What is the best mechanism for this? If we have 50 items in the ClearML queue, can we make one call to see everything in the queue instead of 50 calls? What about multiple instances of Serval using the same ClearML instance (QA, prod)? Should they have different project names such as Serval-<random 6 digit Hex ID> or, have them all in the same project but have the experiment name be -? Note that machine does not know what instance it is actually in...

One potential solution could be to query ClearML every 5 seconds once for all queued items that match a specific pattern (that corresponds with the specific instance of Serval) and query MongoDB once for all current builds - and if there is an update to make in MongoDB, make it. If any jobs have been cancelled, aborted or failed in one (ClearML or MongoDB), update the other.

johnml1135 · 2023-08-11T17:26:15Z

Update after talking with @Enkidu93 - Let's make the following updates:

number to dequeue = number of clearML agents on the queue * 2
Number in the queue is a combination of Hangfire queue and ClearML queue
% complete can be 90/10 split training/inference - and inference progress can be posted through logging scalar values in ClearML and called from the API.

ddaspit

This sounds good to me.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)

johnml1135 · 2023-08-14T20:04:55Z

@ddaspit - we made some more updates to the proposed algorithm here (see the last comment): sillsdev/serval#44. What do you think?

johnml1135 · 2023-08-22T18:20:43Z

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 138 at r2 (raw file):

                        }
                        previousClearMLTasks = currentClearMLTasks;
                        goto case ClearMLTaskStatus.InProgress;

Let's remove the queue logic for right now and commit it to a branch to be dealt with later.

johnml1135 · 2023-08-22T18:21:54Z

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 152 at r2 (raw file):

                                clearMLTask.LastIteration,
                                (clearMLTask.LastIteration / (float)_options.CurrentValue.MaxSteps) * 90.0
                                    + (inferencePercentComplete / 100.0f) * 10.0,

To account for potential early stopping, if inferencePercentComplete > 0, assume training is done and just put in the 90%.

johnml1135 · 2023-08-22T18:25:23Z

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 205 at r2 (raw file):

    )
    {
        var body = new JsonObject { ["id"] = taskId };

Can you add some comments as to why hashes are used and how the data is extracted?

johnml1135 · 2023-08-22T18:26:40Z

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 198 at r2 (raw file):

    }

    private async Task<string?> GetMetricAsync(

rename to GetTaskMetricAsyc

johnml1135 · 2023-08-22T18:28:18Z

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 267 at r2 (raw file):

                tasksAheadInQueue.Add(name);
        }
        return tasksAheadInQueue;

Remove this queue logic as specified above.

johnml1135 · 2023-08-22T18:30:14Z

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 236 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

The code looks fine - see comments in sillsdev/serval#44 primarily about supporting inferencing. We need a plan that can work with it very shortly - and LastIteration won't work well.

It should work for now - when inferencing is added, we can just have inferencing take all 100% of the progress.

johnml1135 · 2023-08-22T18:30:35Z

src/SIL.Machine.AspNetCore/Services/IClearMLService.cs line 35 at r2 (raw file):

        CancellationToken cancellationToken = default
    );

Remove.

johnml1135

Reviewed 3 of 3 files at r2, all commit messages.
Dismissed @ddaspit from 2 discussions.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93)

Enkidu93 · 2023-08-29T15:26:12Z

@johnml1135 I've removed the queue logic and, I believe, responded to the other reviews. Let me know if anything needs to be changed as you make alterations in machine.py.

ddaspit

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93)

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):

                {
                    case ClearMLTaskStatus.InProgress:
                        float inferencePercentComplete = await _clearMLService.GetInferencePercentCompleteAsync(

Do we need a separate request to get this information? Can we get it in the GetTaskAsync call?

Enkidu93 · 2023-09-01T13:23:54Z

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Do we need a separate request to get this information? Can we get it in the GetTaskAsync call?

(Sorry for the delay). We could - if you're OK with me changing the ClearMLTask data class. I'm not sure if it would deserialize properly without some help since the metric is stored in a serialized hashmap of hashmaps. (Thus, MD5 popping up in the ClearML service). I can pursue that if you like; just give me a thumbs up.

johnml1135

Reviewed all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)

ddaspit

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

(Sorry for the delay). We could - if you're OK with me changing the ClearMLTask data class. I'm not sure if it would deserialize properly without some help since the metric is stored in a serialized hashmap of hashmaps. (Thus, MD5 popping up in the ClearML service). I can pursue that if you like; just give me a thumbs up.

I think it is worth checking out. We want to minimize the number of requests we make to ClearML as much as possible.

Enkidu93 · 2023-09-05T13:43:19Z

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…

I think it is worth checking out. We want to minimize the number of requests we make to ClearML as much as possible.

I'll check it out.

Enkidu93 · 2023-09-05T13:43:40Z

@johnml1135 Are you still working on the machine.py end of this? Any updates?

johnml1135 · 2023-09-05T14:09:56Z

@Enkidu93 - yes I am. I have been pulled onto a few other things, but am still working on it.

Enkidu93 · 2023-10-13T20:37:44Z

Finished here.

Enkidu93 added 3 commits August 3, 2023 11:50

Init commit % complete --ECL

734c126

Percent complete implementation for NMT #44 --ECL

30ef860

Fixes #44

5ef722d

90% at 100% of training Working percent complete end-to-end --ECL

Enkidu93 mentioned this pull request Aug 9, 2023

Percent complete sillsdev/serval#78

Merged

johnml1135 reviewed Aug 10, 2023

View reviewed changes

ddaspit requested changes Aug 10, 2023

View reviewed changes

ddaspit reviewed Aug 10, 2023

View reviewed changes

ddaspit reviewed Aug 14, 2023

View reviewed changes

Added inference & temporary queue depth in message --ECL

cd720b8

johnml1135 reviewed Aug 22, 2023

View reviewed changes

Removed queue logic --ECL

3a0ab7b

ddaspit requested changes Aug 29, 2023

View reviewed changes

johnml1135 approved these changes Sep 1, 2023

View reviewed changes

ddaspit reviewed Sep 1, 2023

View reviewed changes

Enkidu93 closed this Oct 13, 2023

Enkidu93 deleted the percent_complete branch October 27, 2023 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Percent complete #52

Percent complete #52

Enkidu93 commented Aug 9, 2023 •

edited

Loading

johnml1135 commented Aug 10, 2023

johnml1135 commented Aug 10, 2023

johnml1135 left a comment

ddaspit left a comment

ddaspit left a comment

johnml1135 commented Aug 11, 2023

johnml1135 commented Aug 11, 2023

ddaspit left a comment

johnml1135 commented Aug 14, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 left a comment

Enkidu93 commented Aug 29, 2023

ddaspit left a comment

Enkidu93 commented Sep 1, 2023

johnml1135 left a comment

ddaspit left a comment

Enkidu93 commented Sep 5, 2023

Enkidu93 commented Sep 5, 2023

johnml1135 commented Sep 5, 2023

Enkidu93 commented Oct 13, 2023

Percent complete #52

Percent complete #52

Conversation

Enkidu93 commented Aug 9, 2023 • edited Loading

johnml1135 commented Aug 10, 2023

johnml1135 commented Aug 10, 2023

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Aug 11, 2023

johnml1135 commented Aug 11, 2023

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Aug 14, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 commented Aug 22, 2023

johnml1135 left a comment

Choose a reason for hiding this comment

Enkidu93 commented Aug 29, 2023

ddaspit left a comment

Choose a reason for hiding this comment

Enkidu93 commented Sep 1, 2023

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

Enkidu93 commented Sep 5, 2023

Enkidu93 commented Sep 5, 2023

johnml1135 commented Sep 5, 2023

Enkidu93 commented Oct 13, 2023

Enkidu93 commented Aug 9, 2023 •

edited

Loading