Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent complete #52

Closed
wants to merge 5 commits into from
Closed

Percent complete #52

wants to merge 5 commits into from

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Aug 9, 2023

Note that this should be merged in parallel with matching PR in serval: sillsdev/serval#78


This change is Reviewable

90% at 100% of training
Working percent complete end-to-end

--ECL
@johnml1135
Copy link
Collaborator

This implements sillsdev/serval#44.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 236 at r1 (raw file):

        return status;
    }

The code looks fine - see comments in sillsdev/serval#44 primarily about supporting inferencing. We need a plan that can work with it very shortly - and LastIteration won't work well.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 116 at r1 (raw file):

                        if (status is not null)
                        {
                            await _platformService.UpdateBuildStatusAsync(buildId, (ProgressStatus)status!);

It shouldn't be necessary to cast status or use the null-forgiving operator (!) here.


src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 222 at r1 (raw file):

        JsonArray entries = (JsonArray)entriesNode;
        int numTasksAheadInQueue = 0;
        foreach (var entry in entries)

var should only be used when the type is explicit elsewhere in the line of code. This helps to make the code more readable.


src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 228 at r1 (raw file):

            numTasksAheadInQueue++;
        }
        ProgressStatus status =

It is the responsibility of the ClearMLNmtEngineBuildJob class to construct the ProgressStatus object and status message. I think it would be better to move the construction of the ProgressStatus object to that class. This class is only responsible for abstracting calls to the ClearML REST API.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that getting the current position in the queue is actually not as straightforward as querying the position from ClearML. There are actually two queues: the Hangfire queue and the ClearML queue. A job is first placed in the Hangfire queue. Once, it is actually running on the job server, then the job will be placed in the ClearML queue. The main purpose of the Hangfire queue is to put a limit on the number of jobs that Machine needs to monitor on ClearML at the same time.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)

@johnml1135
Copy link
Collaborator

@ddaspit - Should we put them all on the ClearML queue? How many can it handle? Should the be in both queues? What is the best mechanism for this? If we have 50 items in the ClearML queue, can we make one call to see everything in the queue instead of 50 calls? What about multiple instances of Serval using the same ClearML instance (QA, prod)? Should they have different project names such as Serval-<random 6 digit Hex ID> or, have them all in the same project but have the experiment name be -? Note that machine does not know what instance it is actually in...

One potential solution could be to query ClearML every 5 seconds once for all queued items that match a specific pattern (that corresponds with the specific instance of Serval) and query MongoDB once for all current builds - and if there is an update to make in MongoDB, make it. If any jobs have been cancelled, aborted or failed in one (ClearML or MongoDB), update the other.

@johnml1135
Copy link
Collaborator

Update after talking with @Enkidu93 - Let's make the following updates:

  • number to dequeue = number of clearML agents on the queue * 2
  • Number in the queue is a combination of Hangfire queue and ClearML queue
  • % complete can be 90/10 split training/inference - and inference progress can be posted through logging scalar values in ClearML and called from the API.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)

@johnml1135
Copy link
Collaborator

@ddaspit - we made some more updates to the proposed algorithm here (see the last comment): sillsdev/serval#44. What do you think?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 138 at r2 (raw file):

                        }
                        previousClearMLTasks = currentClearMLTasks;
                        goto case ClearMLTaskStatus.InProgress;

Let's remove the queue logic for right now and commit it to a branch to be dealt with later.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 152 at r2 (raw file):

                                clearMLTask.LastIteration,
                                (clearMLTask.LastIteration / (float)_options.CurrentValue.MaxSteps) * 90.0
                                    + (inferencePercentComplete / 100.0f) * 10.0,

To account for potential early stopping, if inferencePercentComplete > 0, assume training is done and just put in the 90%.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 205 at r2 (raw file):

    )
    {
        var body = new JsonObject { ["id"] = taskId };

Can you add some comments as to why hashes are used and how the data is extracted?

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 198 at r2 (raw file):

    }

    private async Task<string?> GetMetricAsync(

rename to GetTaskMetricAsyc

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 267 at r2 (raw file):

                tasksAheadInQueue.Add(name);
        }
        return tasksAheadInQueue;

Remove this queue logic as specified above.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/ClearMLService.cs line 236 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

The code looks fine - see comments in sillsdev/serval#44 primarily about supporting inferencing. We need a plan that can work with it very shortly - and LastIteration won't work well.

It should work for now - when inferencing is added, we can just have inferencing take all 100% of the progress.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/IClearMLService.cs line 35 at r2 (raw file):

        CancellationToken cancellationToken = default
    );

Remove.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r2, all commit messages.
Dismissed @ddaspit from 2 discussions.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

@johnml1135 I've removed the queue logic and, I believe, responded to the other reviews. Let me know if anything needs to be changed as you make alterations in machine.py.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):

                {
                    case ClearMLTaskStatus.InProgress:
                        float inferencePercentComplete = await _clearMLService.GetInferencePercentCompleteAsync(

Do we need a separate request to get this information? Can we get it in the GetTaskAsync call?

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Sep 1, 2023

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Do we need a separate request to get this information? Can we get it in the GetTaskAsync call?

(Sorry for the delay). We could - if you're OK with me changing the ClearMLTask data class. I'm not sure if it would deserialize properly without some help since the metric is stored in a serialized hashmap of hashmaps. (Thus, MD5 popping up in the ClearML service). I can pursue that if you like; just give me a thumbs up.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

(Sorry for the delay). We could - if you're OK with me changing the ClearMLTask data class. I'm not sure if it would deserialize properly without some help since the metric is stored in a serialized hashmap of hashmaps. (Thus, MD5 popping up in the ClearML service). I can pursue that if you like; just give me a thumbs up.

I think it is worth checking out. We want to minimize the number of requests we make to ClearML as much as possible.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Sep 5, 2023

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)

src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs line 112 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…

I think it is worth checking out. We want to minimize the number of requests we make to ClearML as much as possible.

I'll check it out.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Sep 5, 2023

@johnml1135 Are you still working on the machine.py end of this? Any updates?

@johnml1135
Copy link
Collaborator

@Enkidu93 - yes I am. I have been pulled onto a few other things, but am still working on it.

@Enkidu93
Copy link
Collaborator Author

Finished here.

@Enkidu93 Enkidu93 closed this Oct 13, 2023
@Enkidu93 Enkidu93 deleted the percent_complete branch October 27, 2023 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants