-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Percent complete #52
Percent complete #52
Conversation
This implements sillsdev/serval#44. |
The code looks fine - see comments in sillsdev/serval#44 primarily about supporting inferencing. We need a plan that can work with it very shortly - and LastIteration won't work well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs
line 116 at r1 (raw file):
if (status is not null) { await _platformService.UpdateBuildStatusAsync(buildId, (ProgressStatus)status!);
It shouldn't be necessary to cast status
or use the null-forgiving operator (!) here.
src/SIL.Machine.AspNetCore/Services/ClearMLService.cs
line 222 at r1 (raw file):
JsonArray entries = (JsonArray)entriesNode; int numTasksAheadInQueue = 0; foreach (var entry in entries)
var
should only be used when the type is explicit elsewhere in the line of code. This helps to make the code more readable.
src/SIL.Machine.AspNetCore/Services/ClearMLService.cs
line 228 at r1 (raw file):
numTasksAheadInQueue++; } ProgressStatus status =
It is the responsibility of the ClearMLNmtEngineBuildJob
class to construct the ProgressStatus
object and status message. I think it would be better to move the construction of the ProgressStatus
object to that class. This class is only responsible for abstracting calls to the ClearML REST API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that getting the current position in the queue is actually not as straightforward as querying the position from ClearML. There are actually two queues: the Hangfire queue and the ClearML queue. A job is first placed in the Hangfire queue. Once, it is actually running on the job server, then the job will be placed in the ClearML queue. The main purpose of the Hangfire queue is to put a limit on the number of jobs that Machine needs to monitor on ClearML at the same time.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)
@ddaspit - Should we put them all on the ClearML queue? How many can it handle? Should the be in both queues? What is the best mechanism for this? If we have 50 items in the ClearML queue, can we make one call to see everything in the queue instead of 50 calls? What about multiple instances of Serval using the same ClearML instance (QA, prod)? Should they have different project names such as One potential solution could be to query ClearML every 5 seconds once for all queued items that match a specific pattern (that corresponds with the specific instance of Serval) and query MongoDB once for all current builds - and if there is an update to make in MongoDB, make it. If any jobs have been cancelled, aborted or failed in one (ClearML or MongoDB), update the other. |
Update after talking with @Enkidu93 - Let's make the following updates:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)
@ddaspit - we made some more updates to the proposed algorithm here (see the last comment): sillsdev/serval#44. What do you think? |
Let's remove the queue logic for right now and commit it to a branch to be dealt with later. |
To account for potential early stopping, if inferencePercentComplete > 0, assume training is done and just put in the 90%. |
Can you add some comments as to why hashes are used and how the data is extracted? |
rename to GetTaskMetricAsyc |
Remove this queue logic as specified above. |
Previously, johnml1135 (John Lambert) wrote…
It should work for now - when inferencing is added, we can just have inferencing take all 100% of the progress. |
Remove. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r2, all commit messages.
Dismissed @ddaspit from 2 discussions.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93)
@johnml1135 I've removed the queue logic and, I believe, responded to the other reviews. Let me know if anything needs to be changed as you make alterations in machine.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs
line 112 at r3 (raw file):
{ case ClearMLTaskStatus.InProgress: float inferencePercentComplete = await _clearMLService.GetInferencePercentCompleteAsync(
Do we need a separate request to get this information? Can we get it in the GetTaskAsync
call?
Previously, ddaspit (Damien Daspit) wrote…
(Sorry for the delay). We could - if you're OK with me changing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs
line 112 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
(Sorry for the delay). We could - if you're OK with me changing the
ClearMLTask
data class. I'm not sure if it would deserialize properly without some help since the metric is stored in a serialized hashmap of hashmaps. (Thus, MD5 popping up in the ClearML service). I can pursue that if you like; just give me a thumbs up.
I think it is worth checking out. We want to minimize the number of requests we make to ClearML as much as possible.
I'll check it out. |
@johnml1135 Are you still working on the machine.py end of this? Any updates? |
@Enkidu93 - yes I am. I have been pulled onto a few other things, but am still working on it. |
Finished here. |
Note that this should be merged in parallel with matching PR in serval: sillsdev/serval#78
This change is