You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, run_training prints messages to stdout and has a return type of None. This has worked ok when being called by the CLI but isn't ideal. Generally, libraries should leave the display of output up to the client/caller. Even with the CLI, if the CLI wants to make format changes, or translations, or anything related to what shows up for the user, it wouldn't be able to do so today for training. If the caller was a REST API, this would be a bigger issue since the API would need to return the result/state to its caller. Suggestions on general rules to follow:
The training library shouldn't use print. Logging should be used instead of some cases at different levels for helpful info/debug. In other cases, the results should be returned to the caller to decide what to do with them
Note: INFO logging should be user friendly and not too verbose that a user wouldn't want to leave it on all the time
run_training should return the result rather than print it
Longer term, a callback could be useful to update the interim status of train
The text was updated successfully, but these errors were encountered:
This was discussed at a refinement meeting today. Question from Oleg about what kind of info does the training library need to pass to which consumers?
Mustafa:
State of the job. Is it broken or still running?
Want access to the logs of the job.
Want some form of consumable metrics, similar to what we do in JSON output.
A few things you want as a consumer of the library to be able to directly access / display to the user to make informed decisions as library consumer. So nice with a distributed training job - user or library has a central point to call on the status/logs/metrics of a running job. Makes things easier down the line. Earlier on we did with OpenShift AI work when we tried to set up distributed training - relied on Ray pretty heavily because of state management it provided - single centralized point while job was running and display to users via a python library on there side.
Additional design discussion needed with eng/runtime team and model/eval(training) team. To be discussed during eng/runtime design meeting? or have a one off meeting for this.
Today, run_training prints messages to stdout and has a return type of None. This has worked ok when being called by the CLI but isn't ideal. Generally, libraries should leave the display of output up to the client/caller. Even with the CLI, if the CLI wants to make format changes, or translations, or anything related to what shows up for the user, it wouldn't be able to do so today for training. If the caller was a REST API, this would be a bigger issue since the API would need to return the result/state to its caller. Suggestions on general rules to follow:
The text was updated successfully, but these errors were encountered: