Training should avoid prints and return status to the caller #281

danmcp · 2024-10-17T13:00:42Z

Today, run_training prints messages to stdout and has a return type of None. This has worked ok when being called by the CLI but isn't ideal. Generally, libraries should leave the display of output up to the client/caller. Even with the CLI, if the CLI wants to make format changes, or translations, or anything related to what shows up for the user, it wouldn't be able to do so today for training. If the caller was a REST API, this would be a bigger issue since the API would need to return the result/state to its caller. Suggestions on general rules to follow:

The training library shouldn't use print. Logging should be used instead of some cases at different levels for helpful info/debug. In other cases, the results should be returned to the caller to decide what to do with them
- Note: INFO logging should be user friendly and not too verbose that a user wouldn't want to leave it on all the time
run_training should return the result rather than print it
Longer term, a callback could be useful to update the interim status of train

mairin · 2024-10-21T17:57:45Z

This was discussed at a refinement meeting today. Question from Oleg about what kind of info does the training library need to pass to which consumers?

Mustafa:

State of the job. Is it broken or still running?
Want access to the logs of the job.
Want some form of consumable metrics, similar to what we do in JSON output.

A few things you want as a consumer of the library to be able to directly access / display to the user to make informed decisions as library consumer. So nice with a distributed training job - user or library has a central point to call on the status/logs/metrics of a running job. Makes things easier down the line. Earlier on we did with OpenShift AI work when we tried to set up distributed training - relied on Ray pretty heavily because of state management it provided - single centralized point while job was running and display to users via a python library on there side.

ktam3 · 2024-10-21T17:58:44Z

Summary 10/21 -

Additional design discussion needed with eng/runtime team and model/eval(training) team. To be discussed during eng/runtime design meeting? or have a one off meeting for this.

cc. @Maxusmusti @RobotSail @JamesKunstle @cdoern

danmcp mentioned this issue Oct 17, 2024

chore: add exit code & tox fix #217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training should avoid prints and return status to the caller #281

Training should avoid prints and return status to the caller #281

danmcp commented Oct 17, 2024 •

edited

Loading

mairin commented Oct 21, 2024

ktam3 commented Oct 21, 2024

Training should avoid prints and return status to the caller #281

Training should avoid prints and return status to the caller #281

Comments

danmcp commented Oct 17, 2024 • edited Loading

mairin commented Oct 21, 2024

ktam3 commented Oct 21, 2024

danmcp commented Oct 17, 2024 •

edited

Loading