-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OEP: Enhance initial prediction of estimate for time of model training. #1033
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need feedback from @codyrancher before approval
Within the /model/train endpoint within the AIOps plugin, when it is time to train a new Deep Learning model, set the remaining time for training a new model to -1 instead of 3600 which is the previous value passed in for 1 hour. The UI service will also need to be updated to parse the -1 value to indicate that the estimate is currently being calculated. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a different approach here? I think it's better to have 2 separate statuses for the 2 phases of a watchlist update (fetch data, train model):
- Preparing data for watchlist update (data is being dumped)
a. No time estimate given - Updating AI models for watchlist (model is training)
a. provide time estimate
|
||
Once the AIOps gateway sends the workloads watchlist through Nats, the training controller service will now count the total number of log messages it will be fetching from Opensearch and then it will be estimate the time to train in the following manner. | ||
|
||
Let B represent the time it takes to process one batch size of 32 training logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we tried other batch sizes? @tybalex
That result will be sent to the /model/statistics endpoint. | ||
|
||
## Acceptance criteria: | ||
* When it is time to train a new model, initially there should be no estimate provided for when model will be trained and initial estimate will be based on the number of logs present within the last hour and the average amount of time it takes to process 100 batches of log messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the status messages that users sees be? Let's determine that here before accepting this OEP.
User Story: | ||
As a user of Opni, I would like to receive an accurate initial estimate on how long it will take for AIOps insights to be ready for workload logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional user storie - some may need separate OEPs
- As a user of opni I want to see the time estimate appear as fast as possible
- As a user of opni, I want to know if I am allowed to update a watchlist
- What happens if a user hits update watchlist but no NVIDIA GPU is attached?
- As a user of opni, I want to know whether a model is being trained or the data is being prepared before training
- As a user of opni I want to be alerted when a model has been trained successfully
- As a user of opni aiops, I want to be alerted when anything related to update watchlist goes wrong
For the last 2 @alexandreLamarre what do you think? Can you and Amartya put this in your backlog?
e868258
to
6fd88cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the summary was cutoff so that needs to be finished.
@AmartC and I talked about this yesterday. We decided on doing a more accurate estimate from the beginning of updating a watchlist. This means there will be minimal impact on the UI and overall the UX should be improved.
I don't feel too strongly about delineating between fetching and training. If we do that I'll update the messaging in the banner.
29495cc
to
5d53e55
Compare
No description provided.