Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OEP: Enhance initial prediction of estimate for time of model training. #1033

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AmartC
Copy link
Contributor

@AmartC AmartC commented Feb 7, 2023

No description provided.

Copy link
Contributor

@sanjay920 sanjay920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need feedback from @codyrancher before approval

Comment on lines 19 to 20
Within the /model/train endpoint within the AIOps plugin, when it is time to train a new Deep Learning model, set the remaining time for training a new model to -1 instead of 3600 which is the previous value passed in for 1 hour. The UI service will also need to be updated to parse the -1 value to indicate that the estimate is currently being calculated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a different approach here? I think it's better to have 2 separate statuses for the 2 phases of a watchlist update (fetch data, train model):

  1. Preparing data for watchlist update (data is being dumped)
    a. No time estimate given
  2. Updating AI models for watchlist (model is training)
    a. provide time estimate


Once the AIOps gateway sends the workloads watchlist through Nats, the training controller service will now count the total number of log messages it will be fetching from Opensearch and then it will be estimate the time to train in the following manner.

Let B represent the time it takes to process one batch size of 32 training logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we tried other batch sizes? @tybalex

That result will be sent to the /model/statistics endpoint.

## Acceptance criteria:
* When it is time to train a new model, initially there should be no estimate provided for when model will be trained and initial estimate will be based on the number of logs present within the last hour and the average amount of time it takes to process 100 batches of log messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the status messages that users sees be? Let's determine that here before accepting this OEP.

Comment on lines +36 to +65
User Story:
As a user of Opni, I would like to receive an accurate initial estimate on how long it will take for AIOps insights to be ready for workload logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional user storie - some may need separate OEPs

  • As a user of opni I want to see the time estimate appear as fast as possible
  • As a user of opni, I want to know if I am allowed to update a watchlist
    • What happens if a user hits update watchlist but no NVIDIA GPU is attached?
  • As a user of opni, I want to know whether a model is being trained or the data is being prepared before training
  • As a user of opni I want to be alerted when a model has been trained successfully
  • As a user of opni aiops, I want to be alerted when anything related to update watchlist goes wrong
    For the last 2 @alexandreLamarre what do you think? Can you and Amartya put this in your backlog?

@AmartC AmartC force-pushed the oep-improve-training-estimate branch from e868258 to 6fd88cd Compare February 7, 2023 23:14
codyrancher
codyrancher previously approved these changes Feb 8, 2023
Copy link
Contributor

@codyrancher codyrancher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the summary was cutoff so that needs to be finished.

@AmartC and I talked about this yesterday. We decided on doing a more accurate estimate from the beginning of updating a watchlist. This means there will be minimal impact on the UI and overall the UX should be improved.

I don't feel too strongly about delineating between fetching and training. If we do that I'll update the messaging in the banner.

@AmartC AmartC force-pushed the oep-improve-training-estimate branch from 29495cc to 5d53e55 Compare February 8, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants