OEP: Enhance initial prediction of estimate for time of model training. #1033

AmartC · 2023-02-07T01:58:32Z

No description provided.

sanjay920

Need feedback from @codyrancher before approval

sanjay920 · 2023-02-07T02:54:56Z

enhancements/aiops/20230203-improve-model-training-estimate.md

+ Within the /model/train endpoint within the AIOps plugin, when it is time to train a new Deep Learning model, set the remaining time for training a new model to -1 instead of 3600 which is the previous value passed in for 1 hour. The UI service will also need to be updated to parse the -1 value to indicate that the estimate is currently being calculated. 
+


Can we have a different approach here? I think it's better to have 2 separate statuses for the 2 phases of a watchlist update (fetch data, train model):

Preparing data for watchlist update (data is being dumped)
a. No time estimate given

Updating AI models for watchlist (model is training)
a. provide time estimate

sanjay920 · 2023-02-07T02:55:38Z

enhancements/aiops/20230203-improve-model-training-estimate.md

+
+ Once the AIOps gateway sends the workloads watchlist through Nats, the training controller service will now count the total number of log messages it will be fetching from Opensearch and then it will be estimate the time to train in the following manner.
+
+ Let B represent the time it takes to process one batch size of 32 training logs.


have we tried other batch sizes? @tybalex

sanjay920 · 2023-02-07T02:56:15Z

enhancements/aiops/20230203-improve-model-training-estimate.md

+ That result will be sent to the /model/statistics endpoint.
+
+## Acceptance criteria: 
+* When it is time to train a new model, initially there should be no estimate provided for when model will be trained and initial estimate will be based on the number of logs present within the last hour and the average amount of time it takes to process 100 batches of log messages.


What should the status messages that users sees be? Let's determine that here before accepting this OEP.

sanjay920 · 2023-02-07T03:00:55Z

enhancements/aiops/20230203-improve-model-training-estimate.md

+User Story:
+As a user of Opni, I would like to receive an accurate initial estimate on how long it will take for AIOps insights to be ready for workload logs.


Additional user storie - some may need separate OEPs

As a user of opni I want to see the time estimate appear as fast as possible

As a user of opni, I want to know if I am allowed to update a watchlist

What happens if a user hits update watchlist but no NVIDIA GPU is attached?

As a user of opni, I want to know whether a model is being trained or the data is being prepared before training

As a user of opni I want to be alerted when a model has been trained successfully

As a user of opni aiops, I want to be alerted when anything related to update watchlist goes wrong
For the last 2 @alexandreLamarre what do you think? Can you and Amartya put this in your backlog?

enhancements/aiops/20230203-improve-model-training-estimate.md

codyrancher

It looks like the summary was cutoff so that needs to be finished.

@AmartC and I talked about this yesterday. We decided on doing a more accurate estimate from the beginning of updating a watchlist. This means there will be minimal impact on the UI and overall the UX should be improved.

I don't feel too strongly about delineating between fetching and training. If we do that I'll update the messaging in the banner.

AmartC requested review from sanjay920, tybalex and dbason February 7, 2023 01:58

sanjay920 suggested changes Feb 7, 2023

View reviewed changes

AmartC force-pushed the oep-improve-training-estimate branch from e868258 to 6fd88cd Compare February 7, 2023 23:14

codyrancher previously approved these changes Feb 8, 2023

View reviewed changes

AmartC dismissed codyrancher’s stale review via 29495cc February 8, 2023 19:35

AmartC added 3 commits February 8, 2023 13:34

Add OEP for model training statistics enhancement for training time

b42e174

Update OEP with more detailed implementation

212bd7a

Update OEP to include more detailed implementation guide

5d53e55

AmartC force-pushed the oep-improve-training-estimate branch from 29495cc to 5d53e55 Compare February 8, 2023 21:34

Update markdown file with latest math equations

f8b9fdd

AmartC requested review from sanjay920 and codyrancher February 9, 2023 05:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OEP: Enhance initial prediction of estimate for time of model training. #1033

OEP: Enhance initial prediction of estimate for time of model training. #1033

AmartC commented Feb 7, 2023

sanjay920 left a comment

sanjay920 Feb 7, 2023

sanjay920 Feb 7, 2023

sanjay920 Feb 7, 2023

sanjay920 Feb 7, 2023

codyrancher left a comment

		Within the /model/train endpoint within the AIOps plugin, when it is time to train a new Deep Learning model, set the remaining time for training a new model to -1 instead of 3600 which is the previous value passed in for 1 hour. The UI service will also need to be updated to parse the -1 value to indicate that the estimate is currently being calculated.


		Once the AIOps gateway sends the workloads watchlist through Nats, the training controller service will now count the total number of log messages it will be fetching from Opensearch and then it will be estimate the time to train in the following manner.

		Let B represent the time it takes to process one batch size of 32 training logs.

		User Story:
		As a user of Opni, I would like to receive an accurate initial estimate on how long it will take for AIOps insights to be ready for workload logs.

OEP: Enhance initial prediction of estimate for time of model training. #1033

Are you sure you want to change the base?

OEP: Enhance initial prediction of estimate for time of model training. #1033

Conversation

AmartC commented Feb 7, 2023

sanjay920 left a comment

Choose a reason for hiding this comment

sanjay920 Feb 7, 2023

Choose a reason for hiding this comment

sanjay920 Feb 7, 2023

Choose a reason for hiding this comment

sanjay920 Feb 7, 2023

Choose a reason for hiding this comment

sanjay920 Feb 7, 2023

Choose a reason for hiding this comment

codyrancher left a comment

Choose a reason for hiding this comment