-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ai): update ai-video selection suspension #3033
base: master
Are you sure you want to change the base?
fix(ai): update ai-video selection suspension #3033
Conversation
494b5d9
to
2504355
Compare
2504355
to
959ae10
Compare
187dcd4
to
d94d62b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like I don't have context to officially approve this, but left some comments. Only nits tho, the implementation makes sense for the PR description.
// as well | ||
// as well. Since AISessionManager re-uses the pools the suspension | ||
// penalty needs to consider the current suspender count to set the penalty | ||
last_count, ok := pool.suspender.list[sess.Transcoder()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/lint: Vars in go should be camelCase
// Refresh if the # of sessions across warm and cold pools falls below the smaller of the maxRefreshSessionsThreshold and | ||
// 1/2 the total # of orchs that can be queried during discovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems out of place now, can you move it closer to L247?
@@ -222,7 +233,17 @@ func (sel *AISessionSelector) Select(ctx context.Context) *AISession { | |||
shouldRefreshSelector := func() bool { | |||
// Refresh if the # of sessions across warm and cold pools falls below the smaller of the maxRefreshSessionsThreshold and | |||
// 1/2 the total # of orchs that can be queried during discovery | |||
discoveryPoolSize := sel.node.OrchestratorPool.Size() | |||
discoveryPoolSize := int(math.Min(float64(sel.node.OrchestratorPool.Size()), float64(sel.initialPoolSize))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this Min
now? Can the pool grow from its initial size?
// penalty needs to consider the current suspender count to set the penalty | ||
last_count, ok := pool.suspender.list[sess.Transcoder()] | ||
if ok { | ||
penalty = pool.suspender.count - last_count + pool.penalty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little lost with this suspension logic. So, I see that:
- the
pool.suspender.count
is increased every timesignalRefresh()
is called pool.penalty
is always set to3
last_count
is always set tosuspender.count + 3
So, that logic would mean that we're not taking the suspended orchestrator until 3 times the signalRefresh()
is called. Is this the idea of this suspension mechanism? That we don't allow the given O to get selected in the 3 refresh sessions?
// if there are no orchestrators in the pools | ||
clog.Infof(ctx, "refreshing sessions, no orchestrators in pools") | ||
for i := 0; i < sel.penalty; i++ { | ||
sel.suspender.signalRefresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release all orchestrators
shouldn't we then just remove them from the suspender.list()
rather than calling signalRefresh()
? My understanding is that if penalty = 3
, then we would need to call signalRefresh()
3 times in order to "release all orchestrators from suspension".
@leszko Are we still planning to merge these fixes? They help alleviate some of the orchestrator stickiness issues we've observed over the past few months. |
Well...if it helps Orchestrators, then yes. Why not? 🙃 Please address the PR review comments and re-request review :) |
What does this pull request do? Explain your changes. (required)
Draft of ai-video selection algo fix.
Suspension was not working because the
penalty
was always3
. This logic was a carryover from transcoding where the suspender always started at a refresh count of 0 because a new session manager was created with each stream. For AI, we are reusing the session manager and the suspender so the refresh count does not reset between requests. The fix to suspension is to consider the current refresh count when calculating the penalty so it is 3 more than the current refresh count in the suspender.There was also an issue where the
discoveryPoolSize
was always 100 and with limited orchestrators providing models a refresh of sessions was being done with every request. I added aninitialPoolSize
field to track the last refresh pool size to use with theshouldRefreshSessions
logic rather than 100. This stabilizes the suspender to allow more orchestrators to be tried with eachSelect
call.Last update was moving the
signalRefresh()
for the suspender that increments the refresh counter in the suspender to theRefresh
function makes it more stable that every time we refresh sessions we add to the suspender refresh countHappy to segregate some of these changes to separate PRs. The suspension fixes can be added separately without dependency on ai-worker PR.
Specific updates (required)
How did you test each of these updates (required)
I have been running these updates on my gateway. Tested 1-200 requests with 5-10 workers sending to gateway. All completed with 1-2 orchestrators providing Bytedance model.
Does this pull request close any open issues?
Checklist:
make
runs successfully./test.sh
pass