-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocator: make assignment much much faster #351
Conversation
When used with a large number of journals, certain graph constructions can be quite slow, particularly during member scale down. First update `benchmark_test.go` to separately model scale-up and scale-down phases of a simulated rolling deployment, which replicates the conditions we've observed. Next implement the label "gap" heuristic for the push/relabel algorithm by tracking the number of nodes at each height < len(nodes), and looking for zero-ings of counts as an indication that a gap has been created. Respond to gaps by instantly re-labeling nodes to a larger height that reconnects them to the network. On my machine, with current (toy) parameters the original algorithm had a max/flow round which required >4.5s to completed. Now, the same network completes in 27ms. This scales up combinatorially with graph complexity. Update some tests to use "testing" instead of go-check. Excepting where tests have been extended, all scenarios tests are essentially unchanged. Finally, bound the number of items that participate in a single max-flow network, and compose a final desired assignment solution by solving multiple independent max flow problems, with disjoint subsets of items and all members. This makes the big-O runtime of assignment linear on the number of items, which generally dominates over members. Tested up to 1MM items / 3MM assignments and 600 members.
741cbd2
to
1c877f4
Compare
To make CI run a little faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was interesting, and required a few trips to the push-relabel wikipedia page before I felt like I understood what was going on here. I'm sure that I still don't fully understand all of this, and I left a few questions just to make sure I'm not completely lost.
The gap heuristic makes sense, though, and so does the splitting of the problem into smaller groups of items. And FWIW I didn't notice anything that seemed wrong.
// instead of combinatorial over members and items. | ||
const itemsPerNetwork = 10000 | ||
|
||
// NOTE(johnny): this could trivially be parallelized if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding: I don't understand is how this could be trivially parallelized while still respecting the item limit for each member. Wouldn't each process need to know the current number of assignments for each member in order to be able to respect their item limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered below.
Then re: parallelization, State
is unchanged when building / evaluating the max flow network, so multiple goroutines can work in parallel. I didn't do it yet because it's extra SLOC we don't appear to need, and I didn't know for-sure that the memory impact would be negligible.
func (fs *sparseFlowNetwork) buildMemberArc(mf *pr.MaxFlow, id pr.NodeID, member int) []pr.Arc { | ||
var c = memberAt(fs.Members, member).ItemLimit() | ||
// Constrain to the scaled ItemLimit for our portion of the global assignment problem. | ||
c = scaleAndRound(c, len(fs.myItems), len(fs.Items)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another check for understanding: I think this line addresses my previous question about respecting item limits when treating this as several smaller networks, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Each sub-problem gets a corresponding fraction of the total member capacity.
When used with a large number of journals, certain graph constructions
can be quite slow, particularly during member scale down.
First update
benchmark_test.go
to separately model scale-up andscale-down phases of a simulated rolling deployment, which replicates
the conditions we've observed.
Next implement the label "gap" heuristic for the push/relabel algorithm
by tracking the number of nodes at each height < len(nodes), and looking
for zero-ings of counts as an indication that a gap has been created.
Respond to gaps by instantly re-labeling nodes to a larger height that
reconnects them to the network.
On my machine, with current (toy) parameters the original algorithm had
a max/flow round which required >4.5s to completed. Now, the same
network completes in 27ms. This scales up combinatorially with graph
complexity.
Update some tests to use "testing" instead of go-check. Excepting where
tests have been extended, all scenarios tests are essentially unchanged.
Finally, bound the number of items that participate in a single max-flow
network, and compose a final desired assignment solution by solving multiple
independent max flow problems, with disjoint subsets of items and all
members.
This makes the big-O runtime of assignment linear on the number of
items, which generally dominates over members.
Testing:
This change is