Skip to content

Commit

Permalink
[CELEBORN-1811] Update default value for `celeborn.master.slot.assign…
Browse files Browse the repository at this point in the history
….extraSlots`

### What changes were proposed in this pull request?
To avoid possible worker load skew for the stages with tiny reducer numbers.

### Why are the changes needed?
If a stage has tiny reducers and skewed partitions, The default value will lead to serious worker load imbalance cause some workers unable to handle shuffle data.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
GA and cluster test.

Closes #3039 from FMX/1811.

Authored-by: mingji <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
  • Loading branch information
FMX authored and SteNicholas committed Dec 31, 2024
1 parent 56019c7 commit 4ec0228
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2883,7 +2883,7 @@ object CelebornConf extends Logging {
.version("0.3.0")
.doc("Extra slots number when master assign slots.")
.intConf
.createWithDefault(2)
.createWithDefault(100)

val MASTER_SLOT_ASSIGN_MAX_WORKERS: ConfigEntry[Int] =
buildConf("celeborn.master.slot.assign.maxWorkers")
Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/master.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ license: |
| celeborn.master.port | 9097 | false | Port for master to bind. | 0.2.0 | |
| celeborn.master.rackResolver.refresh.interval | 30s | false | Interval for refreshing the node rack information periodically. | 0.5.0 | |
| celeborn.master.send.applicationMeta.threads | 8 | false | Number of threads used by the Master to send ApplicationMeta to Workers. | 0.5.0 | |
| celeborn.master.slot.assign.extraSlots | 2 | false | Extra slots number when master assign slots. | 0.3.0 | celeborn.slots.assign.extraSlots |
| celeborn.master.slot.assign.extraSlots | 100 | false | Extra slots number when master assign slots. | 0.3.0 | celeborn.slots.assign.extraSlots |
| celeborn.master.slot.assign.loadAware.diskGroupGradient | 0.1 | false | This value means how many more workload will be placed into a faster disk group than a slower group. | 0.3.0 | celeborn.slots.assign.loadAware.diskGroupGradient |
| celeborn.master.slot.assign.loadAware.fetchTimeWeight | 1.0 | false | Weight of average fetch time when calculating ordering in load-aware assignment strategy | 0.3.0 | celeborn.slots.assign.loadAware.fetchTimeWeight |
| celeborn.master.slot.assign.loadAware.flushTimeWeight | 0.0 | false | Weight of average flush time when calculating ordering in load-aware assignment strategy | 0.3.0 | celeborn.slots.assign.loadAware.flushTimeWeight |
Expand Down
2 changes: 2 additions & 0 deletions docs/migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ license: |

# Upgrading from 0.5 to 0.6

- Since 0.6.0, Celeborn changed the default value of `celeborn.master.slot.assign.extraSlots` from `2` to `100`, which means Celeborn will involve more workers in offering slots.

- Since 0.6.0, Celeborn deprecate `celeborn.worker.congestionControl.low.watermark`. Please use `celeborn.worker.congestionControl.diskBuffer.low.watermark` instead.

- Since 0.6.0, Celeborn deprecate `celeborn.worker.congestionControl.high.watermark`. Please use `celeborn.worker.congestionControl.diskBuffer.high.watermark` instead.
Expand Down

0 comments on commit 4ec0228

Please sign in to comment.