Skip to content

Commit

Permalink
[CELEBORN-1774][FOLLOWUP] Change celeborn.<module>.io.mode optional t…
Browse files Browse the repository at this point in the history
…o explain default behavior in description

### What changes were proposed in this pull request?

Change `celeborn.<module>.io.mode` optional to explain default behavior in description.

### Why are the changes needed?

The default value of `celeborn.<module>.io.mode` in document could be changed by whether epoll mode is available for different os. Therefore, `celeborn.<module>.io.mode` should be changed to optional and explained the default behavior in description of option.

Follow up #3039 (comment).

### Does this PR introduce _any_ user-facing change?

`celeborn.<module>.io.mode` is optional and explains default behavior in description.

### How was this patch tested?

CI.

Closes #3044 from SteNicholas/CELEBORN-1774.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
  • Loading branch information
SteNicholas committed Jan 2, 2025
1 parent a318eb4 commit 16762c6
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,9 @@ class CelebornConf(loadDefaults: Boolean) extends Cloneable with Logging with Se
def rpcDumpIntervalMs(): Long = get(RPC_SUMMARY_DUMP_INTERVAL)

def networkIoMode(module: String): String = {
getTransportConf(module, NETWORK_IO_MODE)
get(
NETWORK_IO_MODE.key.replace("<module>", module),
if (Epoll.isAvailable) IOMode.EPOLL.name() else IOMode.NIO.name())
}

def networkIoPreferDirectBufs(module: String): Boolean = {
Expand Down Expand Up @@ -1931,15 +1933,14 @@ object CelebornConf extends Logging {
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("60s")

val NETWORK_IO_MODE: ConfigEntry[String] =
val NETWORK_IO_MODE: OptionalConfigEntry[String] =
buildConf("celeborn.<module>.io.mode")
.categories("network")
.doc("Netty EventLoopGroup backend, available options: NIO, EPOLL. If epoll mode is available, the default IO mode is EPOLL; otherwise, the default is NIO.")
.stringConf
.transform(_.toUpperCase)
.checkValues(Set(IOMode.NIO.name(), IOMode.EPOLL.name()))
.createWithDefaultFunction(() =>
if (Epoll.isAvailable) IOMode.EPOLL.name() else IOMode.NIO.name())
.createOptional

val NETWORK_IO_PREFER_DIRECT_BUFS: ConfigEntry[Boolean] =
buildConf("celeborn.<module>.io.preferDirectBufs")
Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/network.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ license: |
| celeborn.&lt;module&gt;.io.enableVerboseMetrics | false | false | Whether to track Netty memory detailed metrics. If true, the detailed metrics of Netty PoolByteBufAllocator will be gotten, otherwise only general memory usage will be tracked. | | |
| celeborn.&lt;module&gt;.io.lazyFD | true | false | Whether to initialize FileDescriptor lazily or not. If true, file descriptors are created only when data is going to be transferred. This can reduce the number of open files. If setting <module> to `fetch`, it works for worker fetch server. | | |
| celeborn.&lt;module&gt;.io.maxRetries | 3 | false | Max number of times we will try IO exceptions (such as connection timeouts) per request. If set to 0, we will not do any retries. If setting <module> to `data`, it works for shuffle client push and fetch data. If setting <module> to `replicate`, it works for replicate client of worker replicating data to peer worker. If setting <module> to `push`, it works for Flink shuffle client push data. | | |
| celeborn.&lt;module&gt;.io.mode | EPOLL | false | Netty EventLoopGroup backend, available options: NIO, EPOLL. If epoll mode is available, the default IO mode is EPOLL; otherwise, the default is NIO. | | |
| celeborn.&lt;module&gt;.io.mode | &lt;undefined&gt; | false | Netty EventLoopGroup backend, available options: NIO, EPOLL. If epoll mode is available, the default IO mode is EPOLL; otherwise, the default is NIO. | | |
| celeborn.&lt;module&gt;.io.numConnectionsPerPeer | 1 | false | Number of concurrent connections between two nodes. If setting <module> to `rpc_app`, works for shuffle client. If setting <module> to `rpc_service`, works for master or worker. If setting <module> to `data`, it works for shuffle client push and fetch data. If setting <module> to `replicate`, it works for replicate client of worker replicating data to peer worker. | | |
| celeborn.&lt;module&gt;.io.preferDirectBufs | true | false | If true, we will prefer allocating off-heap byte buffers within Netty. If setting <module> to `rpc_app`, works for shuffle client. If setting <module> to `rpc_service`, works for master or worker. If setting <module> to `data`, it works for shuffle client push and fetch data. If setting <module> to `push`, it works for worker receiving push data. If setting <module> to `replicate`, it works for replicate server or client of worker replicating data to peer worker. If setting <module> to `fetch`, it works for worker fetch server. | | |
| celeborn.&lt;module&gt;.io.receiveBuffer | 0b | false | Receive buffer size (SO_RCVBUF). Note: the optimal size for receive buffer and send buffer should be latency * network_bandwidth. Assuming latency = 1ms, network_bandwidth = 10Gbps buffer size should be ~ 1.25MB. If setting <module> to `rpc_app`, works for shuffle client. If setting <module> to `rpc_service`, works for master or worker. If setting <module> to `data`, it works for shuffle client push and fetch data. If setting <module> to `push`, it works for worker receiving push data. If setting <module> to `replicate`, it works for replicate server or client of worker replicating data to peer worker. If setting <module> to `fetch`, it works for worker fetch server. | 0.2.0 | |
Expand Down
2 changes: 1 addition & 1 deletion docs/migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ license: |

- Since 0.6.0, Celeborn changed the default value of `celeborn.client.spark.fetch.throwsFetchFailure` from `false` to `true`, which means Celeborn will enable spark stage rerun at default.

- Since 0.6.0, Celeborn changed the default value of `celeborn.<module>.io.mode` from `NIO` to `EPOLL` if epoll mode is available, falling back to `NIO` otherwise.
- Since 0.6.0, Celeborn changed `celeborn.<module>.io.mode` optional, of which the default value changed from `NIO` to `EPOLL` if epoll mode is available, falling back to `NIO` otherwise.

- Since 0.6.0, Celeborn has introduced a new RESTful API namespace: /api/v1, which uses the application/json media type for requests and responses.
The `celeborn-openapi-client` SDK is also available to help users interact with the new RESTful APIs.
Expand Down

0 comments on commit 16762c6

Please sign in to comment.