[CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task attempt is odd #3079

onebox-li · 2025-01-21T12:51:06Z

What changes were proposed in this pull request?

The streams opened in the streamCreatorPool thread pool are all based on the primary locations. When the task attempt is odd, the task will start to fetch the chunk from the replica location first. This will cause using the wrong streamHandler to fetch data. To keep the logic simple, we always fetch from the primary location, and when change to peer, closing stream and use a null streamHandler when fetching peers.

Why are the changes needed?

Avoid tasks that are slowed down by NPE and potential data problems.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test.

RexXiong · 2025-01-21T13:42:31Z

Nice catch...

RexXiong · 2025-01-22T03:25:29Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

-      if (fetchChunkRetryCnt == 0 && attemptNumber % 2 == 1 && location.hasPeer()) {
-        location = location.getPeer();
-        logger.debug("Read peer {} for attempt {}.", location, attemptNumber);
-      }
      Exception lastException = null;


IMO, It would be better keep this, switch peers based on the attemptNumber, may avoid failure PartitionLocation previously.

If there is a problem with the primary location, then it is likely that it has already been changed to the peer in the last task attempt but still failed. In this case, it is not so relevant which one to use to start fetching in the new task attempt.
If there is no problem with the primary location and the task is retried due to other problems, this situation is even less relevant.
So I think we could always fetch chunk by starting from the primary location. WDYT？

Maybe we should open streams for both primary and replica locations?

If there is a problem with the primary location, then it is likely that it has already been changed to the peer in the last task attempt but still failed. In this case, it is not so relevant which one to use to start fetching in the new task attempt. If there is no problem with the primary location and the task is retried due to other problems, this situation is even less relevant. So I think we could always fetch chunk by starting from the primary location. WDYT？

Sound reasonable. And if we change location to peer here would cause pbStreamHandler and location inconsistent when createReader,there may be issues in some shuffle scenarios.

Maybe we should open streams for both primary and replica locations?

This would be a bit wasteful because most tasks need not change to replica location if the cluster is stable.

Maybe we should open streams for both primary and replica locations?

This would be a bit wasteful because most tasks need not change to replica location if the cluster is stable.

Sounds reasonable.

zwangsheng · 2025-01-22T03:30:13Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

+                    clientFactory.createClient(location.getHost(), location.getFetchPort());
+                TransportMessage bufferStreamEnd =
+                    new TransportMessage(
+                        MessageType.BUFFER_STREAM_END,


Why we need send BUFFER STREAM END to replicate location?

Thanks, fixed.

I wonder what will happen if we send buffer stream end message to a celeborn worker which did not open the stream. Because we may reach here when location is excluded without open stream first.

RexXiong

LGTM

fix stream handler

c766a25

RexXiong reviewed Jan 22, 2025

View reviewed changes

zwangsheng reviewed Jan 22, 2025

View reviewed changes

fix

62eca03

RexXiong approved these changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task attempt is odd #3079

[CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task attempt is odd #3079

onebox-li commented Jan 21, 2025

RexXiong commented Jan 21, 2025

RexXiong Jan 22, 2025

onebox-li Jan 22, 2025

FMX Jan 22, 2025

RexXiong Jan 22, 2025

onebox-li Jan 22, 2025

FMX Jan 22, 2025

zwangsheng Jan 22, 2025

onebox-li Jan 22, 2025

zwangsheng Jan 22, 2025

RexXiong left a comment

[CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task attempt is odd #3079

Are you sure you want to change the base?

[CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task attempt is odd #3079

Conversation

onebox-li commented Jan 21, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

RexXiong commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RexXiong left a comment

Choose a reason for hiding this comment