Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HFB MPI fix hang on write-out #238

Merged
merged 2 commits into from
Jun 13, 2024
Merged

HFB MPI fix hang on write-out #238

merged 2 commits into from
Jun 13, 2024

Conversation

rikvl
Copy link
Contributor

@rikvl rikvl commented Feb 27, 2024

Workaround for #237

@rikvl rikvl marked this pull request as ready for review February 27, 2024 19:02
@rikvl rikvl requested review from ketiltrout and ljgray June 13, 2024 18:20
@rikvl
Copy link
Contributor Author

rikvl commented Jun 13, 2024

To summarize what we learned last February (see comments in #237): running the HFB blind-search pipeline on multiple nodes hangs in a file write-out stage when the number of MPI processes is larger than the number of NS beams, possibly due to some upstream bug. This PR is a work-around to avoid the failure conditions.

Copy link
Contributor

@ljgray ljgray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround looks fine. I think that this issue is probably the same as the one discussed in radiocosmology/caput#165, which had a lot of investigation but was never resolved.

@rikvl rikvl force-pushed the rvl/hfb-mpi-fix branch from e016c17 to f777f86 Compare June 13, 2024 18:36
@rikvl rikvl merged commit e85ef19 into master Jun 13, 2024
2 checks passed
@rikvl rikvl deleted the rvl/hfb-mpi-fix branch June 13, 2024 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants