Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BANE in a container sometimes hangs #187

Closed
tjgalvin opened this issue Nov 12, 2024 · 2 comments
Closed

BANE in a container sometimes hangs #187

tjgalvin opened this issue Nov 12, 2024 · 2 comments

Comments

@tjgalvin
Copy link
Owner

I have noticed that occasionally a BANE process inside a singularity container will hang for an unreasonably long amount of time. Whats strange to me is that whenever it does there is this message:

59073:WARNING The points in dimension 0 must be strictly ascending or descending

My hunch is that it has something to do with the use of shared memory to minimise the memory footprint. I think it would be pretty straightforward to switch to using a fork of aegeantools that incorporates either:

  • @AlecThomson implementation of the fft approach, or
  • rewritting the multiprocessing pool approach to distributed the stripe index and have the workers open via memmap the appropriate slice. We can also write out the background map as soon as it has been derived, and use the same memmap trick. There might be a hit with the increased I/O, but it eliminates the shared memory

From memory the fft approach has a lingering bug where the output map is blanked and shifted by the kernel shape.

I believe I mocked up some other modes to bane in a separate branch somewhere, but I can't remember the specifics of those.

@tjgalvin
Copy link
Owner Author

I attempted to fix this with #200. In testing it worked as expected. However, it seems that when dask-workers are executing code it is done so in a thread outside of main. So, the signal handling does not work as expected.

/scratch3/gal16b/packages/flint_main/flint/flint/prefect/common/imaging.py", line 214, in task_run_bane_and_aegean
    with timelimit_on_context(timelimit_seconds=timelimit_seconds):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/scratch3/gal16b/packages/flint_main/flint/flint/utils.py", line 52, in timelimit_on_context
    signal.signal(signal.SIGALRM, _signal_timelimit_handler)
  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/signal.py", line 58, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: signal only works in main thread of the main interpreter

I suppose the right thing to do now is to either:

  • fix BANE
  • ensure prefect timeout_seconds behaves as I thought it would initially.

I have no idea why simply rerunning BANE again after such an error works. Is it concurrency with out BANE processes at the same time? The problem only seems to happen under load. Fixing it there (fork or otherwise) might be possible.

Sigh. Sad.

@tjgalvin
Copy link
Owner Author

tjgalvin commented Jan 7, 2025

I have largely fixed the hanging behavior by raising an appropriate error in the callback log handler, which retriggers the singularity call.

The timeout seems to be a known issue in prefect. The real fix is going to be to bane.

@tjgalvin tjgalvin closed this as completed Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant