BANE in a container sometimes hangs #187

tjgalvin · 2024-11-12T14:31:45Z

I have noticed that occasionally a BANE process inside a singularity container will hang for an unreasonably long amount of time. Whats strange to me is that whenever it does there is this message:

59073:WARNING The points in dimension 0 must be strictly ascending or descending

My hunch is that it has something to do with the use of shared memory to minimise the memory footprint. I think it would be pretty straightforward to switch to using a fork of aegeantools that incorporates either:

@AlecThomson implementation of the fft approach, or
rewritting the multiprocessing pool approach to distributed the stripe index and have the workers open via memmap the appropriate slice. We can also write out the background map as soon as it has been derived, and use the same memmap trick. There might be a hit with the increased I/O, but it eliminates the shared memory

From memory the fft approach has a lingering bug where the output map is blanked and shifted by the kernel shape.

I believe I mocked up some other modes to bane in a separate branch somewhere, but I can't remember the specifics of those.

The text was updated successfully, but these errors were encountered:

tjgalvin · 2024-12-27T08:00:37Z

I attempted to fix this with #200. In testing it worked as expected. However, it seems that when dask-workers are executing code it is done so in a thread outside of main. So, the signal handling does not work as expected.

/scratch3/gal16b/packages/flint_main/flint/flint/prefect/common/imaging.py", line 214, in task_run_bane_and_aegean
    with timelimit_on_context(timelimit_seconds=timelimit_seconds):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/scratch3/gal16b/packages/flint_main/flint/flint/utils.py", line 52, in timelimit_on_context
    signal.signal(signal.SIGALRM, _signal_timelimit_handler)
  File "/datasets/work/jones-storage/work/miniconda/miniforge3/envs/flint_main/lib/python3.12/signal.py", line 58, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: signal only works in main thread of the main interpreter

I suppose the right thing to do now is to either:

fix BANE
ensure prefect timeout_seconds behaves as I thought it would initially.

I have no idea why simply rerunning BANE again after such an error works. Is it concurrency with out BANE processes at the same time? The problem only seems to happen under load. Fixing it there (fork or otherwise) might be possible.

Sigh. Sad.

tjgalvin · 2025-01-07T04:59:18Z

I have largely fixed the hanging behavior by raising an appropriate error in the callback log handler, which retriggers the singularity call.

The timeout seems to be a known issue in prefect. The real fix is going to be to bane.

tjgalvin mentioned this issue Nov 12, 2024

Prefect tasks not responding to timeout_seconds #188

Open

tjgalvin mentioned this issue Dec 27, 2024

Add a timelimit context handler, eye towards BANE deadlocks #200

Merged

tjgalvin closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BANE in a container sometimes hangs #187

BANE in a container sometimes hangs #187

tjgalvin commented Nov 12, 2024

tjgalvin commented Dec 27, 2024

tjgalvin commented Jan 7, 2025

BANE in a container sometimes hangs #187

BANE in a container sometimes hangs #187

Comments

tjgalvin commented Nov 12, 2024

tjgalvin commented Dec 27, 2024

tjgalvin commented Jan 7, 2025