Add user-pluggable block error handling and a new sliding-window error handler #2858

yadudoc · 2023-08-10T19:22:34Z

Description

This PR aims to address the limited capabilities of the current simple_error_handler at stopping the Parsl runtime when there are repeated failures. The current system only fails if all jobs fail, which is only indicative of configuration errors or problem with the batch scheduler. This PR adds new behavior that updates the existing block_error_handler bool variable to take a custom error handler. In addition there's a new block_error_threshold that can be used to configure these callbacks.

from parsl.jobs.simple_error_handler import simple_error_handler, windowed_error_handler

config = Config(executors=[
         HighThroughputExecutor(
                ...
                block_error_handler = <simple_error_handler / windowed_error_handler>,
                block_error_threshold)
  ]
)

Fixes # (issue)

Type of change

Choose which options apply, and delete the ones which do not apply.

New feature (non-breaking change that adds functionality)
Code maintentance/cleanup

* For backward compatibility the block_error_handler is disabled if set to False * A new block_error_threshold is added for easier user configurability * Fixed broken logic in handle_errors

* Windowed_error_handler shutsdown the executor IFF the last N jobs all failed where N is configured by block_error_threshold

* Added tests

* Minor mypy fix `block_error_threshold` is only used by the error_handler, so it is removed from the Executor definitions

benclifford · 2023-08-11T06:23:59Z

Three of the four CI runs hung in the --config local tests at different points (and the fourth was subsequently auto-cancelled), with the last commit b963e39

Those tests are quite hangy (I think from race conditions to do with threads and zmq and multiprocessing) so it's not unusual for a --config local to hang. For three to hang in one run seems suspicious, though.

parsl/executors/high_throughput/executor.py

parsl/executors/status_handling.py

benclifford · 2023-08-11T06:49:58Z

parsl/executors/status_handling.py

@@ -163,12 +161,8 @@ def handle_errors(self, status: Dict[str, JobStatus]) -> None:
        """
        if not self.block_error_handler:
            return
-        init_blocks = 3
-        if hasattr(self.provider, 'init_blocks'):


removing this changes the behaviour of the simple error handler, right? so that it will always wait for the first three blocks, rather than the init_blocks number? If we care about this PR adding in new stuff, not changing old behaviour, then this test could move into simple_error_handler, because it has access to executor.provider.init_blocks there.

parsl/jobs/simple_error_handler.py

benclifford · 2023-08-11T06:55:19Z

parsl/tests/test_scaling/test_block_error_handler.py

+    assert htex.block_error_handler is handler_mock
+
+    bad_state_mock = Mock()
+    htex.set_bad_state_and_fail_all = bad_state_mock


does this mock do anything in this test?

benclifford · 2023-08-11T07:01:40Z

this is pretty much how I imagined this would be implemented, after a conversation with Yadu elsewhere

benclifford · 2023-08-11T07:53:56Z

I was also wondering about, because this is a user pluggable interface, if that code needs more protective error handling around the call - however, it looks like the timer code already catches, reports, and continues to time, like this:

1691736331.298730 2023-08-11 06:45:31 MainProcess-9203 JobStatusPoller-Timer-Thread-140360055682320-140360045541056 parsl.utils:349 make_callback ERROR: Callback threw an exception - logging and proceeding anyway

and i think that's probably enough error handing around the error handler.

benclifford · 2023-08-11T15:05:27Z

parsl/jobs/simple_error_handler.py

    (total_jobs, failed_jobs) = _count_jobs(status)
    if total_jobs >= threshold and failed_jobs == total_jobs:
        executor.set_bad_state_and_fail_all(_get_error(status))


+def windowed_error_handler(executor: status_handling.BlockProviderExecutor, status: Dict[str, JobStatus], threshold: int = 3):


probably rename this file to error_handlers.py or something like that, now its more than simple_error_handler.

* Type cleanups * Updating tests * Fixing a sorting error * Adding a `noop_error_handler` * Minor fixes to simple_error_handler to match previous logic

yadudoc · 2023-08-11T22:10:50Z

@benclifford I believe I've addressed all your comments. Let me know if you see any further issues. I see that tests were also passing earlier.

This TODO was implemented in PR #2858

yadudoc added 3 commits August 10, 2023 14:13

Update BlockProviderExecutor to take a callback for block_error_handler

9b87df7

* For backward compatibility the block_error_handler is disabled if set to False * A new block_error_threshold is added for easier user configurability * Fixed broken logic in handle_errors

New windowed_error_handler to better handle long running workflows

329950b

* Windowed_error_handler shutsdown the executor IFF the last N jobs all failed where N is configured by block_error_threshold

* Update executor to work with changes to BlockProviderExecutor

52386a6

* Added tests

yadudoc requested a review from benclifford August 10, 2023 19:22

Removing block_error_threshold

d20ee05

* Minor mypy fix `block_error_threshold` is only used by the error_handler, so it is removed from the Executor definitions

yadudoc force-pushed the fix_scale_out branch from b68010c to d20ee05 Compare August 10, 2023 20:00