Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user-pluggable block error handling and a new sliding-window error handler #2858

Merged
merged 9 commits into from
Aug 12, 2023

Conversation

yadudoc
Copy link
Member

@yadudoc yadudoc commented Aug 10, 2023

Description

This PR aims to address the limited capabilities of the current simple_error_handler at stopping the Parsl runtime when there are repeated failures. The current system only fails if all jobs fail, which is only indicative of configuration errors or problem with the batch scheduler. This PR adds new behavior that updates the existing block_error_handler bool variable to take a custom error handler. In addition there's a new block_error_threshold that can be used to configure these callbacks.

from parsl.jobs.simple_error_handler import simple_error_handler, windowed_error_handler

config = Config(executors=[
         HighThroughputExecutor(
                ...
                block_error_handler = <simple_error_handler / windowed_error_handler>,
                block_error_threshold)
  ]
)

Fixes # (issue)

Type of change

Choose which options apply, and delete the ones which do not apply.

  • New feature (non-breaking change that adds functionality)
  • Code maintentance/cleanup

* For backward compatibility the block_error_handler is disabled if set to False
* A new block_error_threshold is added for easier user configurability
* Fixed broken logic in handle_errors
* Windowed_error_handler shutsdown the executor IFF the last N jobs all failed where N is configured by block_error_threshold
* Minor mypy fix

`block_error_threshold` is only used by the error_handler, so it is removed from the Executor definitions
@benclifford
Copy link
Collaborator

Three of the four CI runs hung in the --config local tests at different points (and the fourth was subsequently auto-cancelled), with the last commit b963e39

Those tests are quite hangy (I think from race conditions to do with threads and zmq and multiprocessing) so it's not unusual for a --config local to hang. For three to hang in one run seems suspicious, though.

@@ -163,12 +161,8 @@ def handle_errors(self, status: Dict[str, JobStatus]) -> None:
"""
if not self.block_error_handler:
return
init_blocks = 3
if hasattr(self.provider, 'init_blocks'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this changes the behaviour of the simple error handler, right? so that it will always wait for the first three blocks, rather than the init_blocks number? If we care about this PR adding in new stuff, not changing old behaviour, then this test could move into simple_error_handler, because it has access to executor.provider.init_blocks there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

assert htex.block_error_handler is handler_mock

bad_state_mock = Mock()
htex.set_bad_state_and_fail_all = bad_state_mock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mock do anything in this test?

@benclifford
Copy link
Collaborator

this is pretty much how I imagined this would be implemented, after a conversation with Yadu elsewhere

@benclifford
Copy link
Collaborator

benclifford commented Aug 11, 2023

I was also wondering about, because this is a user pluggable interface, if that code needs more protective error handling around the call - however, it looks like the timer code already catches, reports, and continues to time, like this:

1691736331.298730 2023-08-11 06:45:31 MainProcess-9203 JobStatusPoller-Timer-Thread-140360055682320-140360045541056 parsl.utils:349 make_callback ERROR: Callback threw an exception - logging and proceeding anyway

and i think that's probably enough error handing around the error handler.

@benclifford benclifford changed the title Better error handling for Batch job failures Add user-pluggable block error handling and a new sliding-window error handler Aug 11, 2023
(total_jobs, failed_jobs) = _count_jobs(status)
if total_jobs >= threshold and failed_jobs == total_jobs:
executor.set_bad_state_and_fail_all(_get_error(status))


def windowed_error_handler(executor: status_handling.BlockProviderExecutor, status: Dict[str, JobStatus], threshold: int = 3):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably rename this file to error_handlers.py or something like that, now its more than simple_error_handler.

* Type cleanups
* Updating tests
* Fixing a sorting error
* Adding a `noop_error_handler`
* Minor fixes to simple_error_handler to match previous logic
@yadudoc yadudoc marked this pull request as ready for review August 11, 2023 22:05
@yadudoc
Copy link
Member Author

yadudoc commented Aug 11, 2023

@benclifford I believe I've addressed all your comments. Let me know if you see any further issues. I see that tests were also passing earlier.

@benclifford benclifford merged commit 322f01b into master Aug 12, 2023
4 checks passed
@benclifford benclifford deleted the fix_scale_out branch August 12, 2023 14:20
benclifford added a commit that referenced this pull request May 3, 2024
benclifford added a commit that referenced this pull request May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants