-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add user-pluggable block error handling and a new sliding-window error handler #2858
Conversation
* For backward compatibility the block_error_handler is disabled if set to False * A new block_error_threshold is added for easier user configurability * Fixed broken logic in handle_errors
* Windowed_error_handler shutsdown the executor IFF the last N jobs all failed where N is configured by block_error_threshold
* Minor mypy fix `block_error_threshold` is only used by the error_handler, so it is removed from the Executor definitions
Three of the four CI runs hung in the Those tests are quite hangy (I think from race conditions to do with threads and zmq and multiprocessing) so it's not unusual for a |
@@ -163,12 +161,8 @@ def handle_errors(self, status: Dict[str, JobStatus]) -> None: | |||
""" | |||
if not self.block_error_handler: | |||
return | |||
init_blocks = 3 | |||
if hasattr(self.provider, 'init_blocks'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this changes the behaviour of the simple error handler, right? so that it will always wait for the first three blocks, rather than the init_blocks number? If we care about this PR adding in new stuff, not changing old behaviour, then this test could move into simple_error_handler, because it has access to executor.provider.init_blocks there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
assert htex.block_error_handler is handler_mock | ||
|
||
bad_state_mock = Mock() | ||
htex.set_bad_state_and_fail_all = bad_state_mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mock do anything in this test?
this is pretty much how I imagined this would be implemented, after a conversation with Yadu elsewhere |
I was also wondering about, because this is a user pluggable interface, if that code needs more protective error handling around the call - however, it looks like the timer code already catches, reports, and continues to time, like this:
and i think that's probably enough error handing around the error handler. |
(total_jobs, failed_jobs) = _count_jobs(status) | ||
if total_jobs >= threshold and failed_jobs == total_jobs: | ||
executor.set_bad_state_and_fail_all(_get_error(status)) | ||
|
||
|
||
def windowed_error_handler(executor: status_handling.BlockProviderExecutor, status: Dict[str, JobStatus], threshold: int = 3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably rename this file to error_handlers.py or something like that, now its more than simple_error_handler.
* Type cleanups * Updating tests * Fixing a sorting error * Adding a `noop_error_handler` * Minor fixes to simple_error_handler to match previous logic
@benclifford I believe I've addressed all your comments. Let me know if you see any further issues. I see that tests were also passing earlier. |
This TODO was implemented in PR #2858
This TODO was implemented in PR #2858
Description
This PR aims to address the limited capabilities of the current
simple_error_handler
at stopping the Parsl runtime when there are repeated failures. The current system only fails if all jobs fail, which is only indicative of configuration errors or problem with the batch scheduler. This PR adds new behavior that updates the existingblock_error_handler
bool variable to take a custom error handler. In addition there's a newblock_error_threshold
that can be used to configure these callbacks.Fixes # (issue)
Type of change
Choose which options apply, and delete the ones which do not apply.