Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle time limit better #37

Open
smoors opened this issue May 22, 2023 · 1 comment
Open

handle time limit better #37

smoors opened this issue May 22, 2023 · 1 comment

Comments

@smoors
Copy link
Collaborator

smoors commented May 22, 2023

  • use more appropriate time limits
  • show better error messages

copying discussion of #28

from @casparvl

For the 1_core, 2_core and 4_core CPU tests, the walltime is too short.

Point number 2 makes me think. First of all, the error is quite non-descriptive:

FAILURE INFO for GROMACS_EESSI_093
  * Expanded name: GROMACS_EESSI %benchmark_info=HECBioSim/hEGFRDimer %nb_impl=cpu %scale=4_cores %module_name=GROMACS/2021.6-foss-2022a
  * Description: GROMACS HECBioSim/hEGFRDimer benchmark (NB: cpu)
  * System partition: snellius:thin
  * Environment: default
  * Stage directory: /scratch-shared/casparl/reframe_output/staging/snellius/thin/default/GROMACS_EESSI_1dfdd606
  * Node list:
  * Job type: batch job (id=2773366)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /1dfdd606 -p default --system snellius:thin -r'
  * Reason: sanity error: pattern 'Finished mdrun' not found in 'md.log'

Maybe we should make a standard sanity check that checks if the job output does not contain something like

slurmstepd: error: *** JOB 2773368 ON tcn509 CANCELLED AT 2023-05-19T20:40:10 DUE TO TIME LIMIT ***

I'm not sure how this generalizes to other systems (I'm assuming all SLURM based systems print this by default), but even so: it doesn't hurt to check. At least there is a better chance of getting a clear error message.

Secondly, how do we make sure we don't run out of walltime? Sure, we could just specify a very long time, but that can be problematic as well (not satisfying max walltimes on a queue, and in our case, jobs <1h actually get backfilled behind a floating reservation that we have in order to reserve some dedicated nodes for short jobs). Should we scale the max walltime based on the amount of resources?

from @smoors

about the walltime issue:

  • we cannot know on which cpu archs this test will run, so it's difficult to guess an appropriate time limit for all users, even if we scale it. even more difficult to guess is how fast this will run on future computers. the 4_cores CPU test actually succeeded for me on a skylake node.
  • alternatively, as we agreed that the tests should be short, we could filter out those scales that we expect to take longer than 30 minutes to run. or at least print a warning.
  • ideally we should also check the walltime of the other benchmarks instead of only the first one (HECBioSim/hEGFRDimer).

about checking for the exceeded time limit message:
i think that's a good idea. we could expand that to also check for an out-of-memory message, and maybe other slurm messages that i don't know of. even better would be to check that the job error file is empty, but Gromacs prints some stuff to stderr, would be nice if there is a way to force Gromacs to not do that.

@smoors smoors changed the title handle exceeded time limit better handle time limit better May 22, 2023
@casparvl
Copy link
Collaborator

casparvl commented May 22, 2023

we cannot know on which cpu archs this test will run, so it's difficult to guess an appropriate time limit for all users, even if we scale it. even more difficult to guess is how fast this will run on future computers. the 4_cores CPU test actually succeeded for me on a skylake node.

Agreed. But, core-to-core speed is also not that different. I'd say if it runs in 10 minutes on an AVX2 based system, a walltime of 30 mins should be enough for everyone (famous last words, I know... "640K ought to be enough for anyone"). Especially if the error is clear enough (we can include a suggestion to just override with --setvar=time_limit for example). If people run it on a local laptop (i.e. with the local spawner, not SLURM), are time limits even enforced by ReFrame? If not, at least the slowest 'hardware' won't run into this.

Admitedly, GPU to GPU times might vary a lot more. But a liberal limit is less of a problem here: I'm not so interested in if it runs in 2 mins or 5, I'd personally be happy with a walltime of 30 mins on each of those tests (as is the case now).

We could implement some (very liberal) time limits based on a look-up table, or some basic logic, to at least allow higher walltime for the slowest tests (< 16 cores or so).

alternatively, as we agreed that the tests should be short, we could filter out those scales that we expect to take longer than 30 minutes to run. or at least print a warning.

We could. At this point, my first approach would be to variant-specific time limits. I think it's easier to implement.

ideally we should also check the walltime of the other benchmarks instead of only the first one (HECBioSim/hEGFRDimer).

Ugh, true. Or decide we don't support those (at least for now) :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants