You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 14, 2024. It is now read-only.
It was noticed earlier this month that the jwang SLURM account had usage on the GPU cluster that exceeded 100% of their awarded amount in their current proposal, but were not locked by the daily update status mechanism.
Looking through the bank log, locking appeared to happen on jan 13, and then again the next day (with no indication of a manual unlock in between).
I haven't confirmed this yet but my guess at why this is happening is that the condition for being included as a SLURM account in update status is too broad (being "unlocked" on any cluster), and the locking is not broad enough (If I remember correctly, the only clusters considered for locking are those with SUs in the active proposal).
In a configuration where a new cluster (Teach, for example) is unlocked, but we don't explicitly add an allocation when awarding SUs on that cluster, it will never get locked, and always trip the conditional that qualifies the SLURM account for the locking check.
As for getting to the >100% usage in the first place, I think this may be related to how in the previous instance of the bank, people could use SUs from other clusters to cover their usage on a different cluster. The previous bank could have been indicating 100% usage on a cluster when in fact SLURM's accounting that we rely on now for the usage would indicate a much higher value.
We may just want to update the logging for now to provide more detail about why the account was considered for locking, what the SUs values were when locking occurred, etc. especially if this starts to impact more accounts before we are able to switch to Keystone.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
It was noticed earlier this month that the
jwang
SLURM account had usage on the GPU cluster that exceeded 100% of their awarded amount in their current proposal, but were not locked by the daily update status mechanism.Looking through the bank log, locking appeared to happen on jan 13, and then again the next day (with no indication of a manual unlock in between).
I haven't confirmed this yet but my guess at why this is happening is that the condition for being included as a SLURM account in update status is too broad (being "unlocked" on any cluster), and the locking is not broad enough (If I remember correctly, the only clusters considered for locking are those with SUs in the active proposal).
In a configuration where a new cluster (Teach, for example) is unlocked, but we don't explicitly add an allocation when awarding SUs on that cluster, it will never get locked, and always trip the conditional that qualifies the SLURM account for the locking check.
As for getting to the >100% usage in the first place, I think this may be related to how in the previous instance of the bank, people could use SUs from other clusters to cover their usage on a different cluster. The previous bank could have been indicating 100% usage on a cluster when in fact SLURM's accounting that we rely on now for the usage would indicate a much higher value.
We may just want to update the logging for now to provide more detail about why the account was considered for locking, what the SUs values were when locking occurred, etc. especially if this starts to impact more accounts before we are able to switch to Keystone.
The text was updated successfully, but these errors were encountered: