-
Notifications
You must be signed in to change notification settings - Fork 42
Job Priority Plugin
Every job submitted to the Flux scheduler will have a job priority value assigned. A job's priority determines the order that a scheduler will select resources for jobs users submit. The scheduler visits the job with the highest priority value first, and each successive job with successively lower priority values after that.
The most basic job priority algorithm is to assign the first job a very high priority value. Every job submitted after the first receive a priority value of one less than the job before. This creates a "first come, first served" (or FCFS) prioritized job queue.
The Flux scheduler is designed to be used at every level in the job hierarchy. A top-most job can be instantiated by the flux user across the entire cluster. Below that, first-level jobs could be created to mimic node partitions, each serving its respective job queue. Jobs could be submitted to the top-level job which would then pass the job into one of its child jobs, depending on the "queue" the job requests. This would be the way to deliver S4 and create the policies and batch system behavior with which users are familiar.
Jobs running closer to leaves of the job hierarchy tree will not need a scheduler with the full-blown scheduling abilities of a traditional batch scheduler. For these schedulers, simple FCFS scheduling may be all that is required. The discussion that follows describes the abilities needed to achieve the S4 requirements, i.e, a scheduler with more complex job prioritizing abilities.
Backfill scheduling refers to the scheduling behavior that allows a job scheduling order that breaks with the strict FCFS order. Backfill scheduling is orthogonal to job priority assignment. Backfill scheduling can apply to any prioritized job queue, no matter what algorithm is used to determine job priority. It simply allows the scheduler to schedule lower priority jobs on resources that would otherwise remain idle while the top priority job waits for all of the resources it requires to become available.
The S4 replacement goals call for a multi-factor job priority algorithm in addition to the FCFS algorithm. Flux could eventually support other algorithms for determining job priority, but only the multi-factor algorithm need be available to address S4.
The multi-factor algorithm is a simple sum of six independent components, where each component is the product of a configurable weight and a normalized factor.
job priority = ( wait_time_weight * wait_time_factor ) +
( fair_share_weight * fair_share_factor ) +
( qos_weight * qos_factor ) +
( queue_weight * queue_factor ) +
( job_size_weight * job_size_factor ) +
( user_weight * user_factor )
The priority of every job is periodically updated using the most recent time-related (wait time and fair-share) factors. The factors themselves are float values normalized to range from 0.0 to 1.0. This allows the associated weights to be compared at face value. The weight value can be set to zero if the contribution from the associated factor is not desired. Priority Factor Descriptions
The Wait Time factor is proportional to the time that has elapsed since the job was submitted. When used, it increases the job priority, the longer the job remains pending in the queue. Like all the factors, the wait time factor needs to be normalized. A maximum wait time must be configured to become the denominator in the calculation of the wait time quotient: (measured wait time) / (maximum wait time). Jobs that wait longer than the maximum wait time accrue no further priority increases.
The Fair-Share factor represents the difference between allotted shares of computing resources and historical usage. It aims to grant higher priority to users' jobs that are under-serviced. Allocated cpu-seconds is the traditional measure of computing resource usage and remains the minimum requirement. Newer versions extend the resources tracked to include memory, GPUs and power. The proposed design for the fair-share component shall support the eventual inclusion of these additional resources.
The Quality of Service (QoS) factor represents a priority adjustment to meet service goals. This factor aims to place jobs further up or further down the queue based on their level of importance. The three levels of importance are traditionally called: expedite, normal, and standby. The ability to request an expedite QoS is conferred by management on those users or project groups who are doing work considered to be of high importance. The standby QoS is usually available to everyone and results in lowered job priority in exchange for being exempt from size or run-time limits that would otherwise be imposed.
The Queue factor is assigned to each job queue which is mapped to a node partition. Its intended use is to allow jobs submitted to one queue have a higher priority than jobs submitted to another queue, especially when those queues are mapped to the same node partition. If Flux's job hierarchy is configured to emulate disjoint job queues like batch and debug, then this component will be unnecessary. The design should include it for completeness, but it's actual application will be uncertain.
The Job Size factor confers a priority boost based on job size (nominally the number of nodes). This has traditionally been employed to favor larger jobs on large parallel clusters. Conversely, it can and has been used to favor smaller jobs on clusters devoted to high throughput computing.
The User factor gives the users the ability to order their jobs in the queue where they would otherwise be assigned the same priority. It is a seldom used feature, but handy nonetheless when a user wants to re-order a bunch of jobs days after they were submitted. The user can assign a user factor that can only lower a job's priority, not raise it.
The job priority plugin will be optional - when not loaded, the priority assigned when the job was created will remain fixed. When loaded, the job priority plugin will assign a priority value to every submitted job. The plugin will periodically update the priority of every job in the queue waiting to run. The job priority plugin requires some or all of the following inputs:
- time the job was submitted
- requested job size and wall clock run time
- requested resources
- user for whom job runs
- charge account associated with job
- requested quality of service (QoS)
- requested job "queue" (where applicable)
- requested "user" priority
- weights configured for each of the above six factors
- historical usage for each charge-able resource tracked to user/charge_account/cluster granularity.
- configured period of job priority update calculations
The job priority plugin will use this input to compute a job priority using the above formula and assign it to each job record at the time the job is submitted and periodically thereafter at the configured update period until the job is scheduled to run.
With the substantial investment required for procuring HPC resources, there is typically more demand for computing resources than can be supplied. The old system of creating quotas to limit users to just their promised shares of the computing cycles proved to be inefficient. If any of the users failed to submit enough jobs to consume their quota for the month, precious computing cycles would go idle.
The solution to this was fair-share scheduling. Instead of hard quotas, users are granted shares of the cluster's computing cycles. The scheduler using the fair-share factor prioritizes jobs so as to achieve usage that matches the allotted shares. If some users fail to submit enough jobs to utilize their allotments, other users will be allowed to consume the otherwise idle cycles.
The fair-share factor represents the degree to which a user's charge account is over or under serviced. A value of 0.5 represents a perfectly serviced account. 0.0 is over-serviced; 1.0 is under-serviced.
The fair-share computation relies on a well defined hierarchy of shared allotments. These shares can be considered claims on the cluster's resources. As stated above, "resources" has been traditionally been defined as allocated cpu-seconds.
The hierarchy of shares allows a hierarchy of accounts. At the top level, the root account owns 100% of the cluster. The direct child accounts of root are each assigned shares. Child accounts of root's children further divide the shares of their parent. Once all of the accounts are created with associated shares of their parent's shares, user are added to the accounts they do work for.
The shares of each account in the hierarchy are needed to create the fair-share factor. The other component required is a tally of actual computing cycles charged to each user under each account. The fair-share factor is derived such that for accounts at the same level, the fair-share factor will be proportional to the degree of being over or under serviced.
As a refinement to the fair-share scheduling, the notion of a half-life decay was introduced to the usage component. This addressed the concern that a person who accrued a bunch of usage weeks ago should receive slightly higher priority than the person who accrued the same usage the day before.
Through experimentation, it was determined that the sweet spot was to decay everyone's usage by half over the course of a week. However, the half-life value must be configurable to accommodate a variety of site needs.