Job Lifecycle¶
Since jobq builds on top of the Kueue job queuing system for scheduling, the lifecycle of a job is very similar to the lifecycle of a workload in Kueue.
The remainder of this document uses the terms job and workload interchangeably.
A workload roughly goes through three phases after its submission: queuing and scheduling, execution, and completion.
Queueing and scheduling¶
After its submission, a workload is in the Submitted
state, where it competes with other workloads for available resource quotas.
Once it is admitted to a cluster queue, it enters the Pending
state, where Kueue will reserve a quota for it.
Alternatively, if the selected local or cluster queue for the workload are stopped or do not exist, the workload will enter the Inadmissible
state until this condition is resolved.
Execution¶
After all admission checks for the workload have passed, it enters the Admitted
state, it is now eligible for execution by the cluster.
Completion¶
When the workload terminates successfully, it enters the terminal Succeeded
state.
If any unrecoverable error occurs during execution, the workload enters the terminal Failed
state. This does not necessarily happen on the first abnormal termination of a pod, depending on the type of workload and other factors (such as the retry limit in a batch/v1/Job
).
A currently executing workload may be preempted by another workload (e.g., by a newly submitted workload with a higher priority). In this case, Kueue will terminate any pods associated with the preempted workload and either requeue it for later execution or evict it from the cluster queue.
State Diagram¶
stateDiagram-v2
direction LR
[*] --> Submitted
Submitted --> Pending: quotaReserved
Submitted --> Inadmissible
Inadmissible --> Submitted
Pending --> Admitted: admitted
Admitted --> Succeeded: success
Admitted --> Failed: error
Admitted --> Submitted: evicted
Admitted --> Pending: requeued
Succeeded --> [*]
Failed --> [*]