Preempt Pytorch Training Job
Arena supports Priority and Preempt the pytorch job, the following steps will show how to use this feature.
1. Create a yaml file to define a PriorityClass
. There are two priorities defined here: critical
and medium
.
➜ cat priorityClass.yaml
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
name: critical
value: 1100000
---
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
name: medium
value: 1000000
submit the PriorityClass
by kubectl
.
➜ kubectl create -f priorityClass.yaml
priorityclass.scheduling.k8s.io/critical created
priorityclass.scheduling.k8s.io/medium created
2. Check the available resources.There are 3 nodes in total, and each node has 4 gpu cards.
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
3. Submit a pytorch job with medium
priority of 3 nodes and 4 cards, which occupies the full resources. In order to verify the effect, we can increase the epoch of training, extend the training time, and facilitate the experiment to view.
➜ arena --loglevel info submit pytorch \
--name=pytorch-priority-medium \
--gpus=4 \
--workers=3 \
--image=registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--priority=medium \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 200"
configmap/pytorch-priority-medium-pytorchjob created
configmap/pytorch-priority-medium-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-priority-medium created
INFO[0000] The Job pytorch-priority-medium has been submitted successfully
INFO[0000] You can run `arena get pytorch-priority-medium --type pytorchjob` to check the job status
4. Get the details of the this job. You can see that the task is running.
➜ arena get pytorch-priority-medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 3m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-master-0 172.16.0.208
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-0 172.16.0.210
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-1 172.16.0.209
5. Check the GPU card usage. It is all occupied.
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 4
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 4
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 4
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
12/12 (100%)
6. Submit a job with priority of critical
to initiate preemption.
➜ arena --loglevel info submit pytorch \
--name=pytorch-priority-critical \
--gpus=1 \
--image=registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--priority=critical \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 50"
configmap/pytorch-priority-critical-pytorchjob created
configmap/pytorch-priority-critical-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-priority-critical created
INFO[0000] The Job pytorch-priority-critical has been submitted successfully
INFO[0000] You can run `arena get pytorch-priority-critical --type pytorchjob` to check the job status
7. Get the details of the this job.
➜ arena get pytorch-priority-critical
arena get pytorch-priority-critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 22s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-critical RUNNING PYTORCHJOB 22s pytorch-priority-critical-master-0 172.16.0.208
8. Check the job status of medium
priority. It has become FAILED
. One instance has been deleted due to preemption.
➜ arena get pytorch-priority-medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-master-0 172.16.0.210
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-worker-0 172.16.0.209
9. Check the event of the pytorch-priority-medium
, and you can see that its python-priority-media-worker-1
has been expelled. The reason for the expulsion is that the python-priority-critical-master-0
is also applying for the resource of this node, and the node has no additional GPU resource, so the low priority job is preempted by the high priority job.
➜ kubectl get events --field-selector involvedObject.name=pytorch-priority-medium-worker-1