Preempt the MPI job
Arena supports Priority and Preemption for MPIJob, the following steps will help to how to use this feature.
1. Create PriorityClass
with the yaml below.
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
name: critical
value: 1100000
---
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
name: medium
value: 1000000
Note
- The Kubernetes Version > 1.11
Save the template that applies in a file named pc.yaml
, and use kubectl
to create the PriorityClass
.
kubectl create -f pc.yaml
2. There is only 1 GPU available in the Kubernetes cluster.
$ arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
192.168.0.20 192.168.0.20 master 0 0
192.168.0.21 192.168.0.21 master 0 0
192.168.0.22 192.168.0.22 master 0 0
192.168.0.23 192.168.0.23 <none> 1 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)
3. Run the MPI training Job with medium
priority. The following command is an example.
$ arena submit mpi \
--name=medium \
--priority=medium \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
configmap/medium-mpijob created
configmap/medium-mpijob labeled
mpijob.kubeflow.org/medium created
INFO[0000] The Job medium has been submitted successfully
INFO[0000] You can run `arena get medium --type mpijob` to check the job status
4. Get the details of the specific job.
$ arena get medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 58s
NAME STATUS TRAINER AGE INSTANCE NODE
medium RUNNING MPIJOB 58s medium-launcher-sz5xj 192.168.0.23
medium RUNNING MPIJOB 58s medium-worker-0 192.168.0.23
5. The only one GPU is used by MPI training Job medium
.
$ arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default medium-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
-------------------------------------------------------------------------
Allocated/Total GPUs In Cluster: 1/1 (100%)
6. Run the MPI training Job with critical
priority.
$ arena submit mpi \
--name=critical \
--priority=critical \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
7. Check MPI Training Job medium
, and find it's preempted by critical-worker-0.
$ kubectl get events --field-selector involvedObject.name=medium-worker-0
LAST SEEN TYPE REASON OBJECT MESSAGE
15m Normal Scheduled pod/medium-worker-0 Successfully assigned default/medium-worker-0 to 192.168.0.23
14m Normal Pulled pod/medium-worker-0 Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine
14m Normal Created pod/medium-worker-0 Created container mpi
14m Normal Started pod/medium-worker-0 Started container mpi
2m32s Normal Preempted pod/medium-worker-0 by default/critical-worker-0 on node 192.168.0.23
2m32s Normal Killing pod/medium-worker-0 Stopping container mpi
8. Check the details of the MPI Training Job medium
, and it's turned to fail.
$ arena get medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 12m
NAME STATUS TRAINER AGE INSTANCE NODE
medium FAILED MPIJOB 20m medium-launcher-sz5xj 192.168.0.23
9. And check the details of the MPI Training Job critical
, it's running.
$ arena get critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 10m
NAME STATUS TRAINER AGE INSTANCE NODE
critical RUNNING MPIJOB 10m critical-launcher-mfffs 192.168.0.23
critical RUNNING MPIJOB 10m critical-worker-0 192.168.0.23
10. And we can find the only GPU is used by the MPI Training Job critical
.
$ arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default critical-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
----------------------------------------------------------------------
Congratulations! You've run the the job in priorities and preemptions with arena
successfully.