Preempt the MPI job

Arena supports Priority and Preemption for MPIJob, the following steps will help to how to use this feature.

1. Create PriorityClass with the yaml below.

apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
  name: critical
value: 1100000

---

apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
  name: medium
value: 1000000

Note

The Kubernetes Version > 1.11

Save the template that applies in a file named pc.yaml, and use kubectl to create the PriorityClass.

kubectl create -f pc.yaml

2. There is only 1 GPU available in the Kubernetes cluster.

$ arena top node
NAME          IPADDRESS     ROLE    GPU(Total)  GPU(Allocated)
192.168.0.20  192.168.0.20  master  0           0
192.168.0.21  192.168.0.21  master  0           0
192.168.0.22  192.168.0.22  master  0           0
192.168.0.23  192.168.0.23  <none>  1           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)

3. Run the MPI training Job with medium priority. The following command is an example.

$ arena submit mpi \
  --name=medium   \
  --priority=medium  \
  --gpus=1 \
  --workers=1 \
  --image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
  "mpirun tail -f /dev/null"

configmap/medium-mpijob created
configmap/medium-mpijob labeled
mpijob.kubeflow.org/medium created
INFO[0000] The Job medium has been submitted successfully
INFO[0000] You can run `arena get medium --type mpijob` to check the job status

4. Get the details of the specific job.

$ arena get medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 58s

NAME    STATUS   TRAINER  AGE  INSTANCE               NODE
medium  RUNNING  MPIJOB   58s  medium-launcher-sz5xj  192.168.0.23
medium  RUNNING  MPIJOB   58s  medium-worker-0        192.168.0.23

5. The only one GPU is used by MPI training Job medium.

$ arena top node -d

NAME:       cn-hangzhou.192.168.0.23
IPADDRESS:  192.168.0.23
ROLE:       <none>

NAMESPACE  NAME             GPU REQUESTS  GPU LIMITS
default    medium-worker-0  1             1

Total GPUs In Node cn-hangzhou.192.168.0.23:      1
Allocated GPUs In Node cn-hangzhou.192.168.0.23:  1 (100%)
-------------------------------------------------------------------------

Allocated/Total GPUs In Cluster:  1/1 (100%)

6. Run the MPI training Job with critical priority.

$ arena submit mpi  \
  --name=critical \
  --priority=critical \
  --gpus=1  \
  --workers=1 \
  --image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
  "mpirun tail -f /dev/null"

7. Check MPI Training Job medium, and find it's preempted by critical-worker-0.

$ kubectl get events --field-selector involvedObject.name=medium-worker-0
LAST SEEN   TYPE     REASON      OBJECT                MESSAGE
15m         Normal   Scheduled   pod/medium-worker-0   Successfully assigned default/medium-worker-0 to 192.168.0.23
14m         Normal   Pulled      pod/medium-worker-0   Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine
14m         Normal   Created     pod/medium-worker-0   Created container mpi
14m         Normal   Started     pod/medium-worker-0   Started container mpi
2m32s       Normal   Preempted   pod/medium-worker-0   by default/critical-worker-0 on node 192.168.0.23
2m32s       Normal   Killing     pod/medium-worker-0   Stopping container mpi

8. Check the details of the MPI Training Job medium, and it's turned to fail.

$ arena get medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 12m

NAME    STATUS  TRAINER  AGE  INSTANCE               NODE
medium  FAILED  MPIJOB   20m  medium-launcher-sz5xj  192.168.0.23

9. And check the details of the MPI Training Job critical, it's running.

$ arena get critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 10m

NAME      STATUS   TRAINER  AGE  INSTANCE                 NODE
critical  RUNNING  MPIJOB   10m  critical-launcher-mfffs  192.168.0.23
critical  RUNNING  MPIJOB   10m  critical-worker-0        192.168.0.23

10. And we can find the only GPU is used by the MPI Training Job critical.

$ arena top node -d
NAME:       cn-hangzhou.192.168.0.23
IPADDRESS:  192.168.0.23
ROLE:       <none>

NAMESPACE  NAME               GPU REQUESTS  GPU LIMITS
default    critical-worker-0  1             1

Total GPUs In Node cn-hangzhou.192.168.0.23:      1
Allocated GPUs In Node cn-hangzhou.192.168.0.23:  1 (100%)
----------------------------------------------------------------------

Congratulations! You've run the the job in priorities and preemptions with arena successfully.