Pytorch Training Job with specified node tolerations

Arena supports submitting a pytorch job which tolerates some k8s nodes who have some taints.

1. Get k8s cluster information:

➜ kubectl get node
NAME                        STATUS   ROLES    AGE     VERSION
cn-huhehaote.172.16.0.205   Ready    master   5h13m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.206   Ready    master   5h12m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.207   Ready    master   5h11m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.208   Ready    <none>   5h7m    v1.16.9-aliyun.1
cn-huhehaote.172.16.0.209   Ready    <none>   5h7m    v1.16.9-aliyun.1
cn-huhehasote.172.16.0.210  Ready    <none>   5h7m    v1.16.9-aliyun.1

2. Give some taints for k8s nodes,for example:

# taint --> gpu_node
➜  kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.208 tainted

➜  kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.209 tainted

# taint --> ssd_node
➜  kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.210 tainted

3. When we add the wrong nodes' taints or restore the node's schedulability, we can remove the nodes' taints in the following commands:

➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node-
node/cn-huhehaote.172.16.0.208 untainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node-
node/cn-huhehaote.172.16.0.209 untainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node-
node/cn-huhehaote.172.16.0.210 untainted

4. When submitting a job, you can tolerate some nodes with taints only add operation --toleration, for example --toleration=gpu_node. This parameter can be used multiple times with different taint keys.

➜ arena --loglevel info submit pytorch \
    --name=pytorch-toleration \
    --gpus=1 \
    --workers=2 \
    --image=registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
    --sync-mode=git \
    --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
    --tensorboard \
    --logdir=/root/logs \
    --toleration gpu_node \
    "python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs"

configmap/pytorch-toleration-pytorchjob created
configmap/pytorch-toleration-pytorchjob labeled
service/pytorch-toleration-tensorboard created
deployment.apps/pytorch-toleration-tensorboard created
pytorchjob.kubeflow.org/pytorch-toleration created
INFO[0000] The Job pytorch-toleration has been submitted successfully
INFO[0000] You can run `arena get pytorch-toleration --type pytorchjob` to check the job status

5. Get the details of the this job.

➜ arena get pytorch-toleration
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m

NAME                STATUS   TRAINER     AGE  INSTANCE                     NODE
pytorch-toleration  RUNNING  PYTORCHJOB  2m   pytorch-toleration-master-0  172.16.0.209
pytorch-toleration  RUNNING  PYTORCHJOB  2m   pytorch-toleration-worker-0  172.16.0.209

Your tensorboard will be available on:
http://172.16.0.205:32091

6. You can use --toleration all to tolerate all node taints.

➜ arena --loglevel info submit pytorch \
    --name=pytorch-toleration-all \
    --gpus=1 \
    --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
    --sync-mode=git \
    --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
    --toleration all \
    "python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend gloo"

configmap/pytorch-toleration-all-pytorchjob created
configmap/pytorch-toleration-all-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-toleration-all created
INFO[0000] The Job pytorch-toleration-all has been submitted successfully
INFO[0000] You can run `arena get pytorch-toleration-all --type pytorchjob` to check the job status

7. Get the details of the this job.

➜ arena get pytorch-toleration-all
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 33s

NAME                    STATUS   TRAINER     AGE  INSTANCE                         NODE
pytorch-toleration-all  RUNNING  PYTORCHJOB  33s  pytorch-toleration-all-master-0  172.16.0.210