Skip to content

Pytorch Training Job with specified node selectors

Arena supports specifying pytorch jobs are running on some k8s particular nodes.

1. Get k8s cluster information.

➜ kubectl get nodes
NAME                        STATUS   ROLES    AGE     VERSION
cn-huhehaote.172.16.0.205   Ready    master   4h19m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.206   Ready    master   4h18m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.207   Ready    master   4h17m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.208   Ready    <none>   4h13m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.209   Ready    <none>   4h13m   v1.16.9-aliyun.1
cn-huhehaote.172.16.0.210   Ready    <none>   4h13m   v1.16.9-aliyun.1

2. Give a label to nodes,for example:

# 172.16.0.208 label gpu_node=true
➜ kubectl label nodes cn-huhehaote.172.16.0.208 gpu_node=true
node/cn-huhehaote.172.16.0.208 labeled

# 172.16.0.209 label gpu_node=ture
➜ kubectl label nodes cn-huhehaote.172.16.0.209 gpu_node=true
node/cn-huhehaote.172.16.0.209 labeled

# 172.16.0.210 label ssd_node=true
➜ kubectl label nodes cn-huhehaote.172.16.0.210 ssd_node=true
node/cn-huhehaote.172.16.0.210 labeled

3. When submitting a pytorch job, you can use the --selector to specify which nodes to accept the job.

➜ arena --loglevel info submit pytorch \
    --name=pytorch-selector \
    --gpus=1 \
    --workers=2 \
    --selector gpu_node=true \
    --image=registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
    --sync-mode=git \
    --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
    "python /root/code/mnist-pytorch/mnist.py --backend gloo"

configmap/pytorch-selector-pytorchjob created
configmap/pytorch-selector-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-selector created
INFO[0000] The Job pytorch-selector has been submitted successfully
INFO[0000] You can run `arena get pytorch-selector --type pytorchjob` to check the job status

4. Get the job details, you can see that the job only runs on this node with IP 172.16.0.209 and label gpu_node=true.

➜ arena get pytorch-selector
STATUS: PENDING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 14s

NAME              STATUS   TRAINER     AGE  INSTANCE                   NODE
pytorch-selector  PENDING  PYTORCHJOB  14s  pytorch-selector-master-0  172.16.0.209
pytorch-selector  PENDING  PYTORCHJOB  14s  pytorch-selector-worker-0  172.16.0.209