Skip to content

Submit Tensorflow Job with specified node selectors

Arena supports assigning jobs to some k8s particular nodes(Currently only support mpi job and tf job), the following steps will show how to use this feature.

Label the nodes

1. query k8s cluster information.

$ kubectl get nodes
NAME                       STATUS   ROLES    AGE     VERSION
cn-beijing.192.168.3.225   Ready    master   2d23h   v1.12.6-aliyun.1
cn-beijing.192.168.3.226   Ready    master   2d23h   v1.12.6-aliyun.1
cn-beijing.192.168.3.227   Ready    master   2d23h   v1.12.6-aliyun.1
cn-beijing.192.168.3.228   Ready    <none>   2d22h   v1.12.6-aliyun.1
cn-beijing.192.168.3.229   Ready    <none>   2d22h   v1.12.6-aliyun.1
cn-beijing.192.168.3.230   Ready    <none>   2d22h   v1.12.6-aliyun.1

2. label the nodes,for example: label node cn-beijing.192.168.3.228" and node cn-beijing.192.168.3.229 with gpu_node=true ,label node cn-beijing.192.168.3.230 with ssd_node=true.

$ kubectl label nodes cn-beijing.192.168.3.228 gpu_node=true
node/cn-beijing.192.168.3.228 labeled
$ kubectl label nodes cn-beijing.192.168.3.229 gpu_node=true
node/cn-beijing.192.168.3.229 labeled
$ kubectl label nodes cn-beijing.192.168.3.230 ssd_node=true
node/cn-beijing.192.168.3.230 labeled

Roles are running with the same node selectors

3. because there is four roles("PS","Worker","Evaluator","Chief") in tf job,you can use --selector to assgin nodes, it is effective for all roles. for example:

$ arena submit tfjob \
    --name=tfjob-with-selector \
    --gpus=1              \
    --workers=1              \
    --selector ssd_node=true \
    --worker-image=cheyang/tf-mnist-distributed:gpu \
    --ps-image=cheyang/tf-mnist-distributed:cpu \
    --ps=1              \
    --tensorboard \
    --loglevel debug \
    "python /app/main.py"

4. check the job status.

$ arena get tfjob-with-selector                                                                                                                                       
STATUS: PENDING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 24s

NAME                   STATUS   TRAINER  AGE  INSTANCE     NODE
tfjob-with-selector    RUNNING  TFJOB    24s  tf-ps-0      192.168.3.230
tfjob-with-selector    PENDING  TFJOB    24s  tf-worker-0  192.168.3.230

Your tensorboard will be available on:
http://192.168.3.230:31867

the job(includes "PS" and "Worker") have been running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is ssd_node=true).

Roles are running with the different node selectors

5. you also can assign node to run single role job,for example: if you want to run a job whose role is "PS" on nodes which own label ssd_node=true and run "Worker" job on nodes which own label gpu_node=true,you can use option --ps-selector and --worker-selector.

$ arena submit tfjob \
    --name=tfjob-with-selector \
    --gpus=1 \
    --workers=1 \
    --ps-selector ssd_node=true \
    --worker-selector gpu_node=true \
    --worker-image=cheyang/tf-mnist-distributed:gpu \
    --ps-image=cheyang/tf-mnist-distributed:cpu \
    --ps=1              \
    --tensorboard \
    --loglevel debug \
    "python /app/main.py"

6. check the jobs's status.

$ arena get tf

STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 23s

NAME                   STATUS   TRAINER  AGE  INSTANCE     NODE
tfjob-with-selector    RUNNING  TFJOB    23s  tf-ps-0      192.168.3.230
tfjob-with-selector    RUNNING  TFJOB    23s  tf-worker-0  192.168.3.228

Your tensorboard will be available on:
http://192.168.3.225:30162

the "PS" job is running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is ssd_node=true ) and the "Worker" job is running on cn-beijing.192.168.3.228(ip is 192.168.3.228,label is gpu_node=true ).

7. if you use --selector in arena submit tf command and also use --ps-selector (or --worker-selector , --evaluator-selector , chief-selector ),the value of --ps-selector would cover value of --selector ,for example:

$ arena submit tfjob \
    --name=tfjob-with-selector \
    --gpus=1              \
    --workers=1              \
    --ps-selector ssd_node=true \
    --selector gpu_node=true \
    --worker-image=cheyang/tf-mnist-distributed:gpu \
    --ps-image=cheyang/tf-mnist-distributed:cpu \
    --ps=1              \
    --tensorboard \
    --loglevel debug \
    "python /app/main.py"

"PS" job will be running on nodes whose label is ssd_node=true ,other jobs will be running on nodes whose label is gpu_node=true . now verify our conclusions,use follow command to check job status.

$ arena get tfjob-with-selector                                                                                                                                      
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 39s

NAME                   STATUS   TRAINER  AGE  INSTANCE     NODE
tfjob-with-selector    RUNNING  TFJOB    39s  tf-ps-0      192.168.3.230
tfjob-with-selector    RUNNING  TFJOB    39s  tf-worker-0  192.168.3.228

Your tensorboard will be available on:
http://192.168.3.225:32105

As you can see, "PS" job is running on nodes which own label ssd_node=true ,other jobs are running on nodes which own label gpu_node=true.