Tensorflow job with gang scheduling enabled
Arena supports distributed TensorFlow Training with gang scheduling by using scheduler-plugins. We can enable gang scheduling by adding parameter --gang
. Learn more https://help.aliyun.com/document_detail/178169.html
When running distributed TensorFlow, we'd better to make sure all
or nothing
. Gang scheduling can help such case.
Warning
Limitation: when using gang scheduling, the tensorboard feature doesn't work well.
The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled.
$ arena submit tf \
--name=tf-dist-git \
--gpus=1 \
--workers=2 \
--worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--ps-image=tensorflow/tensorflow:1.5.0-devel \
--gang \
--tensorboard \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir=/training_logs --data_dir=code/tensorflow-sample-code/data"
service/tf-dist-git-tensorboard created
deployment.apps/tf-dist-git-tensorboard created
tfjob.kubeflow.org/tf-dist-git created
INFO[0002] The Job tf-dist-git has been submitted successfully
INFO[0002] You can run `arena get tf-dist-git --type tfjob` to check the job status
If there are no enough resources, all the instances of the job are PENDING
. If it's not gang scheduling, you can see some pods are RUNNING
and others are PENDING
.
$ arena get tf-dist-git
Name: tf-dist-git
Status: PENDING
Namespace: default
Priority: N/A
Trainer: TFJOB
Duration: 4s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tf-dist-git-ps-0 Pending 4s false 0 N/A
tf-dist-git-worker-0 Pending 4s true 1 N/A
tf-dist-git-worker-1 Pending 4s false 1 N/A
Tensorboard:
Your tensorboard will be available on:
http://10.0.0.80:31029
When there are enough resources, the instances become RUNNING
.
$ arena get tf-dist-git
Name: tf-dist-git
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: TFJOB
Duration: 50s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tf-dist-git-ps-0 Running 0s false 0 cn-beijing.10.0.0.84
tf-dist-git-worker-0 Running 50s true 1 cn-beijing.10.0.0.83
tf-dist-git-worker-1 Running 50s false 1 cn-beijing.10.0.0.85
Tensorboard:
Your tensorboard will be available on:
http://10.0.0.80:31029