Submit a MPI job with gpu topology scheduling
Arena supports gpu topology scheduling For distributed Training. We can enable gpu topology scheduling by adding parameter --gputopology
.
Learn more https://help.aliyun.com/document_detail/190482.html
Vgg16
Enable gpu topology scheduling
- Submit a Tensorflow training job with gputopology
$ arena submit mpi \
--name=tensorflow-topo-4-vgg16 \
--gpus=1 \
--workers=4 \
--gputopology=true \
--image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
"mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"
- Get the running status of the current job
$ arena get tensorflow-topo-4-vgg16 --type mpijob
Name: tensorflow-topo-4-vgg16
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: MPIJOB
Duration: 2m
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tensorflow-topo-4-vgg16-launcher-lmhjl Running 2m true 0 cn-shanghai.192.168.16.172
tensorflow-topo-4-vgg16-worker-0 Running 2m false 1 cn-shanghai.192.168.16.173
tensorflow-topo-4-vgg16-worker-1 Running 2m false 1 cn-shanghai.192.168.16.173
tensorflow-topo-4-vgg16-worker-2 Running 2m false 1 cn-shanghai.192.168.16.173
tensorflow-topo-4-vgg16-worker-3 Running 2m false 1 cn-shanghai.192.168.16.173
- Get current log information
$ arena logs -f tensorflow-topo-4-vgg16
----------------------------------------------------------------
total images/sec: 991.92
----------------------------------------------------------------
Disable gpu topology scheduling
- Submit a Tensorflow training job with gputopology
$ arena submit mpi \
--name=tensorflow-4-vgg16 \
--gpus=1 \
--workers=4 \
--image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
"mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"
- Get the running status of the current job
$ arena get tensorflow-4-vgg16 --type mpijob
Name: tensorflow-4-vgg16
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: MPIJOB
Duration: 9s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tensorflow-4-vgg16-launcher-xc28k Running 9s true 0 cn-shanghai.192.168.16.172
tensorflow-4-vgg16-worker-0 Running 9s false 1 cn-shanghai.192.168.16.172
tensorflow-4-vgg16-worker-1 Running 9s false 1 cn-shanghai.192.168.16.173
tensorflow-4-vgg16-worker-2 Running 9s false 1 cn-shanghai.192.168.16.172
tensorflow-4-vgg16-worker-3 Running 9s false 1 cn-shanghai.192.168.16.173
- Get current log information
$ arena logs -f tensorflow-4-vgg16
----------------------------------------------------------------
total images/sec: 200.47
----------------------------------------------------------------
resnet50
Enable gpu topology scheduling
- Submit a Tensorflow training job with gputopology
$ arena submit mpi \
--name=tensorflow-topo-4-resnet50 \
--gpus=1 \
--workers=4 \
--gputopology=true \
--image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
"mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64 --variable_update=horovod"
- Get the running status of the current job
$ arena get tensorflow-topo-4-resnet50 --type mpijob
Name: tensorflow-topo-4-resnet50
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: MPIJOB
Duration: 8s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tensorflow-topo-4-resnet50-launcher-7ln8j Running 8s true 0 cn-shanghai.192.168.16.172
tensorflow-topo-4-resnet50-worker-0 Running 8s false 1 cn-shanghai.192.168.16.173
tensorflow-topo-4-resnet50-worker-1 Running 8s false 1 cn-shanghai.192.168.16.173
tensorflow-topo-4-resnet50-worker-2 Running 8s false 1 cn-shanghai.192.168.16.173
tensorflow-topo-4-resnet50-worker-3 Running 8s false 1 cn-shanghai.192.168.16.173
- Get current log information
$ arena logs -f tensorflow-topo-4-resnet50
----------------------------------------------------------------
total images/sec: 1471.55
----------------------------------------------------------------
Disable gpu topology scheduling
- Submit a Tensorflow training job with gputopology
$ arena submit mpi \
--name=tensorflow-4-resnet50 \
--gpus=1 \
--workers=4 \
--image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
"mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64 --variable_update=horovod"
- Get the running status of the current job
$ arena get tensorflow-4-resnet50 --type mpijob
Name: tensorflow-4-resnet50
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: MPIJOB
Duration: 9s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
tensorflow-4-resnet50-launcher-q24hv Running 9s true 0 cn-shanghai.192.168.16.172
tensorflow-4-resnet50-worker-0 Running 9s false 1 cn-shanghai.192.168.16.172
tensorflow-4-resnet50-worker-1 Running 9s false 1 cn-shanghai.192.168.16.173
tensorflow-4-resnet50-worker-2 Running 9s false 1 cn-shanghai.192.168.16.172
tensorflow-4-resnet50-worker-3 Running 9s false 1 cn-shanghai.192.168.16.173
- Get current log information
$ arena logs -f tensorflow-4-resnet50
----------------------------------------------------------------
total images/sec: 745.38
----------------------------------------------------------------
Performance Comparison
Based on the comparison results of the above four test cases, as shown in the figure above, the performance comparison results show that after GPU topology scheduling, tensorflow distributed training has a good improvement effect. Note: the result of GPU topology aware scheduling promotion has a certain relationship with the model used by users and the cluster environment. Users can refer to the above examples to evaluate their own model.