Skip to content

Submit a distributed MPI job

Arena supports and simplifies distributed TensorFlow Training (MPI mode).

1. To run a distributed Training with MPI support, you need to specify:

  • GPUs of each worker (only for GPU workload)
  • The number of workers (required)
  • The docker image of MPI worker (required)

The following command is an example. In this example, it defines 2 workers, and each worker has 1 GPU. The tensorboard are enabled.

$ arena submit mpi 
    --name=mpi-dist \
    --gpus=1 \
    --workers=2 \
    --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
    --env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
    --sync-mode=git \
    --sync-source=https://github.com/tensorflow/benchmarks.git \
    --tensorboard \
    "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"

2. Get the details of the specific job.

$ arena get mpi-dist
NAME      STATUS   TRAINER  AGE  INSTANCE                        NODE
mpi-dist  RUNNING  MPIJOB   1d   mpi-dist-mpijob-launcher-ndnw8  192.168.1.120
mpi-dist  RUNNING  MPIJOB   1d   mpi-dist-mpijob-worker-0        192.168.1.119
mpi-dist  RUNNING  MPIJOB   1d   mpi-dist-mpijob-worker-1        192.168.1.120

Your tensorboard will be available on:
192.168.1.117:32559

3. Check the tensorboard.

image

4. Get the MPI dashboard.

$ arena logviewer mpi-dist
Your LogViewer will be available on:
192.168.1.119:9090/#!/log/default/mpi-dist-mpijob-launcher-ndnw8/mpi?namespace=default

image

Congratulations! You've run the distributed MPI training job with arena successfully.