Distributed Pytorch Training Job
This example shows how to use Arena
to submit a distributed pytorch job. This example will download the source code from git url.
1. The first step is to check the available resources.
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
As you see, there are 3 available nodes with GPU for running training jobs.
2. Submit a distributed pytorch training job with 2 nodes and one gpu card, this example downloads the source code from Alibaba Cloud code.
➜ arena --loglevel info submit pytorch \
--name=pytorch-dist-git \
--gpus=1 \
--workers=2 \
--image=registry.cn-beijing.aliyuncs.com/ai-samples/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-dist-git-pytorchjob created
configmap/pytorch-dist-git-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-dist-git created
INFO[0000] The Job pytorch-dist-git has been submitted successfully
INFO[0000] You can run `arena get pytorch-dist-git --type pytorchjob` to check the job status
Note
-
the source code will be downloaded and extracted to the directory
code/
of the working directory. The default working directory is/root
, you can also specify by using--workingDir
. -
workers
is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to themaster
node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job.
3. List all the jobs.
➜ arena list
NAME STATUS TRAINER AGE NODE
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h N/A
4. Get the details of the this job. There are 2 instances of this job, and instance pytorch-dist-git-master-0
is the rank0. Arena simplifies the process of submitting distributed jobs with PyTorch-Operator
.
A Service
will be created for this master
instance for other nodes to access through the name of Service
in PyTorch-Operator
, and inject environment variables into each instance: MASTER_PORT
、MASTER_ADDR
、WORLD_SIZE
、RANK
. Initialization of distributed process group for pytorch(dist.init_ process_ group
). MASTER_PORT
auto assign, MASTER_ADDR
is localhost
in the master
instance, and other instances are Service
name of the master
,WORLD_SIZE
is the total number of instances, and RANK
is the serial number of the current calculation node, and master
is 0, Worker
instance is the index of instance name suffix plus one. For example, in the following example, RANK
of instance pytorch-dist-git-worker-0
is 0 + 1 = 1
.
In Arena, the value filled in by the parameter --workers
contains one master
instance, because master
is also involved in training.
➜ arena get pytorch-local-git
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-master-0 172.16.0.210
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-worker-0 172.16.0.210
5. Check the job logs.
➜ arena logs pytorch-dist-git
WORLD_SIZE: 2, CURRENT_RANK: 0
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
Using CUDA
Using distributed PyTorch with gloo backend
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
Train Epoch: 1 [3200/60000 (5%)] loss=1.4142
Train Epoch: 1 [3840/60000 (6%)] loss=1.0009
...
Note
-
For multi instances of distributed job, the default output is the log of rank0 (the instance is the
master
node). If you want to view the log of the specific instance, you can view it by-i
instance name, for example:➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0
-
In addition, user can view the logs of the last few lines through the parameter
-t
lines num, such as:➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 -t 5 Train Epoch: 1 [58880/60000 (98%)] loss=0.2048 Train Epoch: 1 [59520/60000 (99%)] loss=0.0646 accuracy=0.9661
-
For more parameters, see
arena logs -- help