Attach a Running Training Job
Sometimes, you may need to enter the containers of the training job and execute some commands,like the 'kubectl exec' command of kubectl tool, Arena also can cover this situation.
Warning
arena attach
command is only valid when the training job status is running
arena attach
command is described as below:
$ arena attach -h
Attach a training job and execute some commands
Usage:
arena attach JOB [-i INSTANCE] [-c CONTAINER] [flags]
Flags:
-c, --container string Container name. If omitted, the first container in the instance will be chosen
-h, --help help for attach
-i, --instance string Job instance name
-T, --type string The training type to get, the possible option is tf(Tensorflow),mpi(MPI),py(Pytorch),horovod(Horovod),volcano(Volcano),et(ElasticTraining),spark(Spark). (optional)
Global Flags:
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job
--pprof enable cpu profile
--trace enable trace
1. Make sure the training job is running.
$ arena list
NAME STATUS TRAINER DURATION GPU(Requested) GPU(Allocated) NODE
dawnbench-1x1-v6 RUNNING MPIJOB 15m 2 2 192.168.1.137
tf-distributed-test FAILED TFJOB 11m 0 N/A N/A
tf-git SUCCEEDED TFJOB 14m 0 N/A N/A
mpi-test SUCCEEDED MPIJOB 12h 0 N/A N/A
elastic-training SUCCEEDED ETJOB 40d 0 N/A N/A
horovod-resnet50-v2-4x8-fluid SUCCEEDED MPIJOB 1h 0 N/A N/A
horovod-resnet50-v2-4x8-nfs SUCCEEDED MPIJOB 2h 0 N/A N/A
As you see, the training job dawnbench-1x1-v6 is running and get the training job details.
$ arena get dawnbench-1x1-v6
Name: dawnbench-1x1-v6
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: MPIJOB
Duration: 18m
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
dawnbench-1x1-v6-launcher-7hshj Running 18m true 0 cn-beijing.192.168.1.137
dawnbench-1x1-v6-worker-0 Running 18m false 1 cn-beijing.192.168.1.137
dawnbench-1x1-v6-worker-1 Running 18m false 1 cn-beijing.192.168.1.138
2. Attach the training job.
$ arena attach dawnbench-1x1-v6
Hello! Arena attach the container mpi of instance dawnbench-1x1-v6-launcher-7hshj
#
Then execute the command ls
in container:
# ls
README.md cmd.sh launch-example.sh perseus-tf-vm-demo.ipynb start.sh
benchmarks config-fp16-tf.sh.orig login.sh run_dist_example.sh
clean_caches.sh hurun_dist_example.sh perseus-tf-env.sh run_local_2gpu_example.sh
Note
- you can use option '-i' to specify the instance you want to attach
- you can use option '-c' to specify the container you want to attach of instance
3. If the container of training job can not execute 'sh' command, but it can execute 'bash', you can attach the container of the training job by using following command:
$ arena attach <JOB_NAME> bash
4. If you don't need to attach the container and only need to execute one command in container, you can execute a command like:
$ arena attach <JOB_NAME> -- <COMMAND>
for example:
$ arena attach dawnbench-1x1-v6 -- mkdir /tmpdir
Hello! Arena attach the container mpi of instance dawnbench-1x1-v6-launcher-7hshj