Submit a model profile job

This guide walks through the steps required to profile a pytorch torchscript module.

1. The first step is to check the available resources:

$ arena top node

NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-shenzhen.192.168.1.209  192.168.1.209  <none>  Ready   1           0
cn-shenzhen.192.168.1.210  192.168.1.210  <none>  Ready   1           0
cn-shenzhen.192.168.1.211  192.168.1.211  <none>  Ready   1           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0.0%)

There are 3 available nodes with GPU for running model profile job.

2. Prepare the model to profile and configuration.

In this example, we will profile a pytorch resnet18 model. We need save the resnet18 model as a torchscript module firstly.

import torch
import torchvision

# An instance of your model.
model = torchvision.models.resnet18()

# An example input you would normally provide to your model's forward() method.
dummy_input = torch.rand(1, 3, 224, 224)

# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.
traced_script_module = torch.jit.trace(model, dummy_input)

torch.jit.save(traced_script_module, "resnet18.pt")

Then give a profile configuration file named config.json like below.

{
  "model_name": "resnet18",
  "model_platform": "torchscript",
  "model_path": "/data/models/resnet18/resnet18.pt",
  "inputs": [
    {
      "name": "input",
      "data_type": "float32",
      "shape": [1, 3, 224, 224]
    }
  ],
  "outputs": [
    {
      "name": "output",
      "data_type": "float32",
      "shape": [ 1000 ]
    }
  ]
}

3. Submit a model profile job.

$ arena model analyze profile \
    --name=resnet18-profile \
    --namespace=default \
    --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.0 \
    --image-pull-policy=Always \
    --gpus=1 \
    --loglevel=debug \
    --data=model-pvc:/data \
    --model-config-file=/data/models/resnet18/config.json \
    --report-path=/data/models/resnet18/log \
    --tensorboard \
    --tensorboard-image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.0

service/resnet18-profile-tensorboard created
deployment.apps/resnet18-profile-tensorboard created
job.batch/resnet18-profile created
INFO[0001] The model profile job resnet18-profile has been submitted successfully
INFO[0001] You can run `arena model analyze get resnet18-profile` to check the job status

4. List all the profile jobs.

$ arena model analyze list

NAMESPACE      NAME              STATUS   TYPE     DURATION  AGE  GPU(Requested)
default  resnet18-profile  RUNNING  Profile  34s       34s  1

5. Get model profile job detail info.

$ arena model analyze get resnet18-profile
Name:       resnet18-profile
Namespace:  default
Type:       Profile
Status:     RUNNING
Duration:   57s
Age:        57s
Parameters:
  --model-config-file  /data/models/resnet18/config.json
  --report-path        /data/models/resnet18/log


Instances:
  NAME                    STATUS             AGE  READY  RESTARTS  NODE
  ----                    ------             ---  -----  --------  ----
  resnet18-profile-z4zvr  ContainerCreating  57s  0/1    0         cn-shenzhen.192.168.1.210

6. Use tensorboard view the profile result.

$ kubectl get svc
resnet18-profile-tensorboard     NodePort    172.16.37.74   <none>        6006:32744/TCP   14m

$ kubectl port-forward svc/resnet18-profile-tensorboard 6006:6006
Forwarding from 127.0.0.1:6006 -> 6006
Forwarding from [::1]:6006 -> 6006