Display GPU Usage For Nodes

The arena top node command allows you to see the gpu resource consumption for nodes.

Supported GPU Modes

The arena top node command supports to display node details, which has different GPU modes. Currently supports 4 GPU Modes:

none: the node has no gpus
exclusive: the node has gpus and owns kubernetes extend resource "nvidia.com/gpu".
share: the node has gpus and owns kubernetes extend resource "aliyun.com/gpu-mem".
topology: the node has gpus and owns kubernetes extend resource "aliyun.com/gpu".

Usage

1. The following command will help you to display gpu resource consumption for all nodes:

$ arena top node
NAME                      IPADDRESS      ROLE    STATUS    GPU(Total)  GPU(Allocated)
cn-beijing.192.168.8.10   192.168.8.10   <none>  Ready     0           0
virtual-kubelet           172.27.5.28    agent   Ready     0           0
cn-beijing.192.168.1.135  192.168.1.135  <none>  NotReady  1           0
cn-beijing.192.168.1.136  192.168.1.136  <none>  NotReady  1           0
cn-beijing.192.168.1.137  192.168.1.137  <none>  Ready     1           0
cn-beijing.192.168.8.3    192.168.8.3    <none>  Ready     1           1
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/4 (25.0%)

2. If you only care gpu resource consumption for some nodes, you can specify the nodes:

$ arena top node cn-beijing.192.168.8.10 cn-beijing.192.168.1.136
NAME                      IPADDRESS      ROLE    STATUS    GPU(Total)  GPU(Allocated)
cn-beijing.192.168.8.10   192.168.8.10   <none>  Ready     0           0
cn-beijing.192.168.1.136  192.168.1.136  <none>  NotReady  1           0

3. If you want to display gpu resource consumption for nodes with specified gpu mode,you can use '-m' to filter:

$ arena top node -m e
NAME                      IPADDRESS      ROLE    STATUS    GPU(Total)  GPU(Allocated)
cn-beijing.192.168.1.135  192.168.1.135  <none>  NotReady  1           0
cn-beijing.192.168.1.136  192.168.1.136  <none>  NotReady  1           0
cn-beijing.192.168.1.137  192.168.1.137  <none>  Ready     1           0
cn-beijing.192.168.8.3    192.168.8.3    <none>  Ready     1           1
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs of nodes which own resource nvidia.com/gpu In Cluster:
1/4 (25.0%)

"e" represents "exclusive", This command only display node which owns kubernetes resource "nvidia.com/gpu". The following command can help you to get the supported gpu modes:

$ arena top node  -h | grep mode
  -m, --gpu-mode string   Display node information with following gpu mode:[n(none)|e(exclusive)|t(topology)|s(share)]

4. If you want to get more information of the node, "-d" is requried:

$ arena top node -d cn-beijing.192.168.8.3

Name:    cn-beijing.192.168.8.3
Status:  Ready
Role:    <none>
Type:    GPUExclusive
Address: 192.168.8.3
Description:
  1.This node is enabled gpu exclusive mode.
  2.Pods can request resource 'nvidia.com/gpu' to use gpu exclusive feature on this node
Instances:
  NAMESPACE  NAME                                                       STATUS   GPU(Requested)
  ---------  ----                                                       ------   --------------
  default    fast-style-transfer-alpha-custom-serving-856dbcdbcb-j2vv4  Running  1
GPU Summary:
  Total GPUs:     1
  Allocated GPUs: 1
  Unhealthy GPUs: 0

5. If you need to monitor nodes in real time, "-r" will help you("-r" must work with "-d"):

$ arena top node  cn-beijing.192.168.8.3 -r -d

Name:    cn-beijing.192.168.8.3
Status:  Ready
Role:    <none>
Type:    GPUExclusive
Address: 192.168.8.3
Description:
  1.This node is enabled gpu exclusive mode.
  2.Pods can request resource 'nvidia.com/gpu' to use gpu exclusive feature on this node
Instances:
  NAMESPACE  NAME                                                       STATUS   GPU(Requested)
  ---------  ----                                                       ------   --------------
  default    fast-style-transfer-alpha-custom-serving-856dbcdbcb-j2vv4  Running  1
GPU Summary:
  Total GPUs:     1
  Allocated GPUs: 1
  Unhealthy GPUs: 0

------------------------- 2021-02-22 17:26:29 -------------------------------------

Name:    cn-beijing.192.168.8.3
Status:  Ready
Role:    <none>
Type:    GPUExclusive
Address: 192.168.8.3
Description:
  1.This node is enabled gpu exclusive mode.
  2.Pods can request resource 'nvidia.com/gpu' to use gpu exclusive feature on this node
Instances:
  NAMESPACE  NAME                                                       STATUS   GPU(Requested)
  ---------  ----                                                       ------   --------------
  default    fast-style-transfer-alpha-custom-serving-856dbcdbcb-j2vv4  Running  1
GPU Summary:
  Total GPUs:     1
  Allocated GPUs: 1
  Unhealthy GPUs: 0

------------------------- 2021-02-22 17:26:31 -------------------------------------

6. Arena supports to show gpu metrics of nodes when "--metric" is enabled, this feature requires Prometheus and gpu-exporter has been existed in cluster.

$ arena top node cn-beijing.192.168.8.3 --metric -d

Name:    cn-beijing.192.168.8.3
Status:  Ready
Role:    <none>
Type:    GPUExclusive
Address: 192.168.8.3
Description:
  1.This node is enabled gpu exclusive mode.
  2.Pods can request resource 'nvidia.com/gpu' to use gpu exclusive feature on this node
Instances:
  NAMESPACE  NAME                                                       STATUS   GPU(Requested)  GPU(Allocated)
  ---------  ----                                                       ------   --------------  --------------
  default    fast-style-transfer-alpha-custom-serving-856dbcdbcb-j2vv4  Running  1               gpu0
GPUs:
  INDEX  MEMORY(Total)  MEMORY(Allocated)  MEMORY(Used)  DUTY_CYCLE
  -----  -------------  -----------------  ------------  ----------
  0      14.7 GiB       14.7 GiB           0.0 GiB       0.0%
GPU Summary:
  Total GPUs:           1
  Allocated GPUs:       1
  Unhealthy GPUs:       0
  Total GPU Memory:     14.7 GiB
  Allocated GPU Memory: 14.7 GiB
  Used GPU Memory:      0.0 GiB