Resource Monitoring

Resource monitoring includes monitoring resources like CPU, memory, disk and Network I/O.

Container Metrics

These metrics denote the utilization of CPU, memory, network and disks assigned to a container or pod.
Note: CPU, Memory and Network

In the absence of limits and reservations on the containers or pods, all containers and pods can utilize available resources of the node on which they are deployed and running.

Container CPU Metrics

Captured metrics reflect the percentage of CPU utilized by container/pod, user space and a distill of the usage per core.

CPU Metrics example:
{
 "time": 1554194126,
 "message": {
 "cpu_p": 41.88333333333333,
 "user_p": 25.033333333333335,
 "system_p": 16.85,
 "cpu0.p_cpu": 41.88333333333333,
 "cpu0.p_user": 25.033333333333335,
 "cpu0.p_system": 16.85,
 "ingestion_time": "2019-04-02T08:35:26+00:00",
 "tag": "tml-nosql.6c680d874676.metrics.cpu"
 }
}
Metric Name Field Name Units Data Type Notes
Total CPU consumption cpu_p % number Total CPU usage across all cores assigned to the container - includes user and Kernel processes if there are 4 cores the container can use, the percent usage can go up to 400%
CPU consumption by user processes user_p % number Total CPU used by user processes across all cores
CPU consumption by kernel processes system_p % number Total CPU used by kernel processes across all cores.
Total usage per core N cpuN % number Usage of Core N by user and kernel processes
User processes usage of Core N cpuN % number Usage of Core N by user processes
Kernel processes usage of core N cpuN % number Usage of Core N by kernel processes.
Container Memory
Pod/Container memory metrics example:
{
  "time": 1554198840,
  "message": {
    "Mem.total": 4045520,
    "Mem.used": 3932664,
    "Mem.free": 112856,
    "Swap.total": 1928204,
    "Swap.used": 1483436,
    "Swap.free": 444768,
    "ingestion_time": "2019-04-02T09:54:00+00:00",
    "tag": "tml-log.6a6873b34d5e.metrics.mem"
  }
}
Metric Name Field Name Units Data Type Notes
Total memory (RAM) Mem.total bytes Number Total memory available to container or pod in bytes
Used memory (RAM) Mem.used bytes Number Memory utilized by container in bytes
Free memory (RAM) Mem.free bytes Number available free RAM in bytes
Total swap space Swap.total bytes Number Total swap space
Used swap space Swap.used bytes Number Used swap space
Free swap space Swap.free bytes Number Free swap space
Container Disk

Captured metrics reflects number of bytes read and written at the point in time.

Pod/ Container disk metrics example:
{
  "time": 1554193560,
  "message": {
    "read_size": 7029587968,
    "write_size": 14102749184,
    "ingestion_time": "2019-04-02T08:26:00+00:00",
    "tag": "tml-log.6a6873b34d5e.metrics.disk"
  }
}
Metric Name Field Name Units Data Type
Total bytes read from disk read_Size bytes Number
Total bytes written to disk write_size bytes Number
Container Network

The network metrics are available per network interface like eth1, lo etc. The metrics captured reflect the transmit and receive size at the point in time.

Pod/Container Network metrics example:
{
  "time": 1554199020,
  "message": {
    "eth0.rx.bytes": 516319,
    "eth0.rx.packets": 1062,
    "eth0.rx.errors": 0,
    "eth0.tx.bytes": 61578,
    "eth0.tx.packets": 893,
    "eth0.tx.errors": 0,
    "ingestion_time": "2019-04-02T09:57:00+00:00",
    "tag": "tml-log.6a6873b34d5e.metrics.netif"
  }
}
Metric Name Field Name Units Data Type Notes
Bytes transmitted on a netif_name netif_name bytes Number Total bytes transmitted for the particular network interface.
Packets transmitted on a netif_name netif_name Packet Number Total packets transmitted for the particular network interface.
Errors in transmitting packets on a netif_name netif_name Packet Number Number of packets failed to be transmitted for particular network interface due to window, carrier, aborted, or heartbeat errors
Bytes recieved on a netif_name netif_name bytes Number Total bytes recieved for the particular network interface.
Packets recieved on a netif_name netif_name Packet Number Total packets recieved for the particular network interface.
Errors recieving packets on a netif_name netif_name Packet Number Number of packets dropped

Common Process Metrics

{
  "time": 1554199440,
  "message": {
    "alive": true,
    "proc_name": "td-agent-bit",
    "pid": 2156,
    "mem.VmPeak": 83856000,
    "mem.VmSize": 83852000,
    "mem.VmLck": 0,
    "mem.VmHWM": 7416000,
    "mem.VmRSS": 3412000,
    "mem.VmData": 31028000,
    "mem.VmStk": 132000,
    "mem.VmExe": 4184000,
    "mem.VmLib": 5352000,
    "mem.VmPTE": 140000,
    "mem.VmSwap": 2040000,
    "fd": 65,
    "ingestion_time": "2019-04-02T10:04:00+00:00",
    "tag": "tml-log.6a6873b34d5e.metrics.proc.td-agent-bit"
  }
}
Metric Name Field Name Unit Data Type Notes
Process status alive Boolean Is the process running?
Process name proc_name String Name of the process as identified by /proc/pid/cmd
Peak virtual memory usage mem.VmPeak bytes Number Max memory used by this process so far
Virtual memory size mem.VmSize bytes Number
Current mlocked memory mem.VmLck bytes Number Amount of memory locked by the process. This memory is released after the process exits.
Peak RAM used mem.VmHWM bytes Number
Current RAM being used mem.VmRSS bytes Number
Size of "data" mem.VmData bytes Number
Size of stack mem.VmStk bytes Number
Size of "text" segment mem.VmExe bytes Number
Shared library mem usage mem.VmLib bytes Number
Current swap space used mem.VmSwap bytes Number

Process List

Processes on all containers
Process Name Description
Containeragent The Mashery Local container agent which manages all processes running inside a container.
td-agent-bit The Log and metrics forwarder. It forwards all logs to the Log service.
syslog-ng Supervisor + worker.
Per Container Processes
Container Name Process Name Description
TM proxy Traffic Manager Proxy (embedded jetty)
Sql Jetty On-Prem Loader - syncs with MOM in tethered mode
Sql mysqld Service for MySql
NoSql (seed and non-seed) Cassandra
NoSql (only on non-seed) Jetty Jetty server hosting the ML Registry Java webapp
Cache Memcached 6 processes 1 each for pools 11211, 11212, 11213, 1124, 11215 and, 11216
Cache pxrt The memcache loader, which keeps memcache up-to-date with changes to service definitions, packages et al.
Api lighthttpd CGI server supporting PHP CGI
API memcached 2 processes 1 each for pools 11211 and 11214
API pxrt embedded jetty server hosting the V3 API
API php-cgi ~20 php-cgi processes - workers which execute a V2 API request
CM Jetty Jetty server hosting the certificate manager Java webapp.
logservice td-agent Log collector and forwarder. Grabs logs from other containers and forwards them to user chosen destination. 1 supervisor + 9 Workers
logservice java process which syncs access logs to TIBCO Cloud Mashery in "tethered" mode.

Diagnostic Recipe / Alerts

Metric Field Name / Computation Notes
Is Process Alive alive

The process status metrics are captured every minute

Low water mark

is first time if `alive=false` and

High watermark

If it continues for next 5 minutes, for example, `alive=false` for next 5 times the process metrics are gathered.

Continuous high memory usage mem.VmHWM / ? > .8 Process / Expected usage / Water mark
  • memcached 11214 / This memcached pool will take up more memory than the other pools. / low water mark
  • td-agent-bit / The memory utilization by this should be in the order of MBs. / low water mark
  • Traffic Manager (javaproxy) / The memory utilization by the Traffic Manager will see spikes and troughs but the average utilization will be mid-range to low water mark. / low water mark
  • MySqld /
  • Cassandra / During replication cycles, Cassandra will utilize higher, for example > low water mark