Cluster Sizing

Cluster sizing recommends number of pods of each type that are required for different Queries Per Seconds (QPS).

Pod or Container Sizing

Pod sizing covers two aspects- resources required by a pod and number of pods required for a given QPS and number of refresh token requests for OAuth.

The sizing guidelines are generic and resource requirements will vary by factors like average payload per traffic request and response, size of config, and number of oauth (refresh + create tokens) requests.

Because of the upstream feature of td-agent-bit; number of log pods are nearer to the number of tm pods.

Pod Characteristics

Pod Type Menu Usage CPU Usage Storage Usage Network
TML-NoSQL Normal - High (in case of large Oauth traffic) Normal High in case of large oauth traffic High due to registry activity
TML-CM High (host 4 services) Normal Low Low
TML-LOG High High Very high very high
TML-SQL Normal Normal Normal Normal
TML-Cache High Normal low High (depending on traffic calls)
TML-TM Normal (will see a lot of G1GC young gc activity) High very low High
TML-Reporting Hiigh High Very High High

Limits and Requests

Requests are initial allocation of resources and limits define the max memory or cpu a pod can utilise.

We can define limits for CPU/CPU time and memory required. When defining limits a general recommendation is to set the value for requests to half of limit. A general rule of thumb is to allow a 20% overhead or bump up space for memory and CPU and to ensure that the application is not paging.

By defining resource limits, you have the following benefits:
  • Pods and containers consume resources and there will be situations where one pod can consume more resources leaving other pods starved for them; a starved pod will be restarted.
  • Memory leaks in the application will drain nodes of memory
  • Optimizes use of resources instead of over-provisioning
  • Allows for automatic horizontal scaling of Cache, Log and traffic manager pods

For more information about units of resource:

Kubernetes Pod Sizing

Caution:
  1. The following data captured is for indicative recommendations only.
  2. You can choose to vertically scale by providing higher CPU and memory requests and limits.
Idle state utilization observed
Service Number of Pods Expected Utilization per Pod
Memory CPU
Cluster manager 1 1GB 500m/0.5 cpu time
SQL 1 1GB 100m/0.1 cpu time
NoSQL 1 2GB 500m/0.5 cpu time
Cache 1 500MB 100m/0.1 cpu time
Log 1 1GB 500m/0.5 cpu time
TM 1 500MB 500m/0.5 cpu time
Reporting 1 1GB 100m/0.1 cpu time
500 QPS
Service Number of Pods Observed Utilization Per Pod Memory Per Pod CPU Per Pod
Memory CPU Request Limit Request Limit
Cluster manager 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
SQL 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
NoSQL 1 1 GB 1200m/1.2 cpu time 1000Mi 1250mi 1000m 1500m
Cache 1 500MB 500m/0.5 cpu time 500mi 625Mi 500m 625m
Log 1 2GB 1200m/1.2 cpu time 1000Mi 2500mi 1000m 1500m
TM 1 1.5GB 1500m/1.5 cpu time 1000Mi 1900mi 1000m 1900m
Reporting 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
1500 QPS. In this scenario 50% of traffic is protected by OAuth and an additional 750 QPS is for refresh tokens
Service Number of Pods Observed Utilization Per Pod Memory Per Pod CPU Per Pod
Memory CPU Request Limit Request Limit
Cluster manager 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
SQL 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
NoSQL 3 1 GB 1200m/1.2 cpu time 1000Mi 1250mi 1000m 1500m
Cache 2 1 GB 500m/0.5 cpu time 1000Mi 1250mi 500m 625m
Log 4 2GB 1200m/1.2 cpu time 1000Mi 2500mi 1000m 1500m
TM 5 1.2GB 1200m/1.2 cpu time 1000Mi 1500mi 1000m 1500m
Reporting 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
3000 QPS. In this scenario 50% of traffic is protected by OAuth and an additional 1500 QPS is for refresh tokens
Service Number of Pods Observed Utilization Per Pod Memory Per Pod CPU Per Pod
Memory CPU Request Limit Request Limit
Cluster manager 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
SQL 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
NoSQL 3 1 GB 1200m/1.2 cpu time 1000Mi 1250mi 1000m 1500m
Cache 2 1 GB 500m/0.5 cpu time 500Mi 1250mi 500m 625m
Log 8 2.2GB 1200m/1.2 cpu time 1000Mi 2750mi 1000m 1500m
TM 10 1.2GB 1200m/1.2 cpu time 1000Mi 1500mi 1000m 1500m
Reporting 1 1.5GB 3000m/3 cpu time 1000Mi 1900Mi 500m 3750m
3000 QPS with TM vertically scaled and varied payload sizes.
Caution:
  • All traffic was open i.e. not authenticated via Oauth but controlled.
  • Request and response payload was randomly varied between 1Kb to 1Mb
  • Traffic manager pods were provided with 2000 mi cpu time instead of 1
  • Log pods required higher memory as disk writes were not scaled with the larger payloads and RAM was used to hold log events waiting to be written
Service Number of Pods Observed Utilization Per Pod Memory Per Pod CPU Per Pod
Memory CPU Request Limit Request Limit
Cluster manager 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
SQL 1 1GB 500m/0.5 cpu time 1000Mi 1250Mi 500m 625m
NoSQL 3 1 GB 1200m/1.2 cpu time 1000Mi 1250mi 1000m 1500m
Cache 2 1 GB 500m/0.5 cpu time 500Mi 1250mi 500m 625m
Log 8 3GB 1200m/1.2 cpu time 1000Mi 3750mi 1000m 1500m
TM 10 1.2GB 1200m/1.2 cpu time 1000Mi 1500mi 1000m 2500m
Reporting 1 1.5GB 3000m/3 cpu time 1000Mi 1900Mi 500m 3750m
Caution: Pod performance is dependent on factors like network in most cases and storage speed especially for log pods. Therefore provide higher resources for pods based on their characteristics.

Requirements for different QPS shown here are based on actual observation and the number of pods for TM, Cache and Log have been determined by the following HPA rules:

kubectl autoscale deployment tm-deploy-0 --min=1 --max=5
--cpu-percent=80
kubectl autoscale statefulsets log-set-0 --min=1 --max=5
--cpu-percent=80
kubectl autoscale statefulsets cache-set-0 --min=1 --max=5
--cpu-percent=80

Pod Placement on Nodes

In the case where users would not like to request and set resource limits, the resource usage characteristics of each pod dictate placement of workloads on nodes. High availability deployment also dictates that pods of same type don't end up on same node.

Network Considerations

Local Edition is compatible with all CNCF certified CNIs.

Make sure the POD network is initialized with a unique set of CIDR. Services should be properly deployed with a unique service IP for POD-POD communication

Storage Considerations

To enable workload to be shifted to other nodes and to allow addtion of new pods, use dynamic provisioning applying storage provisioners instead of manually creating persistent volumes.