FAQs

How do I get Grafana's URL?
For k8s cluster:
  1. Run:
    kubectl get svc
    This lists all the services and external IPs of the Load Balancer. Select the Load Balancer IP of the reporting app named "reporting-app-0".
  2. Access the Grafana dashboard: http://<External_IP_Of_Loadbalancer_Of_ReportingApp>:3000
For OpenShift cluster:
  1. Run:
    oc get svc
    This lists all the services and external IPs of the Load Balancer. Select the Load Balancer IP of the reporting app named "reporting-app-0".
  2. Access the Grafana dashboard: http://<External_IP_Of_Loadbalancer_Of_ReportingApp>:3000
For Swarm cluster:
  1. Access the Grafana dashboard using the Swarm master's public IP: http://<SWARM_MASTER_IP>:3000
How do I label a node for reporting in a k8s or OpenShift cluster?
For k8s cluster:
kubectl label nodes <nodename>
node-name=reporting
For OpenShift cluster:
oc label nodes <nodename>
node-name=reporting
Note: Also refer to Prerequisites for more information.
How do I see the labels on a node?
To see the labels on the nodes for k8s cluster:
kubectl get nodes --show-labels
To see labels on the nodes for OpenShift cluster:
oc get nodes --show-labels
Verify that one of the nodes is labelled as "node-name=reporting".
How to get the node running the reporting pod/container?
To see on which node the reporting pod in running on a k8s cluster:
kubectl get pods -o wide
To see on which node the reporting pod in running on an OpenShift k8s cluster:
oc get pods -o wide
To see on which node the reporting pod in running on a Swarm cluster, run the following command on the node which has a placement constraint for the reporting container:
docker ps
Where can logs for different services in the reporting-pod be checked?
There four different services that are running on reporting pod/container:
  • Grafana's logs can be checked at /var/log/grafana/
  • Prometheus's logs can be checked at /var/log/prometheus/
  • Loki's logs can be checked at /var/log/loki/
  • Fluentd logs can be checked at /var/log/fluentd/

You can also check the entry point logs to see where the configuration for different services are picked up from at /var/log/reporting_entrypoint.log

What is the retention period of different types of data in the reporting container?
  • Retention for metrics related to traffic and access logs is 1 year.
  • Retention for all other metrics is 4 days.
  • Retention for verbose metadata is 1 day.
  • Retention for Container's log data is 4 days.
How do I change the retention of metrics in Prometheus?
Create a file cleanup_prometheus_data.ini and add the following details:
[DELETION_PERIOD]
# Deletion period of prometheus data, its in days(number of days for which data has to be kept)
VERBOSE_METRICS_DATA_DELETION={number_of_days}
PROCESS_METRICS_DATA_DELETION={number_of_days}
Save this file and follow instructions in the TML-Reporting Configuration topic to apply the changes during deployment.
How do I change the retention of metrics in Loki?
Create a file loki-docker-config.yaml and paste the following content. Change the value of variable "{hours e.g. 96h}" to keep the application logs:
auth_enabled: false
 
server:
  http_listen_port: 3100
 
ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0
  chunk_target_size: 1536000
 
schema_config:
  configs:
    - from: 2020-07-15
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: {hours e.g. 96h}
 
storage_config:
  boltdb:
    directory: /mnt/data/loki/index
  filesystem:
    directory: /mnt/data/loki/chunks
 
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: {hours e.g. 96h}
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 16
 
chunk_store_config:
  max_look_back_period: 0s
 
table_manager:
  retention_deletes_enabled: true
  retention_period: {hours e.g. 96h}
Save this file and follow instructions in the Loki Configuration topic to apply the changes during deployment.
In which zone reporting-pod will be deployed in multi-zone environment?

In a multi-zone environment, the first zone value (in the array of zone names) given in the manifest file need to be used for labelling. This is the default zone in which reporting would be deployed, provided the node in that zone is properly labelled, as specified in the Prerequisites section.

Will the reporting-pod be displayed in the "cluster manager ls components" command?

No. The reporting pod/container is not managed by TML-cluster, so it won't be listed by Cluster Manager's list components command.

How do I check the reporting-pod's status?
For K8s cluster:
kubectl get pods | grep -i reporting-set-0-0
For OpenShift cluster:
oc get pods | grep -i reporting-set-0-0
For Swarm cluster: Navigate to the node which hosts the reporting pod, then run the command:
docker ps -a | grep -i reporting
How is the QPS rate calculated?
Traffic/Qps Rate Computation.

QPS is the rate of traffic calls per second. It is a rate function calculated on a duration of sliding window (default 5 minutes). In a sliding window of 5 (300 seconds) minutes, the QPS is computed as `total_traffic_in_5_mins/(5*60)`. For low traffic conditions, the actual QPS would be near zero and show up as fractions or decimal values on the graph. For example, for 60 calls in 5 minutes, the QPS would be 60/(5*60) i.e.1/5 or 0.20. For 18K calls in 5 minutes, the QPS would be 18000/(5*60) i.e 60 QPS. For finer granularity, reduce the window size to a lesser value (minimum 1 minute).

Why is CPU percentage going beyond 100%?

%CPU -- CPU Usage is the percentage of your CPU that is being used by the process. By default, the top displays this as a percentage of a single CPU. On multi-core systems, you can have percentages that are greater than 100%. For example, if 3 cores are at 60% use, the top will show a CPU use of 180% for that process. Total CPU Usage graph is the summation of CPU usage of individual processes that are running on that pod/container.

Why is there a different value for uptime metrics if the time range is changed?

The interval to fetch the data for a graph in Grafana is changed for queries having a time range greater than 24 hours. In the case where the selected time range is greater than 1 day, the step interval changes from 1 minute (default) to a higher interval of aggregated data to reduce the number of data points fetched from Prometheus, thus reducing the turnaround time for any query. Due to a change in step interval, the data fetched might be an old data point which displays different values in the graph as per the step interval. For example, if the step interval is 2 minutes, then the last data point fetched would be of n-2 minute where n is the current minute. So, the single stats panel would display that data point only. This would vary depending upon the time range selected in the dashboard.