Skip to main content

KAPPA-Automate IT Infrastructure

Monitoring your Kubernetes Cluster (via Grafana Dashboards)

This dashboard provides a view of both health and resources utilization within your Kubernetes cluster. With the Kubernetes Overview both K8s and VMs you have the capability to monitor deployments while identifying potential problems.

Filtering Dashboards

The dashboard includes several filtering options to help you customize your view:

1.png

(1) Data Source

(5) Time Range

(2) Node(s)

(6) Time Range Zoom Out

(3) Namespace

(7) Refresh

(4) Share

(8) Auto Refresh

In the following sections, you can learn more about different Grafana dashboards and monitoring solutions available for your Kubernetes cluster.

Nodes with disk pressure

Nodes with disk pressure – identifies nodes running low on disk space. Can be used to identify the following issues:

  • Failed storage devices

  • Poorly configured persistent volumes

  • Pod eviction

  • Scheduling issues

Solutions for disk pressure issues:

  • Remove unused files, logs, temporary data from the affected node

  • Verify affected pod has adequate request and resource limits

  • Resize persistent volumes

  • Use node affinity rules to distribute storage-heavy workloads across nodes

2.png

The ellipsis on dashboard provides additional options to interact and manage dashboards:

3.png

(1) View – maximize dashboard

(2) Share - share options

(3) Inspect – export data, JSON, query request, query performance, raw data

(4) More – hide legend

Nodes with memory pressure

Nodes with memory pressure – identifies nodes running low on memory. Can be used to identify the following issues:

  • Insufficient application performance

  • Latency

  • Pod eviction

  • Unstable nodes

Solutions for memory pressure issues:

  • Verify affected pod has adequate request and resource limits

  • Delete and recreate pod

  • Increase node memory

  • Scale cluster by adding more nodes

4.png

The ellipsis on dashboard provides additional options to interact and manage dashboards:

5.png

(1) View – maximize dashboard

(2) Share - share options

(3) Inspect – export data, JSON, query request, query performance, raw data

(4) More – hide legend

Disk usage

Disk usage – the amount of disk usage in each node and application. It can be used to identify the following:

  • Available disk space

  • Disk pressure

  • Excessive data storage

  • Premature disk fill

Solutions for disk usage issues:

  • Verify affected pod has adequate request and resource limits

  • Add more disk space

  • Set resource quota to limit storage used by containers within namespace

6.png

The ellipsis on dashboard provides additional options to interact and manage dashboards:

7.png

(1) View – maximize dashboard

(2) Share - share options

(3) Inspect – export data, JSON, query request, query performance, raw data

Cluster node status

Cluster node status – provides overall health of nodes and metrics. Color state can identify the following:

  • Gray – metric or node in unknown state and/or not reporting

  • Green – Healthy node and metrics are healthy

  • Yellow/Orange – warning that node is experiencing issue but not critical

  • Red – node in critical state requiring immediate attention

8.png

The ellipsis on dashboard provides additional options to interact and manage dashboards:

9.png

CPU

CPU - monitors CPU usage for each node and metrics of load average, throttling, and idle time. Can Identify the following:

  • Anomaly detection

  • Resource optimization

  • CPU throttling

  • CPU Spikes

Solutions for CPU issues:

  • Check metrics, logs, and traces to identify patterns

  • Increase resource requests and limits for containers

  • Optimize resources by identifying nodes under or over-utilized; reallocate resources

10.png

Network I/O pressure

Network I/O pressure – monitors health and performance of network. Can be used to identify the following:

  • Anomalies (spikes and/or drops)

  • Bottlenecks

  • Network Congestion

  • Overutilization

Solutions for Network I/O issues:

  • Redistribute workload to alleviate pressure

  • Implement traffic shaping to prioritize critical network traffic

  • Increase bandwidth or add more nodes to distribute load

11.png

Pods CPU usage

Pods CPU usage - monitors CPU usage of all pods and services running in nodes. Can be used to identify the following:

  • Anomalies (spikes and/or drops)

  • CPU Throttling

  • Scaling needs

  • Resource utilization

Solutions for Pod CPU issues:

  • Set adequate CPU limits and requests

  • Implement Vertical Pod Scaling or Horizontal Pod Scaling

  • Monitor CPU usage with threshold limited alerts

12.png

Pod memory usage

Pod memory usage – provides real-time metrics of nodes memory consumption. Can be used to identify the following:

  • Bottlenecks

  • Memory leaks

  • Performance

  • Resource inefficiencies

Solutions for Pod memory issues:

  • Set adequate memory limits and requests

  • Implement Vertical Pod Scaling

  • Monitor memory usage with threshold limited alerts

13.png

Pods network I/O

Pods network I/O – monitors the amount of inbound and outbound network traffic of each pod.

  • Congestions

  • Data transmission delays

  • DNS failures

  • Intermittent timeouts

Solutions for Network I/O issues:

  • Set adequate CPU/memory request to avoid throttling

  • Implement traffic shaping to prioritize critical network traffic

  • Check DNS, service endpoints, and network policies

  • Check for node pressure or pod restarts

14.png

Container CPU usage

Container CPU usage – CPU metrics of containers within nodes. Can identify the following:

  • High CPU usage

  • CPU throttling

  • Over/Under-provisioning

Solutions for Container CPU usage issues:

  • Set adequate CPU limits and request

  • Implement Vertical Pod Scaling or Horizontal Pod Scaling

  • Monitor CPU usage with threshold limited alerts

15.png

Container memory usage

Container memory usage – metric display of memory usage by containers within nodes. Can be used to identify the following:

  • Memory leaks

  • Memory pressure

  • Container intermitted restarts

  • Poor application performance

Solutions for Container memory usage issues:

  • Set adequate CPU limits and request

  • Implement Vertical Pod Scaling

  • Monitor memory usage with threshold limited alerts

  • Monitor to identify trends and sustained high usage

16.png

Container network I/O

Container network I/O - provide visual of container network input and output metrics.

  • Latency

  • Data flow restrictions

  • Traffic patterns

Solutions for container network I/O issues:

  • Set adequate CPU/memory request to avoid throttling

  • Set bandwidth limits

  • Check DNS, service endpoints, and network policies

  • Check for node pressure or pod restarts

17.png

Number of restarts per pod

Number of restarts per pod – displays number of restarts by pods. Issues it could identify the following:

  • Container fails to start successfully

  • Memory limit exceeded

  • Network timeouts

  • Unhandled exceptions or failed dependencies

Solutions for pod restarts:

  • Modify CPU/memory requests and limits

  • Check Pod events and logs

  • Check DNS, service endpoints, and network policies

  • Verify dependencies are reachable

18.png

Scenarios

Common Kubernetes Scenarios

Kubernetes issues that may occur along with course of action to provide solution

Table 5. Common Kubernetes Scenarios

Scenario

Course of Action

Developer can't access the cluster

Check kubeconfig and RBAC roles

Pod is stuck in CrashLoopBackOff

View logs, restart pod, escalate if needed

Service is unreachable

Verify service and pod endpoints

Namespace quota exceeded

Adjust limits or advise user

App deployment failed

Check Helm release status and logs



Common Pod Networking Scenarios

Kubernetes issues that may occur along with course of action to provide solution

Table 6. Common Pod Networking Scenarios

Scenario

Course of Action

Pod can't reach service

Check DNS, service endpoints, and network policies

High latency

Monitor bandwidth, check node load, inspect CNI

DNS failures

Restart CoreDNS, check config maps

Intermittent timeouts

Look for node pressure, CNI restarts, or pod restarts



Common Container CPU usage Scenarios

Kubernetes issues that may occur along with course of action to provide solution

Table 7. Common Container CPU usage Scenarios

Scenario

Course of Action

High CPU usage

Profile app, increase CPU requests, use HPA

CPU throttling

Increase CPU limits or remove them

Pod evicted due to CPU

Spread workloads, use node affinity

Uneven CPU usage

Balance requests across pods



Common Container memory usage Scenarios

Kubernetes issues that may occur along with course of action to provide solution

Table 8. Common Container memory usage Scenarios

Scenario

Course of Action

Pod is OOMKilled

Increase memory limit, check logs for cause

Memory usage keeps growing

Profile app for leaks, restart periodically

Node memory pressure

Spread pods across nodes, use affinity rules

App crashes under load

Use HPA/VPA, increase memory requests