Teacher Services Cloud - Monitoring
Details for any cluster or subscription monitoring
Subscription monitoring
Manually created service health alerts for each s189 subscription.
s189[d|t|p]-service-health-alert
They will trigger on the below events for UK South or Global regions, and send an email to the TS infra team
- service issue
- planned maintenance
- health advisories
- security advisory
Cluster statuscake alerts
Terraform created statuscake monitoring for the permanent clusters.
These monitor https://status.${cluster}/healthz for each cluster, and will email and page the TS infra team on failure.
AKS Cluster Authentication
An AKS cluster authentication smoke test runs on a GitHub Workflow initiated via crontab schedule every 5 mins, accessing all clusters. It authenticates to Azure via OIDC and runs a simple k8s command to verify all is well. If the script fails it triggers a Slack webhook to the #infra-alert-public channel
Prometheus
Prometheus monitoring is enabled for a cluster by default.
The default prometheus version is hardcoded in the kubernetes variables.tf. It can be overridden for a cluster by adding prometheus_version to the env.tfvars.json file.
There are several other variables that can be changed depending on env requirements.
- prometheus_app_mem - app memory limit (default 1G)
- prometheus_app_cpu - app memory requests (default 100m)
- prometheus_tsdb_retention_time - local storage retention period (default 6h)
Prometheus rules and yml config files are loaded from the terraform_kubernetes/config/prometheus directory. Each file is prefixed with the cluster env. e.g. development.prometheus.rules and development.prometheus.yml
Currently a restart/reload of the prometheus process is required if changes are made to these files.
Thanos
Prometheus is configured to use Thanos for backend storage.
Thanos runs as a sidecar within the prometheus deployment. It copies prometheus collected data after two hours to an Azure storage container.
There are also three separate Thanos services
- thanos-querier
- thanos-store-gateway
- thanos-compactor
All are running as single replica deployments.
The default thanos version is hardcoded in the kubernetes variables.tf. It can be overridden for a cluster by adding thanos_version to the env.tfvars.json file.
There are several other variables that can be changed depending on env requirements.
- thanos_app_mem - sidecar memory limit (default 1G)
- thanos_app_cpu - thanos cpu requests (default 100m)
- thanos_querier_mem - app memory limit for the thanos querier (default 1G)
- thanos_compactor_mem - app memory limit for the thanos compactor (default 1G)
- thanos_store_mem - app memory limit for the thanos store gateway (default 1G)
- thanos_retention_raw - Thanos retention period for raw samples (default 30d)
- thanos_retention_5m - Thanos retention period for 5m samples (default 60d)
- thanos_retention_1h - Thanos retention period for 1h samples (default 90d)
Metrics Retention
Metrics retention is based on sampling
- Raw data(actual data captured) is retained for 30 days. This is data as it is captured by prometheus.
- 5m down samples are retained for 60days.This is data point for a metric 5m apart.
- 1hr down samples are retained for 90 days. This is data point for a metric 1hr apart.
More information on down sampling is available on this link
Down sample allows for reduced storage costs as all the raw data does not need to stored for longer duration charting.
Thanos UI
Metrics can be queried/charted by using thanos UI. While charting metrics in thanos the following should be noted
- Change the data source to prometheus or thanos. See this image
Raw Data Sampling
- Thanos UI allows for querying raw data. However, it retains raw data for only 30 days. Raw data can be queries created by selecting
Only raw dataas below. If more than 30 days is queried for raw data, the charts based on raw data will not show data more than 30 days. See image
5m down sample
Beyond 30days - thanos down samples the data. 5m down sample stores samples for 60 days. See image
1hr down sample
1hr down sample stores metric samples for 90 days. See image
Auto down sample
This option is used by grafana when during charting/visualisation. Where the charts are over a long period of time grafana adopts the most appropriate down sampling for data.
Thanos logging-level
This is currently set to "info" for all 3 thanos components, but can be amended to any of "--log.level=error|warn|info|debug" in code in thanos.tf
Grafana
Grafana provides a visual interface for monitoring logs and metric. It can be configured to different datasources including prometheus and thanos (as it is, in this case). Grafana dashboard can be configured as required to provide different forms of visualisation - inluding charts, graphs etc
The default grafana version is hardcoded in the kubernetes variable.tf. It can be overridden for a cluster by adding grafana_version to the env.tfvars.json file.
There are several other variables that can be changed depending on env requirements.
- grafana_app_mem - app memory limit (default 1Gi)
- grafana_app_cpu - app requests cpu (default 500m)
kube state metrics
Kube-state-metrics is a listening service that generates metrics about the state of Kubernetes objects through leveraging the Kubernetes API; it focuses on object health instead of component health
The default Kube-state-metrics version is hardcoded to version v2.8.2 by adding kube_state_metrics_version to variables.tf
The metrics scraped are:
- requests - with cpu of 100m and memory of 128Mi
- limits - with cpu of 300m and memory of 256Mi
- liveness_probe - with endpoint /healthz and port 8080
- readiness_probe - with endpoint / and port 8081
- telemetry - the telemetry data is accesses via port 8081
Alertmanager
Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, Slack, or other notification mechanisms.
Alertmanager service is running on NodePort 9093.
Alertmanager is a single replica deployment.
The default alert version is hardcoded in the kubernetes variable.tf. It can be overridden for a cluster by adding alertmanager_image_version to the env.tfvars.json file.
There are several other variables that can be changed depending on env requirements.
- alertmanager_app_mem - app memory limit (default 1G)
- alertmanager_app_cpu - app cpu requests (default 1)
Node Exporter
The node exporter enables o/s and hardware metrics for each node.
It's deployed as a daemon set, which creates a node-exporter pod in each node on the cluster. Prometheus then scrapes port 9100 on each of these pods.
The default node exporter version is hardcoded in the kubernetes variables.tf. It can be overridden for a cluster by adding node_exporter_version to the env.tfvars.json file.
PROMETHEUS , ALERTMANAGER and THANOS Auth Key generation.
For auth key generation run the shell script 'scripts/hash_password.sh' by passing username and password , then take the generated key save in to azure vault as a secret. User and password will be stored as clear text in PROMETHEUS-AUTH-CLEAR,ALERTMANAGER-AUTH-CLEAR,THANOS-AUTH-CLEAR
Following auth keys need to be stored on azure vault as a secret.
- PROMETHEUS-AUTH
- ALERTMANAGER-AUTH
- THANOS-AUTH
Azure Monitor Alerting
Azure Monitor is used to track the health and performance of the AKS clusters. The monitoring is configured through Terraform in the azure_metric_alerts.tf file.
Node Availability Monitoring
A metric alert is configured to monitor the availability of nodes in the AKS cluster:
- Alert Name:
[resource-prefix]-tsc-[environment]-nodes-capacity - Metric:
kube_node_status_condition - Evaluation: Every 1 minute over a 5-minute window
- Threshold: Triggers when the number of available nodes with "Ready" status exceeds the configured threshold
- Action: Notifications are sent to the configured Azure Monitor Action Group
The alert helps ensure the cluster maintains sufficient node capacity for workloads. The action group is configured to notify the appropriate team members when node availability issues are detected.
Configuration is managed through Terraform variables:
- The monitoring resource group and action group are defined in the cluster configuration
- The action group name follows the format
[resource-prefix]-tsc - Alert thresholds can be customized per environment
- The metric namespace used is
microsoft.containerservice/managedclusters
High Port Usage
AKS uses an azure load balancer for inbound and outbound connections and this can lead to port exhaustion if a node does alot of network requests.
If port usage goes over a threshold we alert on this as a warning so we can take pre-emptive action.
Port Exhaustion
If connections start failing because of port exhaustion we alert on this as an error.
Troubleshooting Port Exhaustion
Unfortunately we can't alert which kubernetes service is using aa high number of ports so this is a troublshooting exercise following:
Troubleshoot SNAT port exhaustion on Azure Kubernetes Service nodes