Teacher Services Cloud - Service Requests Dashboard
Overview
The Service Requests Dashboard provides a comprehensive view of HTTP request metrics for all services running in the AKS cluster. It replaces the previous PaaS dashboard functionality by showing request counts grouped by HTTP status codes (2xx, 4xx, 5xx).
Features
- Aggregate View: Overall request metrics across all services grouped by status code
- Per-Service Panels: Individual graphs for each service showing request breakdown
-
Status Code Grouping: Color-coded visualization:
- 🟢 Green: 2xx (Success)
- 🟡 Yellow: 4xx (Client Errors)
- 🔴 Red: 5xx (Server Errors)
- Time Range: Default last 1 hour, easily adjustable
- Auto-refresh: 30-second refresh interval
- Filtering: Filter by service/ingress and namespace
Dashboard Location
-
File:
cluster/terraform_kubernetes/config/dashboards/service-requests-dashboard.json -
Grafana URL:
https://grafana.{cluster-domain}/ - Dashboard Title: "Service Requests Dashboard"
Dashboard Components
1. All Services Overview Panel
Shows aggregate request rate across all services with HTTP status code breakdown. Useful for:
- Identifying overall system health
- Detecting widespread issues
- Monitoring total traffic patterns
2. Per-Service Panels
Dynamically generated panels, one per service/ingress, showing:
- Request rate (requests per second)
- HTTP status code distribution
- Individual service health trends
Variables/Filters
Service Filter
-
Variable:
$service - Type: Multi-select dropdown
-
Source: Auto-populated from
nginx_ingress_controller_requestsmetric labels - Default: All services
- Usage: Select specific services to focus the dashboard view
Namespace Filter
-
Variable:
$namespace - Type: Multi-select dropdown
-
Source: Auto-populated from
nginx_ingress_controller_requestsmetric labels - Default: All namespaces
- Usage: Filter by Kubernetes namespace (e.g., BAT, CPD, GIT services)
Prometheus Queries
All Services Overview
sum(rate(nginx_ingress_controller_requests{namespace!=""}[$__rate_interval])) by (status)
Per-Service Metrics
sum(rate(nginx_ingress_controller_requests{ingress="$service"}[$__rate_interval])) by (status)
Use Cases
Monitoring Service Health
- Quick identification of services with high error rates (4xx/5xx)
- Trend analysis for request patterns
- Capacity planning based on request volumes
Incident Response
- Rapid identification of affected services
- Correlation of errors across multiple services
- Historical comparison during incidents
Service Line Analysis
If the dashboard becomes resource-intensive, you can:
- Use the namespace filter to focus on specific service lines (BAT, CPD, GIT)
- Create separate dashboards per service line by duplicating and modifying the dashboard
- Adjust the time range to reduce query load
Deployment
The dashboard is automatically deployed via Terraform:
-
Automatic Provisioning: The dashboard JSON file in
config/dashboards/is automatically picked up by the Terraform configuration -
ConfigMap: Loaded into the
grafana-dashboardsConfigMap in themonitoringnamespace - Grafana: Auto-provisioned when Grafana pod starts
Manual Deployment Steps
If you need to manually update the dashboard:
# 1. Edit the dashboard JSON file
vim cluster/terraform_kubernetes/config/dashboards/service-requests-dashboard.json
# 2. Apply Terraform changes
make <environment> terraform-plan
make <environment> terraform-apply
# 3. Restart Grafana pod (optional, for immediate reload)
kubectl rollout restart deployment/grafana -n monitoring
Performance Considerations
Resource Usage
- Light Load: Single panel showing all services aggregated
- Moderate Load: ~10-20 service panels
- Heavy Load: >20 service panels may impact Grafana performance
Optimization Strategies
If performance becomes an issue:
- Increase Refresh Interval: Change from 30s to 1m or higher
- Reduce Time Range: Use shorter default time ranges (30m instead of 1h)
-
Split by Service Line: Create separate dashboards:
-
service-requests-bat.json(BAT services only) -
service-requests-cpd.json(CPD services only) -
service-requests-git.json(GIT services only)
-
To create service-line specific dashboards:
# Copy the base dashboard
cp service-requests-dashboard.json service-requests-bat.json
# Edit the JSON and add namespace filter to queries:
# Change: nginx_ingress_controller_requests{ingress="$service"}
# To: nginx_ingress_controller_requests{ingress="$service", namespace=~".*bat.*"}
Troubleshooting
No Data Displayed
- Verify Prometheus is scraping nginx-ingress-controller metrics
- Check that services have active traffic
- Confirm nginx-ingress-controller is deployed and running
Missing Services
- Services appear in the dropdown only if they have received requests
- Check the nginx-ingress-controller is properly configured for the service
- Verify the service has an Ingress resource defined
Performance Issues
- Reduce the number of selected services
- Increase refresh interval
- Shorten time range
- Consider splitting into service-line specific dashboards
Related Documentation
- Monitoring - Overall monitoring strategy
- Nginx Ingress Controller Dashboard - Original ingress-level dashboard
- Grafana Configuration - Terraform configuration
Metrics Reference
The dashboard uses the nginx_ingress_controller_requests metric which provides:
-
Labels:
-
ingress: Name of the ingress/service -
namespace: Kubernetes namespace -
status: HTTP status code (200, 404, 500, etc.) -
method: HTTP method (GET, POST, etc.) -
host: Request host header
-
-
Metric Type: Counter
-
Collection: Prometheus via nginx-ingress-controller exporter
-
Retention: Subject to Thanos retention policy (see Monitoring)
Future Enhancements
Potential improvements to consider:
- Service Line Organization: Add service line labels/tags for better filtering
- SLO Integration: Add SLO violation indicators
- Alert Integration: Link to related alerts for each service
- Latency Metrics: Add P95/P99 latency alongside request counts
- Rate Change Detection: Highlight sudden changes in request patterns