📊 Cluster Health Dashboard - Setup Guide
Complete setup guide for the Kubernetes Cluster Health Dashboard Jenkins pipeline.
🎯 What This Dashboard Does
Collects:
- ✅ Cluster information (version, nodes, namespaces, pods)
- ✅ Resource metrics from Prometheus (CPU, Memory, Network)
- ✅ Pod status across all namespaces
- ✅ Node capacity and usage
- ✅ Cost estimation (monthly)
- ✅ Health checks and issues detection
Generates:
- 📊 Interactive HTML dashboard
- 📄 JSON report with all metrics
- 📱 Telegram summary notification
- 📧 Optional email report
📋 Prerequisites
Required:
- ✅ Jenkins with Kubernetes plugin
- ✅ kubectl configured with cluster access
- ✅ Prometheus running in cluster (for metrics)
- ✅ jq installed on Jenkins agent
- ✅ curl installed on Jenkins agent
Optional:
- ⚙️ Telegram bot (for notifications)
- ⚙️ Email configured in Jenkins
- ⚙️ Grafana (referenced in dashboard)
🚀 Setup Steps
Step 1: Install Required Tools on Jenkins Agent
# SSH to your Jenkins agent or use Jenkins shell
# Install jq (JSON processor)
sudo apt-get update
sudo apt-get install -y jq
# Verify installations
jq --version
kubectl version --client
curl --version
Step 2: Configure Prometheus Access
Option A: If Prometheus is in your cluster (recommended)
Check if Prometheus is accessible:
# From Jenkins agent or any pod in cluster
kubectl get svc -n monitoring
# Should see something like:
# prometheus-server ClusterIP 10.43.xxx.xxx <none> 80/TCP
Option B: If Prometheus is external
Update Jenkinsfile environment variables:
environment {
PROMETHEUS_URL = 'http://your-prometheus-url:9090'
}
Test Prometheus access:
# From Jenkins agent
curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
# Should return JSON with metrics
Step 3: Set Up Telegram Notifications (Optional)
If you already have bot from previous setup, skip this!
A. Create Bot (if not done)
- Open Telegram → @BotFather
/newbot- Get token:
1234567890:ABC...
B. Get Chat ID
- Telegram → @userinfobot
- Get your ID:
904518516
C. Add to Jenkins Credentials
Jenkins → Manage Jenkins → Manage Credentials → Add:
Credential 1:
Kind: Secret text
Secret: 8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI
ID: telegram-bot-token
Description: Telegram Bot Token
Credential 2:
Kind: Secret text
Secret: 904518516
ID: telegram-chat-id
Description: Telegram Chat ID
Step 4: Adjust Cost Estimates
Edit Jenkinsfile to match your actual cloud costs:
environment {
// Adjust these to your actual pricing
CPU_PRICE_PER_HOUR = '0.04' // $0.04 per vCPU/hour
MEMORY_PRICE_PER_GB_HOUR = '0.005' // $0.005 per GB/hour
}
Common pricing reference:
- AWS t3.medium: ~$0.0416/hour (2 vCPU, 4GB RAM)
- DigitalOcean: $0.06/hour per vCPU, $0.007/GB RAM
- Local/Bare metal: $0 (or electricity cost)
Step 5: Create Jenkins Pipeline
A. Create New Pipeline Job
- Jenkins → New Item
- Name:
cluster-health-dashboard - Type: Pipeline
- OK
B. Configure Pipeline
-
Description:
Daily cluster health monitoring and reporting. Generates dashboard with metrics, costs, and health checks. -
Build Triggers:
- ☑️ Build periodically
- Schedule:
0 8 * * 1-5(8 AM weekdays)
-
Pipeline:
- Definition: Pipeline script from SCM
- SCM: Git
- Repository URL:
http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops - Credentials:
gitea-credentials - Branch:
*/main - Script Path:
apps/cluster-health-dashboard/Jenkinsfile
C. Or use Pipeline Script Directly
If you want to test first without Git:
- Definition: Pipeline script
- Copy entire Jenkinsfile content into the script box
- Save
Step 6: Add to GitOps Repository
# On your local machine
cd ~/projects/k3s-gitops
# Create directory
mkdir -p apps/cluster-health-dashboard
# Copy Jenkinsfile
cp /path/to/Jenkinsfile apps/cluster-health-dashboard/
# Commit
git add apps/cluster-health-dashboard/
git commit -m "feat: add cluster health dashboard pipeline"
git push origin main
🧪 Testing
Test 1: Manual Run (First Time)
- Jenkins → cluster-health-dashboard → Build with Parameters
- Set:
- REPORT_PERIOD:
24h - SEND_EMAIL:
false(for first test) - SEND_TELEGRAM:
true
- REPORT_PERIOD:
- Build Now
Watch Console Output:
🚀 Starting Cluster Health Dashboard generation...
📋 Collecting cluster information...
Cluster version: v1.28.0
Nodes: 3
Namespaces: 14
Pods: 67
📈 Querying Prometheus for metrics...
✅ Dashboard generated
Test 2: Check Generated Dashboard
After build completes:
- Jenkins → cluster-health-dashboard → Build #1
- Click "Cluster Health Dashboard" (left sidebar)
- Should see beautiful HTML dashboard! 🎨
Test 3: Check Telegram Notification
You should receive:
📊 Cluster Health Report
━━━━━━━━━━━━━━━━━━━━━━
📋 Cluster Info
Version: v1.28.0
Nodes: 3
Namespaces: 14
Total Pods: 67
━━━━━━━━━━━━━━━━━━━━━━
💻 Resources
CPU Cores: 12
Memory: 48 GB
Avg CPU Usage: 23.5%
...
Test 4: Check Artifacts
- Build #1 → Artifacts
- Should see:
dashboard.htmlreport.jsonnamespace-stats.jsonall-pods.jsonnode-resources.json
🔧 Troubleshooting
Issue 1: "Failed to query Prometheus"
Symptoms:
⚠️ Failed to query Prometheus: Connection refused
Fix:
# Check if Prometheus is running
kubectl get pods -n monitoring
# Check service
kubectl get svc -n monitoring
# Test connection from Jenkins pod
kubectl exec -it jenkins-0 -n jenkins -- \
curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
If Prometheus is in different namespace:
Update Jenkinsfile:
PROMETHEUS_URL = 'http://prometheus-server.YOUR_NAMESPACE.svc.cluster.local'
Issue 2: "jq: command not found"
Fix:
# Install jq on Jenkins agent
kubectl exec -it jenkins-0 -n jenkins -- apt-get update
kubectl exec -it jenkins-0 -n jenkins -- apt-get install -y jq
# Or add to Jenkins Dockerfile:
# RUN apt-get update && apt-get install -y jq
Issue 3: "kubectl: command not found"
Fix:
Jenkins needs kubectl. Check installation:
kubectl exec -it jenkins-0 -n jenkins -- kubectl version --client
# If not installed, add to Jenkins image or install:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
mv kubectl /usr/local/bin/
Issue 4: Dashboard shows "0" for all metrics
Possible causes:
- Prometheus not accessible
- Wrong Prometheus URL
- No metrics in Prometheus
Debug:
# Test Prometheus query manually
curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
# Check if metrics exist
curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=container_cpu_usage_seconds_total"
Issue 5: HTML Dashboard not showing
Check:
# Verify HTML Plugin is installed
Jenkins → Manage Jenkins → Manage Plugins → Installed
# Look for: HTML Publisher Plugin
# If not installed:
# Manage Plugins → Available → Search "HTML Publisher" → Install
Issue 6: Telegram notifications not sending
Check credentials:
# Verify credentials exist
Jenkins → Manage Jenkins → Manage Credentials
# Should see:
# - telegram-bot-token
# - telegram-chat-id
# Test manually:
BOT_TOKEN="8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI"
CHAT_ID="904518516"
curl -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
-d chat_id="${CHAT_ID}" \
-d text="Test from terminal"
📊 Understanding the Dashboard
Metrics Explained:
Cluster Information:
- Kubernetes Version: Your K8s version
- Nodes: Number of worker nodes
- Namespaces: Total namespaces
- Total Pods: All pods across cluster
Resource Capacity:
- Total CPU Cores: Sum of all node CPUs
- Total Memory: Sum of all node RAM
- Avg CPU Usage: Average CPU across containers
- Progress Bar: Visual CPU usage
Pod Status:
- Running: Healthy pods ✅
- Pending: Pods waiting to start ⏳
- Failed: Crashed pods ❌
- Total Restarts: Container restarts (high = problem)
Monthly Costs:
- Based on CPU cores and Memory GB
- Calculated using rates you configured
- Estimates infrastructure cost
Health Checks:
- High restart count (>10)
- Failed pods (>0)
- Pending pods (>5)
- High CPU usage (>80%)
Resources by Namespace:
- Table showing pod/container count per namespace
- Sorted by pod count (highest first)
🎨 Customization
Change Schedule
Edit cron trigger in Jenkinsfile:
triggers {
cron('0 8 * * 1-5') // Weekdays 8 AM
// Examples:
// cron('0 */6 * * *') // Every 6 hours
// cron('0 9 * * MON') // Mondays 9 AM
// cron('0 0 * * *') // Daily midnight
}
Add More Metrics
Add to Prometheus queries section:
// Disk I/O
env.DISK_READ_MB = queryPrometheus(
"sum(rate(container_fs_reads_bytes_total[5m])) / 1024 / 1024"
)
// HTTP Requests (if you have metrics)
env.HTTP_REQUESTS_PER_SEC = queryPrometheus(
"sum(rate(http_requests_total[5m]))"
)
Then add to HTML dashboard:
<div class="metric">
<span class="metric-label">Disk Read</span>
<span class="metric-value">${env.DISK_READ_MB} MB/s</span>
</div>
Change Colors/Styling
Edit CSS in generateDashboardHTML():
/* Change main gradient */
background: linear-gradient(135deg, #YOUR_COLOR1 0%, #YOUR_COLOR2 100%);
/* Change card colors */
.card h2 {
color: #YOUR_COLOR;
}
Add Email Recipients
Add to Jenkinsfile:
post {
success {
emailext (
to: 'devops-team@company.com',
subject: "Cluster Health Report - ${new Date().format('yyyy-MM-dd')}",
body: '''
<h2>Daily Cluster Health Report</h2>
<p>Please see attached dashboard.</p>
''',
mimeType: 'text/html',
attachmentsPattern: '**/dashboard.html'
)
}
}
📈 Usage Examples
Weekly Review
Monday 8 AM → Dashboard generated
Review:
- Are costs increasing? Why?
- Any failed pods? Investigate
- CPU usage trending up? Scale?
- Restarts increasing? Bug in app?
Cost Tracking
Week 1: $150/month
Week 2: $180/month ⚠️ (+20%)
→ Check namespace-stats.json
→ Which namespace grew?
→ Review pod counts
Capacity Planning
Current: 12 CPU cores, 23.5% usage
If usage > 70% for 7 days:
→ Time to add nodes
→ Dashboard shows trend
Health Monitoring
Dashboard shows:
❌ 5 pods in Failed state
⚠️ 15 container restarts
→ Click artifact → all-pods.json
→ Find which pods
→ kubectl logs <pod>
→ Fix issue
🔗 Integration with Other Tools
Export to Grafana
Use report.json:
# Download report.json from Jenkins artifact
# Import to Grafana via JSON API datasource
# Create time-series dashboard
Send to Slack
Add Slack webhook:
post {
success {
sh """
curl -X POST ${SLACK_WEBHOOK_URL} \
-H 'Content-Type: application/json' \
-d '{
"text": "Daily Cluster Report: ${env.MONTHLY_TOTAL_COST} USD/month",
"attachments": [{
"color": "good",
"fields": [
{"title": "Nodes", "value": "${env.NODE_COUNT}", "short": true},
{"title": "Pods", "value": "${env.POD_COUNT}", "short": true}
]
}]
}'
"""
}
}
Store in Database
Parse JSON and insert:
stage('Store in Database') {
steps {
script {
def report = readJSON file: "${OUTPUT_DIR}/report.json"
sh """
psql -h postgres -U metrics -d cluster_metrics -c "
INSERT INTO daily_reports (date, cpu_usage, pod_count, cost)
VALUES ('${report.generated_at}', ${report.resources.avg_cpu_usage_percent},
${report.cluster.total_pods}, ${report.costs.monthly_total_usd})
"
"""
}
}
}
✅ Verification Checklist
After setup, verify:
- Jenkins job created
- First build succeeds
- HTML dashboard accessible
- Metrics show real data (not zeros)
- Telegram notification received
- Costs calculated correctly
- JSON report generated
- Namespace table populated
- Health checks working
- Schedule triggers correctly
📚 Next Steps
Enhancements:
- Historical Tracking - Store reports in Git or database
- Alerts - Trigger alerts on threshold breaches
- Comparison - Compare week-over-week trends
- Recommendations - Auto-suggest optimizations
- Deep Dive - Per-namespace detailed reports
Related Pipelines:
- Security Scanning (scan images from this report)
- Cleanup Pipeline (remove resources shown as unused)
- Backup Pipeline (backup based on importance shown here)
You're all set! 🎉
Run your first build and enjoy your cluster health dashboard! 📊✨