# ๐Ÿ“Š Cluster Health Dashboard - Setup Guide Complete setup guide for the Kubernetes Cluster Health Dashboard Jenkins pipeline. --- ## ๐ŸŽฏ What This Dashboard Does ### Collects: - โœ… Cluster information (version, nodes, namespaces, pods) - โœ… Resource metrics from Prometheus (CPU, Memory, Network) - โœ… Pod status across all namespaces - โœ… Node capacity and usage - โœ… Cost estimation (monthly) - โœ… Health checks and issues detection ### Generates: - ๐Ÿ“Š Interactive HTML dashboard - ๐Ÿ“„ JSON report with all metrics - ๐Ÿ“ฑ Telegram summary notification - ๐Ÿ“ง Optional email report --- ## ๐Ÿ“‹ Prerequisites ### Required: - โœ… Jenkins with Kubernetes plugin - โœ… kubectl configured with cluster access - โœ… Prometheus running in cluster (for metrics) - โœ… jq installed on Jenkins agent - โœ… curl installed on Jenkins agent ### Optional: - โš™๏ธ Telegram bot (for notifications) - โš™๏ธ Email configured in Jenkins - โš™๏ธ Grafana (referenced in dashboard) --- ## ๐Ÿš€ Setup Steps ### Step 1: Install Required Tools on Jenkins Agent ```bash # SSH to your Jenkins agent or use Jenkins shell # Install jq (JSON processor) sudo apt-get update sudo apt-get install -y jq # Verify installations jq --version kubectl version --client curl --version ``` ### Step 2: Configure Prometheus Access **Option A: If Prometheus is in your cluster (recommended)** Check if Prometheus is accessible: ```bash # From Jenkins agent or any pod in cluster kubectl get svc -n monitoring # Should see something like: # prometheus-server ClusterIP 10.43.xxx.xxx 80/TCP ``` **Option B: If Prometheus is external** Update Jenkinsfile environment variables: ```groovy environment { PROMETHEUS_URL = 'http://your-prometheus-url:9090' } ``` **Test Prometheus access:** ```bash # From Jenkins agent curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up" # Should return JSON with metrics ``` ### Step 3: Set Up Telegram Notifications (Optional) If you already have bot from previous setup, skip this! **A. Create Bot (if not done)** 1. Open Telegram โ†’ @BotFather 2. `/newbot` 3. Get token: `1234567890:ABC...` **B. Get Chat ID** 1. Telegram โ†’ @userinfobot 2. Get your ID: `904518516` **C. Add to Jenkins Credentials** Jenkins โ†’ Manage Jenkins โ†’ Manage Credentials โ†’ Add: **Credential 1:** ``` Kind: Secret text Secret: 8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI ID: telegram-bot-token Description: Telegram Bot Token ``` **Credential 2:** ``` Kind: Secret text Secret: 904518516 ID: telegram-chat-id Description: Telegram Chat ID ``` ### Step 4: Adjust Cost Estimates Edit Jenkinsfile to match your actual cloud costs: ```groovy environment { // Adjust these to your actual pricing CPU_PRICE_PER_HOUR = '0.04' // $0.04 per vCPU/hour MEMORY_PRICE_PER_GB_HOUR = '0.005' // $0.005 per GB/hour } ``` **Common pricing reference:** - AWS t3.medium: ~$0.0416/hour (2 vCPU, 4GB RAM) - DigitalOcean: $0.06/hour per vCPU, $0.007/GB RAM - Local/Bare metal: $0 (or electricity cost) ### Step 5: Create Jenkins Pipeline **A. Create New Pipeline Job** 1. Jenkins โ†’ New Item 2. Name: `cluster-health-dashboard` 3. Type: Pipeline 4. OK **B. Configure Pipeline** 1. **Description:** ``` Daily cluster health monitoring and reporting. Generates dashboard with metrics, costs, and health checks. ``` 2. **Build Triggers:** - โ˜‘๏ธ Build periodically - Schedule: `0 8 * * 1-5` (8 AM weekdays) 3. **Pipeline:** - Definition: Pipeline script from SCM - SCM: Git - Repository URL: `http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops` - Credentials: `gitea-credentials` - Branch: `*/main` - Script Path: `apps/cluster-health-dashboard/Jenkinsfile` **C. Or use Pipeline Script Directly** If you want to test first without Git: 1. Definition: Pipeline script 2. Copy entire Jenkinsfile content into the script box 3. Save ### Step 6: Add to GitOps Repository ```bash # On your local machine cd ~/projects/k3s-gitops # Create directory mkdir -p apps/cluster-health-dashboard # Copy Jenkinsfile cp /path/to/Jenkinsfile apps/cluster-health-dashboard/ # Commit git add apps/cluster-health-dashboard/ git commit -m "feat: add cluster health dashboard pipeline" git push origin main ``` --- ## ๐Ÿงช Testing ### Test 1: Manual Run (First Time) 1. Jenkins โ†’ cluster-health-dashboard โ†’ Build with Parameters 2. Set: - REPORT_PERIOD: `24h` - SEND_EMAIL: `false` (for first test) - SEND_TELEGRAM: `true` 3. Build Now **Watch Console Output:** ``` ๐Ÿš€ Starting Cluster Health Dashboard generation... ๐Ÿ“‹ Collecting cluster information... Cluster version: v1.28.0 Nodes: 3 Namespaces: 14 Pods: 67 ๐Ÿ“ˆ Querying Prometheus for metrics... โœ… Dashboard generated ``` ### Test 2: Check Generated Dashboard After build completes: 1. Jenkins โ†’ cluster-health-dashboard โ†’ Build #1 2. Click "Cluster Health Dashboard" (left sidebar) 3. Should see beautiful HTML dashboard! ๐ŸŽจ ### Test 3: Check Telegram Notification You should receive: ``` ๐Ÿ“Š Cluster Health Report โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” ๐Ÿ“‹ Cluster Info Version: v1.28.0 Nodes: 3 Namespaces: 14 Total Pods: 67 โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” ๐Ÿ’ป Resources CPU Cores: 12 Memory: 48 GB Avg CPU Usage: 23.5% ... ``` ### Test 4: Check Artifacts 1. Build #1 โ†’ Artifacts 2. Should see: - `dashboard.html` - `report.json` - `namespace-stats.json` - `all-pods.json` - `node-resources.json` --- ## ๐Ÿ”ง Troubleshooting ### Issue 1: "Failed to query Prometheus" **Symptoms:** ``` โš ๏ธ Failed to query Prometheus: Connection refused ``` **Fix:** ```bash # Check if Prometheus is running kubectl get pods -n monitoring # Check service kubectl get svc -n monitoring # Test connection from Jenkins pod kubectl exec -it jenkins-0 -n jenkins -- \ curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up" ``` **If Prometheus is in different namespace:** Update Jenkinsfile: ```groovy PROMETHEUS_URL = 'http://prometheus-server.YOUR_NAMESPACE.svc.cluster.local' ``` ### Issue 2: "jq: command not found" **Fix:** ```bash # Install jq on Jenkins agent kubectl exec -it jenkins-0 -n jenkins -- apt-get update kubectl exec -it jenkins-0 -n jenkins -- apt-get install -y jq # Or add to Jenkins Dockerfile: # RUN apt-get update && apt-get install -y jq ``` ### Issue 3: "kubectl: command not found" **Fix:** Jenkins needs kubectl. Check installation: ```bash kubectl exec -it jenkins-0 -n jenkins -- kubectl version --client # If not installed, add to Jenkins image or install: curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" chmod +x kubectl mv kubectl /usr/local/bin/ ``` ### Issue 4: Dashboard shows "0" for all metrics **Possible causes:** 1. Prometheus not accessible 2. Wrong Prometheus URL 3. No metrics in Prometheus **Debug:** ```bash # Test Prometheus query manually curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up" # Check if metrics exist curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=container_cpu_usage_seconds_total" ``` ### Issue 5: HTML Dashboard not showing **Check:** ```bash # Verify HTML Plugin is installed Jenkins โ†’ Manage Jenkins โ†’ Manage Plugins โ†’ Installed # Look for: HTML Publisher Plugin # If not installed: # Manage Plugins โ†’ Available โ†’ Search "HTML Publisher" โ†’ Install ``` ### Issue 6: Telegram notifications not sending **Check credentials:** ```bash # Verify credentials exist Jenkins โ†’ Manage Jenkins โ†’ Manage Credentials # Should see: # - telegram-bot-token # - telegram-chat-id # Test manually: BOT_TOKEN="8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI" CHAT_ID="904518516" curl -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \ -d chat_id="${CHAT_ID}" \ -d text="Test from terminal" ``` --- ## ๐Ÿ“Š Understanding the Dashboard ### Metrics Explained: **Cluster Information:** - **Kubernetes Version:** Your K8s version - **Nodes:** Number of worker nodes - **Namespaces:** Total namespaces - **Total Pods:** All pods across cluster **Resource Capacity:** - **Total CPU Cores:** Sum of all node CPUs - **Total Memory:** Sum of all node RAM - **Avg CPU Usage:** Average CPU across containers - **Progress Bar:** Visual CPU usage **Pod Status:** - **Running:** Healthy pods โœ… - **Pending:** Pods waiting to start โณ - **Failed:** Crashed pods โŒ - **Total Restarts:** Container restarts (high = problem) **Monthly Costs:** - Based on CPU cores and Memory GB - Calculated using rates you configured - Estimates infrastructure cost **Health Checks:** - High restart count (>10) - Failed pods (>0) - Pending pods (>5) - High CPU usage (>80%) **Resources by Namespace:** - Table showing pod/container count per namespace - Sorted by pod count (highest first) --- ## ๐ŸŽจ Customization ### Change Schedule Edit cron trigger in Jenkinsfile: ```groovy triggers { cron('0 8 * * 1-5') // Weekdays 8 AM // Examples: // cron('0 */6 * * *') // Every 6 hours // cron('0 9 * * MON') // Mondays 9 AM // cron('0 0 * * *') // Daily midnight } ``` ### Add More Metrics Add to Prometheus queries section: ```groovy // Disk I/O env.DISK_READ_MB = queryPrometheus( "sum(rate(container_fs_reads_bytes_total[5m])) / 1024 / 1024" ) // HTTP Requests (if you have metrics) env.HTTP_REQUESTS_PER_SEC = queryPrometheus( "sum(rate(http_requests_total[5m]))" ) ``` Then add to HTML dashboard: ```html
Disk Read ${env.DISK_READ_MB} MB/s
``` ### Change Colors/Styling Edit CSS in `generateDashboardHTML()`: ```css /* Change main gradient */ background: linear-gradient(135deg, #YOUR_COLOR1 0%, #YOUR_COLOR2 100%); /* Change card colors */ .card h2 { color: #YOUR_COLOR; } ``` ### Add Email Recipients Add to Jenkinsfile: ```groovy post { success { emailext ( to: 'devops-team@company.com', subject: "Cluster Health Report - ${new Date().format('yyyy-MM-dd')}", body: '''

Daily Cluster Health Report

Please see attached dashboard.

''', mimeType: 'text/html', attachmentsPattern: '**/dashboard.html' ) } } ``` --- ## ๐Ÿ“ˆ Usage Examples ### Weekly Review ``` Monday 8 AM โ†’ Dashboard generated Review: - Are costs increasing? Why? - Any failed pods? Investigate - CPU usage trending up? Scale? - Restarts increasing? Bug in app? ``` ### Cost Tracking ``` Week 1: $150/month Week 2: $180/month โš ๏ธ (+20%) โ†’ Check namespace-stats.json โ†’ Which namespace grew? โ†’ Review pod counts ``` ### Capacity Planning ``` Current: 12 CPU cores, 23.5% usage If usage > 70% for 7 days: โ†’ Time to add nodes โ†’ Dashboard shows trend ``` ### Health Monitoring ``` Dashboard shows: โŒ 5 pods in Failed state โš ๏ธ 15 container restarts โ†’ Click artifact โ†’ all-pods.json โ†’ Find which pods โ†’ kubectl logs โ†’ Fix issue ``` --- ## ๐Ÿ”— Integration with Other Tools ### Export to Grafana Use `report.json`: ```bash # Download report.json from Jenkins artifact # Import to Grafana via JSON API datasource # Create time-series dashboard ``` ### Send to Slack Add Slack webhook: ```groovy post { success { sh """ curl -X POST ${SLACK_WEBHOOK_URL} \ -H 'Content-Type: application/json' \ -d '{ "text": "Daily Cluster Report: ${env.MONTHLY_TOTAL_COST} USD/month", "attachments": [{ "color": "good", "fields": [ {"title": "Nodes", "value": "${env.NODE_COUNT}", "short": true}, {"title": "Pods", "value": "${env.POD_COUNT}", "short": true} ] }] }' """ } } ``` ### Store in Database Parse JSON and insert: ```groovy stage('Store in Database') { steps { script { def report = readJSON file: "${OUTPUT_DIR}/report.json" sh """ psql -h postgres -U metrics -d cluster_metrics -c " INSERT INTO daily_reports (date, cpu_usage, pod_count, cost) VALUES ('${report.generated_at}', ${report.resources.avg_cpu_usage_percent}, ${report.cluster.total_pods}, ${report.costs.monthly_total_usd}) " """ } } } ``` --- ## โœ… Verification Checklist After setup, verify: - [ ] Jenkins job created - [ ] First build succeeds - [ ] HTML dashboard accessible - [ ] Metrics show real data (not zeros) - [ ] Telegram notification received - [ ] Costs calculated correctly - [ ] JSON report generated - [ ] Namespace table populated - [ ] Health checks working - [ ] Schedule triggers correctly --- ## ๐Ÿ“š Next Steps ### Enhancements: 1. **Historical Tracking** - Store reports in Git or database 2. **Alerts** - Trigger alerts on threshold breaches 3. **Comparison** - Compare week-over-week trends 4. **Recommendations** - Auto-suggest optimizations 5. **Deep Dive** - Per-namespace detailed reports ### Related Pipelines: - Security Scanning (scan images from this report) - Cleanup Pipeline (remove resources shown as unused) - Backup Pipeline (backup based on importance shown here) --- **You're all set! ๐ŸŽ‰** Run your first build and enjoy your cluster health dashboard! ๐Ÿ“Šโœจ