Add apps/cluster-health-dashboard/readme.md

2026-01-07 08:26:35 +00:00
parent 332ef81a0a
commit 88e348d129
1 changed files with 634 additions and 0 deletions
--- a/apps/cluster-health-dashboard/readme.md
+++ b/apps/cluster-health-dashboard/readme.md
@@ -0,0 +1,634 @@
+# 📊 Cluster Health Dashboard - Setup Guide
+
+Complete setup guide for the Kubernetes Cluster Health Dashboard Jenkins pipeline.
+
+---
+
+## 🎯 What This Dashboard Does
+
+### Collects:
+- ✅ Cluster information (version, nodes, namespaces, pods)
+- ✅ Resource metrics from Prometheus (CPU, Memory, Network)
+- ✅ Pod status across all namespaces
+- ✅ Node capacity and usage
+- ✅ Cost estimation (monthly)
+- ✅ Health checks and issues detection
+
+### Generates:
+- 📊 Interactive HTML dashboard
+- 📄 JSON report with all metrics
+- 📱 Telegram summary notification
+- 📧 Optional email report
+
+---
+
+## 📋 Prerequisites
+
+### Required:
+- ✅ Jenkins with Kubernetes plugin
+- ✅ kubectl configured with cluster access
+- ✅ Prometheus running in cluster (for metrics)
+- ✅ jq installed on Jenkins agent
+- ✅ curl installed on Jenkins agent
+
+### Optional:
+- ⚙️ Telegram bot (for notifications)
+- ⚙️ Email configured in Jenkins
+- ⚙️ Grafana (referenced in dashboard)
+
+---
+
+## 🚀 Setup Steps
+
+### Step 1: Install Required Tools on Jenkins Agent
+
+```bash
+# SSH to your Jenkins agent or use Jenkins shell
+
+# Install jq (JSON processor)
+sudo apt-get update
+sudo apt-get install -y jq
+
+# Verify installations
+jq --version
+kubectl version --client
+curl --version
+```
+
+### Step 2: Configure Prometheus Access
+
+**Option A: If Prometheus is in your cluster (recommended)**
+
+Check if Prometheus is accessible:
+
+```bash
+# From Jenkins agent or any pod in cluster
+kubectl get svc -n monitoring
+
+# Should see something like:
+# prometheus-server   ClusterIP   10.43.xxx.xxx   <none>   80/TCP
+```
+
+**Option B: If Prometheus is external**
+
+Update Jenkinsfile environment variables:
+
+```groovy
+environment {
+    PROMETHEUS_URL = 'http://your-prometheus-url:9090'
+}
+```
+
+**Test Prometheus access:**
+
+```bash
+# From Jenkins agent
+curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
+
+# Should return JSON with metrics
+```
+
+### Step 3: Set Up Telegram Notifications (Optional)
+
+If you already have bot from previous setup, skip this!
+
+**A. Create Bot (if not done)**
+1. Open Telegram → @BotFather
+2. `/newbot`
+3. Get token: `1234567890:ABC...`
+
+**B. Get Chat ID**
+1. Telegram → @userinfobot
+2. Get your ID: `904518516`
+
+**C. Add to Jenkins Credentials**
+
+Jenkins → Manage Jenkins → Manage Credentials → Add:
+
+**Credential 1:**
+```
+Kind: Secret text
+Secret: 8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI
+ID: telegram-bot-token
+Description: Telegram Bot Token
+```
+
+**Credential 2:**
+```
+Kind: Secret text
+Secret: 904518516
+ID: telegram-chat-id
+Description: Telegram Chat ID
+```
+
+### Step 4: Adjust Cost Estimates
+
+Edit Jenkinsfile to match your actual cloud costs:
+
+```groovy
+environment {
+    // Adjust these to your actual pricing
+    CPU_PRICE_PER_HOUR = '0.04'        // $0.04 per vCPU/hour
+    MEMORY_PRICE_PER_GB_HOUR = '0.005' // $0.005 per GB/hour
+}
+```
+
+**Common pricing reference:**
+- AWS t3.medium: ~$0.0416/hour (2 vCPU, 4GB RAM)
+- DigitalOcean: $0.06/hour per vCPU, $0.007/GB RAM
+- Local/Bare metal: $0 (or electricity cost)
+
+### Step 5: Create Jenkins Pipeline
+
+**A. Create New Pipeline Job**
+
+1. Jenkins → New Item
+2. Name: `cluster-health-dashboard`
+3. Type: Pipeline
+4. OK
+
+**B. Configure Pipeline**
+
+1. **Description:**
+   ```
+   Daily cluster health monitoring and reporting. 
+   Generates dashboard with metrics, costs, and health checks.
+   ```
+
+2. **Build Triggers:**
+   - ☑️ Build periodically
+   - Schedule: `0 8 * * 1-5` (8 AM weekdays)
+
+3. **Pipeline:**
+   - Definition: Pipeline script from SCM
+   - SCM: Git
+   - Repository URL: `http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops`
+   - Credentials: `gitea-credentials`
+   - Branch: `*/main`
+   - Script Path: `apps/cluster-health-dashboard/Jenkinsfile`
+
+**C. Or use Pipeline Script Directly**
+
+If you want to test first without Git:
+1. Definition: Pipeline script
+2. Copy entire Jenkinsfile content into the script box
+3. Save
+
+### Step 6: Add to GitOps Repository
+
+```bash
+# On your local machine
+cd ~/projects/k3s-gitops
+
+# Create directory
+mkdir -p apps/cluster-health-dashboard
+
+# Copy Jenkinsfile
+cp /path/to/Jenkinsfile apps/cluster-health-dashboard/
+
+# Commit
+git add apps/cluster-health-dashboard/
+git commit -m "feat: add cluster health dashboard pipeline"
+git push origin main
+```
+
+---
+
+## 🧪 Testing
+
+### Test 1: Manual Run (First Time)
+
+1. Jenkins → cluster-health-dashboard → Build with Parameters
+2. Set:
+   - REPORT_PERIOD: `24h`
+   - SEND_EMAIL: `false` (for first test)
+   - SEND_TELEGRAM: `true`
+3. Build Now
+
+**Watch Console Output:**
+```
+🚀 Starting Cluster Health Dashboard generation...
+📋 Collecting cluster information...
+Cluster version: v1.28.0
+Nodes: 3
+Namespaces: 14
+Pods: 67
+📈 Querying Prometheus for metrics...
+✅ Dashboard generated
+```
+
+### Test 2: Check Generated Dashboard
+
+After build completes:
+
+1. Jenkins → cluster-health-dashboard → Build #1
+2. Click "Cluster Health Dashboard" (left sidebar)
+3. Should see beautiful HTML dashboard! 🎨
+
+### Test 3: Check Telegram Notification
+
+You should receive:
+```
+📊 Cluster Health Report
+
+━━━━━━━━━━━━━━━━━━━━━━
+📋 Cluster Info
+Version: v1.28.0
+Nodes: 3
+Namespaces: 14
+Total Pods: 67
+
+━━━━━━━━━━━━━━━━━━━━━━
+💻 Resources
+CPU Cores: 12
+Memory: 48 GB
+Avg CPU Usage: 23.5%
+...
+```
+
+### Test 4: Check Artifacts
+
+1. Build #1 → Artifacts
+2. Should see:
+   - `dashboard.html`
+   - `report.json`
+   - `namespace-stats.json`
+   - `all-pods.json`
+   - `node-resources.json`
+
+---
+
+## 🔧 Troubleshooting
+
+### Issue 1: "Failed to query Prometheus"
+
+**Symptoms:**
+```
+⚠️ Failed to query Prometheus: Connection refused
+```
+
+**Fix:**
+
+```bash
+# Check if Prometheus is running
+kubectl get pods -n monitoring
+
+# Check service
+kubectl get svc -n monitoring
+
+# Test connection from Jenkins pod
+kubectl exec -it jenkins-0 -n jenkins -- \
+  curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
+```
+
+**If Prometheus is in different namespace:**
+
+Update Jenkinsfile:
+```groovy
+PROMETHEUS_URL = 'http://prometheus-server.YOUR_NAMESPACE.svc.cluster.local'
+```
+
+### Issue 2: "jq: command not found"
+
+**Fix:**
+
+```bash
+# Install jq on Jenkins agent
+kubectl exec -it jenkins-0 -n jenkins -- apt-get update
+kubectl exec -it jenkins-0 -n jenkins -- apt-get install -y jq
+
+# Or add to Jenkins Dockerfile:
+# RUN apt-get update && apt-get install -y jq
+```
+
+### Issue 3: "kubectl: command not found"
+
+**Fix:**
+
+Jenkins needs kubectl. Check installation:
+
+```bash
+kubectl exec -it jenkins-0 -n jenkins -- kubectl version --client
+
+# If not installed, add to Jenkins image or install:
+curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
+chmod +x kubectl
+mv kubectl /usr/local/bin/
+```
+
+### Issue 4: Dashboard shows "0" for all metrics
+
+**Possible causes:**
+1. Prometheus not accessible
+2. Wrong Prometheus URL
+3. No metrics in Prometheus
+
+**Debug:**
+
+```bash
+# Test Prometheus query manually
+curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
+
+# Check if metrics exist
+curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=container_cpu_usage_seconds_total"
+```
+
+### Issue 5: HTML Dashboard not showing
+
+**Check:**
+
+```bash
+# Verify HTML Plugin is installed
+Jenkins → Manage Jenkins → Manage Plugins → Installed
+
+# Look for: HTML Publisher Plugin
+
+# If not installed:
+# Manage Plugins → Available → Search "HTML Publisher" → Install
+```
+
+### Issue 6: Telegram notifications not sending
+
+**Check credentials:**
+
+```bash
+# Verify credentials exist
+Jenkins → Manage Jenkins → Manage Credentials
+
+# Should see:
+# - telegram-bot-token
+# - telegram-chat-id
+
+# Test manually:
+BOT_TOKEN="8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI"
+CHAT_ID="904518516"
+
+curl -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
+    -d chat_id="${CHAT_ID}" \
+    -d text="Test from terminal"
+```
+
+---
+
+## 📊 Understanding the Dashboard
+
+### Metrics Explained:
+
+**Cluster Information:**
+- **Kubernetes Version:** Your K8s version
+- **Nodes:** Number of worker nodes
+- **Namespaces:** Total namespaces
+- **Total Pods:** All pods across cluster
+
+**Resource Capacity:**
+- **Total CPU Cores:** Sum of all node CPUs
+- **Total Memory:** Sum of all node RAM
+- **Avg CPU Usage:** Average CPU across containers
+- **Progress Bar:** Visual CPU usage
+
+**Pod Status:**
+- **Running:** Healthy pods ✅
+- **Pending:** Pods waiting to start ⏳
+- **Failed:** Crashed pods ❌
+- **Total Restarts:** Container restarts (high = problem)
+
+**Monthly Costs:**
+- Based on CPU cores and Memory GB
+- Calculated using rates you configured
+- Estimates infrastructure cost
+
+**Health Checks:**
+- High restart count (>10)
+- Failed pods (>0)
+- Pending pods (>5)
+- High CPU usage (>80%)
+
+**Resources by Namespace:**
+- Table showing pod/container count per namespace
+- Sorted by pod count (highest first)
+
+---
+
+## 🎨 Customization
+
+### Change Schedule
+
+Edit cron trigger in Jenkinsfile:
+
+```groovy
+triggers {
+    cron('0 8 * * 1-5')  // Weekdays 8 AM
+    
+    // Examples:
+    // cron('0 */6 * * *')     // Every 6 hours
+    // cron('0 9 * * MON')     // Mondays 9 AM
+    // cron('0 0 * * *')       // Daily midnight
+}
+```
+
+### Add More Metrics
+
+Add to Prometheus queries section:
+
+```groovy
+// Disk I/O
+env.DISK_READ_MB = queryPrometheus(
+    "sum(rate(container_fs_reads_bytes_total[5m])) / 1024 / 1024"
+)
+
+// HTTP Requests (if you have metrics)
+env.HTTP_REQUESTS_PER_SEC = queryPrometheus(
+    "sum(rate(http_requests_total[5m]))"
+)
+```
+
+Then add to HTML dashboard:
+
+```html
+<div class="metric">
+    <span class="metric-label">Disk Read</span>
+    <span class="metric-value">${env.DISK_READ_MB} MB/s</span>
+</div>
+```
+
+### Change Colors/Styling
+
+Edit CSS in `generateDashboardHTML()`:
+
+```css
+/* Change main gradient */
+background: linear-gradient(135deg, #YOUR_COLOR1 0%, #YOUR_COLOR2 100%);
+
+/* Change card colors */
+.card h2 {
+    color: #YOUR_COLOR;
+}
+```
+
+### Add Email Recipients
+
+Add to Jenkinsfile:
+
+```groovy
+post {
+    success {
+        emailext (
+            to: 'devops-team@company.com',
+            subject: "Cluster Health Report - ${new Date().format('yyyy-MM-dd')}",
+            body: '''
+                <h2>Daily Cluster Health Report</h2>
+                <p>Please see attached dashboard.</p>
+            ''',
+            mimeType: 'text/html',
+            attachmentsPattern: '**/dashboard.html'
+        )
+    }
+}
+```
+
+---
+
+## 📈 Usage Examples
+
+### Weekly Review
+
+```
+Monday 8 AM → Dashboard generated
+Review:
+- Are costs increasing? Why?
+- Any failed pods? Investigate
+- CPU usage trending up? Scale?
+- Restarts increasing? Bug in app?
+```
+
+### Cost Tracking
+
+```
+Week 1: $150/month
+Week 2: $180/month ⚠️  (+20%)
+→ Check namespace-stats.json
+→ Which namespace grew?
+→ Review pod counts
+```
+
+### Capacity Planning
+
+```
+Current: 12 CPU cores, 23.5% usage
+If usage > 70% for 7 days:
+→ Time to add nodes
+→ Dashboard shows trend
+```
+
+### Health Monitoring
+
+```
+Dashboard shows:
+❌ 5 pods in Failed state
+⚠️ 15 container restarts
+
+→ Click artifact → all-pods.json
+→ Find which pods
+→ kubectl logs <pod>
+→ Fix issue
+```
+
+---
+
+## 🔗 Integration with Other Tools
+
+### Export to Grafana
+
+Use `report.json`:
+
+```bash
+# Download report.json from Jenkins artifact
+# Import to Grafana via JSON API datasource
+# Create time-series dashboard
+```
+
+### Send to Slack
+
+Add Slack webhook:
+
+```groovy
+post {
+    success {
+        sh """
+            curl -X POST ${SLACK_WEBHOOK_URL} \
+                -H 'Content-Type: application/json' \
+                -d '{
+                    "text": "Daily Cluster Report: ${env.MONTHLY_TOTAL_COST} USD/month",
+                    "attachments": [{
+                        "color": "good",
+                        "fields": [
+                            {"title": "Nodes", "value": "${env.NODE_COUNT}", "short": true},
+                            {"title": "Pods", "value": "${env.POD_COUNT}", "short": true}
+                        ]
+                    }]
+                }'
+        """
+    }
+}
+```
+
+### Store in Database
+
+Parse JSON and insert:
+
+```groovy
+stage('Store in Database') {
+    steps {
+        script {
+            def report = readJSON file: "${OUTPUT_DIR}/report.json"
+            
+            sh """
+                psql -h postgres -U metrics -d cluster_metrics -c "
+                    INSERT INTO daily_reports (date, cpu_usage, pod_count, cost)
+                    VALUES ('${report.generated_at}', ${report.resources.avg_cpu_usage_percent}, 
+                            ${report.cluster.total_pods}, ${report.costs.monthly_total_usd})
+                "
+            """
+        }
+    }
+}
+```
+
+---
+
+## ✅ Verification Checklist
+
+After setup, verify:
+
+- [ ] Jenkins job created
+- [ ] First build succeeds
+- [ ] HTML dashboard accessible
+- [ ] Metrics show real data (not zeros)
+- [ ] Telegram notification received
+- [ ] Costs calculated correctly
+- [ ] JSON report generated
+- [ ] Namespace table populated
+- [ ] Health checks working
+- [ ] Schedule triggers correctly
+
+---
+
+## 📚 Next Steps
+
+### Enhancements:
+1. **Historical Tracking** - Store reports in Git or database
+2. **Alerts** - Trigger alerts on threshold breaches
+3. **Comparison** - Compare week-over-week trends
+4. **Recommendations** - Auto-suggest optimizations
+5. **Deep Dive** - Per-namespace detailed reports
+
+### Related Pipelines:
+- Security Scanning (scan images from this report)
+- Cleanup Pipeline (remove resources shown as unused)
+- Backup Pipeline (backup based on importance shown here)
+
+---
+
+**You're all set! 🎉**
+
+Run your first build and enjoy your cluster health dashboard! 📊✨