Add apps/cluster-health-dashboard/readme.md

2026-01-07 08:26:35 +00:00
parent 332ef81a0a
commit 88e348d129
1 changed files with 634 additions and 0 deletions
--- a/apps/cluster-health-dashboard/readme.md
+++ b/apps/cluster-health-dashboard/readme.md
@@ -0,0 +1,634 @@
 # 📊 Cluster Health Dashboard - Setup Guide
 Complete setup guide for the Kubernetes Cluster Health Dashboard Jenkins pipeline.
 ---
 ## 🎯 What This Dashboard Does
 ### Collects:
 - ✅ Cluster information (version, nodes, namespaces, pods)
 - ✅ Resource metrics from Prometheus (CPU, Memory, Network)
 - ✅ Pod status across all namespaces
 - ✅ Node capacity and usage
 - ✅ Cost estimation (monthly)
 - ✅ Health checks and issues detection
 ### Generates:
 - 📊 Interactive HTML dashboard
 - 📄 JSON report with all metrics
 - 📱 Telegram summary notification
 - 📧 Optional email report
 ---
 ## 📋 Prerequisites
 ### Required:
 - ✅ Jenkins with Kubernetes plugin
 - ✅ kubectl configured with cluster access
 - ✅ Prometheus running in cluster (for metrics)
 - ✅ jq installed on Jenkins agent
 - ✅ curl installed on Jenkins agent
 ### Optional:
 - ⚙️ Telegram bot (for notifications)
 - ⚙️ Email configured in Jenkins
 - ⚙️ Grafana (referenced in dashboard)
 ---
 ## 🚀 Setup Steps
 ### Step 1: Install Required Tools on Jenkins Agent
 ```bash
 # SSH to your Jenkins agent or use Jenkins shell
 # Install jq (JSON processor)
 sudo apt-get update
 sudo apt-get install -y jq
 # Verify installations
 jq --version
 kubectl version --client
 curl --version
 ```
 ### Step 2: Configure Prometheus Access
 **Option A: If Prometheus is in your cluster (recommended)**
 Check if Prometheus is accessible:
 ```bash
 # From Jenkins agent or any pod in cluster
 kubectl get svc -n monitoring
 # Should see something like:
 # prometheus-server   ClusterIP   10.43.xxx.xxx   <none>   80/TCP
 ```
 **Option B: If Prometheus is external**
 Update Jenkinsfile environment variables:
 ```groovy
 environment {
    PROMETHEUS_URL = 'http://your-prometheus-url:9090'
 }
 ```
 **Test Prometheus access:**
 ```bash
 # From Jenkins agent
 curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
 # Should return JSON with metrics
 ```
 ### Step 3: Set Up Telegram Notifications (Optional)
 If you already have bot from previous setup, skip this!
 **A. Create Bot (if not done)**
 1. Open Telegram → @BotFather
 2. `/newbot`
 3. Get token: `1234567890:ABC...`
 **B. Get Chat ID**
 1. Telegram → @userinfobot
 2. Get your ID: `904518516`
 **C. Add to Jenkins Credentials**
 Jenkins → Manage Jenkins → Manage Credentials → Add:
 **Credential 1:**
 ```
 Kind: Secret text
 Secret: 8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI
 ID: telegram-bot-token
 Description: Telegram Bot Token
 ```
 **Credential 2:**
 ```
 Kind: Secret text
 Secret: 904518516
 ID: telegram-chat-id
 Description: Telegram Chat ID
 ```
 ### Step 4: Adjust Cost Estimates
 Edit Jenkinsfile to match your actual cloud costs:
 ```groovy
 environment {
    // Adjust these to your actual pricing
    CPU_PRICE_PER_HOUR = '0.04'        // $0.04 per vCPU/hour
    MEMORY_PRICE_PER_GB_HOUR = '0.005' // $0.005 per GB/hour
 }
 ```
 **Common pricing reference:**
 - AWS t3.medium: ~$0.0416/hour (2 vCPU, 4GB RAM)
 - DigitalOcean: $0.06/hour per vCPU, $0.007/GB RAM
 - Local/Bare metal: $0 (or electricity cost)
 ### Step 5: Create Jenkins Pipeline
 **A. Create New Pipeline Job**
 1. Jenkins → New Item
 2. Name: `cluster-health-dashboard`
 3. Type: Pipeline
 4. OK
 **B. Configure Pipeline**
 1. **Description:**
   ```
   Daily cluster health monitoring and reporting. 
   Generates dashboard with metrics, costs, and health checks.
   ```
 2. **Build Triggers:**
   - ☑️ Build periodically
   - Schedule: `0 8 * * 1-5` (8 AM weekdays)
 3. **Pipeline:**
   - Definition: Pipeline script from SCM
   - SCM: Git
   - Repository URL: `http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops`
   - Credentials: `gitea-credentials`
   - Branch: `*/main`
   - Script Path: `apps/cluster-health-dashboard/Jenkinsfile`
 **C. Or use Pipeline Script Directly**
 If you want to test first without Git:
 1. Definition: Pipeline script
 2. Copy entire Jenkinsfile content into the script box
 3. Save
 ### Step 6: Add to GitOps Repository
 ```bash
 # On your local machine
 cd ~/projects/k3s-gitops
 # Create directory
 mkdir -p apps/cluster-health-dashboard
 # Copy Jenkinsfile
 cp /path/to/Jenkinsfile apps/cluster-health-dashboard/
 # Commit
 git add apps/cluster-health-dashboard/
 git commit -m "feat: add cluster health dashboard pipeline"
 git push origin main
 ```
 ---
 ## 🧪 Testing
 ### Test 1: Manual Run (First Time)
 1. Jenkins → cluster-health-dashboard → Build with Parameters
 2. Set:
   - REPORT_PERIOD: `24h`
   - SEND_EMAIL: `false` (for first test)
   - SEND_TELEGRAM: `true`
 3. Build Now
 **Watch Console Output:**
 ```
 🚀 Starting Cluster Health Dashboard generation...
 📋 Collecting cluster information...
 Cluster version: v1.28.0
 Nodes: 3
 Namespaces: 14
 Pods: 67
 📈 Querying Prometheus for metrics...
 ✅ Dashboard generated
 ```
 ### Test 2: Check Generated Dashboard
 After build completes:
 1. Jenkins → cluster-health-dashboard → Build #1
 2. Click "Cluster Health Dashboard" (left sidebar)
 3. Should see beautiful HTML dashboard! 🎨
 ### Test 3: Check Telegram Notification
 You should receive:
 ```
 📊 Cluster Health Report
 ━━━━━━━━━━━━━━━━━━━━━━
 📋 Cluster Info
 Version: v1.28.0
 Nodes: 3
 Namespaces: 14
 Total Pods: 67
 ━━━━━━━━━━━━━━━━━━━━━━
 💻 Resources
 CPU Cores: 12
 Memory: 48 GB
 Avg CPU Usage: 23.5%
 ...
 ```
 ### Test 4: Check Artifacts
 1. Build #1 → Artifacts
 2. Should see:
   - `dashboard.html`
   - `report.json`
   - `namespace-stats.json`
   - `all-pods.json`
   - `node-resources.json`
 ---
 ## 🔧 Troubleshooting
 ### Issue 1: "Failed to query Prometheus"
 **Symptoms:**
 ```
 ⚠️ Failed to query Prometheus: Connection refused
 ```
 **Fix:**
 ```bash
 # Check if Prometheus is running
 kubectl get pods -n monitoring
 # Check service
 kubectl get svc -n monitoring
 # Test connection from Jenkins pod
 kubectl exec -it jenkins-0 -n jenkins -- \
  curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
 ```
 **If Prometheus is in different namespace:**
 Update Jenkinsfile:
 ```groovy
 PROMETHEUS_URL = 'http://prometheus-server.YOUR_NAMESPACE.svc.cluster.local'
 ```
 ### Issue 2: "jq: command not found"
 **Fix:**
 ```bash
 # Install jq on Jenkins agent
 kubectl exec -it jenkins-0 -n jenkins -- apt-get update
 kubectl exec -it jenkins-0 -n jenkins -- apt-get install -y jq
 # Or add to Jenkins Dockerfile:
 # RUN apt-get update && apt-get install -y jq
 ```
 ### Issue 3: "kubectl: command not found"
 **Fix:**
 Jenkins needs kubectl. Check installation:
 ```bash
 kubectl exec -it jenkins-0 -n jenkins -- kubectl version --client
 # If not installed, add to Jenkins image or install:
 curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
 chmod +x kubectl
 mv kubectl /usr/local/bin/
 ```
 ### Issue 4: Dashboard shows "0" for all metrics
 **Possible causes:**
 1. Prometheus not accessible
 2. Wrong Prometheus URL
 3. No metrics in Prometheus
 **Debug:**
 ```bash
 # Test Prometheus query manually
 curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"
 # Check if metrics exist
 curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=container_cpu_usage_seconds_total"
 ```
 ### Issue 5: HTML Dashboard not showing
 **Check:**
 ```bash
 # Verify HTML Plugin is installed
 Jenkins → Manage Jenkins → Manage Plugins → Installed
 # Look for: HTML Publisher Plugin
 # If not installed:
 # Manage Plugins → Available → Search "HTML Publisher" → Install
 ```
 ### Issue 6: Telegram notifications not sending
 **Check credentials:**
 ```bash
 # Verify credentials exist
 Jenkins → Manage Jenkins → Manage Credentials
 # Should see:
 # - telegram-bot-token
 # - telegram-chat-id
 # Test manually:
 BOT_TOKEN="8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI"
 CHAT_ID="904518516"
 curl -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
    -d chat_id="${CHAT_ID}" \
    -d text="Test from terminal"
 ```
 ---
 ## 📊 Understanding the Dashboard
 ### Metrics Explained:
 **Cluster Information:**
 - **Kubernetes Version:** Your K8s version
 - **Nodes:** Number of worker nodes
 - **Namespaces:** Total namespaces
 - **Total Pods:** All pods across cluster
 **Resource Capacity:**
 - **Total CPU Cores:** Sum of all node CPUs
 - **Total Memory:** Sum of all node RAM
 - **Avg CPU Usage:** Average CPU across containers
 - **Progress Bar:** Visual CPU usage
 **Pod Status:**
 - **Running:** Healthy pods ✅
 - **Pending:** Pods waiting to start ⏳
 - **Failed:** Crashed pods ❌
 - **Total Restarts:** Container restarts (high = problem)
 **Monthly Costs:**
 - Based on CPU cores and Memory GB
 - Calculated using rates you configured
 - Estimates infrastructure cost
 **Health Checks:**
 - High restart count (>10)
 - Failed pods (>0)
 - Pending pods (>5)
 - High CPU usage (>80%)
 **Resources by Namespace:**
 - Table showing pod/container count per namespace
 - Sorted by pod count (highest first)
 ---
 ## 🎨 Customization
 ### Change Schedule
 Edit cron trigger in Jenkinsfile:
 ```groovy
 triggers {
    cron('0 8 * * 1-5')  // Weekdays 8 AM
    // Examples:
    // cron('0 */6 * * *')     // Every 6 hours
    // cron('0 9 * * MON')     // Mondays 9 AM
    // cron('0 0 * * *')       // Daily midnight
 }
 ```
 ### Add More Metrics
 Add to Prometheus queries section:
 ```groovy
 // Disk I/O
 env.DISK_READ_MB = queryPrometheus(
    "sum(rate(container_fs_reads_bytes_total[5m])) / 1024 / 1024"
 )
 // HTTP Requests (if you have metrics)
 env.HTTP_REQUESTS_PER_SEC = queryPrometheus(
    "sum(rate(http_requests_total[5m]))"
 )
 ```
 Then add to HTML dashboard:
 ```html
 <div class="metric">
    <span class="metric-label">Disk Read</span>
    <span class="metric-value">${env.DISK_READ_MB} MB/s</span>
 </div>
 ```
 ### Change Colors/Styling
 Edit CSS in `generateDashboardHTML()`:
 ```css
 /* Change main gradient */
 background: linear-gradient(135deg, #YOUR_COLOR1 0%, #YOUR_COLOR2 100%);
 /* Change card colors */
 .card h2 {
    color: #YOUR_COLOR;
 }
 ```
 ### Add Email Recipients
 Add to Jenkinsfile:
 ```groovy
 post {
    success {
        emailext (
            to: 'devops-team@company.com',
            subject: "Cluster Health Report - ${new Date().format('yyyy-MM-dd')}",
            body: '''
                <h2>Daily Cluster Health Report</h2>
                <p>Please see attached dashboard.</p>
            ''',
            mimeType: 'text/html',
            attachmentsPattern: '**/dashboard.html'
        )
    }
 }
 ```
 ---
 ## 📈 Usage Examples
 ### Weekly Review
 ```
 Monday 8 AM → Dashboard generated
 Review:
 - Are costs increasing? Why?
 - Any failed pods? Investigate
 - CPU usage trending up? Scale?
 - Restarts increasing? Bug in app?
 ```
 ### Cost Tracking
 ```
 Week 1: $150/month
 Week 2: $180/month ⚠️  (+20%)
 → Check namespace-stats.json
 → Which namespace grew?
 → Review pod counts
 ```
 ### Capacity Planning
 ```
 Current: 12 CPU cores, 23.5% usage
 If usage > 70% for 7 days:
 → Time to add nodes
 → Dashboard shows trend
 ```
 ### Health Monitoring
 ```
 Dashboard shows:
 ❌ 5 pods in Failed state
 ⚠️ 15 container restarts
 → Click artifact → all-pods.json
 → Find which pods
 → kubectl logs <pod>
 → Fix issue
 ```
 ---
 ## 🔗 Integration with Other Tools
 ### Export to Grafana
 Use `report.json`:
 ```bash
 # Download report.json from Jenkins artifact
 # Import to Grafana via JSON API datasource
 # Create time-series dashboard
 ```
 ### Send to Slack
 Add Slack webhook:
 ```groovy
 post {
    success {
        sh """
            curl -X POST ${SLACK_WEBHOOK_URL} \
                -H 'Content-Type: application/json' \
                -d '{
                    "text": "Daily Cluster Report: ${env.MONTHLY_TOTAL_COST} USD/month",
                    "attachments": [{
                        "color": "good",
                        "fields": [
                            {"title": "Nodes", "value": "${env.NODE_COUNT}", "short": true},
                            {"title": "Pods", "value": "${env.POD_COUNT}", "short": true}
                        ]
                    }]
                }'
        """
    }
 }
 ```
 ### Store in Database
 Parse JSON and insert:
 ```groovy
 stage('Store in Database') {
    steps {
        script {
            def report = readJSON file: "${OUTPUT_DIR}/report.json"
            sh """
                psql -h postgres -U metrics -d cluster_metrics -c "
                    INSERT INTO daily_reports (date, cpu_usage, pod_count, cost)
                    VALUES ('${report.generated_at}', ${report.resources.avg_cpu_usage_percent}, 
                            ${report.cluster.total_pods}, ${report.costs.monthly_total_usd})
                "
            """
        }
    }
 }
 ```
 ---
 ## ✅ Verification Checklist
 After setup, verify:
 - [ ] Jenkins job created
 - [ ] First build succeeds
 - [ ] HTML dashboard accessible
 - [ ] Metrics show real data (not zeros)
 - [ ] Telegram notification received
 - [ ] Costs calculated correctly
 - [ ] JSON report generated
 - [ ] Namespace table populated
 - [ ] Health checks working
 - [ ] Schedule triggers correctly
 ---
 ## 📚 Next Steps
 ### Enhancements:
 1. **Historical Tracking** - Store reports in Git or database
 2. **Alerts** - Trigger alerts on threshold breaches
 3. **Comparison** - Compare week-over-week trends
 4. **Recommendations** - Auto-suggest optimizations
 5. **Deep Dive** - Per-namespace detailed reports
 ### Related Pipelines:
 - Security Scanning (scan images from this report)
 - Cleanup Pipeline (remove resources shown as unused)
 - Backup Pipeline (backup based on importance shown here)
 ---
 **You're all set! 🎉**
 Run your first build and enjoy your cluster health dashboard! 📊✨