Files
k3s-gitops/apps/cluster-health-dashboard

📊 Cluster Health Dashboard - Setup Guide

Complete setup guide for the Kubernetes Cluster Health Dashboard Jenkins pipeline.


🎯 What This Dashboard Does

Collects:

  • Cluster information (version, nodes, namespaces, pods)
  • Resource metrics from Prometheus (CPU, Memory, Network)
  • Pod status across all namespaces
  • Node capacity and usage
  • Cost estimation (monthly)
  • Health checks and issues detection

Generates:

  • 📊 Interactive HTML dashboard
  • 📄 JSON report with all metrics
  • 📱 Telegram summary notification
  • 📧 Optional email report

📋 Prerequisites

Required:

  • Jenkins with Kubernetes plugin
  • kubectl configured with cluster access
  • Prometheus running in cluster (for metrics)
  • jq installed on Jenkins agent
  • curl installed on Jenkins agent

Optional:

  • ⚙️ Telegram bot (for notifications)
  • ⚙️ Email configured in Jenkins
  • ⚙️ Grafana (referenced in dashboard)

🚀 Setup Steps

Step 1: Install Required Tools on Jenkins Agent

# SSH to your Jenkins agent or use Jenkins shell

# Install jq (JSON processor)
sudo apt-get update
sudo apt-get install -y jq

# Verify installations
jq --version
kubectl version --client
curl --version

Step 2: Configure Prometheus Access

Option A: If Prometheus is in your cluster (recommended)

Check if Prometheus is accessible:

# From Jenkins agent or any pod in cluster
kubectl get svc -n monitoring

# Should see something like:
# prometheus-server   ClusterIP   10.43.xxx.xxx   <none>   80/TCP

Option B: If Prometheus is external

Update Jenkinsfile environment variables:

environment {
    PROMETHEUS_URL = 'http://your-prometheus-url:9090'
}

Test Prometheus access:

# From Jenkins agent
curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"

# Should return JSON with metrics

Step 3: Set Up Telegram Notifications (Optional)

If you already have bot from previous setup, skip this!

A. Create Bot (if not done)

  1. Open Telegram → @BotFather
  2. /newbot
  3. Get token: 1234567890:ABC...

B. Get Chat ID

  1. Telegram → @userinfobot
  2. Get your ID: 904518516

C. Add to Jenkins Credentials

Jenkins → Manage Jenkins → Manage Credentials → Add:

Credential 1:

Kind: Secret text
Secret: 8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI
ID: telegram-bot-token
Description: Telegram Bot Token

Credential 2:

Kind: Secret text
Secret: 904518516
ID: telegram-chat-id
Description: Telegram Chat ID

Step 4: Adjust Cost Estimates

Edit Jenkinsfile to match your actual cloud costs:

environment {
    // Adjust these to your actual pricing
    CPU_PRICE_PER_HOUR = '0.04'        // $0.04 per vCPU/hour
    MEMORY_PRICE_PER_GB_HOUR = '0.005' // $0.005 per GB/hour
}

Common pricing reference:

  • AWS t3.medium: ~$0.0416/hour (2 vCPU, 4GB RAM)
  • DigitalOcean: $0.06/hour per vCPU, $0.007/GB RAM
  • Local/Bare metal: $0 (or electricity cost)

Step 5: Create Jenkins Pipeline

A. Create New Pipeline Job

  1. Jenkins → New Item
  2. Name: cluster-health-dashboard
  3. Type: Pipeline
  4. OK

B. Configure Pipeline

  1. Description:

    Daily cluster health monitoring and reporting. 
    Generates dashboard with metrics, costs, and health checks.
    
  2. Build Triggers:

    • ☑️ Build periodically
    • Schedule: 0 8 * * 1-5 (8 AM weekdays)
  3. Pipeline:

    • Definition: Pipeline script from SCM
    • SCM: Git
    • Repository URL: http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops
    • Credentials: gitea-credentials
    • Branch: */main
    • Script Path: apps/cluster-health-dashboard/Jenkinsfile

C. Or use Pipeline Script Directly

If you want to test first without Git:

  1. Definition: Pipeline script
  2. Copy entire Jenkinsfile content into the script box
  3. Save

Step 6: Add to GitOps Repository

# On your local machine
cd ~/projects/k3s-gitops

# Create directory
mkdir -p apps/cluster-health-dashboard

# Copy Jenkinsfile
cp /path/to/Jenkinsfile apps/cluster-health-dashboard/

# Commit
git add apps/cluster-health-dashboard/
git commit -m "feat: add cluster health dashboard pipeline"
git push origin main

🧪 Testing

Test 1: Manual Run (First Time)

  1. Jenkins → cluster-health-dashboard → Build with Parameters
  2. Set:
    • REPORT_PERIOD: 24h
    • SEND_EMAIL: false (for first test)
    • SEND_TELEGRAM: true
  3. Build Now

Watch Console Output:

🚀 Starting Cluster Health Dashboard generation...
📋 Collecting cluster information...
Cluster version: v1.28.0
Nodes: 3
Namespaces: 14
Pods: 67
📈 Querying Prometheus for metrics...
✅ Dashboard generated

Test 2: Check Generated Dashboard

After build completes:

  1. Jenkins → cluster-health-dashboard → Build #1
  2. Click "Cluster Health Dashboard" (left sidebar)
  3. Should see beautiful HTML dashboard! 🎨

Test 3: Check Telegram Notification

You should receive:

📊 Cluster Health Report

━━━━━━━━━━━━━━━━━━━━━━
📋 Cluster Info
Version: v1.28.0
Nodes: 3
Namespaces: 14
Total Pods: 67

━━━━━━━━━━━━━━━━━━━━━━
💻 Resources
CPU Cores: 12
Memory: 48 GB
Avg CPU Usage: 23.5%
...

Test 4: Check Artifacts

  1. Build #1 → Artifacts
  2. Should see:
    • dashboard.html
    • report.json
    • namespace-stats.json
    • all-pods.json
    • node-resources.json

🔧 Troubleshooting

Issue 1: "Failed to query Prometheus"

Symptoms:

⚠️ Failed to query Prometheus: Connection refused

Fix:

# Check if Prometheus is running
kubectl get pods -n monitoring

# Check service
kubectl get svc -n monitoring

# Test connection from Jenkins pod
kubectl exec -it jenkins-0 -n jenkins -- \
  curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"

If Prometheus is in different namespace:

Update Jenkinsfile:

PROMETHEUS_URL = 'http://prometheus-server.YOUR_NAMESPACE.svc.cluster.local'

Issue 2: "jq: command not found"

Fix:

# Install jq on Jenkins agent
kubectl exec -it jenkins-0 -n jenkins -- apt-get update
kubectl exec -it jenkins-0 -n jenkins -- apt-get install -y jq

# Or add to Jenkins Dockerfile:
# RUN apt-get update && apt-get install -y jq

Issue 3: "kubectl: command not found"

Fix:

Jenkins needs kubectl. Check installation:

kubectl exec -it jenkins-0 -n jenkins -- kubectl version --client

# If not installed, add to Jenkins image or install:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
mv kubectl /usr/local/bin/

Issue 4: Dashboard shows "0" for all metrics

Possible causes:

  1. Prometheus not accessible
  2. Wrong Prometheus URL
  3. No metrics in Prometheus

Debug:

# Test Prometheus query manually
curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up"

# Check if metrics exist
curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=container_cpu_usage_seconds_total"

Issue 5: HTML Dashboard not showing

Check:

# Verify HTML Plugin is installed
Jenkins → Manage Jenkins → Manage Plugins → Installed

# Look for: HTML Publisher Plugin

# If not installed:
# Manage Plugins → Available → Search "HTML Publisher" → Install

Issue 6: Telegram notifications not sending

Check credentials:

# Verify credentials exist
Jenkins → Manage Jenkins → Manage Credentials

# Should see:
# - telegram-bot-token
# - telegram-chat-id

# Test manually:
BOT_TOKEN="8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI"
CHAT_ID="904518516"

curl -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
    -d chat_id="${CHAT_ID}" \
    -d text="Test from terminal"

📊 Understanding the Dashboard

Metrics Explained:

Cluster Information:

  • Kubernetes Version: Your K8s version
  • Nodes: Number of worker nodes
  • Namespaces: Total namespaces
  • Total Pods: All pods across cluster

Resource Capacity:

  • Total CPU Cores: Sum of all node CPUs
  • Total Memory: Sum of all node RAM
  • Avg CPU Usage: Average CPU across containers
  • Progress Bar: Visual CPU usage

Pod Status:

  • Running: Healthy pods
  • Pending: Pods waiting to start
  • Failed: Crashed pods
  • Total Restarts: Container restarts (high = problem)

Monthly Costs:

  • Based on CPU cores and Memory GB
  • Calculated using rates you configured
  • Estimates infrastructure cost

Health Checks:

  • High restart count (>10)
  • Failed pods (>0)
  • Pending pods (>5)
  • High CPU usage (>80%)

Resources by Namespace:

  • Table showing pod/container count per namespace
  • Sorted by pod count (highest first)

🎨 Customization

Change Schedule

Edit cron trigger in Jenkinsfile:

triggers {
    cron('0 8 * * 1-5')  // Weekdays 8 AM
    
    // Examples:
    // cron('0 */6 * * *')     // Every 6 hours
    // cron('0 9 * * MON')     // Mondays 9 AM
    // cron('0 0 * * *')       // Daily midnight
}

Add More Metrics

Add to Prometheus queries section:

// Disk I/O
env.DISK_READ_MB = queryPrometheus(
    "sum(rate(container_fs_reads_bytes_total[5m])) / 1024 / 1024"
)

// HTTP Requests (if you have metrics)
env.HTTP_REQUESTS_PER_SEC = queryPrometheus(
    "sum(rate(http_requests_total[5m]))"
)

Then add to HTML dashboard:

<div class="metric">
    <span class="metric-label">Disk Read</span>
    <span class="metric-value">${env.DISK_READ_MB} MB/s</span>
</div>

Change Colors/Styling

Edit CSS in generateDashboardHTML():

/* Change main gradient */
background: linear-gradient(135deg, #YOUR_COLOR1 0%, #YOUR_COLOR2 100%);

/* Change card colors */
.card h2 {
    color: #YOUR_COLOR;
}

Add Email Recipients

Add to Jenkinsfile:

post {
    success {
        emailext (
            to: 'devops-team@company.com',
            subject: "Cluster Health Report - ${new Date().format('yyyy-MM-dd')}",
            body: '''
                <h2>Daily Cluster Health Report</h2>
                <p>Please see attached dashboard.</p>
            ''',
            mimeType: 'text/html',
            attachmentsPattern: '**/dashboard.html'
        )
    }
}

📈 Usage Examples

Weekly Review

Monday 8 AM → Dashboard generated
Review:
- Are costs increasing? Why?
- Any failed pods? Investigate
- CPU usage trending up? Scale?
- Restarts increasing? Bug in app?

Cost Tracking

Week 1: $150/month
Week 2: $180/month ⚠️  (+20%)
→ Check namespace-stats.json
→ Which namespace grew?
→ Review pod counts

Capacity Planning

Current: 12 CPU cores, 23.5% usage
If usage > 70% for 7 days:
→ Time to add nodes
→ Dashboard shows trend

Health Monitoring

Dashboard shows:
❌ 5 pods in Failed state
⚠️ 15 container restarts

→ Click artifact → all-pods.json
→ Find which pods
→ kubectl logs <pod>
→ Fix issue

🔗 Integration with Other Tools

Export to Grafana

Use report.json:

# Download report.json from Jenkins artifact
# Import to Grafana via JSON API datasource
# Create time-series dashboard

Send to Slack

Add Slack webhook:

post {
    success {
        sh """
            curl -X POST ${SLACK_WEBHOOK_URL} \
                -H 'Content-Type: application/json' \
                -d '{
                    "text": "Daily Cluster Report: ${env.MONTHLY_TOTAL_COST} USD/month",
                    "attachments": [{
                        "color": "good",
                        "fields": [
                            {"title": "Nodes", "value": "${env.NODE_COUNT}", "short": true},
                            {"title": "Pods", "value": "${env.POD_COUNT}", "short": true}
                        ]
                    }]
                }'
        """
    }
}

Store in Database

Parse JSON and insert:

stage('Store in Database') {
    steps {
        script {
            def report = readJSON file: "${OUTPUT_DIR}/report.json"
            
            sh """
                psql -h postgres -U metrics -d cluster_metrics -c "
                    INSERT INTO daily_reports (date, cpu_usage, pod_count, cost)
                    VALUES ('${report.generated_at}', ${report.resources.avg_cpu_usage_percent}, 
                            ${report.cluster.total_pods}, ${report.costs.monthly_total_usd})
                "
            """
        }
    }
}

Verification Checklist

After setup, verify:

  • Jenkins job created
  • First build succeeds
  • HTML dashboard accessible
  • Metrics show real data (not zeros)
  • Telegram notification received
  • Costs calculated correctly
  • JSON report generated
  • Namespace table populated
  • Health checks working
  • Schedule triggers correctly

📚 Next Steps

Enhancements:

  1. Historical Tracking - Store reports in Git or database
  2. Alerts - Trigger alerts on threshold breaches
  3. Comparison - Compare week-over-week trends
  4. Recommendations - Auto-suggest optimizations
  5. Deep Dive - Per-namespace detailed reports
  • Security Scanning (scan images from this report)
  • Cleanup Pipeline (remove resources shown as unused)
  • Backup Pipeline (backup based on importance shown here)

You're all set! 🎉

Run your first build and enjoy your cluster health dashboard! 📊