diff --git a/apps/cluster-health-dashboard/readme.md b/apps/cluster-health-dashboard/readme.md new file mode 100644 index 0000000..4d5eddc --- /dev/null +++ b/apps/cluster-health-dashboard/readme.md @@ -0,0 +1,634 @@ +# ๐Ÿ“Š Cluster Health Dashboard - Setup Guide + +Complete setup guide for the Kubernetes Cluster Health Dashboard Jenkins pipeline. + +--- + +## ๐ŸŽฏ What This Dashboard Does + +### Collects: +- โœ… Cluster information (version, nodes, namespaces, pods) +- โœ… Resource metrics from Prometheus (CPU, Memory, Network) +- โœ… Pod status across all namespaces +- โœ… Node capacity and usage +- โœ… Cost estimation (monthly) +- โœ… Health checks and issues detection + +### Generates: +- ๐Ÿ“Š Interactive HTML dashboard +- ๐Ÿ“„ JSON report with all metrics +- ๐Ÿ“ฑ Telegram summary notification +- ๐Ÿ“ง Optional email report + +--- + +## ๐Ÿ“‹ Prerequisites + +### Required: +- โœ… Jenkins with Kubernetes plugin +- โœ… kubectl configured with cluster access +- โœ… Prometheus running in cluster (for metrics) +- โœ… jq installed on Jenkins agent +- โœ… curl installed on Jenkins agent + +### Optional: +- โš™๏ธ Telegram bot (for notifications) +- โš™๏ธ Email configured in Jenkins +- โš™๏ธ Grafana (referenced in dashboard) + +--- + +## ๐Ÿš€ Setup Steps + +### Step 1: Install Required Tools on Jenkins Agent + +```bash +# SSH to your Jenkins agent or use Jenkins shell + +# Install jq (JSON processor) +sudo apt-get update +sudo apt-get install -y jq + +# Verify installations +jq --version +kubectl version --client +curl --version +``` + +### Step 2: Configure Prometheus Access + +**Option A: If Prometheus is in your cluster (recommended)** + +Check if Prometheus is accessible: + +```bash +# From Jenkins agent or any pod in cluster +kubectl get svc -n monitoring + +# Should see something like: +# prometheus-server ClusterIP 10.43.xxx.xxx 80/TCP +``` + +**Option B: If Prometheus is external** + +Update Jenkinsfile environment variables: + +```groovy +environment { + PROMETHEUS_URL = 'http://your-prometheus-url:9090' +} +``` + +**Test Prometheus access:** + +```bash +# From Jenkins agent +curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up" + +# Should return JSON with metrics +``` + +### Step 3: Set Up Telegram Notifications (Optional) + +If you already have bot from previous setup, skip this! + +**A. Create Bot (if not done)** +1. Open Telegram โ†’ @BotFather +2. `/newbot` +3. Get token: `1234567890:ABC...` + +**B. Get Chat ID** +1. Telegram โ†’ @userinfobot +2. Get your ID: `904518516` + +**C. Add to Jenkins Credentials** + +Jenkins โ†’ Manage Jenkins โ†’ Manage Credentials โ†’ Add: + +**Credential 1:** +``` +Kind: Secret text +Secret: 8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI +ID: telegram-bot-token +Description: Telegram Bot Token +``` + +**Credential 2:** +``` +Kind: Secret text +Secret: 904518516 +ID: telegram-chat-id +Description: Telegram Chat ID +``` + +### Step 4: Adjust Cost Estimates + +Edit Jenkinsfile to match your actual cloud costs: + +```groovy +environment { + // Adjust these to your actual pricing + CPU_PRICE_PER_HOUR = '0.04' // $0.04 per vCPU/hour + MEMORY_PRICE_PER_GB_HOUR = '0.005' // $0.005 per GB/hour +} +``` + +**Common pricing reference:** +- AWS t3.medium: ~$0.0416/hour (2 vCPU, 4GB RAM) +- DigitalOcean: $0.06/hour per vCPU, $0.007/GB RAM +- Local/Bare metal: $0 (or electricity cost) + +### Step 5: Create Jenkins Pipeline + +**A. Create New Pipeline Job** + +1. Jenkins โ†’ New Item +2. Name: `cluster-health-dashboard` +3. Type: Pipeline +4. OK + +**B. Configure Pipeline** + +1. **Description:** + ``` + Daily cluster health monitoring and reporting. + Generates dashboard with metrics, costs, and health checks. + ``` + +2. **Build Triggers:** + - โ˜‘๏ธ Build periodically + - Schedule: `0 8 * * 1-5` (8 AM weekdays) + +3. **Pipeline:** + - Definition: Pipeline script from SCM + - SCM: Git + - Repository URL: `http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops` + - Credentials: `gitea-credentials` + - Branch: `*/main` + - Script Path: `apps/cluster-health-dashboard/Jenkinsfile` + +**C. Or use Pipeline Script Directly** + +If you want to test first without Git: +1. Definition: Pipeline script +2. Copy entire Jenkinsfile content into the script box +3. Save + +### Step 6: Add to GitOps Repository + +```bash +# On your local machine +cd ~/projects/k3s-gitops + +# Create directory +mkdir -p apps/cluster-health-dashboard + +# Copy Jenkinsfile +cp /path/to/Jenkinsfile apps/cluster-health-dashboard/ + +# Commit +git add apps/cluster-health-dashboard/ +git commit -m "feat: add cluster health dashboard pipeline" +git push origin main +``` + +--- + +## ๐Ÿงช Testing + +### Test 1: Manual Run (First Time) + +1. Jenkins โ†’ cluster-health-dashboard โ†’ Build with Parameters +2. Set: + - REPORT_PERIOD: `24h` + - SEND_EMAIL: `false` (for first test) + - SEND_TELEGRAM: `true` +3. Build Now + +**Watch Console Output:** +``` +๐Ÿš€ Starting Cluster Health Dashboard generation... +๐Ÿ“‹ Collecting cluster information... +Cluster version: v1.28.0 +Nodes: 3 +Namespaces: 14 +Pods: 67 +๐Ÿ“ˆ Querying Prometheus for metrics... +โœ… Dashboard generated +``` + +### Test 2: Check Generated Dashboard + +After build completes: + +1. Jenkins โ†’ cluster-health-dashboard โ†’ Build #1 +2. Click "Cluster Health Dashboard" (left sidebar) +3. Should see beautiful HTML dashboard! ๐ŸŽจ + +### Test 3: Check Telegram Notification + +You should receive: +``` +๐Ÿ“Š Cluster Health Report + +โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” +๐Ÿ“‹ Cluster Info +Version: v1.28.0 +Nodes: 3 +Namespaces: 14 +Total Pods: 67 + +โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” +๐Ÿ’ป Resources +CPU Cores: 12 +Memory: 48 GB +Avg CPU Usage: 23.5% +... +``` + +### Test 4: Check Artifacts + +1. Build #1 โ†’ Artifacts +2. Should see: + - `dashboard.html` + - `report.json` + - `namespace-stats.json` + - `all-pods.json` + - `node-resources.json` + +--- + +## ๐Ÿ”ง Troubleshooting + +### Issue 1: "Failed to query Prometheus" + +**Symptoms:** +``` +โš ๏ธ Failed to query Prometheus: Connection refused +``` + +**Fix:** + +```bash +# Check if Prometheus is running +kubectl get pods -n monitoring + +# Check service +kubectl get svc -n monitoring + +# Test connection from Jenkins pod +kubectl exec -it jenkins-0 -n jenkins -- \ + curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up" +``` + +**If Prometheus is in different namespace:** + +Update Jenkinsfile: +```groovy +PROMETHEUS_URL = 'http://prometheus-server.YOUR_NAMESPACE.svc.cluster.local' +``` + +### Issue 2: "jq: command not found" + +**Fix:** + +```bash +# Install jq on Jenkins agent +kubectl exec -it jenkins-0 -n jenkins -- apt-get update +kubectl exec -it jenkins-0 -n jenkins -- apt-get install -y jq + +# Or add to Jenkins Dockerfile: +# RUN apt-get update && apt-get install -y jq +``` + +### Issue 3: "kubectl: command not found" + +**Fix:** + +Jenkins needs kubectl. Check installation: + +```bash +kubectl exec -it jenkins-0 -n jenkins -- kubectl version --client + +# If not installed, add to Jenkins image or install: +curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" +chmod +x kubectl +mv kubectl /usr/local/bin/ +``` + +### Issue 4: Dashboard shows "0" for all metrics + +**Possible causes:** +1. Prometheus not accessible +2. Wrong Prometheus URL +3. No metrics in Prometheus + +**Debug:** + +```bash +# Test Prometheus query manually +curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=up" + +# Check if metrics exist +curl "http://prometheus-server.monitoring.svc.cluster.local/api/v1/query?query=container_cpu_usage_seconds_total" +``` + +### Issue 5: HTML Dashboard not showing + +**Check:** + +```bash +# Verify HTML Plugin is installed +Jenkins โ†’ Manage Jenkins โ†’ Manage Plugins โ†’ Installed + +# Look for: HTML Publisher Plugin + +# If not installed: +# Manage Plugins โ†’ Available โ†’ Search "HTML Publisher" โ†’ Install +``` + +### Issue 6: Telegram notifications not sending + +**Check credentials:** + +```bash +# Verify credentials exist +Jenkins โ†’ Manage Jenkins โ†’ Manage Credentials + +# Should see: +# - telegram-bot-token +# - telegram-chat-id + +# Test manually: +BOT_TOKEN="8347227871:AAHmkc--2ky2yEK80EGyIfpItKzV9zhGZSI" +CHAT_ID="904518516" + +curl -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \ + -d chat_id="${CHAT_ID}" \ + -d text="Test from terminal" +``` + +--- + +## ๐Ÿ“Š Understanding the Dashboard + +### Metrics Explained: + +**Cluster Information:** +- **Kubernetes Version:** Your K8s version +- **Nodes:** Number of worker nodes +- **Namespaces:** Total namespaces +- **Total Pods:** All pods across cluster + +**Resource Capacity:** +- **Total CPU Cores:** Sum of all node CPUs +- **Total Memory:** Sum of all node RAM +- **Avg CPU Usage:** Average CPU across containers +- **Progress Bar:** Visual CPU usage + +**Pod Status:** +- **Running:** Healthy pods โœ… +- **Pending:** Pods waiting to start โณ +- **Failed:** Crashed pods โŒ +- **Total Restarts:** Container restarts (high = problem) + +**Monthly Costs:** +- Based on CPU cores and Memory GB +- Calculated using rates you configured +- Estimates infrastructure cost + +**Health Checks:** +- High restart count (>10) +- Failed pods (>0) +- Pending pods (>5) +- High CPU usage (>80%) + +**Resources by Namespace:** +- Table showing pod/container count per namespace +- Sorted by pod count (highest first) + +--- + +## ๐ŸŽจ Customization + +### Change Schedule + +Edit cron trigger in Jenkinsfile: + +```groovy +triggers { + cron('0 8 * * 1-5') // Weekdays 8 AM + + // Examples: + // cron('0 */6 * * *') // Every 6 hours + // cron('0 9 * * MON') // Mondays 9 AM + // cron('0 0 * * *') // Daily midnight +} +``` + +### Add More Metrics + +Add to Prometheus queries section: + +```groovy +// Disk I/O +env.DISK_READ_MB = queryPrometheus( + "sum(rate(container_fs_reads_bytes_total[5m])) / 1024 / 1024" +) + +// HTTP Requests (if you have metrics) +env.HTTP_REQUESTS_PER_SEC = queryPrometheus( + "sum(rate(http_requests_total[5m]))" +) +``` + +Then add to HTML dashboard: + +```html +
+ Disk Read + ${env.DISK_READ_MB} MB/s +
+``` + +### Change Colors/Styling + +Edit CSS in `generateDashboardHTML()`: + +```css +/* Change main gradient */ +background: linear-gradient(135deg, #YOUR_COLOR1 0%, #YOUR_COLOR2 100%); + +/* Change card colors */ +.card h2 { + color: #YOUR_COLOR; +} +``` + +### Add Email Recipients + +Add to Jenkinsfile: + +```groovy +post { + success { + emailext ( + to: 'devops-team@company.com', + subject: "Cluster Health Report - ${new Date().format('yyyy-MM-dd')}", + body: ''' +

Daily Cluster Health Report

+

Please see attached dashboard.

+ ''', + mimeType: 'text/html', + attachmentsPattern: '**/dashboard.html' + ) + } +} +``` + +--- + +## ๐Ÿ“ˆ Usage Examples + +### Weekly Review + +``` +Monday 8 AM โ†’ Dashboard generated +Review: +- Are costs increasing? Why? +- Any failed pods? Investigate +- CPU usage trending up? Scale? +- Restarts increasing? Bug in app? +``` + +### Cost Tracking + +``` +Week 1: $150/month +Week 2: $180/month โš ๏ธ (+20%) +โ†’ Check namespace-stats.json +โ†’ Which namespace grew? +โ†’ Review pod counts +``` + +### Capacity Planning + +``` +Current: 12 CPU cores, 23.5% usage +If usage > 70% for 7 days: +โ†’ Time to add nodes +โ†’ Dashboard shows trend +``` + +### Health Monitoring + +``` +Dashboard shows: +โŒ 5 pods in Failed state +โš ๏ธ 15 container restarts + +โ†’ Click artifact โ†’ all-pods.json +โ†’ Find which pods +โ†’ kubectl logs +โ†’ Fix issue +``` + +--- + +## ๐Ÿ”— Integration with Other Tools + +### Export to Grafana + +Use `report.json`: + +```bash +# Download report.json from Jenkins artifact +# Import to Grafana via JSON API datasource +# Create time-series dashboard +``` + +### Send to Slack + +Add Slack webhook: + +```groovy +post { + success { + sh """ + curl -X POST ${SLACK_WEBHOOK_URL} \ + -H 'Content-Type: application/json' \ + -d '{ + "text": "Daily Cluster Report: ${env.MONTHLY_TOTAL_COST} USD/month", + "attachments": [{ + "color": "good", + "fields": [ + {"title": "Nodes", "value": "${env.NODE_COUNT}", "short": true}, + {"title": "Pods", "value": "${env.POD_COUNT}", "short": true} + ] + }] + }' + """ + } +} +``` + +### Store in Database + +Parse JSON and insert: + +```groovy +stage('Store in Database') { + steps { + script { + def report = readJSON file: "${OUTPUT_DIR}/report.json" + + sh """ + psql -h postgres -U metrics -d cluster_metrics -c " + INSERT INTO daily_reports (date, cpu_usage, pod_count, cost) + VALUES ('${report.generated_at}', ${report.resources.avg_cpu_usage_percent}, + ${report.cluster.total_pods}, ${report.costs.monthly_total_usd}) + " + """ + } + } +} +``` + +--- + +## โœ… Verification Checklist + +After setup, verify: + +- [ ] Jenkins job created +- [ ] First build succeeds +- [ ] HTML dashboard accessible +- [ ] Metrics show real data (not zeros) +- [ ] Telegram notification received +- [ ] Costs calculated correctly +- [ ] JSON report generated +- [ ] Namespace table populated +- [ ] Health checks working +- [ ] Schedule triggers correctly + +--- + +## ๐Ÿ“š Next Steps + +### Enhancements: +1. **Historical Tracking** - Store reports in Git or database +2. **Alerts** - Trigger alerts on threshold breaches +3. **Comparison** - Compare week-over-week trends +4. **Recommendations** - Auto-suggest optimizations +5. **Deep Dive** - Per-namespace detailed reports + +### Related Pipelines: +- Security Scanning (scan images from this report) +- Cleanup Pipeline (remove resources shown as unused) +- Backup Pipeline (backup based on importance shown here) + +--- + +**You're all set! ๐ŸŽ‰** + +Run your first build and enjoy your cluster health dashboard! ๐Ÿ“Šโœจ \ No newline at end of file