diff --git a/apps/grafana/README-LOKI-INTEGRATION.md b/apps/grafana/README-LOKI-INTEGRATION.md new file mode 100644 index 0000000..4dec770 --- /dev/null +++ b/apps/grafana/README-LOKI-INTEGRATION.md @@ -0,0 +1,388 @@ +# 📊 Grafana + Loki Integration + +## ✅ What Was Configured + +### 1. Loki Data Source +**File**: `apps/grafana/loki-datasource.yaml` + +Automatically adds Loki as a data source in Grafana: +- **URL**: http://loki.loki.svc.cluster.local:3100 +- **Type**: Loki +- **Access**: Proxy (internal cluster access) + +### 2. Loki Logs Dashboard +**File**: `apps/grafana/loki-dashboard.yaml` + +Comprehensive dashboard with 7 panels: + +#### 📈 Panel 1: Log Rate by Namespace +- Real-time log ingestion rate +- Grouped by namespace +- Shows logs/second + +#### 🔥 Panel 2: Error Rate by Namespace +- Errors, exceptions, and fatal messages +- Per namespace breakdown +- 5-minute rate + +#### ⚠️ Panel 3: Total Errors (Last Hour) +- Gauge showing total error count +- Color-coded thresholds: + - Green: < 10 errors + - Yellow: 10-50 errors + - Red: > 50 errors + +#### 🔍 Panel 4: Log Browser +- Interactive log viewer +- Filterable by: + - Namespace (dropdown) + - Pod (dropdown) + - Search text (free text) +- Live tail capability + +#### 📊 Panel 5: Top 10 Namespaces by Log Volume +- Pie chart showing which namespaces generate most logs +- Based on last hour + +#### 🎯 Panel 6: Top 10 Pods by Log Volume +- Pie chart of chattiest pods +- Filtered by selected namespace + +#### 🚨 Panel 7: Errors & Warnings +- All errors/warnings across cluster +- Full log details +- Sortable and searchable + +--- + +## 🚀 How to Access + +### 1. Wait for ArgoCD Sync +ArgoCD will automatically apply the changes (~2-3 minutes). + +### 2. Access Grafana +```bash +# Get Grafana URL +kubectl get ingress -n monitoring + +# Or port-forward +kubectl port-forward -n monitoring svc/k8s-monitoring-grafana 3000:80 +``` + +### 3. Find the Dashboard +In Grafana: +1. Click **Dashboards** (left menu) +2. Search for **"Loki Logs Dashboard"** +3. Or navigate to: **Dashboards → Browse → loki-logs** + +--- + +## 🔍 How to Use the Dashboard + +### Filters at the Top + +**Namespace Filter:** +- Select one or multiple namespaces +- Default: All namespaces + +**Pod Filter:** +- Dynamically updates based on selected namespace(s) +- Default: All pods + +**Search Box:** +- Free-text search across all logs +- Examples: + - `error` - find errors + - `timeout` - find timeouts + - `sync` - find ArgoCD syncs + +### Example Workflows + +#### 1. Debug Application Errors +``` +1. Select namespace: "default" +2. Select pod: "myapp-xyz" +3. Search: "error" +4. Look at "Log Browser" panel +``` + +#### 2. Monitor ArgoCD +``` +1. Select namespace: "argocd" +2. Search: "sync" +3. Check "Error Rate" and "Log Browser" +``` + +#### 3. Find Noisy Pods +``` +1. Look at "Top 10 Pods by Log Volume" +2. Click on highest pod +3. Use "Log Browser" to see what it's logging +``` + +#### 4. Cluster-Wide Error Monitoring +``` +1. Set namespace to "All" +2. Check "Total Errors" gauge +3. Look at "Errors & Warnings" panel at bottom +``` + +--- + +## 📝 LogQL Query Examples + +The dashboard uses LogQL (Loki Query Language). Here are some queries you can use: + +### Basic Queries +```logql +# All logs from a namespace +{namespace="loki"} + +# Logs from specific pod +{pod="loki-0"} + +# Multiple namespaces +{namespace=~"loki|argocd|grafana"} +``` + +### Filtering +```logql +# Contains "error" +{namespace="default"} |= "error" + +# Regex match +{namespace="argocd"} |~ "sync|deploy" + +# NOT containing +{namespace="loki"} != "debug" + +# Case insensitive +{namespace="default"} |~ "(?i)error" +``` + +### Metrics Queries +```logql +# Log rate per namespace +sum by (namespace) (rate({namespace=~".+"}[1m])) + +# Error count +sum(count_over_time({namespace=~".+"} |~ "(?i)error"[5m])) + +# Top namespaces +topk(10, sum by (namespace) (count_over_time({namespace=~".+"}[1h]))) +``` + +### JSON Parsing +```logql +# If logs are JSON +{namespace="app"} | json | level="error" + +# Extract fields +{namespace="app"} | json | line_format "{{.message}}" +``` + +--- + +## 🎨 Dashboard Customization + +### Add New Panel + +1. Click **"Add panel"** (top right) +2. Select **"Loki"** as data source +3. Write your LogQL query +4. Choose visualization type +5. Save + +### Useful Panel Types + +- **Time series**: For rates and counts over time +- **Logs**: For viewing actual log lines +- **Stat**: For single values (like total errors) +- **Gauge**: For thresholds (like error counts) +- **Table**: For structured data +- **Pie chart**: For distribution + +--- + +## 🔧 Troubleshooting + +### Dashboard Not Appearing + +```bash +# Check ConfigMap created +kubectl get configmap -n monitoring loki-logs-dashboard + +# Check Grafana pod logs +kubectl logs -n monitoring -l app.kubernetes.io/name=grafana + +# Restart Grafana +kubectl rollout restart deployment/k8s-monitoring-grafana -n monitoring +``` + +### Data Source Not Working + +```bash +# Test Loki from Grafana pod +kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \ + curl http://loki.loki.svc.cluster.local:3100/ready + +# Should return: ready +``` + +### No Logs Showing + +```bash +# Check Promtail is running +kubectl get pods -n loki -l app.kubernetes.io/name=promtail + +# Check Promtail logs +kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50 + +# Test query directly +kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \ + curl "http://loki.loki.svc.cluster.local:3100/loki/api/v1/labels" +``` + +--- + +## 📊 What Logs Are Collected + +Promtail collects logs from: + +### 1. All Pod Logs +``` +/var/log/pods/__//*.log +``` + +### 2. Labels Added Automatically +Every log line gets these labels: +- `namespace` - Kubernetes namespace +- `pod` - Pod name +- `container` - Container name +- `node` - Node where pod runs +- `job` - Always "kubernetes-pods" + +### 3. Example Log Entry +```json +{ + "namespace": "loki", + "pod": "loki-0", + "container": "loki", + "node": "master1", + "timestamp": "2026-01-05T13:30:00Z", + "line": "level=info msg=\"flushing stream\"" +} +``` + +--- + +## 🎯 Advanced Features + +### Live Tail +Click **"Live"** button in Log Browser panel to stream logs in real-time. + +### Context +Click on any log line → "Show context" to see surrounding logs. + +### Log Details +Click on any log line to see: +- All labels +- Parsed fields (if JSON) +- Timestamp +- Full message + +### Sharing +Click **"Share"** (top right) to: +- Copy link +- Create snapshot +- Export as JSON + +--- + +## 🚨 Alerting (Optional) + +You can create alerts based on log patterns: + +### Example: Alert on High Error Rate +```yaml +alert: HighErrorRate +expr: sum(rate({namespace=~".+"} |~ "(?i)error"[5m])) > 10 +for: 5m +annotations: + summary: "High error rate detected" + description: "{{ $value }} errors/sec" +``` + +To add alerts: +1. Go to dashboard panel +2. Click "Alert" tab +3. Configure threshold +4. Set notification channel + +--- + +## 📈 Performance Tips + +### 1. Limit Time Range +- Use smaller time ranges for faster queries +- Default: 1 hour (good balance) + +### 2. Use Filters +- Always filter by namespace/pod when possible +- Reduces data scanned + +### 3. Dashboard Refresh +- Default: 10 seconds +- Increase if experiencing lag + +### 4. Log Volume +- Monitor "Top 10" panels +- Consider log retention policy if volume is high + +--- + +## 🔗 Useful Links + +- **Loki API**: http://loki.loki.svc.cluster.local:3100 +- **Loki Ready**: http://loki.loki.svc.cluster.local:3100/ready +- **Loki Metrics**: http://loki.loki.svc.cluster.local:3100/metrics +- **LogQL Docs**: https://grafana.com/docs/loki/latest/logql/ + +--- + +## 📋 Quick Reference + +### LogQL Operators +- `|=` - Contains (exact) +- `!=` - Does not contain +- `|~` - Regex match +- `!~` - Regex not match +- `| json` - Parse JSON +- `| logfmt` - Parse logfmt +- `| line_format` - Format output + +### Rate Functions +- `rate()` - Per-second rate +- `count_over_time()` - Total count +- `bytes_over_time()` - Total bytes +- `bytes_rate()` - Bytes per second + +### Aggregations +- `sum by (label)` - Sum grouped by label +- `count by (label)` - Count grouped +- `avg by (label)` - Average +- `max by (label)` - Maximum +- `topk(n, query)` - Top N results + +--- + +**Dashboard is ready! It will appear in Grafana after ArgoCD syncs (~2-3 minutes).** 🎉 + +## Next Steps + +1. ✅ Wait for ArgoCD sync +2. ✅ Open Grafana +3. ✅ Find "Loki Logs Dashboard" +4. ✅ Start exploring your logs! + +Want me to add more panels or create specific queries for your use case?