docs(grafana): Add comprehensive Loki integration guide
This commit is contained in:
388
apps/grafana/README-LOKI-INTEGRATION.md
Normal file
388
apps/grafana/README-LOKI-INTEGRATION.md
Normal file
@@ -0,0 +1,388 @@
|
|||||||
|
# 📊 Grafana + Loki Integration
|
||||||
|
|
||||||
|
## ✅ What Was Configured
|
||||||
|
|
||||||
|
### 1. Loki Data Source
|
||||||
|
**File**: `apps/grafana/loki-datasource.yaml`
|
||||||
|
|
||||||
|
Automatically adds Loki as a data source in Grafana:
|
||||||
|
- **URL**: http://loki.loki.svc.cluster.local:3100
|
||||||
|
- **Type**: Loki
|
||||||
|
- **Access**: Proxy (internal cluster access)
|
||||||
|
|
||||||
|
### 2. Loki Logs Dashboard
|
||||||
|
**File**: `apps/grafana/loki-dashboard.yaml`
|
||||||
|
|
||||||
|
Comprehensive dashboard with 7 panels:
|
||||||
|
|
||||||
|
#### 📈 Panel 1: Log Rate by Namespace
|
||||||
|
- Real-time log ingestion rate
|
||||||
|
- Grouped by namespace
|
||||||
|
- Shows logs/second
|
||||||
|
|
||||||
|
#### 🔥 Panel 2: Error Rate by Namespace
|
||||||
|
- Errors, exceptions, and fatal messages
|
||||||
|
- Per namespace breakdown
|
||||||
|
- 5-minute rate
|
||||||
|
|
||||||
|
#### ⚠️ Panel 3: Total Errors (Last Hour)
|
||||||
|
- Gauge showing total error count
|
||||||
|
- Color-coded thresholds:
|
||||||
|
- Green: < 10 errors
|
||||||
|
- Yellow: 10-50 errors
|
||||||
|
- Red: > 50 errors
|
||||||
|
|
||||||
|
#### 🔍 Panel 4: Log Browser
|
||||||
|
- Interactive log viewer
|
||||||
|
- Filterable by:
|
||||||
|
- Namespace (dropdown)
|
||||||
|
- Pod (dropdown)
|
||||||
|
- Search text (free text)
|
||||||
|
- Live tail capability
|
||||||
|
|
||||||
|
#### 📊 Panel 5: Top 10 Namespaces by Log Volume
|
||||||
|
- Pie chart showing which namespaces generate most logs
|
||||||
|
- Based on last hour
|
||||||
|
|
||||||
|
#### 🎯 Panel 6: Top 10 Pods by Log Volume
|
||||||
|
- Pie chart of chattiest pods
|
||||||
|
- Filtered by selected namespace
|
||||||
|
|
||||||
|
#### 🚨 Panel 7: Errors & Warnings
|
||||||
|
- All errors/warnings across cluster
|
||||||
|
- Full log details
|
||||||
|
- Sortable and searchable
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 How to Access
|
||||||
|
|
||||||
|
### 1. Wait for ArgoCD Sync
|
||||||
|
ArgoCD will automatically apply the changes (~2-3 minutes).
|
||||||
|
|
||||||
|
### 2. Access Grafana
|
||||||
|
```bash
|
||||||
|
# Get Grafana URL
|
||||||
|
kubectl get ingress -n monitoring
|
||||||
|
|
||||||
|
# Or port-forward
|
||||||
|
kubectl port-forward -n monitoring svc/k8s-monitoring-grafana 3000:80
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Find the Dashboard
|
||||||
|
In Grafana:
|
||||||
|
1. Click **Dashboards** (left menu)
|
||||||
|
2. Search for **"Loki Logs Dashboard"**
|
||||||
|
3. Or navigate to: **Dashboards → Browse → loki-logs**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔍 How to Use the Dashboard
|
||||||
|
|
||||||
|
### Filters at the Top
|
||||||
|
|
||||||
|
**Namespace Filter:**
|
||||||
|
- Select one or multiple namespaces
|
||||||
|
- Default: All namespaces
|
||||||
|
|
||||||
|
**Pod Filter:**
|
||||||
|
- Dynamically updates based on selected namespace(s)
|
||||||
|
- Default: All pods
|
||||||
|
|
||||||
|
**Search Box:**
|
||||||
|
- Free-text search across all logs
|
||||||
|
- Examples:
|
||||||
|
- `error` - find errors
|
||||||
|
- `timeout` - find timeouts
|
||||||
|
- `sync` - find ArgoCD syncs
|
||||||
|
|
||||||
|
### Example Workflows
|
||||||
|
|
||||||
|
#### 1. Debug Application Errors
|
||||||
|
```
|
||||||
|
1. Select namespace: "default"
|
||||||
|
2. Select pod: "myapp-xyz"
|
||||||
|
3. Search: "error"
|
||||||
|
4. Look at "Log Browser" panel
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Monitor ArgoCD
|
||||||
|
```
|
||||||
|
1. Select namespace: "argocd"
|
||||||
|
2. Search: "sync"
|
||||||
|
3. Check "Error Rate" and "Log Browser"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. Find Noisy Pods
|
||||||
|
```
|
||||||
|
1. Look at "Top 10 Pods by Log Volume"
|
||||||
|
2. Click on highest pod
|
||||||
|
3. Use "Log Browser" to see what it's logging
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 4. Cluster-Wide Error Monitoring
|
||||||
|
```
|
||||||
|
1. Set namespace to "All"
|
||||||
|
2. Check "Total Errors" gauge
|
||||||
|
3. Look at "Errors & Warnings" panel at bottom
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📝 LogQL Query Examples
|
||||||
|
|
||||||
|
The dashboard uses LogQL (Loki Query Language). Here are some queries you can use:
|
||||||
|
|
||||||
|
### Basic Queries
|
||||||
|
```logql
|
||||||
|
# All logs from a namespace
|
||||||
|
{namespace="loki"}
|
||||||
|
|
||||||
|
# Logs from specific pod
|
||||||
|
{pod="loki-0"}
|
||||||
|
|
||||||
|
# Multiple namespaces
|
||||||
|
{namespace=~"loki|argocd|grafana"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Filtering
|
||||||
|
```logql
|
||||||
|
# Contains "error"
|
||||||
|
{namespace="default"} |= "error"
|
||||||
|
|
||||||
|
# Regex match
|
||||||
|
{namespace="argocd"} |~ "sync|deploy"
|
||||||
|
|
||||||
|
# NOT containing
|
||||||
|
{namespace="loki"} != "debug"
|
||||||
|
|
||||||
|
# Case insensitive
|
||||||
|
{namespace="default"} |~ "(?i)error"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Metrics Queries
|
||||||
|
```logql
|
||||||
|
# Log rate per namespace
|
||||||
|
sum by (namespace) (rate({namespace=~".+"}[1m]))
|
||||||
|
|
||||||
|
# Error count
|
||||||
|
sum(count_over_time({namespace=~".+"} |~ "(?i)error"[5m]))
|
||||||
|
|
||||||
|
# Top namespaces
|
||||||
|
topk(10, sum by (namespace) (count_over_time({namespace=~".+"}[1h])))
|
||||||
|
```
|
||||||
|
|
||||||
|
### JSON Parsing
|
||||||
|
```logql
|
||||||
|
# If logs are JSON
|
||||||
|
{namespace="app"} | json | level="error"
|
||||||
|
|
||||||
|
# Extract fields
|
||||||
|
{namespace="app"} | json | line_format "{{.message}}"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎨 Dashboard Customization
|
||||||
|
|
||||||
|
### Add New Panel
|
||||||
|
|
||||||
|
1. Click **"Add panel"** (top right)
|
||||||
|
2. Select **"Loki"** as data source
|
||||||
|
3. Write your LogQL query
|
||||||
|
4. Choose visualization type
|
||||||
|
5. Save
|
||||||
|
|
||||||
|
### Useful Panel Types
|
||||||
|
|
||||||
|
- **Time series**: For rates and counts over time
|
||||||
|
- **Logs**: For viewing actual log lines
|
||||||
|
- **Stat**: For single values (like total errors)
|
||||||
|
- **Gauge**: For thresholds (like error counts)
|
||||||
|
- **Table**: For structured data
|
||||||
|
- **Pie chart**: For distribution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Troubleshooting
|
||||||
|
|
||||||
|
### Dashboard Not Appearing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check ConfigMap created
|
||||||
|
kubectl get configmap -n monitoring loki-logs-dashboard
|
||||||
|
|
||||||
|
# Check Grafana pod logs
|
||||||
|
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
|
||||||
|
|
||||||
|
# Restart Grafana
|
||||||
|
kubectl rollout restart deployment/k8s-monitoring-grafana -n monitoring
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Source Not Working
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test Loki from Grafana pod
|
||||||
|
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
|
||||||
|
curl http://loki.loki.svc.cluster.local:3100/ready
|
||||||
|
|
||||||
|
# Should return: ready
|
||||||
|
```
|
||||||
|
|
||||||
|
### No Logs Showing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Promtail is running
|
||||||
|
kubectl get pods -n loki -l app.kubernetes.io/name=promtail
|
||||||
|
|
||||||
|
# Check Promtail logs
|
||||||
|
kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50
|
||||||
|
|
||||||
|
# Test query directly
|
||||||
|
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
|
||||||
|
curl "http://loki.loki.svc.cluster.local:3100/loki/api/v1/labels"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📊 What Logs Are Collected
|
||||||
|
|
||||||
|
Promtail collects logs from:
|
||||||
|
|
||||||
|
### 1. All Pod Logs
|
||||||
|
```
|
||||||
|
/var/log/pods/<namespace>_<pod>_<uid>/<container>/*.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Labels Added Automatically
|
||||||
|
Every log line gets these labels:
|
||||||
|
- `namespace` - Kubernetes namespace
|
||||||
|
- `pod` - Pod name
|
||||||
|
- `container` - Container name
|
||||||
|
- `node` - Node where pod runs
|
||||||
|
- `job` - Always "kubernetes-pods"
|
||||||
|
|
||||||
|
### 3. Example Log Entry
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"namespace": "loki",
|
||||||
|
"pod": "loki-0",
|
||||||
|
"container": "loki",
|
||||||
|
"node": "master1",
|
||||||
|
"timestamp": "2026-01-05T13:30:00Z",
|
||||||
|
"line": "level=info msg=\"flushing stream\""
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Advanced Features
|
||||||
|
|
||||||
|
### Live Tail
|
||||||
|
Click **"Live"** button in Log Browser panel to stream logs in real-time.
|
||||||
|
|
||||||
|
### Context
|
||||||
|
Click on any log line → "Show context" to see surrounding logs.
|
||||||
|
|
||||||
|
### Log Details
|
||||||
|
Click on any log line to see:
|
||||||
|
- All labels
|
||||||
|
- Parsed fields (if JSON)
|
||||||
|
- Timestamp
|
||||||
|
- Full message
|
||||||
|
|
||||||
|
### Sharing
|
||||||
|
Click **"Share"** (top right) to:
|
||||||
|
- Copy link
|
||||||
|
- Create snapshot
|
||||||
|
- Export as JSON
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 Alerting (Optional)
|
||||||
|
|
||||||
|
You can create alerts based on log patterns:
|
||||||
|
|
||||||
|
### Example: Alert on High Error Rate
|
||||||
|
```yaml
|
||||||
|
alert: HighErrorRate
|
||||||
|
expr: sum(rate({namespace=~".+"} |~ "(?i)error"[5m])) > 10
|
||||||
|
for: 5m
|
||||||
|
annotations:
|
||||||
|
summary: "High error rate detected"
|
||||||
|
description: "{{ $value }} errors/sec"
|
||||||
|
```
|
||||||
|
|
||||||
|
To add alerts:
|
||||||
|
1. Go to dashboard panel
|
||||||
|
2. Click "Alert" tab
|
||||||
|
3. Configure threshold
|
||||||
|
4. Set notification channel
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📈 Performance Tips
|
||||||
|
|
||||||
|
### 1. Limit Time Range
|
||||||
|
- Use smaller time ranges for faster queries
|
||||||
|
- Default: 1 hour (good balance)
|
||||||
|
|
||||||
|
### 2. Use Filters
|
||||||
|
- Always filter by namespace/pod when possible
|
||||||
|
- Reduces data scanned
|
||||||
|
|
||||||
|
### 3. Dashboard Refresh
|
||||||
|
- Default: 10 seconds
|
||||||
|
- Increase if experiencing lag
|
||||||
|
|
||||||
|
### 4. Log Volume
|
||||||
|
- Monitor "Top 10" panels
|
||||||
|
- Consider log retention policy if volume is high
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔗 Useful Links
|
||||||
|
|
||||||
|
- **Loki API**: http://loki.loki.svc.cluster.local:3100
|
||||||
|
- **Loki Ready**: http://loki.loki.svc.cluster.local:3100/ready
|
||||||
|
- **Loki Metrics**: http://loki.loki.svc.cluster.local:3100/metrics
|
||||||
|
- **LogQL Docs**: https://grafana.com/docs/loki/latest/logql/
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Quick Reference
|
||||||
|
|
||||||
|
### LogQL Operators
|
||||||
|
- `|=` - Contains (exact)
|
||||||
|
- `!=` - Does not contain
|
||||||
|
- `|~` - Regex match
|
||||||
|
- `!~` - Regex not match
|
||||||
|
- `| json` - Parse JSON
|
||||||
|
- `| logfmt` - Parse logfmt
|
||||||
|
- `| line_format` - Format output
|
||||||
|
|
||||||
|
### Rate Functions
|
||||||
|
- `rate()` - Per-second rate
|
||||||
|
- `count_over_time()` - Total count
|
||||||
|
- `bytes_over_time()` - Total bytes
|
||||||
|
- `bytes_rate()` - Bytes per second
|
||||||
|
|
||||||
|
### Aggregations
|
||||||
|
- `sum by (label)` - Sum grouped by label
|
||||||
|
- `count by (label)` - Count grouped
|
||||||
|
- `avg by (label)` - Average
|
||||||
|
- `max by (label)` - Maximum
|
||||||
|
- `topk(n, query)` - Top N results
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Dashboard is ready! It will appear in Grafana after ArgoCD syncs (~2-3 minutes).** 🎉
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. ✅ Wait for ArgoCD sync
|
||||||
|
2. ✅ Open Grafana
|
||||||
|
3. ✅ Find "Loki Logs Dashboard"
|
||||||
|
4. ✅ Start exploring your logs!
|
||||||
|
|
||||||
|
Want me to add more panels or create specific queries for your use case?
|
||||||
Reference in New Issue
Block a user