docs(grafana): Add comprehensive Loki integration guide
This commit is contained in:
388
apps/grafana/README-LOKI-INTEGRATION.md
Normal file
388
apps/grafana/README-LOKI-INTEGRATION.md
Normal file
@@ -0,0 +1,388 @@
|
||||
# 📊 Grafana + Loki Integration
|
||||
|
||||
## ✅ What Was Configured
|
||||
|
||||
### 1. Loki Data Source
|
||||
**File**: `apps/grafana/loki-datasource.yaml`
|
||||
|
||||
Automatically adds Loki as a data source in Grafana:
|
||||
- **URL**: http://loki.loki.svc.cluster.local:3100
|
||||
- **Type**: Loki
|
||||
- **Access**: Proxy (internal cluster access)
|
||||
|
||||
### 2. Loki Logs Dashboard
|
||||
**File**: `apps/grafana/loki-dashboard.yaml`
|
||||
|
||||
Comprehensive dashboard with 7 panels:
|
||||
|
||||
#### 📈 Panel 1: Log Rate by Namespace
|
||||
- Real-time log ingestion rate
|
||||
- Grouped by namespace
|
||||
- Shows logs/second
|
||||
|
||||
#### 🔥 Panel 2: Error Rate by Namespace
|
||||
- Errors, exceptions, and fatal messages
|
||||
- Per namespace breakdown
|
||||
- 5-minute rate
|
||||
|
||||
#### ⚠️ Panel 3: Total Errors (Last Hour)
|
||||
- Gauge showing total error count
|
||||
- Color-coded thresholds:
|
||||
- Green: < 10 errors
|
||||
- Yellow: 10-50 errors
|
||||
- Red: > 50 errors
|
||||
|
||||
#### 🔍 Panel 4: Log Browser
|
||||
- Interactive log viewer
|
||||
- Filterable by:
|
||||
- Namespace (dropdown)
|
||||
- Pod (dropdown)
|
||||
- Search text (free text)
|
||||
- Live tail capability
|
||||
|
||||
#### 📊 Panel 5: Top 10 Namespaces by Log Volume
|
||||
- Pie chart showing which namespaces generate most logs
|
||||
- Based on last hour
|
||||
|
||||
#### 🎯 Panel 6: Top 10 Pods by Log Volume
|
||||
- Pie chart of chattiest pods
|
||||
- Filtered by selected namespace
|
||||
|
||||
#### 🚨 Panel 7: Errors & Warnings
|
||||
- All errors/warnings across cluster
|
||||
- Full log details
|
||||
- Sortable and searchable
|
||||
|
||||
---
|
||||
|
||||
## 🚀 How to Access
|
||||
|
||||
### 1. Wait for ArgoCD Sync
|
||||
ArgoCD will automatically apply the changes (~2-3 minutes).
|
||||
|
||||
### 2. Access Grafana
|
||||
```bash
|
||||
# Get Grafana URL
|
||||
kubectl get ingress -n monitoring
|
||||
|
||||
# Or port-forward
|
||||
kubectl port-forward -n monitoring svc/k8s-monitoring-grafana 3000:80
|
||||
```
|
||||
|
||||
### 3. Find the Dashboard
|
||||
In Grafana:
|
||||
1. Click **Dashboards** (left menu)
|
||||
2. Search for **"Loki Logs Dashboard"**
|
||||
3. Or navigate to: **Dashboards → Browse → loki-logs**
|
||||
|
||||
---
|
||||
|
||||
## 🔍 How to Use the Dashboard
|
||||
|
||||
### Filters at the Top
|
||||
|
||||
**Namespace Filter:**
|
||||
- Select one or multiple namespaces
|
||||
- Default: All namespaces
|
||||
|
||||
**Pod Filter:**
|
||||
- Dynamically updates based on selected namespace(s)
|
||||
- Default: All pods
|
||||
|
||||
**Search Box:**
|
||||
- Free-text search across all logs
|
||||
- Examples:
|
||||
- `error` - find errors
|
||||
- `timeout` - find timeouts
|
||||
- `sync` - find ArgoCD syncs
|
||||
|
||||
### Example Workflows
|
||||
|
||||
#### 1. Debug Application Errors
|
||||
```
|
||||
1. Select namespace: "default"
|
||||
2. Select pod: "myapp-xyz"
|
||||
3. Search: "error"
|
||||
4. Look at "Log Browser" panel
|
||||
```
|
||||
|
||||
#### 2. Monitor ArgoCD
|
||||
```
|
||||
1. Select namespace: "argocd"
|
||||
2. Search: "sync"
|
||||
3. Check "Error Rate" and "Log Browser"
|
||||
```
|
||||
|
||||
#### 3. Find Noisy Pods
|
||||
```
|
||||
1. Look at "Top 10 Pods by Log Volume"
|
||||
2. Click on highest pod
|
||||
3. Use "Log Browser" to see what it's logging
|
||||
```
|
||||
|
||||
#### 4. Cluster-Wide Error Monitoring
|
||||
```
|
||||
1. Set namespace to "All"
|
||||
2. Check "Total Errors" gauge
|
||||
3. Look at "Errors & Warnings" panel at bottom
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📝 LogQL Query Examples
|
||||
|
||||
The dashboard uses LogQL (Loki Query Language). Here are some queries you can use:
|
||||
|
||||
### Basic Queries
|
||||
```logql
|
||||
# All logs from a namespace
|
||||
{namespace="loki"}
|
||||
|
||||
# Logs from specific pod
|
||||
{pod="loki-0"}
|
||||
|
||||
# Multiple namespaces
|
||||
{namespace=~"loki|argocd|grafana"}
|
||||
```
|
||||
|
||||
### Filtering
|
||||
```logql
|
||||
# Contains "error"
|
||||
{namespace="default"} |= "error"
|
||||
|
||||
# Regex match
|
||||
{namespace="argocd"} |~ "sync|deploy"
|
||||
|
||||
# NOT containing
|
||||
{namespace="loki"} != "debug"
|
||||
|
||||
# Case insensitive
|
||||
{namespace="default"} |~ "(?i)error"
|
||||
```
|
||||
|
||||
### Metrics Queries
|
||||
```logql
|
||||
# Log rate per namespace
|
||||
sum by (namespace) (rate({namespace=~".+"}[1m]))
|
||||
|
||||
# Error count
|
||||
sum(count_over_time({namespace=~".+"} |~ "(?i)error"[5m]))
|
||||
|
||||
# Top namespaces
|
||||
topk(10, sum by (namespace) (count_over_time({namespace=~".+"}[1h])))
|
||||
```
|
||||
|
||||
### JSON Parsing
|
||||
```logql
|
||||
# If logs are JSON
|
||||
{namespace="app"} | json | level="error"
|
||||
|
||||
# Extract fields
|
||||
{namespace="app"} | json | line_format "{{.message}}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎨 Dashboard Customization
|
||||
|
||||
### Add New Panel
|
||||
|
||||
1. Click **"Add panel"** (top right)
|
||||
2. Select **"Loki"** as data source
|
||||
3. Write your LogQL query
|
||||
4. Choose visualization type
|
||||
5. Save
|
||||
|
||||
### Useful Panel Types
|
||||
|
||||
- **Time series**: For rates and counts over time
|
||||
- **Logs**: For viewing actual log lines
|
||||
- **Stat**: For single values (like total errors)
|
||||
- **Gauge**: For thresholds (like error counts)
|
||||
- **Table**: For structured data
|
||||
- **Pie chart**: For distribution
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### Dashboard Not Appearing
|
||||
|
||||
```bash
|
||||
# Check ConfigMap created
|
||||
kubectl get configmap -n monitoring loki-logs-dashboard
|
||||
|
||||
# Check Grafana pod logs
|
||||
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
|
||||
|
||||
# Restart Grafana
|
||||
kubectl rollout restart deployment/k8s-monitoring-grafana -n monitoring
|
||||
```
|
||||
|
||||
### Data Source Not Working
|
||||
|
||||
```bash
|
||||
# Test Loki from Grafana pod
|
||||
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
|
||||
curl http://loki.loki.svc.cluster.local:3100/ready
|
||||
|
||||
# Should return: ready
|
||||
```
|
||||
|
||||
### No Logs Showing
|
||||
|
||||
```bash
|
||||
# Check Promtail is running
|
||||
kubectl get pods -n loki -l app.kubernetes.io/name=promtail
|
||||
|
||||
# Check Promtail logs
|
||||
kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50
|
||||
|
||||
# Test query directly
|
||||
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
|
||||
curl "http://loki.loki.svc.cluster.local:3100/loki/api/v1/labels"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 What Logs Are Collected
|
||||
|
||||
Promtail collects logs from:
|
||||
|
||||
### 1. All Pod Logs
|
||||
```
|
||||
/var/log/pods/<namespace>_<pod>_<uid>/<container>/*.log
|
||||
```
|
||||
|
||||
### 2. Labels Added Automatically
|
||||
Every log line gets these labels:
|
||||
- `namespace` - Kubernetes namespace
|
||||
- `pod` - Pod name
|
||||
- `container` - Container name
|
||||
- `node` - Node where pod runs
|
||||
- `job` - Always "kubernetes-pods"
|
||||
|
||||
### 3. Example Log Entry
|
||||
```json
|
||||
{
|
||||
"namespace": "loki",
|
||||
"pod": "loki-0",
|
||||
"container": "loki",
|
||||
"node": "master1",
|
||||
"timestamp": "2026-01-05T13:30:00Z",
|
||||
"line": "level=info msg=\"flushing stream\""
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Advanced Features
|
||||
|
||||
### Live Tail
|
||||
Click **"Live"** button in Log Browser panel to stream logs in real-time.
|
||||
|
||||
### Context
|
||||
Click on any log line → "Show context" to see surrounding logs.
|
||||
|
||||
### Log Details
|
||||
Click on any log line to see:
|
||||
- All labels
|
||||
- Parsed fields (if JSON)
|
||||
- Timestamp
|
||||
- Full message
|
||||
|
||||
### Sharing
|
||||
Click **"Share"** (top right) to:
|
||||
- Copy link
|
||||
- Create snapshot
|
||||
- Export as JSON
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Alerting (Optional)
|
||||
|
||||
You can create alerts based on log patterns:
|
||||
|
||||
### Example: Alert on High Error Rate
|
||||
```yaml
|
||||
alert: HighErrorRate
|
||||
expr: sum(rate({namespace=~".+"} |~ "(?i)error"[5m])) > 10
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "High error rate detected"
|
||||
description: "{{ $value }} errors/sec"
|
||||
```
|
||||
|
||||
To add alerts:
|
||||
1. Go to dashboard panel
|
||||
2. Click "Alert" tab
|
||||
3. Configure threshold
|
||||
4. Set notification channel
|
||||
|
||||
---
|
||||
|
||||
## 📈 Performance Tips
|
||||
|
||||
### 1. Limit Time Range
|
||||
- Use smaller time ranges for faster queries
|
||||
- Default: 1 hour (good balance)
|
||||
|
||||
### 2. Use Filters
|
||||
- Always filter by namespace/pod when possible
|
||||
- Reduces data scanned
|
||||
|
||||
### 3. Dashboard Refresh
|
||||
- Default: 10 seconds
|
||||
- Increase if experiencing lag
|
||||
|
||||
### 4. Log Volume
|
||||
- Monitor "Top 10" panels
|
||||
- Consider log retention policy if volume is high
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Useful Links
|
||||
|
||||
- **Loki API**: http://loki.loki.svc.cluster.local:3100
|
||||
- **Loki Ready**: http://loki.loki.svc.cluster.local:3100/ready
|
||||
- **Loki Metrics**: http://loki.loki.svc.cluster.local:3100/metrics
|
||||
- **LogQL Docs**: https://grafana.com/docs/loki/latest/logql/
|
||||
|
||||
---
|
||||
|
||||
## 📋 Quick Reference
|
||||
|
||||
### LogQL Operators
|
||||
- `|=` - Contains (exact)
|
||||
- `!=` - Does not contain
|
||||
- `|~` - Regex match
|
||||
- `!~` - Regex not match
|
||||
- `| json` - Parse JSON
|
||||
- `| logfmt` - Parse logfmt
|
||||
- `| line_format` - Format output
|
||||
|
||||
### Rate Functions
|
||||
- `rate()` - Per-second rate
|
||||
- `count_over_time()` - Total count
|
||||
- `bytes_over_time()` - Total bytes
|
||||
- `bytes_rate()` - Bytes per second
|
||||
|
||||
### Aggregations
|
||||
- `sum by (label)` - Sum grouped by label
|
||||
- `count by (label)` - Count grouped
|
||||
- `avg by (label)` - Average
|
||||
- `max by (label)` - Maximum
|
||||
- `topk(n, query)` - Top N results
|
||||
|
||||
---
|
||||
|
||||
**Dashboard is ready! It will appear in Grafana after ArgoCD syncs (~2-3 minutes).** 🎉
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ Wait for ArgoCD sync
|
||||
2. ✅ Open Grafana
|
||||
3. ✅ Find "Loki Logs Dashboard"
|
||||
4. ✅ Start exploring your logs!
|
||||
|
||||
Want me to add more panels or create specific queries for your use case?
|
||||
Reference in New Issue
Block a user