Files
k3s-gitops/apps/grafana/README-LOKI-INTEGRATION.md

389 lines
8.0 KiB
Markdown

# 📊 Grafana + Loki Integration
## ✅ What Was Configured
### 1. Loki Data Source
**File**: `apps/grafana/loki-datasource.yaml`
Automatically adds Loki as a data source in Grafana:
- **URL**: http://loki.loki.svc.cluster.local:3100
- **Type**: Loki
- **Access**: Proxy (internal cluster access)
### 2. Loki Logs Dashboard
**File**: `apps/grafana/loki-dashboard.yaml`
Comprehensive dashboard with 7 panels:
#### 📈 Panel 1: Log Rate by Namespace
- Real-time log ingestion rate
- Grouped by namespace
- Shows logs/second
#### 🔥 Panel 2: Error Rate by Namespace
- Errors, exceptions, and fatal messages
- Per namespace breakdown
- 5-minute rate
#### ⚠️ Panel 3: Total Errors (Last Hour)
- Gauge showing total error count
- Color-coded thresholds:
- Green: < 10 errors
- Yellow: 10-50 errors
- Red: > 50 errors
#### 🔍 Panel 4: Log Browser
- Interactive log viewer
- Filterable by:
- Namespace (dropdown)
- Pod (dropdown)
- Search text (free text)
- Live tail capability
#### 📊 Panel 5: Top 10 Namespaces by Log Volume
- Pie chart showing which namespaces generate most logs
- Based on last hour
#### 🎯 Panel 6: Top 10 Pods by Log Volume
- Pie chart of chattiest pods
- Filtered by selected namespace
#### 🚨 Panel 7: Errors & Warnings
- All errors/warnings across cluster
- Full log details
- Sortable and searchable
---
## 🚀 How to Access
### 1. Wait for ArgoCD Sync
ArgoCD will automatically apply the changes (~2-3 minutes).
### 2. Access Grafana
```bash
# Get Grafana URL
kubectl get ingress -n monitoring
# Or port-forward
kubectl port-forward -n monitoring svc/k8s-monitoring-grafana 3000:80
```
### 3. Find the Dashboard
In Grafana:
1. Click **Dashboards** (left menu)
2. Search for **"Loki Logs Dashboard"**
3. Or navigate to: **Dashboards → Browse → loki-logs**
---
## 🔍 How to Use the Dashboard
### Filters at the Top
**Namespace Filter:**
- Select one or multiple namespaces
- Default: All namespaces
**Pod Filter:**
- Dynamically updates based on selected namespace(s)
- Default: All pods
**Search Box:**
- Free-text search across all logs
- Examples:
- `error` - find errors
- `timeout` - find timeouts
- `sync` - find ArgoCD syncs
### Example Workflows
#### 1. Debug Application Errors
```
1. Select namespace: "default"
2. Select pod: "myapp-xyz"
3. Search: "error"
4. Look at "Log Browser" panel
```
#### 2. Monitor ArgoCD
```
1. Select namespace: "argocd"
2. Search: "sync"
3. Check "Error Rate" and "Log Browser"
```
#### 3. Find Noisy Pods
```
1. Look at "Top 10 Pods by Log Volume"
2. Click on highest pod
3. Use "Log Browser" to see what it's logging
```
#### 4. Cluster-Wide Error Monitoring
```
1. Set namespace to "All"
2. Check "Total Errors" gauge
3. Look at "Errors & Warnings" panel at bottom
```
---
## 📝 LogQL Query Examples
The dashboard uses LogQL (Loki Query Language). Here are some queries you can use:
### Basic Queries
```logql
# All logs from a namespace
{namespace="loki"}
# Logs from specific pod
{pod="loki-0"}
# Multiple namespaces
{namespace=~"loki|argocd|grafana"}
```
### Filtering
```logql
# Contains "error"
{namespace="default"} |= "error"
# Regex match
{namespace="argocd"} |~ "sync|deploy"
# NOT containing
{namespace="loki"} != "debug"
# Case insensitive
{namespace="default"} |~ "(?i)error"
```
### Metrics Queries
```logql
# Log rate per namespace
sum by (namespace) (rate({namespace=~".+"}[1m]))
# Error count
sum(count_over_time({namespace=~".+"} |~ "(?i)error"[5m]))
# Top namespaces
topk(10, sum by (namespace) (count_over_time({namespace=~".+"}[1h])))
```
### JSON Parsing
```logql
# If logs are JSON
{namespace="app"} | json | level="error"
# Extract fields
{namespace="app"} | json | line_format "{{.message}}"
```
---
## 🎨 Dashboard Customization
### Add New Panel
1. Click **"Add panel"** (top right)
2. Select **"Loki"** as data source
3. Write your LogQL query
4. Choose visualization type
5. Save
### Useful Panel Types
- **Time series**: For rates and counts over time
- **Logs**: For viewing actual log lines
- **Stat**: For single values (like total errors)
- **Gauge**: For thresholds (like error counts)
- **Table**: For structured data
- **Pie chart**: For distribution
---
## 🔧 Troubleshooting
### Dashboard Not Appearing
```bash
# Check ConfigMap created
kubectl get configmap -n monitoring loki-logs-dashboard
# Check Grafana pod logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
# Restart Grafana
kubectl rollout restart deployment/k8s-monitoring-grafana -n monitoring
```
### Data Source Not Working
```bash
# Test Loki from Grafana pod
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
curl http://loki.loki.svc.cluster.local:3100/ready
# Should return: ready
```
### No Logs Showing
```bash
# Check Promtail is running
kubectl get pods -n loki -l app.kubernetes.io/name=promtail
# Check Promtail logs
kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50
# Test query directly
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
curl "http://loki.loki.svc.cluster.local:3100/loki/api/v1/labels"
```
---
## 📊 What Logs Are Collected
Promtail collects logs from:
### 1. All Pod Logs
```
/var/log/pods/<namespace>_<pod>_<uid>/<container>/*.log
```
### 2. Labels Added Automatically
Every log line gets these labels:
- `namespace` - Kubernetes namespace
- `pod` - Pod name
- `container` - Container name
- `node` - Node where pod runs
- `job` - Always "kubernetes-pods"
### 3. Example Log Entry
```json
{
"namespace": "loki",
"pod": "loki-0",
"container": "loki",
"node": "master1",
"timestamp": "2026-01-05T13:30:00Z",
"line": "level=info msg=\"flushing stream\""
}
```
---
## 🎯 Advanced Features
### Live Tail
Click **"Live"** button in Log Browser panel to stream logs in real-time.
### Context
Click on any log line → "Show context" to see surrounding logs.
### Log Details
Click on any log line to see:
- All labels
- Parsed fields (if JSON)
- Timestamp
- Full message
### Sharing
Click **"Share"** (top right) to:
- Copy link
- Create snapshot
- Export as JSON
---
## 🚨 Alerting (Optional)
You can create alerts based on log patterns:
### Example: Alert on High Error Rate
```yaml
alert: HighErrorRate
expr: sum(rate({namespace=~".+"} |~ "(?i)error"[5m])) > 10
for: 5m
annotations:
summary: "High error rate detected"
description: "{{ $value }} errors/sec"
```
To add alerts:
1. Go to dashboard panel
2. Click "Alert" tab
3. Configure threshold
4. Set notification channel
---
## 📈 Performance Tips
### 1. Limit Time Range
- Use smaller time ranges for faster queries
- Default: 1 hour (good balance)
### 2. Use Filters
- Always filter by namespace/pod when possible
- Reduces data scanned
### 3. Dashboard Refresh
- Default: 10 seconds
- Increase if experiencing lag
### 4. Log Volume
- Monitor "Top 10" panels
- Consider log retention policy if volume is high
---
## 🔗 Useful Links
- **Loki API**: http://loki.loki.svc.cluster.local:3100
- **Loki Ready**: http://loki.loki.svc.cluster.local:3100/ready
- **Loki Metrics**: http://loki.loki.svc.cluster.local:3100/metrics
- **LogQL Docs**: https://grafana.com/docs/loki/latest/logql/
---
## 📋 Quick Reference
### LogQL Operators
- `|=` - Contains (exact)
- `!=` - Does not contain
- `|~` - Regex match
- `!~` - Regex not match
- `| json` - Parse JSON
- `| logfmt` - Parse logfmt
- `| line_format` - Format output
### Rate Functions
- `rate()` - Per-second rate
- `count_over_time()` - Total count
- `bytes_over_time()` - Total bytes
- `bytes_rate()` - Bytes per second
### Aggregations
- `sum by (label)` - Sum grouped by label
- `count by (label)` - Count grouped
- `avg by (label)` - Average
- `max by (label)` - Maximum
- `topk(n, query)` - Top N results
---
**Dashboard is ready! It will appear in Grafana after ArgoCD syncs (~2-3 minutes).** 🎉
## Next Steps
1. ✅ Wait for ArgoCD sync
2. ✅ Open Grafana
3. ✅ Find "Loki Logs Dashboard"
4. ✅ Start exploring your logs!
Want me to add more panels or create specific queries for your use case?