k3s-gitops/apps/grafana/README-LOKI-INTEGRATION.md

# 📊 Grafana + Loki Integration

## ✅ What Was Configured

### 1. Loki Data Source
**File**: `apps/grafana/loki-datasource.yaml`

Automatically adds Loki as a data source in Grafana:
- **URL**: http://loki.loki.svc.cluster.local:3100
- **Type**: Loki
- **Access**: Proxy (internal cluster access)

### 2. Loki Logs Dashboard
**File**: `apps/grafana/loki-dashboard.yaml`

Comprehensive dashboard with 7 panels:

#### 📈 Panel 1: Log Rate by Namespace
- Real-time log ingestion rate
- Grouped by namespace
- Shows logs/second

#### 🔥 Panel 2: Error Rate by Namespace
- Errors, exceptions, and fatal messages
- Per namespace breakdown
- 5-minute rate

#### ⚠️ Panel 3: Total Errors (Last Hour)
- Gauge showing total error count
- Color-coded thresholds:
  - Green: < 10 errors
  - Yellow: 10-50 errors
  - Red: > 50 errors

#### 🔍 Panel 4: Log Browser
- Interactive log viewer
- Filterable by:
  - Namespace (dropdown)
  - Pod (dropdown)
  - Search text (free text)
- Live tail capability

#### 📊 Panel 5: Top 10 Namespaces by Log Volume
- Pie chart showing which namespaces generate most logs
- Based on last hour

#### 🎯 Panel 6: Top 10 Pods by Log Volume
- Pie chart of chattiest pods
- Filtered by selected namespace

#### 🚨 Panel 7: Errors & Warnings
- All errors/warnings across cluster
- Full log details
- Sortable and searchable

---

## 🚀 How to Access

### 1. Wait for ArgoCD Sync
ArgoCD will automatically apply the changes (~2-3 minutes).

### 2. Access Grafana
```bash
# Get Grafana URL
kubectl get ingress -n monitoring

# Or port-forward
kubectl port-forward -n monitoring svc/k8s-monitoring-grafana 3000:80
```

### 3. Find the Dashboard
In Grafana:
1. Click **Dashboards** (left menu)
2. Search for **"Loki Logs Dashboard"**
3. Or navigate to: **Dashboards → Browse → loki-logs**

---

## 🔍 How to Use the Dashboard

### Filters at the Top

**Namespace Filter:**
- Select one or multiple namespaces
- Default: All namespaces

**Pod Filter:**
- Dynamically updates based on selected namespace(s)
- Default: All pods

**Search Box:**
- Free-text search across all logs
- Examples:
  - `error` - find errors
  - `timeout` - find timeouts
  - `sync` - find ArgoCD syncs

### Example Workflows

#### 1. Debug Application Errors
```
1. Select namespace: "default"
2. Select pod: "myapp-xyz"
3. Search: "error"
4. Look at "Log Browser" panel
```

#### 2. Monitor ArgoCD
```
1. Select namespace: "argocd"
2. Search: "sync"
3. Check "Error Rate" and "Log Browser"
```

#### 3. Find Noisy Pods
```
1. Look at "Top 10 Pods by Log Volume"
2. Click on highest pod
3. Use "Log Browser" to see what it's logging
```

#### 4. Cluster-Wide Error Monitoring
```
1. Set namespace to "All"
2. Check "Total Errors" gauge
3. Look at "Errors & Warnings" panel at bottom
```

---

## 📝 LogQL Query Examples

The dashboard uses LogQL (Loki Query Language). Here are some queries you can use:

### Basic Queries
```logql
# All logs from a namespace
{namespace="loki"}

# Logs from specific pod
{pod="loki-0"}

# Multiple namespaces
{namespace=~"loki|argocd|grafana"}
```

### Filtering
```logql
# Contains "error"
{namespace="default"} |= "error"

# Regex match
{namespace="argocd"} |~ "sync|deploy"

# NOT containing
{namespace="loki"} != "debug"

# Case insensitive
{namespace="default"} |~ "(?i)error"
```

### Metrics Queries
```logql
# Log rate per namespace
sum by (namespace) (rate({namespace=~".+"}[1m]))

# Error count
sum(count_over_time({namespace=~".+"} |~ "(?i)error"[5m]))

# Top namespaces
topk(10, sum by (namespace) (count_over_time({namespace=~".+"}[1h])))
```

### JSON Parsing
```logql
# If logs are JSON
{namespace="app"} | json | level="error"

# Extract fields
{namespace="app"} | json | line_format "{{.message}}"
```

---

## 🎨 Dashboard Customization

### Add New Panel

1. Click **"Add panel"** (top right)
2. Select **"Loki"** as data source
3. Write your LogQL query
4. Choose visualization type
5. Save

### Useful Panel Types

- **Time series**: For rates and counts over time
- **Logs**: For viewing actual log lines
- **Stat**: For single values (like total errors)
- **Gauge**: For thresholds (like error counts)
- **Table**: For structured data
- **Pie chart**: For distribution

---

## 🔧 Troubleshooting

### Dashboard Not Appearing

```bash
# Check ConfigMap created
kubectl get configmap -n monitoring loki-logs-dashboard

# Check Grafana pod logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

# Restart Grafana
kubectl rollout restart deployment/k8s-monitoring-grafana -n monitoring
```

### Data Source Not Working

```bash
# Test Loki from Grafana pod
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
  curl http://loki.loki.svc.cluster.local:3100/ready

# Should return: ready
```

### No Logs Showing

```bash
# Check Promtail is running
kubectl get pods -n loki -l app.kubernetes.io/name=promtail

# Check Promtail logs
kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50

# Test query directly
kubectl exec -n monitoring -it deployment/k8s-monitoring-grafana -- \
  curl "http://loki.loki.svc.cluster.local:3100/loki/api/v1/labels"
```

---

## 📊 What Logs Are Collected

Promtail collects logs from:

### 1. All Pod Logs
```
/var/log/pods/<namespace>_<pod>_<uid>/<container>/*.log
```

### 2. Labels Added Automatically
Every log line gets these labels:
- `namespace` - Kubernetes namespace
- `pod` - Pod name
- `container` - Container name
- `node` - Node where pod runs
- `job` - Always "kubernetes-pods"

### 3. Example Log Entry
```json
{
  "namespace": "loki",
  "pod": "loki-0",
  "container": "loki",
  "node": "master1",
  "timestamp": "2026-01-05T13:30:00Z",
  "line": "level=info msg=\"flushing stream\""
}
```

---

## 🎯 Advanced Features

### Live Tail
Click **"Live"** button in Log Browser panel to stream logs in real-time.

### Context
Click on any log line → "Show context" to see surrounding logs.

### Log Details
Click on any log line to see:
- All labels
- Parsed fields (if JSON)
- Timestamp
- Full message

### Sharing
Click **"Share"** (top right) to:
- Copy link
- Create snapshot
- Export as JSON

---

## 🚨 Alerting (Optional)

You can create alerts based on log patterns:

### Example: Alert on High Error Rate
```yaml
alert: HighErrorRate
expr: sum(rate({namespace=~".+"} |~ "(?i)error"[5m])) > 10
for: 5m
annotations:
  summary: "High error rate detected"
  description: "{{ $value }} errors/sec"
```

To add alerts:
1. Go to dashboard panel
2. Click "Alert" tab
3. Configure threshold
4. Set notification channel

---

## 📈 Performance Tips

### 1. Limit Time Range
- Use smaller time ranges for faster queries
- Default: 1 hour (good balance)

### 2. Use Filters
- Always filter by namespace/pod when possible
- Reduces data scanned

### 3. Dashboard Refresh
- Default: 10 seconds
- Increase if experiencing lag

### 4. Log Volume
- Monitor "Top 10" panels
- Consider log retention policy if volume is high

---

## 🔗 Useful Links

- **Loki API**: http://loki.loki.svc.cluster.local:3100
- **Loki Ready**: http://loki.loki.svc.cluster.local:3100/ready
- **Loki Metrics**: http://loki.loki.svc.cluster.local:3100/metrics
- **LogQL Docs**: https://grafana.com/docs/loki/latest/logql/

---

## 📋 Quick Reference

### LogQL Operators
- `|=` - Contains (exact)
- `!=` - Does not contain
- `|~` - Regex match
- `!~` - Regex not match
- `| json` - Parse JSON
- `| logfmt` - Parse logfmt
- `| line_format` - Format output

### Rate Functions
- `rate()` - Per-second rate
- `count_over_time()` - Total count
- `bytes_over_time()` - Total bytes
- `bytes_rate()` - Bytes per second

### Aggregations
- `sum by (label)` - Sum grouped by label
- `count by (label)` - Count grouped
- `avg by (label)` - Average
- `max by (label)` - Maximum
- `topk(n, query)` - Top N results

---

**Dashboard is ready! It will appear in Grafana after ArgoCD syncs (~2-3 minutes).** 🎉

## Next Steps

1. ✅ Wait for ArgoCD sync
2. ✅ Open Grafana
3. ✅ Find "Loki Logs Dashboard"
4. ✅ Start exploring your logs!

Want me to add more panels or create specific queries for your use case?