Files
k3s-gitops/apps/demo-nginx/docs/Argocd_sync_issue.md

419 lines
10 KiB
Markdown

# 🔧 ArgoCD Sync vs Actual Deployment Issue
## 🐛 The Problem
**Symptom:**
- ArgoCD shows `Synced`
- Deployment manifest in Kubernetes is updated ✅
- **BUT** pods are still running old image ❌
**Why This Happens:**
```
┌─────────────────────────────────────────────────────────────┐
│ Git Repository │
│ deployment.yaml: image: app:v2 ✅ │
└──────────────────┬──────────────────────────────────────────┘
│ ArgoCD syncs
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes API (Deployment Object) │
│ spec.template.image: app:v2 ✅ │
└──────────────────┬──────────────────────────────────────────┘
│ Kubernetes Controller should trigger rollout
┌─────────────────────────────────────────────────────────────┐
│ Running Pods │
│ Pod-1: image: app:v1 ❌ (OLD!) │
│ Pod-2: image: app:v1 ❌ (OLD!) │
└─────────────────────────────────────────────────────────────┘
ArgoCD says "Synced" because Git == Kubernetes manifest ✅
But pods haven't rolled out yet! ❌
```
---
## 🔍 Why ArgoCD Says "Synced"
ArgoCD checks:
1. ✅ Git manifest == Kubernetes Deployment object
2. ✅ Health status (from status fields)
ArgoCD **DOES NOT** check:
- ❌ Are pods actually running?
- ❌ What image are pods using?
- ❌ Did rollout complete?
**ArgoCD's job:** Keep Kubernetes resources in sync with Git
**NOT ArgoCD's job:** Wait for pods to finish rolling out
---
## ⚠️ When This Happens
### Scenario 1: Slow Rollout
```
14:00:00 - ArgoCD syncs deployment (v1 → v2)
14:00:05 - ArgoCD: "Synced!" ✅
14:00:10 - Kubernetes starts rollout
14:00:30 - Pod-1 terminates (v1)
14:00:35 - Pod-3 starts (v2)
14:00:50 - Pod-2 terminates (v1)
14:00:55 - Pod-4 starts (v2)
14:01:00 - Rollout complete! ✅
Jenkins checks at 14:00:05: ArgoCD says "Synced"
But pods are still v1! ❌
```
### Scenario 2: Image Pull Delay
```
14:00:00 - ArgoCD syncs
14:00:05 - ArgoCD: "Synced!" ✅
14:00:10 - Kubernetes tries to start new pod
14:00:15 - Pulling image... (slow network)
14:00:45 - Image pulled
14:00:50 - Pod starts
14:01:00 - Pod ready
Jenkins checks at 14:00:05: "Synced" but no new pods yet!
```
### Scenario 3: Resource Constraints
```
14:00:00 - ArgoCD syncs
14:00:05 - ArgoCD: "Synced!" ✅
14:00:10 - Kubernetes: "No resources available"
14:00:20 - Kubernetes: "Waiting for node capacity..."
14:01:00 - Old pod terminates, resources freed
14:01:10 - New pod starts
Jenkins checks at 14:00:05: "Synced" but can't schedule pods!
```
---
## ✅ The Solution
### What Jenkins Must Check:
```groovy
// ❌ BAD - Only checks ArgoCD
if (argocdStatus == 'Synced') {
echo "Done!"
}
// ✅ GOOD - Checks ArgoCD + Kubernetes
if (argocdStatus == 'Synced') {
// 1. Wait for rollout
kubectl rollout status deployment/app
// 2. Verify actual pod images
podImages = kubectl get pods -o jsonpath='{.status.containerStatuses[0].image}'
if (podImages contains newVersion) {
echo "Verified!"
}
}
```
---
## 🎯 New Jenkinsfile Verification
### Stage 1: ArgoCD Sync Check
```groovy
stage('Wait for ArgoCD Sync') {
// Checks:
// 1. ArgoCD sync status = "Synced"
// 2. Deployment SPEC image updated
//
// Does NOT check if pods rolled out!
// That's the next stage.
}
```
**Output:**
```
ArgoCD sync status: Synced
Deployment spec image: app:v2
✅ ArgoCD synced and deployment spec updated!
Note: Pods may still be rolling out - will verify in next stage
```
### Stage 2: Wait for Rollout
```groovy
stage('Wait for Deployment') {
// Uses kubectl rollout status
// Waits for actual pod rollout to complete
sh "kubectl rollout status deployment/app --timeout=5m"
}
```
**What `kubectl rollout status` does:**
- Watches deployment progress
- Waits for all new pods to be ready
- Returns when rollout complete
- Times out if rollout stuck
**Output:**
```
Waiting for deployment "app" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "app" rollout to finish: 1 old replicas are pending termination...
deployment "app" successfully rolled out
✅ Rollout completed successfully!
```
### Stage 3: Verify Actual Pods
```groovy
stage('Verify Deployment') {
// CRITICAL CHECKS:
// 1. Deployment status
readyReplicas == desiredReplicas
// 2. Deployment spec image
deploymentImage contains newTag
// 3. ACTUAL POD IMAGES (most important!)
podImages = all pods images
for each podImage:
if podImage does not contain newTag:
FAIL!
// 4. Pod health
all pods in Running state
// 5. Restart count
check for crash loops
}
```
**Output:**
```
================================================
DEPLOYMENT VERIFICATION
================================================
1. Checking deployment status...
Desired replicas: 2
Updated replicas: 2
Ready replicas: 2
Available replicas: 2
✅ All pods ready
2. Checking deployment spec image...
Deployment spec image: app:v2
Expected tag: v2
✅ Deployment spec correct
3. Checking actual running pod images...
Running pod images:
- app:v2
- app:v2
✅ All pods running correct image
4. Checking pod readiness probes...
✅ All pods in Running state
5. Checking for container restarts...
Max restart count: 0
✅ Restart count acceptable
================================================
✅ ALL VERIFICATION CHECKS PASSED!
================================================
```
---
## 🔥 What Happens If Check #3 Fails
```
3. Checking actual running pod images...
Running pod images:
- app:v1 ❌
- app:v1 ❌
❌ Pod running wrong image: app:v1
❌ FAILED: 2 pod(s) running old image!
This is the ArgoCD sync bug - deployment updated but pods not rolled out
```
**Jenkins will:**
1. ❌ Mark build as failed
2. 🔄 Trigger rollback (if enabled)
3. 📱 Send notification with details
---
## 🧪 Testing the Fix
### Test 1: Normal Deployment
```bash
# Update image in Git
git commit -m "Update to v2"
git push
# Jenkins should:
# 1. Wait for ArgoCD sync ✅
# 2. Wait for rollout ✅
# 3. Verify pods have v2 ✅
# 4. Success! ✅
```
### Test 2: Slow Rollout
```bash
# Set slow rollout
kubectl patch deployment app -p '{"spec":{"strategy":{"rollingUpdate":{"maxUnavailable":0,"maxSurge":1}}}}'
# Update image
git push
# Jenkins should:
# 1. ArgoCD syncs quickly ✅
# 2. Wait for slow rollout (may take 2-3 minutes) ⏳
# 3. Verify when complete ✅
```
### Test 3: Rollout Stuck
```bash
# Create a broken image tag
# Update to image: app:nonexistent
git push
# Jenkins should:
# 1. ArgoCD syncs ✅
# 2. kubectl rollout status times out ❌
# 3. Rollback triggered ✅
```
---
## 📊 Comparison: Old vs New
### Old Pipeline (Unreliable)
```
1. ArgoCD sync check
├─ Checks: ArgoCD status
├─ Checks: Deployment spec image
└─ Duration: ~30 seconds
⚠️ PROBLEM: Pods might not have rolled out!
2. Success! ✅ (but pods are still old!)
```
### New Pipeline (Reliable)
```
1. ArgoCD sync check
├─ Checks: ArgoCD status
├─ Checks: Deployment spec image
└─ Duration: ~30 seconds
2. Rollout status check
├─ Checks: kubectl rollout status
├─ Waits: For actual pod rollout
└─ Duration: ~1-2 minutes
3. Verification
├─ Checks: Deployment status
├─ Checks: ACTUAL pod images ← KEY!
├─ Checks: Pod health
├─ Checks: Restart count
└─ Duration: ~10 seconds
4. Success! ✅ (pods verified running new version)
```
---
## 🎯 Key Takeaways
### ❌ Don't Trust:
- ArgoCD "Synced" status alone
- Deployment spec image alone
- Health status alone
### ✅ Always Verify:
1. **ArgoCD synced** (manifest applied)
2. **Rollout completed** (`kubectl rollout status`)
3. **Actual pod images** (what's really running)
4. **Pod health** (ready and not crashing)
### 💡 Remember:
```
ArgoCD "Synced" = Git matches Kubernetes manifest ✅
BUT
Kubernetes manifest != Running pods ⚠️
You MUST check actual pods!
```
---
## 🔗 Related Issues
- [ArgoCD #2723](https://github.com/argoproj/argo-cd/issues/2723) - "Synced but pods not updated"
- [Kubernetes #93033](https://github.com/kubernetes/kubernetes/issues/93033) - "Deployment rollout delays"
---
## 🚀 Using the New Jenkinsfile
```bash
# 1. Update Jenkinsfile in your repo
cp Jenkinsfile.telegram.en apps/demo-nginx/Jenkinsfile
# 2. Commit and push
git add apps/demo-nginx/Jenkinsfile
git commit -m "fix: add proper deployment verification"
git push
# 3. Run build
# Jenkins will now properly verify deployments!
```
---
## 📱 Notifications
With the new verification, you'll see:
**During deployment:**
```
⏳ ArgoCD Syncing
Application: demo-nginx
Timeout: 120s
🚀 Deploying to Kubernetes
Deployment: demo-nginx
Image: main-42
Rolling out new pods...
```
**On success:**
```
✅ Deployment Successful!
Verified:
- ArgoCD synced ✅
- Rollout completed ✅
- Pods running v42 ✅
- All pods healthy ✅
```
**On failure:**
```
❌ Deployment Failed
Error: 2 pods running old image!
Rollback initiated...
```
---
**This fix ensures Jenkins never reports success until pods are actually running the new version!**