Files
k3s-gitops/docs/gitops-cicd/gitlab-harbor-swarm-automation-solution.md

41 KiB
Raw Permalink Blame History

GitLab + Harbor + Docker Swarm: Automated Deployment Solution

Версия: 1.0
Дата создания: Январь 2026
Статус: Implementation Ready
Целевая аудитория: DevOps Team, Development Team


Executive Summary

Данный документ описывает практическое решение для автоматизации deployment процесса в существующей инфраструктуре:

Текущая ситуация:

  • GitLab уже установлен
  • Harbor Registry уже работает
  • Docker Swarm с несколькими контейнерами
  • 4 окружения: Development → Sandbox → Testing → Production
  • Ручной deployment через bash скрипты
  • Нет процесса code review
  • Нет автоматического rollback
  • Получаем готовые images из Harbor без visibility

Предлагаемое решение:

  • GitLab CI/CD pipelines для автоматического deployment
  • GitOps подход: Git как source of truth для deployments
  • Автоматический deployment по средам с approval gates
  • One-click rollback capability
  • Deployment history и audit trail
  • Health checks и автоматический rollback при failure

Результаты внедрения:

  • 🚀 Deployment time: с 30-60 минут → 5-10 минут
  • 🔒 Human errors: reduction на 90%
  • 📊 Full visibility: кто, что, когда deployed
  • Rollback: с 1-2 часов → 2-3 минуты
  • Compliance: полный audit trail

Содержание

  1. Архитектура решения
  2. GitLab CI/CD Pipeline Implementation
  3. Docker Stack Management
  4. Environment Management Strategy
  5. Rollback Strategy
  6. Monitoring & Health Checks
  7. Implementation Roadmap
  8. Best Practices

1. Архитектура решения

1.1 Current State Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Current Manual Process                    │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│   Developer → Build Image → Push to Harbor                   │
│                                   ↓                           │
│                          Notify DevOps Team                   │
│                                   ↓                           │
│              DevOps manually runs bash scripts:               │
│                                                               │
│   1. SSH to Swarm manager                                     │
│   2. docker service update app --image harbor/app:new-tag    │
│   3. Check logs manually                                      │
│   4. Hope everything works                                    │
│   5. Repeat for each environment (4x)                         │
│                                                               │
│   Problems:                                                   │
│   • Time consuming (30-60 min per environment)                │
│   • Error prone (typos, wrong tags)                           │
│   • No rollback plan                                          │
│   • No audit trail                                            │
│   • No validation before deployment                           │
└─────────────────────────────────────────────────────────────┘

1.2 Target Automated Architecture

┌──────────────────────────────────────────────────────────────┐
│              Automated GitOps-Based Solution                  │
├──────────────────────────────────────────────────────────────┤
│                                                                │
│  Developer pushes image tag change to Git                     │
│         ↓                                                      │
│  GitLab CI/CD Pipeline automatically:                         │
│         ↓                                                      │
│  ┌─────────────────────────────────────────────────┐         │
│  │  1. Validate docker-compose.yml syntax          │         │
│  │  2. Check image exists in Harbor                │         │
│  │  3. Deploy to Development (automatic)           │         │
│  │  4. Run health checks                           │         │
│  │  5. Wait for manual approval → Sandbox          │         │
│  │  6. Deploy to Sandbox                           │         │
│  │  7. Wait for manual approval → Testing          │         │
│  │  8. Deploy to Testing                           │         │
│  │  9. Wait for manual approval → Production       │         │
│  │ 10. Deploy to Production                        │         │
│  │ 11. Monitor deployment success                  │         │
│  │ 12. Auto-rollback if health checks fail         │         │
│  └─────────────────────────────────────────────────┘         │
│                                                                │
│  Benefits:                                                     │
│  ✅ 5-10 minutes per environment                              │
│  ✅ Zero human errors                                         │
│  ✅ Automatic rollback on failure                             │
│  ✅ Complete audit trail in Git                               │
│  ✅ Pre-deployment validation                                 │
└──────────────────────────────────────────────────────────────┘

1.3 Git Repository Structure

deployment-configs/          # New GitLab repository
├── README.md
├── .gitlab-ci.yml          # CI/CD pipeline definition
│
├── environments/
│   ├── development/
│   │   ├── docker-compose.yml
│   │   ├── .env
│   │   └── healthcheck.sh
│   │
│   ├── sandbox/
│   │   ├── docker-compose.yml
│   │   ├── .env
│   │   └── healthcheck.sh
│   │
│   ├── testing/
│   │   ├── docker-compose.yml
│   │   ├── .env
│   │   └── healthcheck.sh
│   │
│   └── production/
│       ├── docker-compose.yml
│       ├── .env
│       └── healthcheck.sh
│
├── scripts/
│   ├── deploy.sh           # Deployment script
│   ├── rollback.sh         # Rollback script
│   ├── healthcheck.sh      # Health validation
│   └── validate-compose.sh # Pre-deployment validation
│
└── docs/
    ├── deployment-guide.md
    └── rollback-procedure.md

2. GitLab CI/CD Pipeline Implementation

2.1 Complete .gitlab-ci.yml

# .gitlab-ci.yml - Complete automated deployment pipeline

variables:
  DOCKER_HOST: "tcp://docker-swarm-manager:2376"
  DOCKER_TLS_VERIFY: "1"
  HARBOR_REGISTRY: "harbor.company.com"
  
  # Swarm connection details (stored in GitLab CI/CD variables)
  # SWARM_DEV_HOST, SWARM_SANDBOX_HOST, SWARM_TEST_HOST, SWARM_PROD_HOST
  # SWARM_SSH_KEY (SSH private key for authentication)

stages:
  - validate
  - deploy-dev
  - deploy-sandbox
  - deploy-testing
  - deploy-production
  - rollback

#═══════════════════════════════════════════════════════════
# Stage 1: VALIDATION
#═══════════════════════════════════════════════════════════

validate:syntax:
  stage: validate
  image: docker:24-cli
  script:
    - echo "Validating docker-compose files..."
    - |
      for env in development sandbox testing production; do
        echo "Checking $env environment..."
        docker-compose -f environments/$env/docker-compose.yml config > /dev/null
        if [ $? -eq 0 ]; then
          echo "✅ $env: Syntax OK"
        else
          echo "❌ $env: Syntax ERROR"
          exit 1
        fi
      done
  only:
    - branches
  tags:
    - docker

validate:images:
  stage: validate
  image: docker:24-cli
  before_script:
    - docker login -u $HARBOR_USER -p $HARBOR_PASSWORD $HARBOR_REGISTRY
  script:
    - echo "Checking if images exist in Harbor..."
    - |
      for env in development sandbox testing production; do
        echo "Checking images for $env..."
        
        # Extract image tags from docker-compose
        images=$(grep "image:" environments/$env/docker-compose.yml | awk '{print $2}')
        
        for image in $images; do
          echo "Pulling $image to verify existence..."
          docker pull $image
          if [ $? -eq 0 ]; then
            echo "✅ Image exists: $image"
          else
            echo "❌ Image NOT found: $image"
            exit 1
          fi
        done
      done
  only:
    - branches
  tags:
    - docker

#═══════════════════════════════════════════════════════════
# Stage 2: DEPLOY TO DEVELOPMENT (Automatic)
#═══════════════════════════════════════════════════════════

deploy:development:
  stage: deploy-dev
  image: alpine:latest
  before_script:
    - apk add --no-cache openssh-client bash docker-cli
    - eval $(ssh-agent -s)
    - echo "$SWARM_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - ssh-keyscan -H $SWARM_DEV_HOST >> ~/.ssh/known_hosts
  script:
    - echo "🚀 Deploying to DEVELOPMENT environment..."
    
    # Copy files to swarm manager
    - scp -r environments/development root@$SWARM_DEV_HOST:/tmp/deploy/
    - scp scripts/deploy.sh root@$SWARM_DEV_HOST:/tmp/deploy/
    
    # Execute deployment
    - |
      ssh root@$SWARM_DEV_HOST bash << 'EOF'
        cd /tmp/deploy/development
        
        # Load environment variables
        source .env
        
        # Deploy stack
        docker stack deploy -c docker-compose.yml --with-registry-auth app-stack
        
        # Wait for services to stabilize
        echo "Waiting for services to start..."
        sleep 30
        
        # Check service status
        docker stack services app-stack
        
        # Run health checks
        bash ../healthcheck.sh
      EOF
    
    - echo "✅ Deployment to DEVELOPMENT completed"
    
  environment:
    name: development
    url: https://dev.company.com
    on_stop: stop:development
  
  only:
    - main
    - develop
  
  tags:
    - deployment

#═══════════════════════════════════════════════════════════
# Stage 3: DEPLOY TO SANDBOX (Manual Approval Required)
#═══════════════════════════════════════════════════════════

deploy:sandbox:
  stage: deploy-sandbox
  image: alpine:latest
  before_script:
    - apk add --no-cache openssh-client bash docker-cli
    - eval $(ssh-agent -s)
    - echo "$SWARM_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - ssh-keyscan -H $SWARM_SANDBOX_HOST >> ~/.ssh/known_hosts
  
  script:
    - echo "🚀 Deploying to SANDBOX environment..."
    - scp -r environments/sandbox root@$SWARM_SANDBOX_HOST:/tmp/deploy/
    - |
      ssh root@$SWARM_SANDBOX_HOST bash << 'EOF'
        cd /tmp/deploy/sandbox
        source .env
        docker stack deploy -c docker-compose.yml --with-registry-auth app-stack
        sleep 30
        docker stack services app-stack
        bash ../healthcheck.sh
      EOF
    - echo "✅ Deployment to SANDBOX completed"
  
  environment:
    name: sandbox
    url: https://sandbox.company.com
  
  when: manual  # ⚠️ Requires manual approval
  
  only:
    - main
  
  tags:
    - deployment

#═══════════════════════════════════════════════════════════
# Stage 4: DEPLOY TO TESTING (Manual Approval Required)
#═══════════════════════════════════════════════════════════

deploy:testing:
  stage: deploy-testing
  image: alpine:latest
  before_script:
    - apk add --no-cache openssh-client bash docker-cli
    - eval $(ssh-agent -s)
    - echo "$SWARM_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - ssh-keyscan -H $SWARM_TEST_HOST >> ~/.ssh/known_hosts
  
  script:
    - echo "🚀 Deploying to TESTING environment..."
    - scp -r environments/testing root@$SWARM_TEST_HOST:/tmp/deploy/
    - |
      ssh root@$SWARM_TEST_HOST bash << 'EOF'
        cd /tmp/deploy/testing
        source .env
        docker stack deploy -c docker-compose.yml --with-registry-auth app-stack
        sleep 30
        docker stack services app-stack
        bash ../healthcheck.sh
      EOF
    - echo "✅ Deployment to TESTING completed"
  
  environment:
    name: testing
    url: https://testing.company.com
  
  when: manual  # ⚠️ Requires manual approval
  
  only:
    - main
  
  tags:
    - deployment

#═══════════════════════════════════════════════════════════
# Stage 5: DEPLOY TO PRODUCTION (Manual Approval Required)
#═══════════════════════════════════════════════════════════

deploy:production:
  stage: deploy-production
  image: alpine:latest
  before_script:
    - apk add --no-cache openssh-client bash docker-cli
    - eval $(ssh-agent -s)
    - echo "$SWARM_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - ssh-keyscan -H $SWARM_PROD_HOST >> ~/.ssh/known_hosts
  
  script:
    - echo "🚀 Deploying to PRODUCTION environment..."
    
    # Backup current deployment
    - |
      ssh root@$SWARM_PROD_HOST bash << 'EOF'
        echo "Creating backup of current deployment..."
        mkdir -p /backup/deployments/$(date +%Y%m%d-%H%M%S)
        docker stack services app-stack --format "{{.Name}} {{.Image}}" > /backup/deployments/$(date +%Y%m%d-%H%M%S)/services.txt
        echo "Backup created"
      EOF
    
    # Deploy new version
    - scp -r environments/production root@$SWARM_PROD_HOST:/tmp/deploy/
    - |
      ssh root@$SWARM_PROD_HOST bash << 'EOF'
        cd /tmp/deploy/production
        source .env
        
        echo "Starting production deployment..."
        docker stack deploy -c docker-compose.yml --with-registry-auth app-stack
        
        echo "Waiting for services to stabilize..."
        sleep 60
        
        echo "Checking service health..."
        docker stack services app-stack
        
        # Run comprehensive health checks
        bash ../healthcheck.sh
        
        if [ $? -eq 0 ]; then
          echo "✅ Health checks PASSED"
        else
          echo "❌ Health checks FAILED - consider rollback"
          exit 1
        fi
      EOF
    
    - echo "✅ Deployment to PRODUCTION completed successfully"
  
  environment:
    name: production
    url: https://app.company.com
  
  when: manual  # ⚠️ Requires manual approval + confirmation
  
  only:
    - main
  
  tags:
    - deployment

#═══════════════════════════════════════════════════════════
# ROLLBACK JOBS (Manual Trigger)
#═══════════════════════════════════════════════════════════

rollback:production:
  stage: rollback
  image: alpine:latest
  before_script:
    - apk add --no-cache openssh-client bash docker-cli git
    - eval $(ssh-agent -s)
    - echo "$SWARM_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - ssh-keyscan -H $SWARM_PROD_HOST >> ~/.ssh/known_hosts
  
  script:
    - echo "🔄 Rolling back PRODUCTION to previous version..."
    
    # Get previous Git commit
    - PREVIOUS_COMMIT=$(git rev-parse HEAD~1)
    - echo "Rolling back to commit: $PREVIOUS_COMMIT"
    
    # Checkout previous version
    - git checkout $PREVIOUS_COMMIT -- environments/production/
    
    # Deploy previous version
    - scp -r environments/production root@$SWARM_PROD_HOST:/tmp/rollback/
    - |
      ssh root@$SWARM_PROD_HOST bash << 'EOF'
        cd /tmp/rollback/production
        source .env
        
        echo "Rolling back to previous version..."
        docker stack deploy -c docker-compose.yml --with-registry-auth app-stack
        
        sleep 30
        
        echo "Verifying rollback..."
        docker stack services app-stack
        bash ../healthcheck.sh
      EOF
    
    - echo "✅ Rollback completed"
  
  environment:
    name: production
    action: rollback
  
  when: manual
  
  only:
    - main
  
  tags:
    - deployment

3. Docker Stack Management

3.1 Example docker-compose.yml Structure

# environments/production/docker-compose.yml

version: '3.8'

services:
  
  #════════════════════════════════════════════════════════
  # Frontend Application
  #════════════════════════════════════════════════════════
  frontend:
    image: ${HARBOR_REGISTRY}/company/frontend:${FRONTEND_VERSION}
    networks:
      - app-network
    ports:
      - "80:80"
      - "443:443"
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
        monitor: 30s
      rollback_config:
        parallelism: 1
        delay: 5s
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
      placement:
        constraints:
          - node.role == worker
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
  
  #════════════════════════════════════════════════════════
  # Backend API
  #════════════════════════════════════════════════════════
  api:
    image: ${HARBOR_REGISTRY}/company/api:${API_VERSION}
    networks:
      - app-network
      - db-network
    environment:
      - DATABASE_URL=${DATABASE_URL}
      - REDIS_URL=${REDIS_URL}
      - JWT_SECRET=${JWT_SECRET}
    secrets:
      - db_password
      - jwt_secret
    deploy:
      replicas: 5
      update_config:
        parallelism: 2
        delay: 10s
        failure_action: rollback
        monitor: 45s
      rollback_config:
        parallelism: 2
        delay: 5s
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
      placement:
        constraints:
          - node.role == worker
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
  
  #════════════════════════════════════════════════════════
  # Worker Service
  #════════════════════════════════════════════════════════
  worker:
    image: ${HARBOR_REGISTRY}/company/worker:${WORKER_VERSION}
    networks:
      - app-network
      - db-network
    environment:
      - REDIS_URL=${REDIS_URL}
      - QUEUE_NAME=jobs
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: any
        delay: 10s
        max_attempts: 3
      placement:
        constraints:
          - node.role == worker
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
  
  #════════════════════════════════════════════════════════
  # Cache (Redis)
  #════════════════════════════════════════════════════════
  redis:
    image: redis:7-alpine
    networks:
      - app-network
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker
      restart_policy:
        condition: any
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

#════════════════════════════════════════════════════════
# Networks
#════════════════════════════════════════════════════════
networks:
  app-network:
    driver: overlay
    attachable: true
  db-network:
    driver: overlay
    internal: true

#════════════════════════════════════════════════════════
# Secrets
#════════════════════════════════════════════════════════
secrets:
  db_password:
    external: true
  jwt_secret:
    external: true

3.2 Environment Variables (.env files)

# environments/production/.env

# Harbor Registry
HARBOR_REGISTRY=harbor.company.com

# Application Versions (THIS IS WHAT YOU UPDATE!)
FRONTEND_VERSION=v2.1.5
API_VERSION=v3.2.1
WORKER_VERSION=v1.8.3

# Database Configuration
DATABASE_URL=postgresql://user@db-prod:5432/appdb

# Redis Configuration
REDIS_URL=redis://redis:6379

# Application Configuration
JWT_SECRET_FILE=/run/secrets/jwt_secret
LOG_LEVEL=info
ENVIRONMENT=production

3.3 Health Check Script

#!/bin/bash
# environments/production/healthcheck.sh

set -e

echo "═══════════════════════════════════════════════"
echo "Running Health Checks for Production"
echo "═══════════════════════════════════════════════"

STACK_NAME="app-stack"
FAILED=0

# Check if all services are running
echo ""
echo "1⃣  Checking service status..."
SERVICES=$(docker stack services $STACK_NAME --format "{{.Name}}")

for service in $SERVICES; do
    REPLICAS=$(docker service ls --filter name=$service --format "{{.Replicas}}")
    echo "   $service: $REPLICAS"
    
    # Check if service has failed replicas
    if echo "$REPLICAS" | grep -q "0/"; then
        echo "   ❌ Service $service has NO running replicas!"
        FAILED=1
    fi
done

# Check frontend health endpoint
echo ""
echo "2⃣  Checking Frontend health endpoint..."
if curl -sf http://localhost/health > /dev/null; then
    echo "   ✅ Frontend health check PASSED"
else
    echo "   ❌ Frontend health check FAILED"
    FAILED=1
fi

# Check API health endpoint
echo ""
echo "3⃣  Checking API health endpoint..."
if curl -sf http://localhost:3000/health > /dev/null; then
    echo "   ✅ API health check PASSED"
else
    echo "   ❌ API health check FAILED"
    FAILED=1
fi

# Check Redis connectivity
echo ""
echo "4⃣  Checking Redis connectivity..."
if docker exec $(docker ps -q -f name=${STACK_NAME}_redis) redis-cli ping | grep -q PONG; then
    echo "   ✅ Redis connectivity PASSED"
else
    echo "   ❌ Redis connectivity FAILED"
    FAILED=1
fi

# Check for recent errors in logs
echo ""
echo "5⃣  Checking recent logs for errors..."
ERROR_COUNT=$(docker service logs --since 5m $STACK_NAME | grep -i "error\|fatal\|panic" | wc -l)
if [ $ERROR_COUNT -gt 10 ]; then
    echo "   ⚠️  Found $ERROR_COUNT errors in last 5 minutes"
    FAILED=1
else
    echo "   ✅ Error count acceptable: $ERROR_COUNT"
fi

echo ""
echo "═══════════════════════════════════════════════"
if [ $FAILED -eq 0 ]; then
    echo "✅ ALL HEALTH CHECKS PASSED"
    echo "═══════════════════════════════════════════════"
    exit 0
else
    echo "❌ HEALTH CHECKS FAILED"
    echo "═══════════════════════════════════════════════"
    exit 1
fi

4. Environment Management Strategy

4.1 Promotion Flow

┌─────────────────────────────────────────────────────────┐
│              Environment Promotion Flow                  │
└─────────────────────────────────────────────────────────┘

Developer updates image version in Git
         ↓
    Development (Automatic)
         ├─ Deploy immediately
         ├─ Run health checks
         └─ ✅ If successful → enable Sandbox deployment
         
         ↓ (Manual approval required)
         
    Sandbox (Manual Trigger)
         ├─ QA team tests features
         ├─ Run integration tests
         └─ ✅ If approved → enable Testing deployment
         
         ↓ (Manual approval required)
         
    Testing (Manual Trigger)
         ├─ Full regression testing
         ├─ Performance testing
         └─ ✅ If approved → enable Production deployment
         
         ↓ (Manual approval required + confirmation)
         
    Production (Manual Trigger)
         ├─ Backup current state
         ├─ Deploy with blue-green strategy
         ├─ Run comprehensive health checks
         └─ ✅ Monitor or 🔄 Rollback if issues

4.2 Deployment Approval Matrix

Environment Approval Required Who Can Approve Rollback Strategy
Development No (Automatic) N/A Automatic on health check failure
Sandbox Yes (Manual) Any Developer Manual via GitLab UI
Testing Yes (Manual) QA Lead, DevOps Lead Manual via GitLab UI
Production Yes (Manual + Confirmation) DevOps Lead, CTO Automatic on failure + Manual option

4.3 Change Management Workflow

# Example: Updating application version

# 1. Developer receives new image from Harbor
New image available: harbor.company.com/company/api:v3.2.2

# 2. Developer creates feature branch
git checkout -b update-api-v3.2.2

# 3. Update version in Development environment
# Edit: environments/development/.env
API_VERSION=v3.2.2

# 4. Commit and push
git add environments/development/.env
git commit -m "feat: update API to v3.2.2 in development"
git push origin update-api-v3.2.2

# 5. Create Merge Request in GitLab
- Title: "Update API to v3.2.2"
- Description: "New features: X, Y, Z. Bug fixes: A, B"
- Assign to: DevOps team for review

# 6. After MR approval and merge to main:
- GitLab CI automatically deploys to Development
- Monitor deployment
- If successful, manually trigger Sandbox deployment

# 7. QA tests in Sandbox
- If approved, update Testing environment
- Repeat process

# 8. Production deployment
- Update production/.env with new version
- Create MR with detailed change log
- Require approvals from: DevOps Lead + CTO
- Schedule deployment window
- Execute manual deployment
- Monitor closely

5. Rollback Strategy

5.1 Automatic Rollback (Health Check Failure)

# In docker-compose.yml - automatic rollback on failure

services:
  api:
    deploy:
      update_config:
        failure_action: rollback  # ← Automatic rollback!
        monitor: 60s             # Monitor for 60 seconds
      rollback_config:
        parallelism: 2           # Roll back 2 at a time
        delay: 5s                # 5s between rollbacks

How it works:

  1. New version deploys
  2. Docker Swarm monitors health checks for 60 seconds
  3. If health checks fail → Automatic rollback to previous version
  4. Previous version restored within 2-3 minutes

5.2 Manual Rollback via GitLab

Option A: Rollback via Git History

# GitLab Pipeline: rollback:production job

# 1. Identify previous working version
git log --oneline environments/production/.env

# 2. Checkout previous commit
git checkout <previous-commit-hash> -- environments/production/

# 3. Pipeline redeploys previous version
# 4. Verify health checks

Option B: Rollback via GitLab UI

GitLab → Deployments → Environments → Production
  ↓
Click "Rollback" button
  ↓
Select previous successful deployment
  ↓
Confirm rollback
  ↓
Pipeline automatically executes rollback job

5.3 Emergency Rollback Procedure

#!/bin/bash
# scripts/emergency-rollback.sh

# FOR EMERGENCY USE ONLY - bypasses GitLab pipeline
# Run directly on Swarm manager node

STACK_NAME="app-stack"
BACKUP_DIR="/backup/deployments"

echo "🚨 EMERGENCY ROLLBACK INITIATED"

# Find last backup
LAST_BACKUP=$(ls -td $BACKUP_DIR/* | head -1)
echo "Rolling back to: $LAST_BACKUP"

# Extract previous image versions
while read line; do
    SERVICE=$(echo $line | awk '{print $1}')
    IMAGE=$(echo $line | awk '{print $2}')
    
    echo "Rolling back $SERVICE to $IMAGE"
    docker service update --image $IMAGE ${STACK_NAME}_${SERVICE}
done < "$LAST_BACKUP/services.txt"

echo "✅ Emergency rollback completed"
echo "⚠️  Remember to update Git repository to match!"

6. Monitoring & Health Checks

6.1 Service-Level Health Checks

# In docker-compose.yml

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
  interval: 30s      # Check every 30 seconds
  timeout: 10s       # Request timeout
  retries: 3         # Fail after 3 attempts
  start_period: 60s  # Grace period for startup

6.2 Stack-Level Monitoring

#!/bin/bash
# scripts/monitor-deployment.sh

STACK_NAME="app-stack"

while true; do
    clear
    echo "═══════════════════════════════════════════════"
    echo "Stack: $STACK_NAME - $(date)"
    echo "═══════════════════════════════════════════════"
    
    # Show service status
    docker stack services $STACK_NAME
    
    echo ""
    echo "Recent logs (last 10 lines):"
    docker service logs --tail=10 $STACK_NAME
    
    sleep 10
done

6.3 Notification Integration

# Add to .gitlab-ci.yml

after_script:
  - |
    if [ "$CI_JOB_STATUS" == "success" ]; then
      MESSAGE="✅ Deployment to $CI_ENVIRONMENT_NAME successful"
    else
      MESSAGE="❌ Deployment to $CI_ENVIRONMENT_NAME FAILED"
    fi
    
    # Send to Slack
    curl -X POST -H 'Content-type: application/json' \
      --data "{\"text\":\"$MESSAGE\nPipeline: $CI_PIPELINE_URL\"}" \
      $SLACK_WEBHOOK_URL
    
    # Send email (if SMTP configured)
    echo "$MESSAGE" | mail -s "Deployment Notification" devops@company.com

7. Implementation Roadmap

Phase 1: Preparation (Week 1)

Day 1-2: Repository Setup

  • Create deployment-configs repository in GitLab
  • Create directory structure (environments/, scripts/)
  • Add current docker-compose.yml to each environment
  • Create .env files with current versions
  • Commit initial structure

Day 3-4: GitLab Configuration

  • Configure GitLab CI/CD variables:
    • SWARM_DEV_HOST, SWARM_SANDBOX_HOST, SWARM_TEST_HOST, SWARM_PROD_HOST
    • SWARM_SSH_KEY (SSH private key)
    • HARBOR_USER, HARBOR_PASSWORD
    • SLACK_WEBHOOK_URL (optional)
  • Create SSH keys for GitLab Runner → Swarm access
  • Test SSH connectivity from GitLab to each Swarm environment

Day 5: Scripts Development

  • Create deploy.sh script
  • Create healthcheck.sh script
  • Create rollback.sh script
  • Test scripts manually on Development environment

Phase 2: Pipeline Implementation (Week 2)

Day 1-2: Basic Pipeline

  • Create .gitlab-ci.yml with validation stage only
  • Test syntax validation
  • Test image validation

Day 3: Development Deployment

  • Add deploy:development job
  • Test automatic deployment to Development
  • Verify health checks work

Day 4: Sandbox & Testing

  • Add deploy:sandbox job (manual)
  • Add deploy:testing job (manual)
  • Test manual approval workflow

Day 5: Production Deployment

  • Add deploy:production job (manual + confirmation)
  • Add backup before deployment
  • Test on Friday afternoon (low traffic)

Phase 3: Rollback Implementation (Week 3)

Day 1-2: Automatic Rollback

  • Configure Docker Swarm automatic rollback
  • Test by deploying broken version
  • Verify automatic recovery

Day 3-4: Manual Rollback

  • Implement rollback:production job
  • Test Git-based rollback
  • Document rollback procedure

Day 5: Emergency Procedures

  • Create emergency-rollback.sh script
  • Test emergency rollback
  • Document for on-call team

Phase 4: Monitoring & Optimization (Week 4)

Day 1-2: Monitoring

  • Set up deployment notifications (Slack/Email)
  • Configure Prometheus metrics collection
  • Create Grafana dashboards for deployments

Day 3-4: Documentation

  • Write deployment guide for developers
  • Write operations runbook
  • Create troubleshooting guide
  • Record demo video

Day 5: Team Training

  • Train developers on new workflow
  • Train QA team on approval process
  • Train DevOps team on monitoring/rollback
  • Conduct Q&A session

8. Best Practices & Tips

8.1 Version Management

DO:

# Use semantic versioning
API_VERSION=v3.2.1  # ← Good: Clear, semantic version

# Include Git commit hash for traceability
API_VERSION=v3.2.1-abc123ef

# Use immutable tags
IMAGE=harbor.company.com/app:v1.2.3  # ← Good: Specific version

DON'T:

# Avoid mutable tags
API_VERSION=latest  # ← Bad: Can change unexpectedly

# Avoid ambiguous versions
API_VERSION=production  # ← Bad: What version is this?

8.2 Deployment Timing

Recommended deployment windows:

  • Development: Anytime (automatic)
  • Sandbox: Business hours (9am-5pm)
  • Testing: Business hours (requires QA)
  • Production:
    • Normal changes: Tuesday-Thursday, 10am-2pm
    • Critical fixes: Anytime with proper approval
    • Avoid: Monday mornings, Friday afternoons, weekends

8.3 Communication

Before Production deployment:

Slack announcement template:

📢 Production Deployment Scheduled

🗓 Date: January 15, 2026
⏰ Time: 11:00 AM (EST)
⏱ Duration: ~15 minutes
📝 Changes:
  - API v3.2.1 → v3.2.2 (bug fixes)
  - Frontend v2.1.5 → v2.1.6 (UI improvements)

🔗 Release Notes: [link]
🔗 Rollback Plan: [link]

Please report any issues to #devops-alerts

8.4 Security Considerations

# Store sensitive data as Docker secrets
secrets:
  db_password:
    external: true  # ← Created outside compose file
  api_key:
    external: true

# Never commit secrets to Git!
# Use GitLab CI/CD variables for:
# - SSH keys
# - API tokens
# - Passwords
# - Certificates

8.5 Troubleshooting Common Issues

Issue 1: Pipeline fails with "SSH connection refused"

# Solution: Verify SSH key in GitLab CI/CD variables
# Test manually:
ssh -i ~/.ssh/gitlab_rsa root@swarm-manager

Issue 2: Image pull fails from Harbor

# Solution: Check registry credentials
docker login harbor.company.com -u $HARBOR_USER -p $HARBOR_PASSWORD

# Verify image exists:
docker pull harbor.company.com/company/api:v3.2.1

Issue 3: Health checks fail after deployment

# Debug: Check service logs
docker service logs app-stack_api --tail 100

# Check service status
docker service ps app-stack_api

# Manual health check
curl http://localhost:3000/health

Issue 4: Deployment stuck "pending"

# Check swarm node status
docker node ls

# Check resource availability
docker node inspect swarm-worker-1 | grep Resources -A 10

# Check for failed tasks
docker service ps app-stack_api --no-trunc

9. Success Metrics

9.1 Key Performance Indicators

Before Automation:

  • 📊 Deployment frequency: 1-2 per week
  • ⏱ Average deployment time: 30-60 minutes per environment
  • 🐛 Deployment errors: ~20% (typos, wrong tags)
  • 🔄 Rollback time: 1-2 hours (manual)
  • 📝 Audit trail: Partial (chat logs, manual notes)

After Automation (Target):

  • 📊 Deployment frequency: 5-10 per week
  • ⏱ Average deployment time: 5-10 minutes per environment
  • 🐛 Deployment errors: <2% (automated validation)
  • 🔄 Rollback time: 2-3 minutes (automatic)
  • 📝 Audit trail: Complete (Git history + GitLab logs)

9.2 Success Criteria

Week 4 Evaluation:

  • All 4 environments deployed via GitLab CI/CD
  • Zero manual SSH deployments
  • At least 5 successful Production deployments
  • At least 1 successful rollback test
  • Team can deploy without DevOps assistance
  • Complete audit trail for all deployments
  • Average deployment time < 15 minutes

10. Conclusion & Next Steps

Current State

Manual bash script deployments
No audit trail
Error-prone process
Slow rollbacks

Target State (After Implementation)

Automated GitLab CI/CD pipelines
Complete Git-based audit trail
Validated deployments with health checks
2-minute automatic rollbacks
Self-service for developers

Immediate Next Steps

  1. This Week:

    • Create GitLab repository structure
    • Configure CI/CD variables
    • Test SSH connectivity
  2. Next Week:

    • Implement basic pipeline
    • Test Development deployments
    • Add validation stages
  3. Week 3-4:

    • Roll out to all environments
    • Implement rollback procedures
    • Train team

Resources Needed

  • Time Investment: 2-4 weeks (1 DevOps engineer)
  • Infrastructure: GitLab Runner (existing OK)
  • Training: 2-3 hours team training session
  • Documentation: Deployment guide + runbooks

Support & Questions

For implementation assistance:


Document Version: 1.0
Last Updated: Январь 2026
Status: Ready for Implementation
Author: DevOps Team
Review Date: After Phase 2 completion