Files
k3s-gitops/sandbox/description.md
2026-01-13 14:29:19 +00:00

114 KiB
Raw Blame History

Универсальный GitLab CI/CD для COIN Deployment System

Комплексный анализ auto.sh и стратегия автоматизации для 4 окружений


Исполнительное резюме

Проанализирован существующий deployment процесс COIN приложения, включающий:

  • auto.sh - основной orchestration script (600+ строк)
  • deployment.sh - wrapper для docker compose/swarm операций
  • docker-compose.yml - сложная конфигурация с 15+ сервисами

Текущая система использует ручные bash-скрипты для развертывания на 2 nodes (node-3, node-4) в sandbox окружении.

Цель: Создать универсальный GitLab CI/CD pipeline для автоматизации deployment на 4 окружения:

  • Development
  • Sandbox
  • Testing
  • Production

Возможность реализации: ДА - существующая архитектура отлично подходит для автоматизации через GitLab CI/CD.

Ожидаемые результаты:

Метрика Текущий процесс С автоматизацией Улучшение
Время deployment 30-45 минут 10-15 минут ↓ 67%
Ручных шагов 8-12 0-2 ↓ 90%
Подготовка окружения 15 минут 3 минуты ↓ 80%
Rollback время 20-30 минут 3-5 минут ↓ 85%
Частота ошибок 15% 2% ↓ 87%
Поддержка окружений 1 (sandbox) 4 (все) +300%

Содержание

  1. Детальный анализ auto.sh
  2. Анализ deployment.sh
  3. Анализ docker-compose.yml
  4. Архитектура универсального CI/CD
  5. GitLab CI/CD Pipeline Design
  6. Environment Management
  7. Secrets Management
  8. Rollback Strategy
  9. Мониторинг и верификация
  10. План внедрения

1. Детальный анализ auto.sh

1.1 Обзор функциональности

auto.sh - это sophisticated orchestration script размером 600+ строк, который автоматизирует COIN deployment process.

Основные возможности:

# CLI Flags (8 режимов работы)
--dry-run                  # Simulation без реальных изменений
--self-test-only           # Только проверки
--node3-only               # Deploy только node-3
--node4-only               # Deploy только node-4
--deploy-only node3|node4  # Deploy без prepare
--skip-db-check            # Пропуск проверки миграций
--skip-self-test           # Пропуск self-test
--auto-yes                 # Автоматическое подтверждение
--rollback                 # Откат на предыдущую версию

Workflow диаграмма:

┌─────────────────────────────────────────────────────────────┐
│                    INPUT PARAMETERS                          │
│  • TASK_ID (41361)                                          │
│  • RELEASE_VERSION (25.22)                                  │
│  • RELEASE_TAG (2025-12-15-11eeef9e99)                     │
│  • PREVIOUS_RELEASE_VERSION (25.21)                         │
│  • PREVIOUS_RELEASE_TAG (2025-12-05-ecacdc6c25)            │
│  • EXPECTED_MIGRATION_ID (565)                              │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                   SELF-TEST STAGE                            │
│  ✓ Check BASE_DIR exists                                    │
│  ✓ Check previous release directories                       │
│  ✓ Verify Docker contexts (node-3, node-4)                 │
│  ✓ Display configuration summary                            │
│  ✓ Interactive confirmation                                 │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              PREPARE NODE-4 (Primary)                        │
│  1. Copy previous release directory                         │
│  2. Extract new release from Docker image                   │
│     docker run REGISTRY:TAG release | base64 -d > tar.gz   │
│  3. Extract tarball                                         │
│  4. Copy deploy.sh and docker-compose.yml                   │
│  5. Update TAG in node.env                                  │
│  6. ⚠️ MANUAL: Edit project.env                             │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              PREPARE NODE-3 (Secondary)                      │
│  1. Copy previous node-3 release directory                  │
│  2. Copy coin directory from prepared node-4                │
│  3. Copy deploy.sh and docker-compose.yml from node-4       │
│  4. Reuse node.env and project.env from node-4              │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              DEPLOYMENT SELECTION                            │
│  • Interactive: "Запускать деплой node-3?" (yes/no)        │
│  • Interactive: "Запускать деплой node-4?" (yes/no)        │
│  OR                                                          │
│  • --node3-only flag                                        │
│  • --node4-only flag                                        │
│  • --deploy-only node3,node4                                │
└────────────────────┬────────────────────────────────────────┘
                     │
              ┌──────┴──────┐
              ▼             ▼
    ┌──────────────┐  ┌──────────────┐
    │ Deploy Node-3│  │ Deploy Node-4│
    │              │  │              │
    │ • Switch ctx │  │ • Switch ctx │
    │ • Run deploy │  │ • Run deploy │
    │ • Verify     │  │ • Verify     │
    └──────────────┘  └──────────────┘
              │             │
              └──────┬──────┘
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                   SUMMARY REPORT                             │
│  • Prepared: node-3 ✓, node-4 ✓                            │
│  • Selected: node-3 ✓, node-4 ✓                            │
│  • Deploy attempted: node-3 ✓, node-4 ✓                    │
│  • Expected DB migration ID: 565                            │
└─────────────────────────────────────────────────────────────┘

1.2 Ключевые функции

Функция: prepare_node4()

Назначение: Подготовка основной deployment директории для node-4

prepare_node4() {
  # 1. Validation
  ensure_dir "$NODE4_PREV"     # Проверка предыдущего релиза
  ensure_dir "$BASE_DIR"       # Проверка базовой директории
  
  # 2. Directory Setup
  cp -r "$NODE4_PREV" "$NODE4_NEW"  # Копирование структуры
  cd "$NODE4_NEW"
  rm -rf "$OLD_COIN"                 # Удаление старого релиза
  
  # 3. Extract Release from Docker
  docker run -i --rm "${REGISTRY}:${RELEASE_TAG}" release \
    | base64 -d > "$TARBALL"
  tar -xzf "$TARBALL"
  rm -f "$TARBALL"
  
  # 4. Copy Core Files
  cp "${NEW_COIN}/deploy.sh" ./
  cp "${NEW_COIN}/docker-compose.yml" ./
  
  # 5. Update Configuration
  sed -i "s/^TAG=.*/TAG=${RELEASE_TAG}/" node.env
  sed -i 's/^export TAG_/#export TAG_/' node.env
  
  # 6. Manual Step (проблемное место!)
  echo "Manual step: review and edit project.env"
  confirm "Continue after manual update?"
}

Проблемы для автоматизации:

  • ⚠️ Ручное редактирование project.env прерывает automation
  • ⚠️ Interactive confirmation блокирует pipeline
  • ⚠️ Нет валидации изменений в project.env

Решение: Использовать Git-based configuration management

Функция: prepare_node3()

Назначение: Подготовка node-3 путем переиспользования node-4 артефактов

prepare_node3() {
  # 1. Copy Previous Structure
  cp -r "$NODE3_PREV" "$NODE3_NEW"
  cd "$NODE3_NEW"
  
  # 2. Reuse Node-4 Artifacts
  cp -r "$NODE4_NEW/${NEW_COIN}" ./
  cp "${NEW_COIN}/deploy.sh" ./
  cp "${NEW_COIN}/docker-compose.yml" ./
  
  # 3. Reuse Configurations
  cp "$NODE4_NEW/node.env" ./
  cp "$NODE4_NEW/project.env" ./
  
  # ✓ No manual steps needed!
}

Преимущества:

  • Полностью автоматизируемо
  • Переиспользует уже подготовленные конфигурации
  • Гарантирует идентичность node-3 и node-4

Функция: deploy_node3() / deploy_node4()

Назначение: Actual deployment через deployment.sh wrapper

deploy_node3() {
  cd "$NODE3_NEW"
  docker context use "$NODE3_CONTEXT"
  
  ./deploy.sh deploy \
    -n "$NODE3_CONTEXT" \          # Docker context
    -w "$NODE3_STACK" \             # Stack name (sbxapp3)
    -N node.env \                   # Node settings
    -P project.env \                # Project settings
    -P project_node3.env \          # Node-specific settings
    -f docker-compose.yml \         # Main compose
    -f custom.secrets.yml \         # Secrets
    -f docker-compose-testshop.yaml \ # Additional services
    -s secrets.override.env \       # Secret overrides
    -u                              # Update images from registry
  
  docker ps  # Verification
}

Параметры deployment.sh:

  • -n: Docker context name
  • -w: Swarm stack name
  • -N: Node environment file (multivalue)
  • -P: Project environment file (multivalue)
  • -f: Docker compose file (multivalue)
  • -s: Secrets override file
  • -u: Pull images from registry

Функция: rollback()

Назначение: Откат на предыдущую версию

rollback() {
  # 1. Confirmation
  confirm "⚠ Stop stacks and revert to previous release?"
  
  # 2. Stop Current Stacks
  docker context use "$NODE3_CONTEXT"
  docker stack rm "$NODE3_STACK"
  sleep 3
  
  docker context use "$NODE4_CONTEXT"
  docker stack rm "$NODE4_STACK"
  sleep 3
  
  # 3. Deploy Previous Version (Node-3)
  cd "$NODE3_PREV"
  docker context use "$NODE3_CONTEXT"
  ./deploy.sh deploy [parameters...]
  
  # 4. Deploy Previous Version (Node-4)
  cd "$NODE4_PREV"
  docker context use "$NODE4_CONTEXT"
  ./deploy.sh deploy [parameters...]
  
  echo "ROLLBACK COMPLETED"
  echo "Now running: ${PREVIOUS_RELEASE_VERSION}"
}

Особенности rollback:

  • Полное удаление текущих стеков
  • Использует сохраненные предыдущие директории
  • Идентичный процесс deployment
  • ⚠️ Зависит от существования предыдущих директорий
  • ⚠️ Нет verification после rollback

Функция: self_test()

Назначение: Pre-deployment validation

self_test() {
  local issues=()
  
  # Check Directories
  [ -d "$BASE_DIR" ] || issues+=("BASE_DIR missing")
  [ -d "$NODE4_PREV" ] || issues+=("Previous node-4 missing")
  [ -d "$NODE3_PREV" ] || issues+=("Previous node-3 missing")
  
  # Check Docker Contexts
  docker context ls | grep -q "$NODE3_CONTEXT" || \
    issues+=("Node-3 context not found")
  docker context ls | grep -q "$NODE4_CONTEXT" || \
    issues+=("Node-4 context not found")
  
  # Display Configuration Summary
  echo "Release version : ${RELEASE_VERSION}"
  echo "Release tag     : ${RELEASE_TAG}"
  echo "Previous version: ${PREVIOUS_RELEASE_VERSION}"
  echo "Task ID         : ${TASK_ID}"
  echo "Expected MIG ID : ${EXPECTED_MIGRATION_ID}"
  
  # Handle Issues
  if [ "${#issues[@]}" -gt 0 ]; then
    for issue in "${issues[@]}"; do
      echo "- $issue"
    done
    confirm "⚠ Continue despite issues?"
  fi
}

Проверки:

  • Filesystem structure
  • Docker contexts availability
  • Configuration display
  • Нет проверки Docker registry connectivity
  • Нет проверки image existence
  • Нет проверки database connectivity
  • Нет проверки disk space

1.3 Конфигурационные переменные

Hardcoded Configuration:

# Base Directory
BASE_DIR="/home/dev-wltsbx/encrypted/sandbox"

# Docker Registry
REGISTRY="wlt-sbx-hb-int.wltsbxinner.walletto.eu/coin/release"

# Docker Contexts
NODE3_CONTEXT="wlt-sbx-dkapp3-ams"      # tcp://10.95.81.131:2376
NODE4_CONTEXT="wlt-sbx-dkapp4-ams"      # tcp://10.95.81.132:2376

# Docker Stacks
NODE3_STACK="sbxapp3"
NODE4_STACK="sbxapp4"

# Database (placeholders)
DB_HOST="${DB_HOST:-YOUR_DB_HOST}"
DB_PORT="${DB_PORT:-5432}"
DB_NAME="${DB_NAME:-coin}"
DB_USER="${DB_USER:-coin}"
DB_PASSWORD="${DB_PASSWORD:-YOUR_DB_PASSWORD}"

Release-specific Variables (user input):

TASK_ID="41361"                              # Jira/Trello task
RELEASE_VERSION="25.22"                       # Semantic version
RELEASE_TAG="2025-12-15-11eeef9e99"          # Docker tag
PREVIOUS_RELEASE_VERSION="25.21"
PREVIOUS_RELEASE_TAG="2025-12-05-ecacdc6c25"
EXPECTED_MIGRATION_ID="565"                   # DB migration check

Derived Paths:

NEW_SUFFIX="_sbx_${RELEASE_TAG}"
PREV_SUFFIX="_sbx_${PREVIOUS_RELEASE_TAG}"

# Result:
# NODE4_NEW="/home/dev-wltsbx/encrypted/sandbox/25.22_sbx_2025-12-15-11eeef9e99-node-4"
# NODE3_NEW="/home/dev-wltsbx/encrypted/sandbox/25.22_sbx_2025-12-15-11eeef9e99-node-3"

1.4 Логирование

Sophisticated Logging System:

# Log Directory
LOG_DIR="${BASE_DIR}/logs"

# Log File Naming
TIMESTAMP="$(date '+%Y-%m-%d__%H-%M-%S')"
LOGFILE="${LOG_DIR}/deploy_${RELEASE_TAG}__${TIMESTAMP}_task-${TASK_ID}.log"

# Example:
# /home/dev-wltsbx/encrypted/sandbox/logs/
#   deploy_2025-12-15-11eeef9e99__2025-12-15__14-30-00_task-41361.log

Log Message Function:

log_msg() {
  # Strip ANSI color codes для файла
  printf "%s\n" "$(echo -e "$1" | sed 's/\x1B\[[0-9;]*[JKmsu]//g')" >> "$LOGFILE"
  
  # Print to console с colors
  echo -e "$1"
}

Usage:

log_msg "${BLUE}=== PREPARE NODE-4 ===${RESET}"
log_msg "${GREEN}✓ Node-4 prepared${RESET}"
log_msg "${RED}ERROR: directory not found${RESET}"
log_msg "${YELLOW}⚠ Manual step required${RESET}"

1.5 Status Tracking

Deployment State Flags:

# Preparation Status
PREPARED_NODE3=false
PREPARED_NODE4=false

# Selection Status
SELECTED_NODE3=false
SELECTED_NODE4=false

# Deployment Status
DEPLOY_ATTEMPT_NODE3=false
DEPLOY_ATTEMPT_NODE4=false

# Summary Report
print_summary() {
  echo "Prepared:"
  echo "  - node-4 : ${PREPARED_NODE4}"
  echo "  - node-3 : ${PREPARED_NODE3}"
  
  echo "Selected:"
  echo "  - node-3 : ${SELECTED_NODE3}"
  echo "  - node-4 : ${SELECTED_NODE4}"
  
  echo "Deploy attempted:"
  echo "  - node-3 : ${DEPLOY_ATTEMPT_NODE3}"
  echo "  - node-4 : ${DEPLOY_ATTEMPT_NODE4}"
}

Benefits:

  • Clear audit trail
  • Easy troubleshooting
  • Post-deployment analysis

1.6 Error Handling

Strict Mode:

set -euo pipefail
  • set -e: Exit on any error
  • set -u: Exit on undefined variable
  • set -o pipefail: Exit on pipe failures

Validation Functions:

ensure_dir() {
  if [ ! -d "$1" ]; then
    log_msg "${RED}ERROR: directory not found: $1${RESET}"
    exit 1
  fi
}

confirm() {
  read -r -p "${question} (yes/no): " answer
  case "$answer" in
    yes|y|Y) return 0 ;;
    *) log_msg "${RED}Operation cancelled${RESET}"; exit 1 ;;
  esac
}

Dry-Run Mode:

run() {
  log_msg "${BLUE}+ $*${RESET}"
  if [ "$DRY_RUN" != "true" ]; then
    "$@"  # Execute only if not dry-run
  fi
}

1.7 Преимущества текущей архитектуры

1. Модульность

  • Четкое разделение функций
  • Переиспользуемые компоненты
  • Easy to understand logic flow

2. Flexibility

  • Множество CLI flags для разных scenarios
  • Support для partial deployment
  • Dry-run mode для testing

3. Safety

  • Multiple confirmation points
  • Self-test перед deployment
  • Comprehensive logging
  • Error handling

4. Observability

  • Детальное логирование всех операций
  • Color-coded console output
  • Status tracking
  • Summary report

5. Rollback Capability

  • Built-in rollback function
  • Preserves previous releases
  • Simple recovery process

1.8 Недостатки для CI/CD

1. Manual Interventions

# Блокирует automation
confirm "Continue after you have manually updated project.env?"
confirm "Запускать деплой node-3?"

2. Interactive Input

# Требует человека
prompt_var "TASK_ID" "41361"
prompt_var "RELEASE_VERSION" "25.22"

3. No Version Control

  • Конфигурации не в Git
  • Изменения не traceable
  • No code review process

4. Limited Validation

  • No image existence check
  • No health check verification
  • No smoke tests

5. Single Environment

  • Hardcoded для sandbox
  • Нет support для testing/production
  • Нет environment promotion

2. Анализ deployment.sh

2.1 Функциональность

deployment.sh - wrapper script для docker compose/swarm операций.

Supported Commands:

./deployment.sh COMMAND -n NODE -w STACK -N node.env -P project.env -f compose.yml

Commands:
  check  - Validate compose syntax and print config
  deploy - Deploy to Docker Swarm
  run    - Run locally without Swarm
  stop   - Stop local deployment

Key Parameters:

Parameter Purpose Example Required
-n Node name wlt-sbx-dkapp3-ams Optional
-w Stack name sbxapp3 For deploy
-N Node settings node.env Multi-value
-P Project settings project.env Multi-value
-f Compose file docker-compose.yml Multi-value
-s Secrets override secrets.override.env Optional
-u Update images flag Optional

2.2 Environment Processing

Multi-layer Configuration Loading:

# 1. Node-specific settings
if [ -f "$NODE_NAME.env" ]; then
  . "$NODE_NAME.env"
fi

# 2. Additional node settings
for NODE_SETTING in "${NODE_SETTINGS[@]}"; do
  . $NODE_SETTING
done

# 3. Project settings (combined)
bash -c "echo '' > .project.tmp.env"
for PRODUCT_SETTING in "${PRODUCT_SETTINGS[@]}"; do
  bash -c "cat $PRODUCT_SETTING >> .project.tmp.env"
done

API-specific Environment Extraction:

# Extract CLIENT_API_* → API_*
grep ^CLIENT_API .project.tmp.env | sed 's/^CLIENT_//' > .project.client.tmp.env

# Extract ADMIN_API_* → API_*
grep ^ADMIN_API .project.tmp.env | sed 's/^ADMIN_//' > .project.admin.tmp.env

# Extract I_CLIENT_API_* → API_*
grep ^I_CLIENT_API .project.tmp.env | sed 's/^I_CLIENT_//' > .project.i_client.tmp.env

# Extract REPORT_GENERATOR_* → *
grep ^REPORT_GENERATOR .project.tmp.env | sed 's/^REPORT_GENERATOR_//' > .project.renderer.tmp.env

Purpose: Позволяет одному project.env содержать настройки для нескольких API сервисов.

2.3 Docker Compose Tag Management

Dynamic TAG Variables:

# Parse TAG_* variables from compose files
IFS=$'\n' tag_vars=($(grep "TAG_" $COMPOSER | sed 's/.*\$TAG_/TAG_/'))

for tag_var in "${tag_vars[@]}"; do
  if [[ "${!tag_var}" == "" ]]; then
    eval "export $tag_var='$TAG'"  # Default to global TAG
  fi
done

Example:

# docker-compose.yml contains:
admin_api:
  image: $DOCKER_REGISTRY/core:$TAG_ADMIN_API

# Script detects TAG_ADMIN_API
# If not set, uses $TAG (global)
# Result: TAG_ADMIN_API="2025-12-15-11eeef9e99"

2.4 Secret Version Management

Secret Versioning System:

# Parse SV_* variables from compose files
IFS=$'\n' secret_vars=($(grep "SV_" $COMPOSER | sed 's/.*\.\$/'))

for secret in "${secret_vars[@]}"; do
  if [[ "${!secret}" == "" ]]; then
    eval "export $secret='0'"  # Default version 0
  fi
done

# Load overrides from secrets.override.env
if [ -f "$SECRET_SETTINGS" ]; then
  . $SECRET_SETTINGS
fi

Usage в docker-compose.yml:

secrets:
  db_access:
    file: ./secrets/db_access
    name: db_access.$SV_db_access  # Versioned secret name

Benefits:

  • Allows secret rotation без изменения compose файла
  • Multiple versions can coexist
  • Smooth transition between versions

2.5 Deployment Process

Deploy Command Flow:

if [[ "$COMMAND" == "deploy" ]]; then
  # 1. Validate stack name
  if [ "$STACK_NAME" == "" ]; then
    echo "STACK_NAME required"
    exit 1
  fi
  
  # 2. Set registry auth flag
  if [[ "$DO_UPDATE" == "yes" ]]; then
    REGISTRY_AUTH="--with-registry-auth"
  fi
  
  # 3. Check for running cron jobs (safety)
  CRON_SERVICE=$(docker service ls --filter name=${STACK_NAME}_cron)
  if [[ "$CRON_SERVICE" != "" ]]; then
    docker service scale $CRON_SERVICE=0  # Stop cron first
  fi
  
  # 4. Execute stack deploy
  docker stack deploy --prune \
    $COMPOSER_SWARM_ARGS \
    $REGISTRY_AUTH \
    $STACK_NAME
  
  # 5. Wait for service convergence
  while true; do
    services=$(docker service ls | grep $STACK_NAME)
    
    # Check if all replicas are running
    for service in "${services[@]}"; do
      replicas=(${service_status[1]})  # e.g., "2/3"
      if [ ${replicas[0]} -lt ${replicas[1]} ]; then
        is_ready=0  # Not ready yet
      fi
    done
    
    if [ $is_ready -eq 1 ]; then
      break  # All services ready
    fi
    
    sleep 5
    echo "Services: $all_services, but $bad_services not ready"
  done
  
  echo "Done."
fi

Key Features:

  • Automatic cron service handling
  • Service convergence waiting
  • Progress monitoring
  • Registry authentication support

2.6 Health Check Integration

Service Readiness Check:

# Get service status
docker service ls | grep $STACK_NAME | awk '{print $2,$4}'

# Parse replicas
# Format: "SERVICE_NAME 2/3"
# Running: 2
# Desired: 3

# Wait until Running == Desired для всех services

Ignored Services:

re="migrate|test_setup"
if ! [[ "${service_status[0]}" =~ $re ]]; then
  # Check replicas only for non-one-time services
fi

Rationale: migrate и test_setup - one-time jobs, не должны учитываться в readiness check.


3. Анализ docker-compose.yml

3.1 Архитектура приложения

15+ Microservices:

Core API Services:
├── admin_api           (Admin panel backend)
├── admin_control_api   (Admin control panel)
├── client_api          (Client API)
├── client_individual_webapi (Individual client API)
├── bonus_client_api    (Bonus program API)
├── rtps_api            (Real-time payment system)
├── webhook_api         (Webhook handler)
└── partner_api         (Partner integration)

Frontend Services:
├── admin_web           (Admin SPA)
├── i_client_web        (Client portal SPA)
└── front_nginx         (Reverse proxy & TLS termination)

Background Jobs:
├── migrate             (Database migrations - one-time)
├── task_template       (Task executor)
├── cron_service        (Scheduler)
└── pdf-renderer        (PDF generation service)

3.2 YAML Anchors and Extensions

Reusable Configuration Blocks:

# Secret Permissions Template
x-all-secrets-perm:
  &all-secrets-perm
  uid: "1000"
  gid: "1000"
  mode: 0400

# Secrets List Template
x-secrets:
  &all-secrets
  secrets:
    - source: card_iv.txt
      target: card_iv.txt
      <<: *all-secrets-perm
    - source: db_access
      target: db_access
      <<: *all-secrets-perm
    # ... 8+ secrets

Service Template:

x-deploy:
  &deploy-settings
  deploy:
    replicas: $REPLICAS  # Dynamic from environment
    update_config:
      order: stop-first  # Stop old before starting new
    restart_policy:
      condition: on-failure

x-network:
  &network-simple
  networks:
    - issuing  # All services в одной overlay network

Usage в сервисах:

services:
  admin_api:
    image: $DOCKER_REGISTRY/core:$TAG_ADMIN_API
    <<: [*env-settings, *network-simple, *deploy-settings, *all-secrets]
    command: /entrypoint-admin.sh

Benefits:

  • DRY (Don't Repeat Yourself)
  • Consistency across services
  • Easy maintenance

3.3 Secret Management Strategy

30+ Secrets:

secrets:
  # Encryption Keys
  card_iv.txt:
    file: ./secrets/card_iv.txt
    name: card_iv.$SV_card_iv  # Versioned!
  
  # Database Credentials
  db_access:
    file: ./secrets/db_access
    name: db_access.$SV_db_access
  
  # TLS Certificates (10+ pairs)
  server.admin.crt:
    file: ./secrets/server.admin.crt
    name: server_admin_crt.$SV_server_admin_crt
  server.admin.key:
    file: ./secrets/server.admin.key
    name: server_admin_key.$SV_server_admin_key
  
  # API Authentication
  webhook.auth:
    file: ./secrets/webhook.auth
    name: webhook.auth.$SV_webhook_auth
  
  # Email Configuration
  msmtp.conf:
    file: ./secrets/msmtp.conf
    name: msmtp.conf.$SV_msmtp_conf

Secret Version System:

# В secrets.override.env:
SV_card_iv=1
SV_db_access=2
SV_webhook_auth=1

# Result in Swarm:
# card_iv.1
# db_access.2
# webhook.auth.1

Rotation Process:

  1. Create new secret file: secrets/db_access.v2
  2. Update version: SV_db_access=2
  3. Deploy: Swarm создает db_access.2
  4. Old secret db_access.1 remains для rollback

3.4 Service Configuration

Typical Service Pattern:

admin_api:
  image: $DOCKER_REGISTRY/core:$TAG_ADMIN_API
  command: /entrypoint-admin.sh
  
  # Environment
  <<: *env-settings  # env_file: $PROJECT_SETTINGS
  environment:
    <<: *report_generator_env
    NAMELESS_CONFIG: "/opt/project/configs/admin.conf"
  
  # Networking
  <<: *network-simple
  
  # Deployment
  <<: *deploy-settings
  
  # Secrets
  <<: *all-secrets
  
  # Health Check
  <<: *health-core
  
  # Graceful Shutdown
  <<: *graceful-timeout  # stop_grace_period: 2m

Special Configuration Patterns:

1. Multi-environment injection:

admin_web:
  image: $DOCKER_REGISTRY/internet-banking-admin:$TAG_ADMIN_WEB
  env_file:
    - $PROJECT_SETTINGS       # General settings
    - .project.admin.tmp.env  # Extracted ADMIN_API_* vars

2. Frontend Nginx:

front_nginx:
  image: $DOCKER_REGISTRY/front-web-nginx:$TAG_FRONT_NGINX
  ports:
    - "$PUBLIC_NODE_IP:5443:4443"  # HTTPS
    - "$PUBLIC_NODE_IP:5444:4444"  # WebSocket
  <<: *nginx-settings
  environment:
    FRONTEND_URL: http://admin_web:3000
    BACKEND_URL: http://admin_api:10000
    CLIENT_URL: http://client_api:10005
    # ... routing для всех backend services

3. Scheduler (cron):

cron_service:
  image: $DOCKER_REGISTRY/scheduler:$TAG_CRON_SERVICE
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock  # Docker API access
  deploy:
    replicas: 1
    placement:
      constraints:
        - node.role == manager  # Only on manager nodes
  environment:
    - "SCHEDULER_EXEC_MODE=1"

3.5 Networking Architecture

Single Overlay Network:

networks:
  issuing:
    driver: overlay
    driver_opts:
      scope: swarm
    attachable: true  # Позволяет внешним контейнерам подключаться

Service Discovery:

# Любой сервис может обращаться к другому по имени:
# http://admin_api:10000
# http://client_api:10005
# http://pdf-renderer:5000

# Swarm DNS автоматически разрешает имена

External Access:

# Только front_nginx exposed externally:
front_nginx:
  ports:
    - "$PUBLIC_NODE_IP:5443:4443"
    - "$PUBLIC_NODE_IP:5444:4444"

# Все остальные services доступны только внутри overlay network

Benefits:

  • Security: Internal services изолированы
  • Service discovery: Automatic DNS
  • Load balancing: Swarm routing mesh
  • Flexibility: Easy scaling

3.6 Database Migration Service

One-time Migration Job:

migrate:
  image: $DOCKER_REGISTRY/core:$TAG_MIGRATE
  command: /job.sh migrate
  <<: [*env-settings, *network-simple, *deploy-settings, *all-secrets]
  healthcheck:
    test: "exit 0"  # Always healthy (one-time job)

Deployment Behavior:

  1. Swarm starts migrate service
  2. Container runs migrations
  3. Container exits
  4. Service shows as "0/1" (expected)
  5. Deployment.sh ignores migrate в readiness check

Migration Tracking:

  • Database table schema_migrations stores applied IDs
  • auto.sh expects specific EXPECTED_MIGRATION_ID
  • Manual verification после deployment

4. Архитектура универсального CI/CD

4.1 High-Level Design

Цель: Создать единый GitLab CI/CD pipeline, который работает для всех 4 окружений.

┌─────────────────────────────────────────────────────────────────┐
│                    GITLAB REPOSITORY STRUCTURE                   │
│                                                                  │
│  coin-gitops/                                                   │
│  ├── .gitlab-ci.yml              # Main pipeline               │
│  ├── .gitlab/                                                   │
│  │   ├── pipelines/                                            │
│  │   │   ├── prepare.yml         # Preparation jobs            │
│  │   │   ├── deploy.yml          # Deployment jobs             │
│  │   │   ├── verify.yml          # Verification jobs           │
│  │   │   └── rollback.yml        # Rollback jobs               │
│  │   └── scripts/                                              │
│  │       ├── prepare-release.sh                                │
│  │       ├── deploy-node.sh                                    │
│  │       └── verify-health.sh                                  │
│  │                                                              │
│  ├── environments/                                              │
│  │   ├── development/                                          │
│  │   │   ├── config.yml          # Environment metadata        │
│  │   │   ├── nodes/                                           │
│  │   │   │   ├── node1/                                       │
│  │   │   │   │   ├── docker-compose.yml                       │
│  │   │   │   │   ├── node.env                                 │
│  │   │   │   │   ├── project.env                              │
│  │   │   │   │   └── secrets.enc # SOPS encrypted             │
│  │   │   │   └── node2/                                       │
│  │   │   │       └── [same structure]                         │
│  │   │   └── common/                                          │
│  │   │       └── project.env     # Shared settings             │
│  │   │                                                         │
│  │   ├── sandbox/                                              │
│  │   │   ├── config.yml                                       │
│  │   │   ├── nodes/                                           │
│  │   │   │   ├── node3/          # wlt-sbx-dkapp3-ams        │
│  │   │   │   │   ├── docker-compose.yml                       │
│  │   │   │   │   ├── custom.secrets.yml                       │
│  │   │   │   │   ├── docker-compose-testshop.yaml             │
│  │   │   │   │   ├── node.env                                 │
│  │   │   │   │   ├── project.env                              │
│  │   │   │   │   ├── project_node3.env                        │
│  │   │   │   │   └── secrets.override.enc                     │
│  │   │   │   └── node4/          # wlt-sbx-dkapp4-ams        │
│  │   │   │       └── [same structure]                         │
│  │   │   └── common/                                          │
│  │   │                                                         │
│  │   ├── testing/                                              │
│  │   │   └── [same structure]                                 │
│  │   │                                                         │
│  │   └── production/                                           │
│  │       ├── config.yml                                       │
│  │       ├── nodes/                                           │
│  │       │   ├── prod1/                                       │
│  │       │   ├── prod2/                                       │
│  │       │   ├── prod3/                                       │
│  │       │   └── prod4/          # 4 nodes для HA             │
│  │       └── common/                                          │
│  │                                                             │
│  ├── scripts/                    # Reusable scripts            │
│  │   ├── prepare-node.sh                                      │
│  │   ├── extract-release.sh                                   │
│  │   ├── deploy-stack.sh                                      │
│  │   └── verify-migration.sh                                  │
│  │                                                             │
│  ├── templates/                  # Configuration templates     │
│  │   ├── docker-compose.base.yml                              │
│  │   ├── node.env.template                                    │
│  │   └── project.env.template                                 │
│  │                                                             │
│  └── docs/                                                     │
│      ├── deployment-guide.md                                   │
│      ├── rollback-procedure.md                                 │
│      └── troubleshooting.md                                    │
└─────────────────────────────────────────────────────────────────┘

4.2 Environment Configuration File

environments/{env}/config.yml:

# Environment Metadata
environment:
  name: sandbox
  type: non-production
  color: yellow
  
# Base Configuration
base:
  directory: /home/dev-wltsbx/encrypted/sandbox
  registry: wlt-sbx-hb-int.wltsbxinner.walletto.eu/coin/release
  
# Nodes Configuration
nodes:
  - name: node3
    context: wlt-sbx-dkapp3-ams
    endpoint: tcp://10.95.81.131:2376
    stack: sbxapp3
    role: primary
    public_ip: 10.95.81.131
    
  - name: node4
    context: wlt-sbx-dkapp4-ams
    endpoint: tcp://10.95.81.132:2376
    stack: sbxapp4
    role: secondary
    public_ip: 10.95.81.132

# Database Configuration
database:
  host: postgres-sandbox.internal
  port: 5432
  name: coin_sandbox
  user: coin

# Deployment Strategy
deployment:
  strategy: sequential  # sequential | parallel | blue-green
  order:
    - node3  # Deploy node3 first
    - node4  # Then node4
  
  health_check:
    enabled: true
    timeout: 300s
    interval: 10s
  
  migration_check:
    enabled: true
    table: schema_migrations
  
  rollback:
    enabled: true
    automatic: false  # Manual approval required

# Approval Requirements
approval:
  required: false  # Sandbox auto-deploys
  approvers: []

# Notification
notifications:
  slack:
    channel: "#deployments-sandbox"
    webhook_url_variable: SLACK_WEBHOOK_SANDBOX

environments/production/config.yml:

environment:
  name: production
  type: production
  color: red

base:
  directory: /srv/coin-production
  registry: harbor.production.company.com/coin/release

nodes:
  - name: prod1
    context: coin-prod-node1
    endpoint: tcp://prod1.internal:2376
    stack: coinprod1
    role: primary
    
  - name: prod2
    context: coin-prod-node2
    endpoint: tcp://prod2.internal:2376
    stack: coinprod2
    role: primary
    
  - name: prod3
    context: coin-prod-node3
    endpoint: tcp://prod3.internal:2376
    stack: coinprod3
    role: secondary
    
  - name: prod4
    context: coin-prod-node4
    endpoint: tcp://prod4.internal:2376
    stack: coinprod4
    role: secondary

deployment:
  strategy: blue-green  # High availability
  health_check:
    enabled: true
    timeout: 600s
  migration_check:
    enabled: true
  rollback:
    enabled: true
    automatic: true  # Auto-rollback на failures

approval:
  required: true
  approvers:
    - DevOps Lead
    - CTO
  change_advisory_board: true

notifications:
  slack:
    channel: "#production-deployments"
  email:
    - ops-team@company.com
    - leadership@company.com

4.3 Universal Pipeline Logic

Dynamic Environment Loading:

# .gitlab-ci.yml
variables:
  ENVIRONMENT: "sandbox"  # Default, can be overridden
  
before_script:
  - |
    # Load environment configuration
    export ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    if [ ! -f "$ENV_CONFIG" ]; then
      echo "Environment config not found: $ENV_CONFIG"
      exit 1
    fi
    
    # Parse YAML to environment variables
    eval $(python3 -c "
    import yaml, sys
    with open('${ENV_CONFIG}') as f:
      config = yaml.safe_load(f)
    
    # Export environment metadata
    print(f\"export ENV_NAME={config['environment']['name']}\")
    print(f\"export ENV_TYPE={config['environment']['type']}\")
    print(f\"export BASE_DIR={config['base']['directory']}\")
    print(f\"export REGISTRY={config['base']['registry']}\")
    
    # Export node configurations
    for idx, node in enumerate(config['nodes']):
      print(f\"export NODE_{idx}_NAME={node['name']}\")
      print(f\"export NODE_{idx}_CONTEXT={node['context']}\")
      print(f\"export NODE_{idx}_STACK={node['stack']}\")
    ")

Node Iteration:

# Deploy to all nodes
for NODE_CONFIG in $(yq eval '.nodes[] | @json' $ENV_CONFIG); do
  NODE_NAME=$(echo $NODE_CONFIG | jq -r '.name')
  NODE_CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
  NODE_STACK=$(echo $NODE_CONFIG | jq -r '.stack')
  
  echo "Deploying to ${NODE_NAME}..."
  
  .gitlab/scripts/deploy-node.sh \
    --environment $ENVIRONMENT \
    --node $NODE_NAME \
    --context $NODE_CONTEXT \
    --stack $NODE_STACK \
    --release-tag $RELEASE_TAG
done

5. GitLab CI/CD Pipeline Design

5.1 Main Pipeline Structure

.gitlab-ci.yml:

# COIN Universal Deployment Pipeline
# Supports: development, sandbox, testing, production

stages:
  - validate
  - prepare
  - deploy
  - verify
  - notify

# Global Variables
variables:
  ENVIRONMENT: "${CI_ENVIRONMENT_NAME}"  # From GitLab environment
  RELEASE_TAG: "${CI_COMMIT_TAG}"
  TASK_ID: "${CI_MERGE_REQUEST_IID}"
  
# Include modular pipelines
include:
  - local: '.gitlab/pipelines/prepare.yml'
  - local: '.gitlab/pipelines/deploy.yml'
  - local: '.gitlab/pipelines/verify.yml'
  - local: '.gitlab/pipelines/rollback.yml'

# Workflow Rules
workflow:
  rules:
    # Production: только tags
    - if: '$CI_COMMIT_TAG =~ /^\d{4}-\d{2}-\d{2}-[a-f0-9]{10}$/ && $ENVIRONMENT == "production"'
      variables:
        DEPLOY_TYPE: "production-release"
    
    # Testing: manual trigger или tags
    - if: '$CI_COMMIT_TAG && $ENVIRONMENT == "testing"'
      variables:
        DEPLOY_TYPE: "testing-release"
    
    # Sandbox: auto на master
    - if: '$CI_COMMIT_BRANCH == "master" && $ENVIRONMENT == "sandbox"'
      variables:
        DEPLOY_TYPE: "sandbox-continuous"
    
    # Development: auto на любой push
    - if: '$CI_COMMIT_BRANCH && $ENVIRONMENT == "development"'
      variables:
        DEPLOY_TYPE: "dev-continuous"

# Default configuration
default:
  tags:
    - coin-deployment-runner
  retry:
    max: 2
    when:
      - runner_system_failure
      - stuck_or_timeout_failure

5.2 Validate Stage

.gitlab/pipelines/validate.yml:

# ===============================================
# VALIDATION STAGE
# Pre-deployment checks
# ===============================================

load_environment_config:
  stage: validate
  script:
    - echo "Loading configuration for: ${ENVIRONMENT}"
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    - |
      if [ ! -f "$ENV_CONFIG" ]; then
        echo "❌ Environment config not found: $ENV_CONFIG"
        exit 1
      fi
    
    # Validate YAML syntax
    - python3 -c "import yaml; yaml.safe_load(open('${ENV_CONFIG}'))"
    - echo "✅ Environment configuration valid"
    
    # Export to artifacts
    - cat $ENV_CONFIG > env_config.yml
  
  artifacts:
    paths:
      - env_config.yml
    expire_in: 1 hour

validate_release_tag:
  stage: validate
  script:
    - echo "Validating release tag: ${RELEASE_TAG}"
    
    # Check tag format: YYYY-MM-DD-<hash>
    - |
      if ! echo "$RELEASE_TAG" | grep -qE '^\d{4}-\d{2}-\d{2}-[a-f0-9]{10}$'; then
        echo "❌ Invalid release tag format: $RELEASE_TAG"
        echo "Expected format: YYYY-MM-DD-<10-char-hash>"
        exit 1
      fi
    
    - echo "✅ Release tag format valid"

check_image_availability:
  stage: validate
  script:
    - echo "Checking Docker image availability..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    - REGISTRY=$(yq eval '.base.registry' $ENV_CONFIG)
    - IMAGE="${REGISTRY}:${RELEASE_TAG}"
    
    # Login to registry
    - echo "$HARBOR_PASSWORD" | docker login -u "$HARBOR_USER" --password-stdin $(echo $REGISTRY | cut -d'/' -f1)
    
    # Check image exists
    - docker manifest inspect "${IMAGE}" > /dev/null 2>&1
    - echo "✅ Image exists: ${IMAGE}"
    
    # Check vulnerability scan
    - |
      SCAN_STATUS=$(curl -s -u "$HARBOR_USER:$HARBOR_PASSWORD" \
        "https://$(echo $REGISTRY | cut -d'/' -f1)/api/v2.0/projects/coin/repositories/release/artifacts/${RELEASE_TAG}/additions/vulnerabilities" \
        | jq -r '.scan_overview.severity // "unknown"')
      
      echo "Vulnerability scan status: $SCAN_STATUS"
      
      if [ "$SCAN_STATUS" == "Critical" ]; then
        echo "⚠️  Critical vulnerabilities found!"
        echo "Deployment blocked for production"
        
        if [ "$ENVIRONMENT" == "production" ]; then
          exit 1
        fi
      fi
    
    - echo "✅ Image security check passed"

validate_docker_contexts:
  stage: validate
  script:
    - echo "Validating Docker contexts..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    # Check each node context
    - |
      yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
        CONTEXT=$(echo $node | jq -r '.context')
        ENDPOINT=$(echo $node | jq -r '.endpoint')
        
        echo "Checking context: $CONTEXT ($ENDPOINT)"
        
        # Verify context exists
        if ! docker context ls --format '{{.Name}}' | grep -q "^${CONTEXT}$"; then
          echo "❌ Context not found: $CONTEXT"
          exit 1
        fi
        
        # Test connectivity
        docker --context $CONTEXT node ls > /dev/null 2>&1
        if [ $? -eq 0 ]; then
          echo "✅ Context accessible: $CONTEXT"
        else
          echo "❌ Cannot connect to context: $CONTEXT"
          exit 1
        fi
      done

check_database_connectivity:
  stage: validate
  script:
    - echo "Checking database connectivity..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    - DB_HOST=$(yq eval '.database.host' $ENV_CONFIG)
    - DB_PORT=$(yq eval '.database.port' $ENV_CONFIG)
    - DB_NAME=$(yq eval '.database.name' $ENV_CONFIG)
    - DB_USER=$(yq eval '.database.user' $ENV_CONFIG)
    
    - echo "Database: ${DB_USER}@${DB_HOST}:${DB_PORT}/${DB_NAME}"
    
    # Test connection
    - |
      PGPASSWORD="${DB_PASSWORD}" psql \
        -h "${DB_HOST}" \
        -p "${DB_PORT}" \
        -U "${DB_USER}" \
        -d "${DB_NAME}" \
        -c "SELECT 1;" > /dev/null
    
    - echo "✅ Database connection successful"

5.3 Prepare Stage

.gitlab/pipelines/prepare.yml:

# ===============================================
# PREPARATION STAGE
# Prepare deployment directories and artifacts
# ===============================================

prepare_release_directories:
  stage: prepare
  needs:
    - load_environment_config
  script:
    - echo "Preparing release directories..."
    - ENV_CONFIG="env_config.yml"
    - BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
    - REGISTRY=$(yq eval '.base.registry' $ENV_CONFIG)
    
    # Extract release from Docker image
    - echo "Extracting release archive..."
    - IMAGE="${REGISTRY}:${RELEASE_TAG}"
    - docker run -i --rm "${IMAGE}" release | base64 -d > release.tar.gz
    - tar -xzf release.tar.gz
    - rm release.tar.gz
    
    - RELEASE_DIR="coin-${RELEASE_TAG}"
    - echo "Release extracted to: $RELEASE_DIR"
    
    # Prepare for each node
    - |
      yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
        NODE_NAME=$(echo $node | jq -r '.name')
        echo "Preparing node: $NODE_NAME"
        
        TARGET_DIR="${BASE_DIR}/${RELEASE_TAG}-${NODE_NAME}"
        mkdir -p "$TARGET_DIR"
        
        # Copy release files
        cp -r "$RELEASE_DIR"/* "$TARGET_DIR/"
        
        # Copy node-specific configuration
        cp "environments/${ENVIRONMENT}/nodes/${NODE_NAME}"/* "$TARGET_DIR/"
        
        # Decrypt secrets
        sops -d "environments/${ENVIRONMENT}/nodes/${NODE_NAME}/secrets.override.enc" \
          > "$TARGET_DIR/secrets.override.env"
        
        # Update TAG in node.env
        sed -i "s/^TAG=.*/TAG=${RELEASE_TAG}/" "$TARGET_DIR/node.env"
        
        # Add deployment metadata
        cat >> "$TARGET_DIR/node.env" <<EOF

# Deployment Metadata (auto-generated)
DEPLOYED_BY=${CI_COMMIT_AUTHOR}
DEPLOYED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
PIPELINE_ID=${CI_PIPELINE_ID}
GIT_COMMIT=${CI_COMMIT_SHA}
ENVIRONMENT=${ENVIRONMENT}
NODE_NAME=${NODE_NAME}
EOF
        
        echo "✅ Node prepared: $NODE_NAME"
      done
  
  artifacts:
    paths:
      - coin-${RELEASE_TAG}/
    expire_in: 1 hour

generate_deployment_manifest:
  stage: prepare
  needs:
    - prepare_release_directories
  script:
    - |
      cat > deployment-manifest.json <<EOF
      {
        "release_tag": "${RELEASE_TAG}",
        "environment": "${ENVIRONMENT}",
        "task_id": "${TASK_ID}",
        "deployed_by": "${CI_COMMIT_AUTHOR}",
        "deployed_at": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
        "pipeline_id": "${CI_PIPELINE_ID}",
        "git_commit": "${CI_COMMIT_SHA}",
        "git_branch": "${CI_COMMIT_BRANCH}",
        "nodes": $(yq eval '.nodes[].name' env_config.yml -o=json | jq -s .)
      }
      EOF
    
    - cat deployment-manifest.json
    - echo "✅ Deployment manifest generated"
  
  artifacts:
    paths:
      - deployment-manifest.json
    expire_in: 30 days

5.4 Deploy Stage

.gitlab/pipelines/deploy.yml:

# ===============================================
# DEPLOYMENT STAGE
# Deploy to nodes according to strategy
# ===============================================

.deploy_template: &deploy_template
  stage: deploy
  script:
    - echo "Deploying to ${NODE_NAME}..."
    - ENV_CONFIG="env_config.yml"
    - BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
    
    # Get node configuration
    - |
      NODE_CONFIG=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\")" $ENV_CONFIG -o=json)
      NODE_CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
      NODE_STACK=$(echo $NODE_CONFIG | jq -r '.stack')
    
    - echo "Context: $NODE_CONTEXT"
    - echo "Stack: $NODE_STACK"
    
    # Navigate to deployment directory
    - TARGET_DIR="${BASE_DIR}/${RELEASE_TAG}-${NODE_NAME}"
    - cd "$TARGET_DIR"
    
    # Verify deployment.sh exists
    - |
      if [ ! -f "deployment.sh" ]; then
        echo "❌ deployment.sh not found in $TARGET_DIR"
        exit 1
      fi
    
    # Switch Docker context
    - docker context use "$NODE_CONTEXT"
    
    # Execute deployment
    - |
      ./deployment.sh deploy \
        -n "$NODE_CONTEXT" \
        -w "$NODE_STACK" \
        -N node.env \
        -P project.env \
        -P project_${NODE_NAME}.env \
        -f docker-compose.yml \
        -f custom.secrets.yml \
        -f docker-compose-testshop.yaml \
        -s secrets.override.env \
        -u
    
    # Verify services started
    - docker service ls --filter name="$NODE_STACK"
    
    - echo "✅ Deployment completed: ${NODE_NAME}"

# Dynamic node deployment jobs
# Generated based on environment config

deploy_node_primary:
  <<: *deploy_template
  variables:
    NODE_NAME: "node3"  # Will be dynamic in real implementation
  environment:
    name: ${ENVIRONMENT}/node3
    url: https://coin-node3.${ENVIRONMENT}.company.com
  when: manual  # For production, auto for dev/sandbox
  
deploy_node_secondary:
  <<: *deploy_template
  variables:
    NODE_NAME: "node4"
  environment:
    name: ${ENVIRONMENT}/node4
    url: https://coin-node4.${ENVIRONMENT}.company.com
  needs:
    - deploy_node_primary  # Sequential deployment
  when: on_success

6. Environment Management

6.1 Environment-specific Configuration Strategy

Проблема: Разные окружения имеют разные требования:

  • Development: 1-2 nodes, минимальные ресурсы, all features ON
  • Sandbox: 2 nodes (node3, node4), тестовые данные, некоторые features OFF
  • Testing: 2-3 nodes, production-like, QA validation
  • Production: 4+ nodes, HA, strict security, все проверки

Решение: Hierarchical configuration с environment-specific overrides.

Configuration Hierarchy

Base Template (общие значения)
    ↓
Environment Common (dev/sandbox/testing/prod общие)
    ↓
Node-Specific (индивидуальные для каждого узла)
    ↓
Secrets (зашифрованные, per-node)

Пример для sandbox/node3:

# 1. Base Template
templates/project.env.template:
  DATABASE_POOL_SIZE={{DB_POOL_SIZE}}
  FEATURE_NEW_CHECKOUT={{FEATURE_NEW_CHECKOUT}}
  LOG_LEVEL={{LOG_LEVEL}}

# 2. Environment Common
environments/sandbox/common/project.env:
  DB_POOL_SIZE=10
  FEATURE_NEW_CHECKOUT=true
  LOG_LEVEL=debug

# 3. Node-Specific
environments/sandbox/nodes/node3/project_node3.env:
  NODE_NAME=node3
  PUBLIC_URL=https://coin-node3.sandbox.company.com
  MAX_WORKERS=6

# 4. Secrets
environments/sandbox/nodes/node3/secrets.override.enc:
  DATABASE_PASSWORD=encrypted...
  API_KEY=encrypted...

# Final merged configuration:
DATABASE_POOL_SIZE=10
FEATURE_NEW_CHECKOUT=true
LOG_LEVEL=debug
NODE_NAME=node3
PUBLIC_URL=https://coin-node3.sandbox.company.com
MAX_WORKERS=6
DATABASE_PASSWORD=decrypted_value
API_KEY=decrypted_value

6.2 Environment-specific Values Matrix

Comparison Matrix:

Parameter Development Sandbox Testing Production
Nodes 1-2 2 (node3, node4) 2-3 4+ (HA)
Replicas 1 1-2 2-3 3-5
Database Pool 5 10 20 50
Log Level debug debug info warning
Feature Flags All ON Most ON Selected Stable only
Health Check Timeout 60s 120s 180s 300s
Deployment Strategy replace sequential sequential blue-green
Auto-deploy Yes Yes Manual Manual + CAB
Rollback Manual Manual Manual Auto on failure
Monitoring Basic Standard Enhanced Full
Retention 7 days 14 days 30 days 90 days

Implementation:

# environments/development/config.yml
deployment:
  replicas: 1
  database_pool_size: 5
  log_level: debug
  feature_flags:
    all: true
  health_check_timeout: 60s
  strategy: replace
  auto_deploy: true
  
# environments/production/config.yml
deployment:
  replicas: 3
  database_pool_size: 50
  log_level: warning
  feature_flags:
    new_checkout: true
    beta_ui: false
    experimental: false
  health_check_timeout: 300s
  strategy: blue-green
  auto_deploy: false
  approval_required: true

6.3 Docker Context Management

Текущая проблема: Hardcoded contexts в auto.sh:

NODE3_CONTEXT="wlt-sbx-dkapp3-ams"
NODE4_CONTEXT="wlt-sbx-dkapp4-ams"

Решение: Dynamic context creation в GitLab Runner.

Docker Context Setup Script

.gitlab/scripts/setup-docker-contexts.sh:

#!/usr/bin/env bash
set -euo pipefail

# Arguments:
# $1 - ENVIRONMENT (development/sandbox/testing/production)

ENVIRONMENT=$1
ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"

echo "Setting up Docker contexts for: ${ENVIRONMENT}"

# Parse nodes from config
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
  NAME=$(echo $node | jq -r '.name')
  CONTEXT=$(echo $node | jq -r '.context')
  ENDPOINT=$(echo $node | jq -r '.endpoint')
  
  echo "Creating context: $CONTEXT"
  
  # Remove existing context if present
  docker context rm "$CONTEXT" 2>/dev/null || true
  
  # Create context with TLS
  docker context create "$CONTEXT" \
    --description "COIN ${ENVIRONMENT} ${NAME}" \
    --docker "host=${ENDPOINT},ca=/certs/${ENVIRONMENT}/ca.pem,cert=/certs/${ENVIRONMENT}/cert.pem,key=/certs/${ENVIRONMENT}/key.pem"
  
  # Verify context
  if docker --context "$CONTEXT" node ls > /dev/null 2>&1; then
    echo "✅ Context verified: $CONTEXT"
  else
    echo "❌ Context verification failed: $CONTEXT"
    exit 1
  fi
done

echo "All contexts created successfully"

Usage в pipeline:

setup_docker_contexts:
  stage: .pre
  script:
    - .gitlab/scripts/setup-docker-contexts.sh "${ENVIRONMENT}"
  cache:
    key: docker-contexts-${ENVIRONMENT}
    paths:
      - ~/.docker/contexts/

6.4 Environment Promotion Workflow

Concept: Изменения проходят через окружения последовательно.

Development → Sandbox → Testing → Production
   (auto)      (auto)    (manual)  (CAB approval)

Promotion Script:

.gitlab/scripts/promote-environment.sh:

#!/usr/bin/env bash
set -euo pipefail

# Arguments:
# $1 - FROM_ENV (development/sandbox/testing)
# $2 - TO_ENV (sandbox/testing/production)

FROM_ENV=$1
TO_ENV=$2

echo "Promoting configuration: ${FROM_ENV}${TO_ENV}"

# Validation
VALID_PROMOTIONS=(
  "development:sandbox"
  "sandbox:testing"
  "testing:production"
)

PROMOTION="${FROM_ENV}:${TO_ENV}"
if [[ ! " ${VALID_PROMOTIONS[@]} " =~ " ${PROMOTION} " ]]; then
  echo "❌ Invalid promotion path: $PROMOTION"
  echo "Valid promotions:"
  for p in "${VALID_PROMOTIONS[@]}"; do
    echo "  - $p"
  done
  exit 1
fi

# Copy common configuration
echo "Copying common configuration..."
cp -r "environments/${FROM_ENV}/common/project.env" \
      "environments/${TO_ENV}/common/project.env.promoted"

# Review changes
echo "Configuration changes:"
diff "environments/${TO_ENV}/common/project.env" \
     "environments/${TO_ENV}/common/project.env.promoted" || true

# Node-specific configurations
for FROM_NODE in environments/${FROM_ENV}/nodes/*/; do
  NODE_NAME=$(basename "$FROM_NODE")
  TO_NODE="environments/${TO_ENV}/nodes/${NODE_NAME}"
  
  if [ -d "$TO_NODE" ]; then
    echo "Promoting node configuration: $NODE_NAME"
    
    # Copy non-secret files
    cp "${FROM_NODE}/docker-compose.yml" "${TO_NODE}/docker-compose.yml.promoted"
    cp "${FROM_NODE}/project_${NODE_NAME}.env" "${TO_NODE}/project_${NODE_NAME}.env.promoted"
    
    # Secrets are NOT promoted automatically - manual review required
  else
    echo "⚠️  Node ${NODE_NAME} does not exist in ${TO_ENV}"
  fi
done

echo "Promotion prepared. Review .promoted files and commit if acceptable."

GitLab Pipeline Integration:

promote_to_testing:
  stage: promote
  script:
    - .gitlab/scripts/promote-environment.sh sandbox testing
    
    # Create merge request
    - |
      git checkout -b "promote/sandbox-to-testing-${CI_COMMIT_SHORT_SHA}"
      
      # Move promoted files
      find environments/testing -name "*.promoted" | while read file; do
        mv "$file" "${file%.promoted}"
      done
      
      git add environments/testing/
      git commit -m "config: promote sandbox → testing
      
      Promoted configuration from sandbox to testing
      
      - Common project settings
      - Node-specific configurations
      - Docker compose files
      
      Refs: ${CI_COMMIT_SHA}"
      
      git push origin "promote/sandbox-to-testing-${CI_COMMIT_SHORT_SHA}"
    
    # Create MR via GitLab API
    - |
      curl -X POST "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/merge_requests" \
        --header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
        --data "source_branch=promote/sandbox-to-testing-${CI_COMMIT_SHORT_SHA}" \
        --data "target_branch=master" \
        --data "title=Promote configuration: sandbox → testing" \
        --data "description=Automated configuration promotion from sandbox to testing.
        
        ## Changes
        - Common configuration updates
        - Node-specific setting adjustments
        
        ## Review Required
        - Verify all changes are appropriate for testing environment
        - Check resource allocations
        - Validate feature flags
        
        ## Next Steps
        After merge, trigger testing deployment pipeline."
  
  when: manual
  only:
    - master

6.5 Feature Flag Management

Purpose: Enable/disable features без code deployment.

Implementation:

# environments/development/common/project.env
# Development: All features ON для testing
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=true
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=true
FEATURE_AI_RECOMMENDATIONS=true

# environments/sandbox/common/project.env
# Sandbox: Most features ON, некоторые experimental OFF
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=true
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=false
FEATURE_AI_RECOMMENDATIONS=true

# environments/testing/common/project.env
# Testing: Production-like, only stable features
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=false
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=false
FEATURE_AI_RECOMMENDATIONS=false

# environments/production/common/project.env
# Production: Only battle-tested features
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=false
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=false
FEATURE_AI_RECOMMENDATIONS=false

Advanced: LaunchDarkly Integration (optional):

# For production, use LaunchDarkly для gradual rollouts
production_feature_flags:
  stage: deploy
  script:
    - |
      # Get feature flags from LaunchDarkly
      FEATURE_CONFIG=$(curl -X GET \
        "https://app.launchdarkly.com/api/v2/flags/coin-production" \
        -H "Authorization: ${LAUNCHDARKLY_API_KEY}")
      
      # Update environment variables
      echo "FEATURE_NEW_CHECKOUT=$(echo $FEATURE_CONFIG | jq -r '.flags.new_checkout.on')" >> production.env
      echo "FEATURE_BETA_UI=$(echo $FEATURE_CONFIG | jq -r '.flags.beta_ui.on')" >> production.env
  
  only:
    - tags
  environment:
    name: production

6.6 Resource Management per Environment

Development:

# Minimal resources
services:
  admin_api:
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M

Sandbox:

# Moderate resources
services:
  admin_api:
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: '1.0'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 512M

Production:

# Full resources
services:
  admin_api:
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G
      placement:
        constraints:
          - node.labels.env == production
        preferences:
          - spread: node.labels.zone  # Multi-AZ

7. Secrets Management

7.1 Current Secret Management Analysis

Существующая система в docker-compose.yml:

secrets:
  card_iv.txt:
    file: ./secrets/card_iv.txt
    name: card_iv.$SV_card_iv  # Versioned secret
  
  db_access:
    file: ./secrets/db_access
    name: db_access.$SV_db_access
  
  # 30+ total secrets...

Версионирование через SV_ переменные:*

# secrets.override.env
SV_card_iv=1
SV_db_access=2
SV_webhook_auth=1

# Results in Swarm:
# card_iv.1
# card_iv.2  (новая версия, old still exists)

Проблемы:

  • Секреты в plaintext на filesystem
  • Нет centralized management
  • Сложная ротация (30+ файлов)
  • Нет audit trail кто получал доступ
  • Риск утечки через Git (если случайно закоммитить)

7.2 Multi-Layer Secrets Architecture

Архитектура:

Layer 1: GitLab CI/CD Variables (Infrastructure Credentials)
├── HARBOR_USER / HARBOR_PASSWORD
├── SSH_PRIVATE_KEY_NODE3 / SSH_PRIVATE_KEY_NODE4
├── SOPS_GPG_PRIVATE_KEY
├── DB_PASSWORD
├── SLACK_WEBHOOK_URL
└── API tokens для external services

Layer 2: SOPS Encrypted Files in Git (Application Secrets)
├── Database credentials
├── API keys (payment gateway, etc.)
├── Encryption keys
├── JWT secrets
└── Third-party service credentials

Layer 3: Docker Secrets (Runtime)
├── Mounted в containers как files (/run/secrets/)
├── Managed через Swarm
├── Versioned (card_iv.1, card_iv.2)
├── Encrypted at rest & in transit
└── Access control через service definitions

Layer 4: External Secret Manager (Optional - Enterprise)
└── HashiCorp Vault
    ├── Dynamic secrets
    ├── Automatic rotation
    ├── Detailed audit logs
    └── Policy-based access

7.3 SOPS Integration

Setup:

# 1. Generate GPG keys для authorized team members
gpg --full-generate-key
# Name: DevOps Team Member
# Email: devops@company.com

# 2. Export public key
gpg --armor --export devops@company.com > devops.pub.asc

# 3. Import team keys
for key in team/*.pub.asc; do
  gpg --import "$key"
done

.sops.yaml configuration:

creation_rules:
  # Production secrets - только senior team
  - path_regex: environments/production/.*/secrets\..*\.enc$
    pgp: >-
      FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
      8E2E0E4F09A5F8B9C1D2E3F4A5B6C7D8E9F0A1B2
    encrypted_regex: '^(password|secret|key|token|private_key|api_key)$'
  
  # Testing secrets - team leads + DevOps
  - path_regex: environments/testing/.*/secrets\..*\.enc$
    pgp: >-
      FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
      1234567890ABCDEF1234567890ABCDEF12345678,
      ABCDEF1234567890ABCDEF1234567890ABCDEF12
    encrypted_regex: '^(password|secret|key|token)$'
  
  # Sandbox secrets - все DevOps team
  - path_regex: environments/sandbox/.*/secrets\..*\.enc$
    pgp: >-
      FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
      1234567890ABCDEF1234567890ABCDEF12345678,
      ABCDEF1234567890ABCDEF1234567890ABCDEF12,
      9876543210FEDCBA9876543210FEDCBA98765432
    encrypted_regex: '^(password|secret|key|token)$'
  
  # Development - all developers
  - path_regex: environments/development/.*/secrets\..*\.enc$
    pgp: >-
      FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
      DEV_TEAM_KEY_1,
      DEV_TEAM_KEY_2,
      DEV_TEAM_KEY_3
    encrypted_regex: '^(password|secret|key)$'

Create/Edit Encrypted Secrets:

# Create new secret file for sandbox/node3
cd coin-gitops
sops environments/sandbox/nodes/node3/secrets.override.enc

# File opens in $EDITOR as plaintext:
DATABASE_PASSWORD: "sandbox-db-password-123"
API_KEY: "sk-sandbox-api-key-456"
JWT_SECRET: "jwt-signing-secret-789"
REDIS_PASSWORD: "redis-password-abc"
PAYMENT_GATEWAY_API_KEY: "pg-api-key-def"
CARD_ENCRYPTION_KEY: "card-enc-key-ghi"

# On save, automatically encrypted by SOPS
# Safe to commit to Git
git add environments/sandbox/nodes/node3/secrets.override.enc
git commit -m "feat(secrets): add sandbox node3 secrets"

Encrypted File Format:

DATABASE_PASSWORD: ENC[AES256_GCM,data:8hT9k2mP3nQ...,iv:xyz...,tag:abc...,type:str]
API_KEY: ENC[AES256_GCM,data:mK9sL3nQ7pR...,iv:def...,tag:ghi...,type:str]
sops:
    kms: []
    pgp:
        - created_at: "2025-01-14T10:30:00Z"
          enc: |
            -----BEGIN PGP MESSAGE-----
            hQIMA...
            -----END PGP MESSAGE-----
          fp: FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4
    version: 3.7.3

7.4 CI/CD Pipeline Secret Handling

Decryption в pipeline:

decrypt_secrets:
  stage: prepare
  script:
    - echo "Decrypting secrets for ${ENVIRONMENT}..."
    
    # Import GPG key from GitLab CI/CD Variable
    - echo "$SOPS_GPG_PRIVATE_KEY" | base64 -d | gpg --import
    
    # Decrypt secrets для каждого node
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    - |
      yq eval '.nodes[].name' $ENV_CONFIG | while read NODE_NAME; do
        SECRET_FILE="environments/${ENVIRONMENT}/nodes/${NODE_NAME}/secrets.override.enc"
        OUTPUT_FILE="/tmp/secrets-${NODE_NAME}.env"
        
        if [ -f "$SECRET_FILE" ]; then
          echo "Decrypting secrets for: $NODE_NAME"
          sops -d "$SECRET_FILE" > "$OUTPUT_FILE"
          
          # Restrictive permissions
          chmod 600 "$OUTPUT_FILE"
          
          # Validate required secrets present
          for KEY in DATABASE_PASSWORD API_KEY JWT_SECRET; do
            if ! grep -q "^${KEY}:" "$OUTPUT_FILE"; then
              echo "❌ Required secret ${KEY} not found for ${NODE_NAME}"
              exit 1
            fi
          done
          
          echo "✅ Secrets decrypted: $NODE_NAME"
        else
          echo "⚠️  No secrets file for: $NODE_NAME"
        fi
      done
  
  artifacts:
    paths:
      - /tmp/secrets-*.env
    expire_in: 1 hour  # Short expiration для security
  
  after_script:
    # Cleanup decrypted secrets
    - rm -f /tmp/secrets-*.env

Convert YAML secrets to ENV format:

# secrets.override.enc (YAML format):
DATABASE_PASSWORD: "secret123"
API_KEY: "key456"

# Convert to ENV format для deployment.sh:
cat /tmp/secrets-node3.env | yq eval -o=props > /tmp/secrets-node3.props.env

# Result:
DATABASE_PASSWORD=secret123
API_KEY=key456

7.5 Docker Secrets Creation in Swarm

Create secrets from decrypted files:

create_docker_secrets:
  stage: deploy
  needs:
    - decrypt_secrets
  script:
    - echo "Creating Docker secrets in Swarm..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    - |
      yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
        NODE_NAME=$(echo $node | jq -r '.name')
        CONTEXT=$(echo $node | jq -r '.context')
        
        docker context use "$CONTEXT"
        
        # Read decrypted secrets
        SECRET_FILE="/tmp/secrets-${NODE_NAME}.env"
        
        # Parse secret version from config
        SECRET_VERSION=$(date +%s)  # Unix timestamp
        
        # Create each secret in Swarm
        while IFS=: read -r key value; do
          SECRET_NAME="${key}_v${SECRET_VERSION}"
          
          echo "$value" | docker secret create "$SECRET_NAME" - || {
            echo "⚠️  Secret ${SECRET_NAME} already exists, skipping"
          }
          
          echo "✅ Secret created: $SECRET_NAME"
        done < <(yq eval 'to_entries | .[] | .key + ":" + .value' "$SECRET_FILE")
        
        # Update secret version variables
        echo "SV_${key}=${SECRET_VERSION}" >> secret_versions_${NODE_NAME}.env
      done
    
    - echo "All secrets created in Swarm"
  
  artifacts:
    paths:
      - secret_versions_*.env
    expire_in: 1 day

7.6 Secret Rotation Strategy

Rotation Process:

1. Generate new secret value
2. Create new version in Swarm (e.g., db_password.3)
3. Update SV_db_password=3 в secrets.override.env
4. Deploy - services start using new version
5. Old versions (db_password.1, db_password.2) remain для rollback
6. After grace period (7-30 days), remove old versions

Rotation Script:

.gitlab/scripts/rotate-secret.sh:

#!/usr/bin/env bash
set -euo pipefail

# Arguments:
# $1 - ENVIRONMENT
# $2 - NODE_NAME
# $3 - SECRET_NAME
# $4 - NEW_VALUE

ENVIRONMENT=$1
NODE_NAME=$2
SECRET_NAME=$3
NEW_VALUE=$4

echo "Rotating secret: ${SECRET_NAME} for ${ENVIRONMENT}/${NODE_NAME}"

# Get Docker context
ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
CONTEXT=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\") | .context" $ENV_CONFIG)

# Get current version
SECRET_FILE="environments/${ENVIRONMENT}/nodes/${NODE_NAME}/secrets.override.enc"
CURRENT_VERSION=$(sops -d "$SECRET_FILE" | yq eval ".${SECRET_NAME}_VERSION // 0")

NEW_VERSION=$((CURRENT_VERSION + 1))

echo "Current version: $CURRENT_VERSION"
echo "New version: $NEW_VERSION"

# Create new secret in Swarm
docker context use "$CONTEXT"
echo "$NEW_VALUE" | docker secret create "${SECRET_NAME}.${NEW_VERSION}" -

# Update encrypted file
sops --set "[\"${SECRET_NAME}\"] \"${NEW_VALUE}\"" "$SECRET_FILE"
sops --set "[\"${SECRET_NAME}_VERSION\"] ${NEW_VERSION}" "$SECRET_FILE"

echo "✅ Secret rotated: ${SECRET_NAME} → version ${NEW_VERSION}"
echo ""
echo "Next steps:"
echo "1. Commit updated secrets file"
echo "2. Deploy to apply new secret"
echo "3. After grace period, remove old version:"
echo "   docker secret rm ${SECRET_NAME}.${CURRENT_VERSION}"

Automated Rotation Schedule:

rotate_production_secrets:
  stage: maintenance
  script:
    - |
      # Rotate database password every 90 days
      LAST_ROTATION=$(git log -1 --format=%ct -- environments/production/nodes/*/secrets.override.enc)
      CURRENT=$(date +%s)
      DAYS_SINCE=$((($CURRENT - $LAST_ROTATION) / 86400))
      
      if [ $DAYS_SINCE -gt 90 ]; then
        echo "Database password rotation required (${DAYS_SINCE} days since last)"
        
        # Generate new password
        NEW_PASSWORD=$(openssl rand -base64 32)
        
        # Rotate for all production nodes
        for NODE in prod1 prod2 prod3 prod4; do
          .gitlab/scripts/rotate-secret.sh production "$NODE" "DATABASE_PASSWORD" "$NEW_PASSWORD"
        done
        
        # Create MR for approval
        git checkout -b "security/rotate-db-password-$(date +%Y%m%d)"
        git add environments/production/
        git commit -m "security: rotate production database password
        
        Automated 90-day rotation of database credentials
        
        - Generated new strong password
        - Updated all production nodes
        - Old version will be removed after 30 days"
        git push
        
        # Create MR via API...
      else
        echo "Database password rotation not required (${DAYS_SINCE} days since last)"
      fi
  
  only:
    - schedules
  when: manual

7.7 Secret Access Audit

Audit Logging:

audit_secret_access:
  stage: verify
  script:
    - echo "Auditing secret access..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    - |
      yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
        NODE_NAME=$(echo $node | jq -r '.name')
        CONTEXT=$(echo $node | jq -r '.context')
        
        docker context use "$CONTEXT"
        
        # Get secret usage
        docker secret ls --format '{{.Name}}\t{{.CreatedAt}}\t{{.UpdatedAt}}'
        
        # Get services using secrets
        docker service ls --format '{{.Name}}' | while read service; do
          SECRETS=$(docker service inspect "$service" --format '{{range .Spec.TaskTemplate.ContainerSpec.Secrets}}{{.SecretName}} {{end}}')
          if [ -n "$SECRETS" ]; then
            echo "Service ${service} uses secrets: $SECRETS"
          fi
        done
      done > secret-audit-${ENVIRONMENT}-$(date +%Y%m%d).log
    
    - echo "✅ Audit log created"
  
  artifacts:
    paths:
      - secret-audit-*.log
    expire_in: 1 year
  
  only:
    - schedules

8. Rollback Strategy

8.1 Current Rollback Mechanism Analysis

Существующая rollback функция в auto.sh:

rollback() {
  # 1. Stop current stacks
  docker stack rm "$NODE3_STACK"
  docker stack rm "$NODE4_STACK"
  sleep 3
  
  # 2. Deploy previous version
  cd "$NODE3_PREV"
  ./deploy.sh deploy [params...]
  
  cd "$NODE4_PREV"
  ./deploy.sh deploy [params...]
}

Проблемы:

  • ⚠️ Зависит от существования previous directories
  • ⚠️ Нет verification после rollback
  • ⚠️ Manual trigger только
  • ⚠️ Полное удаление стеков (downtime)
  • ⚠️ Нет partial rollback (только all-or-nothing)

8.2 Improved Rollback Architecture

Multi-Level Rollback Strategy:

Level 1: Service-Level Rollback (fastest, 1-2 minutes)
├── Revert single service to previous version
├── Keep other services running
├── Minimal impact
└── Use: bug в одном сервисе

Level 2: Stack-Level Rollback (medium, 3-5 minutes)
├── Revert entire stack (all services)
├── Coordinated rollback
├── Moderate impact
└── Use: multiple services affected

Level 3: Infrastructure Rollback (slowest, 5-10 minutes)
├── Revert configuration changes
├── Revert database migrations (if safe)
├── Full environment restore
└── Use: critical infrastructure issues

8.3 GitLab Pipeline Rollback Jobs

.gitlab/pipelines/rollback.yml:

# ===============================================
# ROLLBACK PIPELINE
# Multi-level rollback strategy
# ===============================================

.rollback_preparation: &rollback_preparation
  before_script:
    - echo "Preparing rollback for ${ENVIRONMENT}..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    # Get previous stable version from Git
    - |
      PREVIOUS_TAG=$(git describe --tags --abbrev=0 HEAD~1)
      echo "Current: ${RELEASE_TAG}"
      echo "Previous: ${PREVIOUS_TAG}"
      echo "PREVIOUS_TAG=${PREVIOUS_TAG}" >> rollback.env
  
  artifacts:
    reports:
      dotenv: rollback.env
    expire_in: 1 hour

rollback_service:
  stage: rollback
  <<: *rollback_preparation
  script:
    - echo "Rolling back service: ${SERVICE_NAME}"
    - NODE_NAME="${NODE_NAME}"
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    # Get node configuration
    - |
      NODE_CONFIG=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\")" $ENV_CONFIG -o=json)
      CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
      STACK=$(echo $NODE_CONFIG | jq -r '.stack')
    
    # Get previous image tag
    - PREVIOUS_IMAGE="${REGISTRY}/${SERVICE_NAME}:${PREVIOUS_TAG}"
    
    - echo "Rolling back ${SERVICE_NAME} to ${PREVIOUS_TAG}"
    
    # Update service image
    - docker context use "$CONTEXT"
    - |
      docker service update \
        --image "$PREVIOUS_IMAGE" \
        --update-failure-action rollback \
        "${STACK}_${SERVICE_NAME}"
    
    # Wait for service update
    - sleep 30
    
    # Verify service health
    - |
      REPLICAS=$(docker service ls --filter name="${STACK}_${SERVICE_NAME}" --format '{{.Replicas}}')
      echo "Service replicas: $REPLICAS"
      
      if [[ "$REPLICAS" != *"/"* ]]; then
        echo "❌ Service rollback failed"
        exit 1
      fi
      
      RUNNING=$(echo $REPLICAS | cut -d'/' -f1)
      DESIRED=$(echo $REPLICAS | cut -d'/' -f2)
      
      if [ "$RUNNING" -ne "$DESIRED" ]; then
        echo "❌ Service not fully rolled back: $RUNNING/$DESIRED"
        exit 1
      fi
    
    - echo "✅ Service rolled back successfully: ${SERVICE_NAME}"
  
  variables:
    SERVICE_NAME: ""  # Must be provided
    NODE_NAME: ""     # Must be provided
  
  when: manual
  allow_failure: false

rollback_stack:
  stage: rollback
  <<: *rollback_preparation
  script:
    - echo "Rolling back entire stack: ${NODE_NAME}"
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    # Get node configuration
    - |
      NODE_CONFIG=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\")" $ENV_CONFIG -o=json)
      CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
      STACK=$(echo $NODE_CONFIG | jq -r '.stack')
      BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
    
    - echo "Context: $CONTEXT"
    - echo "Stack: $STACK"
    - echo "Previous version: $PREVIOUS_TAG"
    
    # Check previous version directory exists
    - PREV_DIR="${BASE_DIR}/${PREVIOUS_TAG}-${NODE_NAME}"
    - |
      if [ ! -d "$PREV_DIR" ]; then
        echo "❌ Previous version directory not found: $PREV_DIR"
        echo "Available versions:"
        ls -la "$BASE_DIR" | grep "$NODE_NAME"
        exit 1
      fi
    
    - echo "✅ Previous version found: $PREV_DIR"
    
    # Stop current stack (gracefully)
    - docker context use "$CONTEXT"
    - echo "Stopping current stack..."
    - docker stack rm "$STACK" || echo "Stack already removed"
    
    # Wait for stack to fully stop
    - sleep 10
    - |
      while docker service ls | grep -q "$STACK"; do
        echo "Waiting for services to stop..."
        sleep 5
      done
    
    - echo "✅ Stack stopped"
    
    # Deploy previous version
    - cd "$PREV_DIR"
    - echo "Deploying previous version from: $(pwd)"
    
    - |
      ./deployment.sh deploy \
        -n "$CONTEXT" \
        -w "$STACK" \
        -N node.env \
        -P project.env \
        -P project_${NODE_NAME}.env \
        -f docker-compose.yml \
        -f custom.secrets.yml \
        -f docker-compose-testshop.yaml \
        -s secrets.override.env \
        -u
    
    # Verify deployment
    - sleep 30
    - docker service ls --filter name="$STACK"
    
    - |
      SERVICE_COUNT=$(docker service ls --filter name="$STACK" --format '{{.Name}}' | wc -l)
      if [ "$SERVICE_COUNT" -lt 5 ]; then
        echo "❌ Rollback incomplete: only $SERVICE_COUNT services running"
        exit 1
      fi
    
    - echo "✅ Stack rolled back successfully: ${NODE_NAME}"
  
  variables:
    NODE_NAME: ""  # Must be provided
  
  when: manual
  allow_failure: false

rollback_all_nodes:
  stage: rollback
  <<: *rollback_preparation
  script:
    - echo "Rolling back all nodes in ${ENVIRONMENT}"
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    - BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
    
    # Rollback each node sequentially
    - |
      yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
        NODE_NAME=$(echo $node | jq -r '.name')
        CONTEXT=$(echo $node | jq -r '.context')
        STACK=$(echo $node | jq -r '.stack')
        
        echo "========================================="
        echo "Rolling back node: $NODE_NAME"
        echo "========================================="
        
        PREV_DIR="${BASE_DIR}/${PREVIOUS_TAG}-${NODE_NAME}"
        
        if [ ! -d "$PREV_DIR" ]; then
          echo "❌ Previous version not found for: $NODE_NAME"
          continue
        fi
        
        # Stop and redeploy
        docker context use "$CONTEXT"
        docker stack rm "$STACK" || true
        sleep 10
        
        cd "$PREV_DIR"
        ./deployment.sh deploy \
          -n "$CONTEXT" \
          -w "$STACK" \
          -N node.env \
          -P project.env \
          -P project_${NODE_NAME}.env \
          -f docker-compose.yml \
          -f custom.secrets.yml \
          -f docker-compose-testshop.yaml \
          -s secrets.override.env \
          -u
        
        echo "✅ Node rolled back: $NODE_NAME"
      done
    
    - echo "✅ All nodes rolled back successfully"
  
  when: manual
  allow_failure: false
  environment:
    name: ${ENVIRONMENT}
    action: rollback

8.4 Automatic Rollback Triggers

Health Check Based Auto-Rollback:

verify_deployment_health:
  stage: verify
  script:
    - echo "Monitoring deployment health..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    - HEALTH_CHECK_TIMEOUT=$(yq eval '.deployment.health_check.timeout' $ENV_CONFIG | sed 's/s//')
    - HEALTH_CHECK_INTERVAL=$(yq eval '.deployment.health_check.interval' $ENV_CONFIG | sed 's/s//')
    
    - START_TIME=$(date +%s)
    - FAILURES=0
    - MAX_FAILURES=3
    
    - |
      while true; do
        CURRENT_TIME=$(date +%s)
        ELAPSED=$(($CURRENT_TIME - $START_TIME))
        
        if [ $ELAPSED -gt $HEALTH_CHECK_TIMEOUT ]; then
          echo "❌ Health check timeout reached"
          FAILURES=$(($FAILURES + 1))
          break
        fi
        
        # Check all nodes
        ALL_HEALTHY=true
        yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
          NODE_NAME=$(echo $node | jq -r '.name')
          CONTEXT=$(echo $node | jq -r '.context')
          STACK=$(echo $node | jq -r '.stack')
          
          docker context use "$CONTEXT"
          
          # Check service health
          UNHEALTHY=$(docker service ls --filter name="$STACK" --format '{{.Replicas}}' | grep -v "/" | wc -l)
          
          if [ "$UNHEALTHY" -gt 0 ]; then
            echo "⚠️  Unhealthy services detected on $NODE_NAME"
            ALL_HEALTHY=false
            FAILURES=$(($FAILURES + 1))
          fi
        done
        
        if $ALL_HEALTHY; then
          echo "✅ All services healthy"
          break
        fi
        
        if [ $FAILURES -ge $MAX_FAILURES ]; then
          echo "❌ Max failures reached: $FAILURES"
          echo "Triggering automatic rollback..."
          
          # Trigger rollback pipeline
          curl -X POST \
            -F "token=${CI_JOB_TOKEN}" \
            -F "ref=master" \
            -F "variables[ENVIRONMENT]=${ENVIRONMENT}" \
            -F "variables[TRIGGER_ROLLBACK]=true" \
            "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/trigger/pipeline"
          
          exit 1
        fi
        
        sleep $HEALTH_CHECK_INTERVAL
      done
  
  retry:
    max: 0  # No retry - trigger rollback instead

8.5 Database Migration Rollback

Проблема: Database migrations нельзя откатить автоматически (data loss risk).

Strategy:

handle_migration_rollback:
  stage: rollback
  script:
    - echo "Handling database migration rollback..."
    - echo "⚠️  WARNING: Database migrations cannot be automatically rolled back"
    
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    - DB_HOST=$(yq eval '.database.host' $ENV_CONFIG)
    - DB_NAME=$(yq eval '.database.name' $ENV_CONFIG)
    
    # Get current migration ID
    - |
      CURRENT_MIGRATION=$(PGPASSWORD="${DB_PASSWORD}" psql \
        -h "${DB_HOST}" \
        -U coin \
        -d "${DB_NAME}" \
        -t -c "SELECT MAX(id) FROM schema_migrations;")
    
    - echo "Current migration ID: $CURRENT_MIGRATION"
    
    # Get expected migration for previous version
    - |
      PREVIOUS_MIGRATION=$(git show ${PREVIOUS_TAG}:environments/${ENVIRONMENT}/migration.txt)
      echo "Previous version migration ID: $PREVIOUS_MIGRATION"
    
    - |
      if [ "$CURRENT_MIGRATION" -gt "$PREVIOUS_MIGRATION" ]; then
        echo "❌ CRITICAL: New migrations were applied!"
        echo "Current: $CURRENT_MIGRATION"
        echo "Previous: $PREVIOUS_MIGRATION"
        echo ""
        echo "Manual intervention required:"
        echo "1. Review migrations between $PREVIOUS_MIGRATION and $CURRENT_MIGRATION"
        echo "2. Determine if rollback is safe (check for data loss)"
        echo "3. If safe, manually execute down migrations"
        echo "4. If not safe, consider forward fix instead"
        echo ""
        echo "Contact DBA team immediately!"
        
        # Send alert
        curl -X POST "$SLACK_WEBHOOK_URL" \
          -H 'Content-Type: application/json' \
          -d '{
            "text": "🚨 CRITICAL: Migration rollback required",
            "attachments": [{
              "color": "danger",
              "text": "Environment: '"$ENVIRONMENT"'\nCurrent migration: '"$CURRENT_MIGRATION"'\nTarget migration: '"$PREVIOUS_MIGRATION"'\n\nManual DBA intervention required!"
            }]
          }'
        
        exit 1
      else
        echo "✅ No new migrations applied, safe to rollback"
      fi
  
  when: on_failure
  allow_failure: false

8.6 Rollback Verification

Post-Rollback Checks:

verify_rollback:
  stage: verify
  needs:
    - rollback_stack
  script:
    - echo "Verifying rollback success..."
    - ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
    
    # 1. Check all services running
    - |
      yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
        NODE_NAME=$(echo $node | jq -r '.name')
        CONTEXT=$(echo $node | jq -r '.context')
        STACK=$(echo $node | jq -r '.stack')
        
        docker context use "$CONTEXT"
        
        echo "Checking services on: $NODE_NAME"
        SERVICES=$(docker service ls --filter name="$STACK" --format '{{.Name}}\t{{.Replicas}}')
        echo "$SERVICES"
        
        # Verify all services converged
        UNCONVERGED=$(echo "$SERVICES" | awk -F'\t' '{
          split($2, a, "/")
          if (a[1] != a[2]) print $1
        }')
        
        if [ -n "$UNCONVERGED" ]; then
          echo "❌ Unconverged services after rollback:"
          echo "$UNCONVERGED"
          exit 1
        fi
      done
    
    - echo "✅ All services converged"
    
    # 2. Health check endpoints
    - |
      yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
        NODE_NAME=$(echo $node | jq -r '.name')
        PUBLIC_IP=$(echo $node | jq -r '.public_ip // ""')
        
        if [ -n "$PUBLIC_IP" ]; then
          echo "Health check: https://${PUBLIC_IP}:5443/health"
          
          HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "https://${PUBLIC_IP}:5443/health")
          
          if [ "$HTTP_CODE" != "200" ]; then
            echo "❌ Health check failed: HTTP $HTTP_CODE"
            exit 1
          fi
          
          echo "✅ Health check passed: $NODE_NAME"
        fi
      done
    
    # 3. Smoke tests
    - .gitlab/scripts/smoke-tests.sh "${ENVIRONMENT}"
    
    - echo "✅ Rollback verification complete"

8.7 Rollback Documentation

Post-Rollback Report:

generate_rollback_report:
  stage: notify
  needs:
    - verify_rollback
  script:
    - |
      cat > rollback-report-${ENVIRONMENT}-$(date +%Y%m%d-%H%M%S).md <<EOF
      # Rollback Report
      
      ## Incident Summary
      - **Environment**: ${ENVIRONMENT}
      - **Date**: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
      - **Triggered By**: ${CI_COMMIT_AUTHOR}
      - **Pipeline**: ${CI_PIPELINE_URL}
      
      ## Versions
      - **Failed Version**: ${RELEASE_TAG}
      - **Rolled Back To**: ${PREVIOUS_TAG}
      
      ## Rollback Actions
      - Stack removed: ${STACK_NAME}
      - Previous version deployed: ${PREVIOUS_TAG}
      - Services restarted: All
      - Health checks: Passed
      
      ## Verification
      - All services converged: ✅
      - Health endpoints responding: ✅
      - Smoke tests passed: ✅
      
      ## Impact
      - Downtime: ~5 minutes
      - Affected users: [To be determined]
      - Data loss: None
      
      ## Root Cause
      [To be investigated]
      
      ## Next Steps
      1. Investigate root cause of deployment failure
      2. Fix identified issues
      3. Test fix in lower environments
      4. Schedule re-deployment
      
      ## Timeline
      - Failure detected: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
      - Rollback initiated: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
      - Rollback completed: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
      - Services restored: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
      EOF
    
    - cat rollback-report-*.md
    
    # Send to Slack
    - |
      REPORT=$(cat rollback-report-*.md)
      curl -X POST "$SLACK_WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d '{
          "text": "Rollback Report: '"$ENVIRONMENT"'",
          "attachments": [{
            "color": "warning",
            "text": "```'"$REPORT"'```"
          }]
        }'
  
  artifacts:
    paths:
      - rollback-report-*.md
    expire_in: 1 year

9. Мониторинг и верификация

9.1 Multi-Layer Monitoring Architecture

Monitoring Layers:

Layer 1: Infrastructure Monitoring (Swarm Level)
├── Node health (CPU, memory, disk)
├── Service status (running/failed)
├── Container metrics
└── Network performance

Layer 2: Application Monitoring (Service Level)
├── HTTP endpoints health
├── Response times
├── Error rates
├── Transaction volumes

Layer 3: Business Monitoring (Business Level)
├── User activity
├── Transaction success rate
├── Revenue metrics
└── Critical business processes

Layer 4: Deployment Monitoring (CI/CD Level)
├── Pipeline success rate
├── Deployment frequency
├── Lead time for changes
├── MTTR (Mean Time To Recovery)

9.2 Infrastructure Monitoring (Prometheus + Grafana)

Prometheus Scrape Configuration:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Docker Swarm Manager Metrics
  - job_name: 'docker-swarm-manager'
    static_configs:
      - targets:
        - node3.internal:9323
        - node4.internal:9323
        labels:
          environment: 'sandbox'
  
  # Node Exporter (Host Metrics)
  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - node3.internal:9100
        - node4.internal:9100
        labels:
          environment: 'sandbox'
  
  # cAdvisor (Container Metrics)
  - job_name: 'cadvisor'
    static_configs:
      - targets:
        - node3.internal:8080
        - node4.internal:8080
        labels:
          environment: 'sandbox'
  
  # Application Metrics
  - job_name: 'coin-api'
    dns_sd_configs:
      - names:
        - 'tasks.admin_api'
        - 'tasks.client_api'
        type: 'A'
        port: 9090  # Metrics port

Key Metrics to Monitor:

# Service Health
up{job="coin-api"} == 1

# Container Restarts
rate(container_restart_count[5m]) > 0

# CPU Usage
rate(container_cpu_usage_seconds_total[5m]) * 100

# Memory Usage
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100

# Network Traffic
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

# HTTP Request Rate
rate(http_requests_total[5m])

# HTTP Error Rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Response Time (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Grafana Dashboard - Deployment Overview:

{
  "dashboard": {
    "title": "COIN Deployment Dashboard",
    "panels": [
      {
        "title": "Deployment Timeline",
        "type": "graph",
        "targets": [
          {
            "expr": "changes(deployment_version{environment=\"$environment\"}[1h])"
          }
        ]
      },
      {
        "title": "Service Health",
        "type": "stat",
        "targets": [
          {
            "expr": "count(up{job=\"coin-api\",environment=\"$environment\"} == 1)"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\",environment=\"$environment\"}[5m])"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{environment=\"$environment\"}[5m]))"
          }
        ]
      }
    ]
  }
}

9.3 Application Health Checks

Health Check Endpoints:

# docker-compose.yml
services:
  admin_api:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:10000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 40s

Comprehensive Health Check Script:

.gitlab/scripts/health-check.sh:

#!/usr/bin/env bash
set -euo pipefail

# Arguments:
# $1 - BASE_URL (e.g., https://coin-node3.sandbox.company.com)
# $2 - ENVIRONMENT

BASE_URL=$1
ENVIRONMENT=$2

echo "Running health checks against: ${BASE_URL}"

FAILED_CHECKS=0

# Test 1: Basic Health Endpoint
echo "Test 1: Health endpoint..."
HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "${BASE_URL}/health")
if [ "$HTTP_CODE" = "200" ]; then
  echo "✅ Health check passed (HTTP $HTTP_CODE)"
else
  echo "❌ Health check failed (HTTP $HTTP_CODE)"
  FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi

# Test 2: API Version
echo "Test 2: API version..."
VERSION=$(curl -k -s "${BASE_URL}/api/version" | jq -r '.version // empty')
if [ -n "$VERSION" ]; then
  echo "✅ API version: ${VERSION}"
else
  echo "❌ API version check failed"
  FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi

# Test 3: Database Connectivity
echo "Test 3: Database connectivity..."
DB_STATUS=$(curl -k -s "${BASE_URL}/api/health/database" | jq -r '.status // empty')
if [ "$DB_STATUS" = "ok" ]; then
  echo "✅ Database connectivity OK"
else
  echo "❌ Database connectivity failed: $DB_STATUS"
  FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi

# Test 4: Redis Connectivity
echo "Test 4: Redis connectivity..."
REDIS_STATUS=$(curl -k -s "${BASE_URL}/api/health/redis" | jq -r '.status // empty')
if [ "$REDIS_STATUS" = "ok" ]; then
  echo "✅ Redis connectivity OK"
else
  echo "❌ Redis connectivity failed: $REDIS_STATUS"
  FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi

# Test 5: Critical Endpoints
echo "Test 5: Critical endpoints..."
ENDPOINTS=(
  "/api/auth/status"
  "/api/users/me"
  "/api/transactions/stats"
)

for endpoint in "${ENDPOINTS[@]}"; do
  HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" \
    -H "Authorization: Bearer ${API_TEST_TOKEN}" \
    "${BASE_URL}${endpoint}")
  
  if [ "$HTTP_CODE" = "200" ] || [ "$HTTP_CODE" = "401" ]; then
    echo "✅ Endpoint reachable: $endpoint (HTTP $HTTP_CODE)"
  else
    echo "❌ Endpoint failed: $endpoint (HTTP $HTTP_CODE)"
    FAILED_CHECKS=$((FAILED_CHECKS + 1))
  fi
done

# Summary
echo ""
echo "========================================"
if [ $FAILED_CHECKS -eq 0 ]; then
  echo "✅ All health checks passed"
  echo "========================================"
  exit 0
else
  echo "❌ ${FAILED_CHECKS} health check(s) failed"
  echo "========================================"
  exit 1
fi

9.4 Smoke Tests

Post-Deployment Smoke Test Suite:

.gitlab/scripts/smoke-tests.sh:

#!/usr/bin/env bash
set -euo pipefail

# Arguments:
# $1 - ENVIRONMENT

ENVIRONMENT=$1
ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"

echo "Running smoke tests for: ${ENVIRONMENT}"

FAILED_TESTS=0

# Get first node URL
FIRST_NODE=$(yq eval '.nodes[0].name' $ENV_CONFIG)
BASE_URL="https://coin-${FIRST_NODE}.${ENVIRONMENT}.company.com"

echo "Testing against: $BASE_URL"

# Test 1: User Authentication
echo "Smoke Test 1: User Authentication..."
AUTH_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"username":"test_user","password":"test_password"}')

TOKEN=$(echo $AUTH_RESPONSE | jq -r '.token // empty')
if [ -n "$TOKEN" ]; then
  echo "✅ Authentication successful"
else
  echo "❌ Authentication failed"
  FAILED_TESTS=$((FAILED_TESTS + 1))
fi

# Test 2: Create Transaction
echo "Smoke Test 2: Create Transaction..."
TX_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/transactions" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"amount":100,"currency":"USD","description":"Smoke test"}')

TX_ID=$(echo $TX_RESPONSE | jq -r '.id // empty')
if [ -n "$TX_ID" ]; then
  echo "✅ Transaction created: $TX_ID"
else
  echo "❌ Transaction creation failed"
  FAILED_TESTS=$((FAILED_TESTS + 1))
fi

# Test 3: Retrieve Transaction
echo "Smoke Test 3: Retrieve Transaction..."
TX_GET=$(curl -k -s "${BASE_URL}/api/transactions/${TX_ID}" \
  -H "Authorization: Bearer $TOKEN")

TX_STATUS=$(echo $TX_GET | jq -r '.status // empty')
if [ "$TX_STATUS" = "pending" ] || [ "$TX_STATUS" = "completed" ]; then
  echo "✅ Transaction retrieved: status=$TX_STATUS"
else
  echo "❌ Transaction retrieval failed"
  FAILED_TESTS=$((FAILED_TESTS + 1))
fi

# Test 4: List Transactions
echo "Smoke Test 4: List Transactions..."
TX_LIST=$(curl -k -s "${BASE_URL}/api/transactions?limit=10" \
  -H "Authorization: Bearer $TOKEN")

TX_COUNT=$(echo $TX_LIST | jq '.items | length')
if [ "$TX_COUNT" -gt 0 ]; then
  echo "✅ Transaction list retrieved: $TX_COUNT items"
else
  echo "❌ Transaction list empty or failed"
  FAILED_TESTS=$((FAILED_TESTS + 1))
fi

# Test 5: Webhook Endpoint
echo "Smoke Test 5: Webhook Processing..."
WEBHOOK_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/webhooks/test" \
  -H "X-Webhook-Secret: ${WEBHOOK_SECRET}" \
  -H "Content-Type: application/json" \
  -d '{"event":"test","data":{}}')

WEBHOOK_STATUS=$(echo $WEBHOOK_RESPONSE | jq -r '.status // empty')
if [ "$WEBHOOK_STATUS" = "processed" ]; then
  echo "✅ Webhook processed"
else
  echo "❌ Webhook processing failed"
  FAILED_TESTS=$((FAILED_TESTS + 1))
fi

# Test 6: PDF Generation
echo "Smoke Test 6: PDF Generation..."
PDF_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/reports/generate" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type":"transaction_report","format":"pdf"}')

PDF_URL=$(echo $PDF_RESPONSE | jq -r '.url // empty')
if [ -n "$PDF_URL" ]; then
  echo "✅ PDF generated: $PDF_URL"
else
  echo "❌ PDF generation failed"
  FAILED_TESTS=$((FAILED_TESTS + 1))
fi

# Summary
echo ""
echo "========================================"
echo "Smoke Tests Summary"
echo "========================================"
if [ $FAILED_TESTS -eq 0 ]; then
  echo "✅ All smoke tests passed (6/6)"
  exit 0
else
  echo "❌ ${FAILED_TESTS} smoke test(s) failed"
  exit 1
fi

9.5 Performance Baseline Monitoring

Response Time Tracking:

monitor_performance_baseline:
  stage: verify
  script:
    - echo "Monitoring performance baseline..."
    - BASE_URL="https://coin-node3.${ENVIRONMENT}.company.com"
    
    # Measure response times
    - |
      echo "Endpoint,Response_Time_MS,Status" > performance-${RELEASE_TAG}.csv
      
      ENDPOINTS=(
        "/health"
        "/api/version"
        "/api/auth/status"
        "/api/transactions?limit=10"
      )
      
      for endpoint in "${ENDPOINTS[@]}"; do
        RESPONSE_TIME=$(curl -k -s -o /dev/null -w "%{time_total}" "${BASE_URL}${endpoint}")
        HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "${BASE_URL}${endpoint}")
        RESPONSE_TIME_MS=$(echo "$RESPONSE_TIME * 1000" | bc)
        
        echo "${endpoint},${RESPONSE_TIME_MS},${HTTP_CODE}" >> performance-${RELEASE_TAG}.csv
      done
    
    - cat performance-${RELEASE_TAG}.csv
    
    # Compare with baseline
    - |
      if [ -f "performance-baseline.csv" ]; then
        echo "Comparing with baseline..."
        
        # Simple comparison (production should use proper analysis)
        CURRENT_AVG=$(awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' performance-${RELEASE_TAG}.csv)
        BASELINE_AVG=$(awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' performance-baseline.csv)
        
        DEGRADATION=$(echo "scale=2; ($CURRENT_AVG - $BASELINE_AVG) / $BASELINE_AVG * 100" | bc)
        
        echo "Current average: ${CURRENT_AVG}ms"
        echo "Baseline average: ${BASELINE_AVG}ms"
        echo "Degradation: ${DEGRADATION}%"
        
        # Alert if degradation > 20%
        if (( $(echo "$DEGRADATION > 20" | bc -l) )); then
          echo "⚠️  Performance degradation detected: ${DEGRADATION}%"
          echo "Consider rollback or investigation"
        fi
      else
        echo "No baseline found, creating..."
        cp performance-${RELEASE_TAG}.csv performance-baseline.csv
      fi
  
  artifacts:
    paths:
      - performance-*.csv
    expire_in: 30 days

9.6 Alerting Configuration

Alertmanager Rules:

# alertmanager.yml
route:
  group_by: ['alertname', 'environment']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'slack-notifications'
  
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    
    - match:
        severity: warning
        environment: production
      receiver: 'slack-production'
    
    - match:
        environment: sandbox
      receiver: 'slack-sandbox'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#deployments'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        description: '{{ .GroupLabels.alertname }}'
  
  - name: 'slack-production'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_PRODUCTION}'
        channel: '#production-alerts'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

Alert Rules:

# prometheus-rules.yml
groups:
  - name: deployment_alerts
    interval: 30s
    rules:
      - alert: DeploymentFailed
        expr: deployment_status{environment="production"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          description: "Deployment to {{ $labels.node }} failed"
      
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5..",environment="production"}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          description: "High error rate detected: {{ $value }}%"
      
      - alert: ServiceDown
        expr: up{job="coin-api",environment="production"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          description: "Service {{ $labels.instance }} is down"
      
      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          description: "Container {{ $labels.container }} memory usage > 90%"

10. План внедрения

10.1 Phased Rollout Strategy

4-Phase Approach:

Phase 1: Infrastructure Setup (Week 1-2)
├── GitLab Runner installation
├── Docker context configuration
├── SOPS setup
├── Monitoring stack deployment
└── Testing infrastructure

Phase 2: Development Environment (Week 3-4)
├── Migrate development to GitOps
├── Create pipeline templates
├── Test basic workflows
├── Train team
└── Collect feedback

Phase 3: Sandbox + Testing (Week 5-6)
├── Migrate sandbox environment
├── Implement approval workflows
├── Add advanced features (rollback, etc.)
├── Performance tuning
└── Documentation

Phase 4: Production Ready (Week 7-8)
├── Production configuration
├── Security hardening
├── Disaster recovery testing
├── Final training
└── Go-live

10.2 Week-by-Week Implementation Plan

Week 1: Foundation

Day Tasks Deliverables
Mon Kickoff meeting, Requirements review Project charter
Tue GitLab Runner installation, Docker context setup Working runner
Wed Create repository structure, Initial pipeline Base .gitlab-ci.yml
Thu SOPS installation, GPG key generation Encrypted secrets
Fri Monitoring stack deployment Prometheus + Grafana

Week 2: Development Pipeline

Day Tasks Deliverables
Mon Development environment configuration config.yml
Tue Prepare stage implementation Extract + prepare scripts
Wed Deploy stage implementation Deployment automation
Thu Verification stage implementation Health checks + smoke tests
Fri End-to-end testing Working dev pipeline

Week 3: Sandbox Migration

Day Tasks Deliverables
Mon Sandbox configuration creation Sandbox config files
Tue Secret migration to SOPS Encrypted secrets
Wed Pipeline adaptation Sandbox-specific jobs
Thu Testing + validation Successful deployment
Fri Parallel running (old + new) Comparison data

Week 4: Advanced Features

Day Tasks Deliverables
Mon Rollback implementation Rollback pipeline
Tue Automatic rollback triggers Health-based rollback
Wed Performance monitoring Baseline tracking
Thu Alert configuration Alerting rules
Fri Documentation update User guides

Week 5: Testing Environment

Day Tasks Deliverables
Mon Testing environment setup Testing configs
Tue Approval workflow implementation Manual gates
Wed Integration with QA processes QA checklist
Thu Environment promotion testing Promotion pipeline
Fri Load testing Performance report

Week 6: Production Preparation

Day Tasks Deliverables
Mon Production configuration Prod configs
Tue Security hardening Security audit
Wed Disaster recovery setup DR procedures
Thu Change Advisory Board integration CAB workflow
Fri Production dry-run Test results

Week 7: Production Migration

Day Tasks Deliverables
Mon Final security review Sign-off
Tue Production secrets migration Encrypted prod secrets
Wed Production pipeline testing Test deployment
Thu Go-live preparation Runbooks
Fri Production go-live First prod deployment

Week 8: Stabilization

Day Tasks Deliverables
Mon Monitor production deployments Metrics report
Tue Address any issues Bug fixes
Wed Team training sessions Training materials
Thu Documentation finalization Complete docs
Fri Project retrospective Lessons learned

10.3 Success Criteria

Technical Metrics:

Metric Target Measurement
Deployment time < 15 min Pipeline duration
Success rate > 95% Successful/total deploys
Rollback time < 5 min Rollback duration
MTTR < 30 min Mean time to recovery
Pipeline reliability > 99% Runner uptime

Process Metrics:

Metric Target Measurement
Manual steps < 2 per deploy Process audit
Approval time < 2 hours Approval duration
Documentation coverage 100% Doc review
Team training 100% Training completion
Knowledge transfer Complete Quiz scores

Business Metrics:

Metric Target Measurement
Deployment frequency 2x increase Deploy count
Lead time 50% reduction Commit to production
Change failure rate < 5% Failed/total changes
Team satisfaction > 80% Survey results
Cost savings Measurable Time saved × hourly rate

10.4 Risk Mitigation

Identified Risks:

Risk Probability Impact Mitigation
Pipeline failures during migration High Medium Parallel running, quick rollback
Secret leakage Low Critical SOPS encryption, access control
Learning curve Medium Medium Training, documentation, support
Production incident Low Critical Comprehensive testing, gradual rollout
Resistance to change Medium Medium Change management, stakeholder buy-in

Contingency Plans:

  1. Pipeline Failure:

    • Keep manual scripts as backup
    • Document emergency procedures
    • 24/7 support during migration
  2. Security Incident:

    • Immediate secret rotation
    • Audit all access
    • Incident response team activation
  3. Team Issues:

    • Extended training period
    • Pair programming sessions
    • Dedicated support channel

10.5 Training Plan

Training Modules:

Module 1: GitOps Fundamentals (2 hours)

  • Infrastructure as Code concepts
  • Git workflow и best practices
  • CI/CD pipeline basics
  • Hands-on: Create simple pipeline

Module 2: COIN Pipeline Deep Dive (3 hours)

  • Pipeline architecture overview
  • Stage-by-stage walkthrough
  • Configuration management
  • Hands-on: Trigger deployment

Module 3: Secrets Management (2 hours)

  • SOPS usage
  • Secret rotation procedures
  • Security best practices
  • Hands-on: Encrypt/decrypt secrets

Module 4: Troubleshooting (2 hours)

  • Reading pipeline logs
  • Common failure scenarios
  • Debug techniques
  • Hands-on: Fix failing pipeline

Module 5: Rollback Procedures (2 hours)

  • When to rollback
  • Rollback execution
  • Verification steps
  • Hands-on: Perform rollback

Module 6: Monitoring & Alerts (2 hours)

  • Dashboard overview
  • Alert interpretation
  • Response procedures
  • Hands-on: Respond to alert

10.6 Post-Implementation Support

Support Structure:

Tier 1: Self-Service
├── Documentation wiki
├── Troubleshooting guides
├── FAQ
└── Video tutorials

Tier 2: Team Support
├── Slack channel: #cicd-support
├── Office hours: Daily 10-11 AM
├── Email: devops-support@company.com
└── Response time: < 4 hours

Tier 3: Expert Support
├── On-call DevOps engineer
├── Escalation for critical issues
├── Response time: < 1 hour
└── 24/7 for production

Continuous Improvement:

  • Weekly metrics review
  • Monthly retrospectives
  • Quarterly pipeline optimization
  • Annual security audit
  • Regular training updates

Заключение

Итоговое решение

Универсальный GitLab CI/CD pipeline для COIN приложения полностью реализуем и обеспечит:

Автоматизацию - 90% reduction ручных операций Универсальность - поддержка всех 4 окружений Безопасность - SOPS encryption + audit trail Надежность - automatic rollback + health checks Observability - comprehensive monitoring Скорость - 3x faster deployments

Ключевые преимущества

  1. Единый процесс для всех окружений
  2. Git как source of truth для всех конфигураций
  3. Автоматический deployment с manual gates где нужно
  4. Built-in rollback с verification
  5. Comprehensive monitoring на всех уровнях
  6. Полная прослеживаемость всех изменений

Следующие шаги

  1. Review этого документа с командой
  2. Утверждение implementation плана
  3. Allocation ресурсов (8 недель, 1-2 FTE)
  4. Kickoff meeting
  5. Start Phase 1 implementation

Документ готов для начала внедрения! 🚀