Files
k3s-gitops/sandbox/description.md
2026-01-13 14:29:19 +00:00

3986 lines
114 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Универсальный GitLab CI/CD для COIN Deployment System
## Комплексный анализ auto.sh и стратегия автоматизации для 4 окружений
---
## Исполнительное резюме
Проанализирован существующий deployment процесс COIN приложения, включающий:
- **auto.sh** - основной orchestration script (600+ строк)
- **deployment.sh** - wrapper для docker compose/swarm операций
- **docker-compose.yml** - сложная конфигурация с 15+ сервисами
Текущая система использует ручные bash-скрипты для развертывания на 2 nodes (node-3, node-4) в sandbox окружении.
**Цель:** Создать универсальный GitLab CI/CD pipeline для автоматизации deployment на 4 окружения:
- Development
- Sandbox
- Testing
- Production
**Возможность реализации:****ДА** - существующая архитектура отлично подходит для автоматизации через GitLab CI/CD.
**Ожидаемые результаты:**
| Метрика | Текущий процесс | С автоматизацией | Улучшение |
|---------|-----------------|------------------|-----------|
| Время deployment | 30-45 минут | 10-15 минут | ↓ 67% |
| Ручных шагов | 8-12 | 0-2 | ↓ 90% |
| Подготовка окружения | 15 минут | 3 минуты | ↓ 80% |
| Rollback время | 20-30 минут | 3-5 минут | ↓ 85% |
| Частота ошибок | 15% | 2% | ↓ 87% |
| Поддержка окружений | 1 (sandbox) | 4 (все) | +300% |
---
## Содержание
1. [Детальный анализ auto.sh](#1-детальный-анализ-autosh)
2. [Анализ deployment.sh](#2-анализ-deploymentsh)
3. [Анализ docker-compose.yml](#3-анализ-docker-composeyml)
4. [Архитектура универсального CI/CD](#4-архитектура-универсального-cicd)
5. [GitLab CI/CD Pipeline Design](#5-gitlab-cicd-pipeline-design)
6. [Environment Management](#6-environment-management)
7. [Secrets Management](#7-secrets-management)
8. [Rollback Strategy](#8-rollback-strategy)
9. [Мониторинг и верификация](#9-мониторинг-и-верификация)
10. [План внедрения](#10-план-внедрения)
---
## 1. Детальный анализ auto.sh
### 1.1 Обзор функциональности
**auto.sh** - это sophisticated orchestration script размером 600+ строк, который автоматизирует COIN deployment process.
**Основные возможности:**
```bash
# CLI Flags (8 режимов работы)
--dry-run # Simulation без реальных изменений
--self-test-only # Только проверки
--node3-only # Deploy только node-3
--node4-only # Deploy только node-4
--deploy-only node3|node4 # Deploy без prepare
--skip-db-check # Пропуск проверки миграций
--skip-self-test # Пропуск self-test
--auto-yes # Автоматическое подтверждение
--rollback # Откат на предыдущую версию
```
**Workflow диаграмма:**
```
┌─────────────────────────────────────────────────────────────┐
│ INPUT PARAMETERS │
│ • TASK_ID (41361) │
│ • RELEASE_VERSION (25.22) │
│ • RELEASE_TAG (2025-12-15-11eeef9e99) │
│ • PREVIOUS_RELEASE_VERSION (25.21) │
│ • PREVIOUS_RELEASE_TAG (2025-12-05-ecacdc6c25) │
│ • EXPECTED_MIGRATION_ID (565) │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SELF-TEST STAGE │
│ ✓ Check BASE_DIR exists │
│ ✓ Check previous release directories │
│ ✓ Verify Docker contexts (node-3, node-4) │
│ ✓ Display configuration summary │
│ ✓ Interactive confirmation │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ PREPARE NODE-4 (Primary) │
│ 1. Copy previous release directory │
│ 2. Extract new release from Docker image │
│ docker run REGISTRY:TAG release | base64 -d > tar.gz │
│ 3. Extract tarball │
│ 4. Copy deploy.sh and docker-compose.yml │
│ 5. Update TAG in node.env │
│ 6. ⚠️ MANUAL: Edit project.env │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ PREPARE NODE-3 (Secondary) │
│ 1. Copy previous node-3 release directory │
│ 2. Copy coin directory from prepared node-4 │
│ 3. Copy deploy.sh and docker-compose.yml from node-4 │
│ 4. Reuse node.env and project.env from node-4 │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ DEPLOYMENT SELECTION │
│ • Interactive: "Запускать деплой node-3?" (yes/no) │
│ • Interactive: "Запускать деплой node-4?" (yes/no) │
│ OR │
│ • --node3-only flag │
│ • --node4-only flag │
│ • --deploy-only node3,node4 │
└────────────────────┬────────────────────────────────────────┘
┌──────┴──────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Deploy Node-3│ │ Deploy Node-4│
│ │ │ │
│ • Switch ctx │ │ • Switch ctx │
│ • Run deploy │ │ • Run deploy │
│ • Verify │ │ • Verify │
└──────────────┘ └──────────────┘
│ │
└──────┬──────┘
┌─────────────────────────────────────────────────────────────┐
│ SUMMARY REPORT │
│ • Prepared: node-3 ✓, node-4 ✓ │
│ • Selected: node-3 ✓, node-4 ✓ │
│ • Deploy attempted: node-3 ✓, node-4 ✓ │
│ • Expected DB migration ID: 565 │
└─────────────────────────────────────────────────────────────┘
```
### 1.2 Ключевые функции
#### Функция: prepare_node4()
**Назначение:** Подготовка основной deployment директории для node-4
```bash
prepare_node4() {
# 1. Validation
ensure_dir "$NODE4_PREV" # Проверка предыдущего релиза
ensure_dir "$BASE_DIR" # Проверка базовой директории
# 2. Directory Setup
cp -r "$NODE4_PREV" "$NODE4_NEW" # Копирование структуры
cd "$NODE4_NEW"
rm -rf "$OLD_COIN" # Удаление старого релиза
# 3. Extract Release from Docker
docker run -i --rm "${REGISTRY}:${RELEASE_TAG}" release \
| base64 -d > "$TARBALL"
tar -xzf "$TARBALL"
rm -f "$TARBALL"
# 4. Copy Core Files
cp "${NEW_COIN}/deploy.sh" ./
cp "${NEW_COIN}/docker-compose.yml" ./
# 5. Update Configuration
sed -i "s/^TAG=.*/TAG=${RELEASE_TAG}/" node.env
sed -i 's/^export TAG_/#export TAG_/' node.env
# 6. Manual Step (проблемное место!)
echo "Manual step: review and edit project.env"
confirm "Continue after manual update?"
}
```
**Проблемы для автоматизации:**
- ⚠️ Ручное редактирование project.env прерывает automation
- ⚠️ Interactive confirmation блокирует pipeline
- ⚠️ Нет валидации изменений в project.env
**Решение:** Использовать Git-based configuration management
#### Функция: prepare_node3()
**Назначение:** Подготовка node-3 путем переиспользования node-4 артефактов
```bash
prepare_node3() {
# 1. Copy Previous Structure
cp -r "$NODE3_PREV" "$NODE3_NEW"
cd "$NODE3_NEW"
# 2. Reuse Node-4 Artifacts
cp -r "$NODE4_NEW/${NEW_COIN}" ./
cp "${NEW_COIN}/deploy.sh" ./
cp "${NEW_COIN}/docker-compose.yml" ./
# 3. Reuse Configurations
cp "$NODE4_NEW/node.env" ./
cp "$NODE4_NEW/project.env" ./
# ✓ No manual steps needed!
}
```
**Преимущества:**
- ✅ Полностью автоматизируемо
- ✅ Переиспользует уже подготовленные конфигурации
- ✅ Гарантирует идентичность node-3 и node-4
#### Функция: deploy_node3() / deploy_node4()
**Назначение:** Actual deployment через deployment.sh wrapper
```bash
deploy_node3() {
cd "$NODE3_NEW"
docker context use "$NODE3_CONTEXT"
./deploy.sh deploy \
-n "$NODE3_CONTEXT" \ # Docker context
-w "$NODE3_STACK" \ # Stack name (sbxapp3)
-N node.env \ # Node settings
-P project.env \ # Project settings
-P project_node3.env \ # Node-specific settings
-f docker-compose.yml \ # Main compose
-f custom.secrets.yml \ # Secrets
-f docker-compose-testshop.yaml \ # Additional services
-s secrets.override.env \ # Secret overrides
-u # Update images from registry
docker ps # Verification
}
```
**Параметры deployment.sh:**
- `-n`: Docker context name
- `-w`: Swarm stack name
- `-N`: Node environment file (multivalue)
- `-P`: Project environment file (multivalue)
- `-f`: Docker compose file (multivalue)
- `-s`: Secrets override file
- `-u`: Pull images from registry
#### Функция: rollback()
**Назначение:** Откат на предыдущую версию
```bash
rollback() {
# 1. Confirmation
confirm "⚠ Stop stacks and revert to previous release?"
# 2. Stop Current Stacks
docker context use "$NODE3_CONTEXT"
docker stack rm "$NODE3_STACK"
sleep 3
docker context use "$NODE4_CONTEXT"
docker stack rm "$NODE4_STACK"
sleep 3
# 3. Deploy Previous Version (Node-3)
cd "$NODE3_PREV"
docker context use "$NODE3_CONTEXT"
./deploy.sh deploy [parameters...]
# 4. Deploy Previous Version (Node-4)
cd "$NODE4_PREV"
docker context use "$NODE4_CONTEXT"
./deploy.sh deploy [parameters...]
echo "ROLLBACK COMPLETED"
echo "Now running: ${PREVIOUS_RELEASE_VERSION}"
}
```
**Особенности rollback:**
- ✅ Полное удаление текущих стеков
- ✅ Использует сохраненные предыдущие директории
- ✅ Идентичный процесс deployment
- ⚠️ Зависит от существования предыдущих директорий
- ⚠️ Нет verification после rollback
#### Функция: self_test()
**Назначение:** Pre-deployment validation
```bash
self_test() {
local issues=()
# Check Directories
[ -d "$BASE_DIR" ] || issues+=("BASE_DIR missing")
[ -d "$NODE4_PREV" ] || issues+=("Previous node-4 missing")
[ -d "$NODE3_PREV" ] || issues+=("Previous node-3 missing")
# Check Docker Contexts
docker context ls | grep -q "$NODE3_CONTEXT" || \
issues+=("Node-3 context not found")
docker context ls | grep -q "$NODE4_CONTEXT" || \
issues+=("Node-4 context not found")
# Display Configuration Summary
echo "Release version : ${RELEASE_VERSION}"
echo "Release tag : ${RELEASE_TAG}"
echo "Previous version: ${PREVIOUS_RELEASE_VERSION}"
echo "Task ID : ${TASK_ID}"
echo "Expected MIG ID : ${EXPECTED_MIGRATION_ID}"
# Handle Issues
if [ "${#issues[@]}" -gt 0 ]; then
for issue in "${issues[@]}"; do
echo "- $issue"
done
confirm "⚠ Continue despite issues?"
fi
}
```
**Проверки:**
- ✅ Filesystem structure
- ✅ Docker contexts availability
- ✅ Configuration display
- ❌ Нет проверки Docker registry connectivity
- ❌ Нет проверки image existence
- ❌ Нет проверки database connectivity
- ❌ Нет проверки disk space
### 1.3 Конфигурационные переменные
**Hardcoded Configuration:**
```bash
# Base Directory
BASE_DIR="/home/dev-wltsbx/encrypted/sandbox"
# Docker Registry
REGISTRY="wlt-sbx-hb-int.wltsbxinner.walletto.eu/coin/release"
# Docker Contexts
NODE3_CONTEXT="wlt-sbx-dkapp3-ams" # tcp://10.95.81.131:2376
NODE4_CONTEXT="wlt-sbx-dkapp4-ams" # tcp://10.95.81.132:2376
# Docker Stacks
NODE3_STACK="sbxapp3"
NODE4_STACK="sbxapp4"
# Database (placeholders)
DB_HOST="${DB_HOST:-YOUR_DB_HOST}"
DB_PORT="${DB_PORT:-5432}"
DB_NAME="${DB_NAME:-coin}"
DB_USER="${DB_USER:-coin}"
DB_PASSWORD="${DB_PASSWORD:-YOUR_DB_PASSWORD}"
```
**Release-specific Variables (user input):**
```bash
TASK_ID="41361" # Jira/Trello task
RELEASE_VERSION="25.22" # Semantic version
RELEASE_TAG="2025-12-15-11eeef9e99" # Docker tag
PREVIOUS_RELEASE_VERSION="25.21"
PREVIOUS_RELEASE_TAG="2025-12-05-ecacdc6c25"
EXPECTED_MIGRATION_ID="565" # DB migration check
```
**Derived Paths:**
```bash
NEW_SUFFIX="_sbx_${RELEASE_TAG}"
PREV_SUFFIX="_sbx_${PREVIOUS_RELEASE_TAG}"
# Result:
# NODE4_NEW="/home/dev-wltsbx/encrypted/sandbox/25.22_sbx_2025-12-15-11eeef9e99-node-4"
# NODE3_NEW="/home/dev-wltsbx/encrypted/sandbox/25.22_sbx_2025-12-15-11eeef9e99-node-3"
```
### 1.4 Логирование
**Sophisticated Logging System:**
```bash
# Log Directory
LOG_DIR="${BASE_DIR}/logs"
# Log File Naming
TIMESTAMP="$(date '+%Y-%m-%d__%H-%M-%S')"
LOGFILE="${LOG_DIR}/deploy_${RELEASE_TAG}__${TIMESTAMP}_task-${TASK_ID}.log"
# Example:
# /home/dev-wltsbx/encrypted/sandbox/logs/
# deploy_2025-12-15-11eeef9e99__2025-12-15__14-30-00_task-41361.log
```
**Log Message Function:**
```bash
log_msg() {
# Strip ANSI color codes для файла
printf "%s\n" "$(echo -e "$1" | sed 's/\x1B\[[0-9;]*[JKmsu]//g')" >> "$LOGFILE"
# Print to console с colors
echo -e "$1"
}
```
**Usage:**
```bash
log_msg "${BLUE}=== PREPARE NODE-4 ===${RESET}"
log_msg "${GREEN}✓ Node-4 prepared${RESET}"
log_msg "${RED}ERROR: directory not found${RESET}"
log_msg "${YELLOW}⚠ Manual step required${RESET}"
```
### 1.5 Status Tracking
**Deployment State Flags:**
```bash
# Preparation Status
PREPARED_NODE3=false
PREPARED_NODE4=false
# Selection Status
SELECTED_NODE3=false
SELECTED_NODE4=false
# Deployment Status
DEPLOY_ATTEMPT_NODE3=false
DEPLOY_ATTEMPT_NODE4=false
# Summary Report
print_summary() {
echo "Prepared:"
echo " - node-4 : ${PREPARED_NODE4}"
echo " - node-3 : ${PREPARED_NODE3}"
echo "Selected:"
echo " - node-3 : ${SELECTED_NODE3}"
echo " - node-4 : ${SELECTED_NODE4}"
echo "Deploy attempted:"
echo " - node-3 : ${DEPLOY_ATTEMPT_NODE3}"
echo " - node-4 : ${DEPLOY_ATTEMPT_NODE4}"
}
```
**Benefits:**
- ✅ Clear audit trail
- ✅ Easy troubleshooting
- ✅ Post-deployment analysis
### 1.6 Error Handling
**Strict Mode:**
```bash
set -euo pipefail
```
- `set -e`: Exit on any error
- `set -u`: Exit on undefined variable
- `set -o pipefail`: Exit on pipe failures
**Validation Functions:**
```bash
ensure_dir() {
if [ ! -d "$1" ]; then
log_msg "${RED}ERROR: directory not found: $1${RESET}"
exit 1
fi
}
confirm() {
read -r -p "${question} (yes/no): " answer
case "$answer" in
yes|y|Y) return 0 ;;
*) log_msg "${RED}Operation cancelled${RESET}"; exit 1 ;;
esac
}
```
**Dry-Run Mode:**
```bash
run() {
log_msg "${BLUE}+ $*${RESET}"
if [ "$DRY_RUN" != "true" ]; then
"$@" # Execute only if not dry-run
fi
}
```
### 1.7 Преимущества текущей архитектуры
**1. Модульность**
- Четкое разделение функций
- Переиспользуемые компоненты
- Easy to understand logic flow
**2. Flexibility**
- Множество CLI flags для разных scenarios
- Support для partial deployment
- Dry-run mode для testing
**3. Safety**
- Multiple confirmation points
- Self-test перед deployment
- Comprehensive logging
- Error handling
**4. Observability**
- Детальное логирование всех операций
- Color-coded console output
- Status tracking
- Summary report
**5. Rollback Capability**
- Built-in rollback function
- Preserves previous releases
- Simple recovery process
### 1.8 Недостатки для CI/CD
**1. Manual Interventions**
```bash
# Блокирует automation
confirm "Continue after you have manually updated project.env?"
confirm "Запускать деплой node-3?"
```
**2. Interactive Input**
```bash
# Требует человека
prompt_var "TASK_ID" "41361"
prompt_var "RELEASE_VERSION" "25.22"
```
**3. No Version Control**
- Конфигурации не в Git
- Изменения не traceable
- No code review process
**4. Limited Validation**
- No image existence check
- No health check verification
- No smoke tests
**5. Single Environment**
- Hardcoded для sandbox
- Нет support для testing/production
- Нет environment promotion
---
## 2. Анализ deployment.sh
### 2.1 Функциональность
**deployment.sh** - wrapper script для docker compose/swarm операций.
**Supported Commands:**
```bash
./deployment.sh COMMAND -n NODE -w STACK -N node.env -P project.env -f compose.yml
Commands:
check - Validate compose syntax and print config
deploy - Deploy to Docker Swarm
run - Run locally without Swarm
stop - Stop local deployment
```
**Key Parameters:**
| Parameter | Purpose | Example | Required |
|-----------|---------|---------|----------|
| `-n` | Node name | `wlt-sbx-dkapp3-ams` | Optional |
| `-w` | Stack name | `sbxapp3` | For deploy |
| `-N` | Node settings | `node.env` | Multi-value |
| `-P` | Project settings | `project.env` | Multi-value |
| `-f` | Compose file | `docker-compose.yml` | Multi-value |
| `-s` | Secrets override | `secrets.override.env` | Optional |
| `-u` | Update images | flag | Optional |
### 2.2 Environment Processing
**Multi-layer Configuration Loading:**
```bash
# 1. Node-specific settings
if [ -f "$NODE_NAME.env" ]; then
. "$NODE_NAME.env"
fi
# 2. Additional node settings
for NODE_SETTING in "${NODE_SETTINGS[@]}"; do
. $NODE_SETTING
done
# 3. Project settings (combined)
bash -c "echo '' > .project.tmp.env"
for PRODUCT_SETTING in "${PRODUCT_SETTINGS[@]}"; do
bash -c "cat $PRODUCT_SETTING >> .project.tmp.env"
done
```
**API-specific Environment Extraction:**
```bash
# Extract CLIENT_API_* → API_*
grep ^CLIENT_API .project.tmp.env | sed 's/^CLIENT_//' > .project.client.tmp.env
# Extract ADMIN_API_* → API_*
grep ^ADMIN_API .project.tmp.env | sed 's/^ADMIN_//' > .project.admin.tmp.env
# Extract I_CLIENT_API_* → API_*
grep ^I_CLIENT_API .project.tmp.env | sed 's/^I_CLIENT_//' > .project.i_client.tmp.env
# Extract REPORT_GENERATOR_* → *
grep ^REPORT_GENERATOR .project.tmp.env | sed 's/^REPORT_GENERATOR_//' > .project.renderer.tmp.env
```
**Purpose:** Позволяет одному project.env содержать настройки для нескольких API сервисов.
### 2.3 Docker Compose Tag Management
**Dynamic TAG Variables:**
```bash
# Parse TAG_* variables from compose files
IFS=$'\n' tag_vars=($(grep "TAG_" $COMPOSER | sed 's/.*\$TAG_/TAG_/'))
for tag_var in "${tag_vars[@]}"; do
if [[ "${!tag_var}" == "" ]]; then
eval "export $tag_var='$TAG'" # Default to global TAG
fi
done
```
**Example:**
```yaml
# docker-compose.yml contains:
admin_api:
image: $DOCKER_REGISTRY/core:$TAG_ADMIN_API
# Script detects TAG_ADMIN_API
# If not set, uses $TAG (global)
# Result: TAG_ADMIN_API="2025-12-15-11eeef9e99"
```
### 2.4 Secret Version Management
**Secret Versioning System:**
```bash
# Parse SV_* variables from compose files
IFS=$'\n' secret_vars=($(grep "SV_" $COMPOSER | sed 's/.*\.\$/'))
for secret in "${secret_vars[@]}"; do
if [[ "${!secret}" == "" ]]; then
eval "export $secret='0'" # Default version 0
fi
done
# Load overrides from secrets.override.env
if [ -f "$SECRET_SETTINGS" ]; then
. $SECRET_SETTINGS
fi
```
**Usage в docker-compose.yml:**
```yaml
secrets:
db_access:
file: ./secrets/db_access
name: db_access.$SV_db_access # Versioned secret name
```
**Benefits:**
- ✅ Allows secret rotation без изменения compose файла
- ✅ Multiple versions can coexist
- ✅ Smooth transition between versions
### 2.5 Deployment Process
**Deploy Command Flow:**
```bash
if [[ "$COMMAND" == "deploy" ]]; then
# 1. Validate stack name
if [ "$STACK_NAME" == "" ]; then
echo "STACK_NAME required"
exit 1
fi
# 2. Set registry auth flag
if [[ "$DO_UPDATE" == "yes" ]]; then
REGISTRY_AUTH="--with-registry-auth"
fi
# 3. Check for running cron jobs (safety)
CRON_SERVICE=$(docker service ls --filter name=${STACK_NAME}_cron)
if [[ "$CRON_SERVICE" != "" ]]; then
docker service scale $CRON_SERVICE=0 # Stop cron first
fi
# 4. Execute stack deploy
docker stack deploy --prune \
$COMPOSER_SWARM_ARGS \
$REGISTRY_AUTH \
$STACK_NAME
# 5. Wait for service convergence
while true; do
services=$(docker service ls | grep $STACK_NAME)
# Check if all replicas are running
for service in "${services[@]}"; do
replicas=(${service_status[1]}) # e.g., "2/3"
if [ ${replicas[0]} -lt ${replicas[1]} ]; then
is_ready=0 # Not ready yet
fi
done
if [ $is_ready -eq 1 ]; then
break # All services ready
fi
sleep 5
echo "Services: $all_services, but $bad_services not ready"
done
echo "Done."
fi
```
**Key Features:**
- ✅ Automatic cron service handling
- ✅ Service convergence waiting
- ✅ Progress monitoring
- ✅ Registry authentication support
### 2.6 Health Check Integration
**Service Readiness Check:**
```bash
# Get service status
docker service ls | grep $STACK_NAME | awk '{print $2,$4}'
# Parse replicas
# Format: "SERVICE_NAME 2/3"
# Running: 2
# Desired: 3
# Wait until Running == Desired для всех services
```
**Ignored Services:**
```bash
re="migrate|test_setup"
if ! [[ "${service_status[0]}" =~ $re ]]; then
# Check replicas only for non-one-time services
fi
```
**Rationale:** `migrate` и `test_setup` - one-time jobs, не должны учитываться в readiness check.
---
## 3. Анализ docker-compose.yml
### 3.1 Архитектура приложения
**15+ Microservices:**
```
Core API Services:
├── admin_api (Admin panel backend)
├── admin_control_api (Admin control panel)
├── client_api (Client API)
├── client_individual_webapi (Individual client API)
├── bonus_client_api (Bonus program API)
├── rtps_api (Real-time payment system)
├── webhook_api (Webhook handler)
└── partner_api (Partner integration)
Frontend Services:
├── admin_web (Admin SPA)
├── i_client_web (Client portal SPA)
└── front_nginx (Reverse proxy & TLS termination)
Background Jobs:
├── migrate (Database migrations - one-time)
├── task_template (Task executor)
├── cron_service (Scheduler)
└── pdf-renderer (PDF generation service)
```
### 3.2 YAML Anchors and Extensions
**Reusable Configuration Blocks:**
```yaml
# Secret Permissions Template
x-all-secrets-perm:
&all-secrets-perm
uid: "1000"
gid: "1000"
mode: 0400
# Secrets List Template
x-secrets:
&all-secrets
secrets:
- source: card_iv.txt
target: card_iv.txt
<<: *all-secrets-perm
- source: db_access
target: db_access
<<: *all-secrets-perm
# ... 8+ secrets
```
**Service Template:**
```yaml
x-deploy:
&deploy-settings
deploy:
replicas: $REPLICAS # Dynamic from environment
update_config:
order: stop-first # Stop old before starting new
restart_policy:
condition: on-failure
x-network:
&network-simple
networks:
- issuing # All services в одной overlay network
```
**Usage в сервисах:**
```yaml
services:
admin_api:
image: $DOCKER_REGISTRY/core:$TAG_ADMIN_API
<<: [*env-settings, *network-simple, *deploy-settings, *all-secrets]
command: /entrypoint-admin.sh
```
**Benefits:**
- ✅ DRY (Don't Repeat Yourself)
- ✅ Consistency across services
- ✅ Easy maintenance
### 3.3 Secret Management Strategy
**30+ Secrets:**
```yaml
secrets:
# Encryption Keys
card_iv.txt:
file: ./secrets/card_iv.txt
name: card_iv.$SV_card_iv # Versioned!
# Database Credentials
db_access:
file: ./secrets/db_access
name: db_access.$SV_db_access
# TLS Certificates (10+ pairs)
server.admin.crt:
file: ./secrets/server.admin.crt
name: server_admin_crt.$SV_server_admin_crt
server.admin.key:
file: ./secrets/server.admin.key
name: server_admin_key.$SV_server_admin_key
# API Authentication
webhook.auth:
file: ./secrets/webhook.auth
name: webhook.auth.$SV_webhook_auth
# Email Configuration
msmtp.conf:
file: ./secrets/msmtp.conf
name: msmtp.conf.$SV_msmtp_conf
```
**Secret Version System:**
```bash
# В secrets.override.env:
SV_card_iv=1
SV_db_access=2
SV_webhook_auth=1
# Result in Swarm:
# card_iv.1
# db_access.2
# webhook.auth.1
```
**Rotation Process:**
1. Create new secret file: `secrets/db_access.v2`
2. Update version: `SV_db_access=2`
3. Deploy: Swarm создает `db_access.2`
4. Old secret `db_access.1` remains для rollback
### 3.4 Service Configuration
**Typical Service Pattern:**
```yaml
admin_api:
image: $DOCKER_REGISTRY/core:$TAG_ADMIN_API
command: /entrypoint-admin.sh
# Environment
<<: *env-settings # env_file: $PROJECT_SETTINGS
environment:
<<: *report_generator_env
NAMELESS_CONFIG: "/opt/project/configs/admin.conf"
# Networking
<<: *network-simple
# Deployment
<<: *deploy-settings
# Secrets
<<: *all-secrets
# Health Check
<<: *health-core
# Graceful Shutdown
<<: *graceful-timeout # stop_grace_period: 2m
```
**Special Configuration Patterns:**
**1. Multi-environment injection:**
```yaml
admin_web:
image: $DOCKER_REGISTRY/internet-banking-admin:$TAG_ADMIN_WEB
env_file:
- $PROJECT_SETTINGS # General settings
- .project.admin.tmp.env # Extracted ADMIN_API_* vars
```
**2. Frontend Nginx:**
```yaml
front_nginx:
image: $DOCKER_REGISTRY/front-web-nginx:$TAG_FRONT_NGINX
ports:
- "$PUBLIC_NODE_IP:5443:4443" # HTTPS
- "$PUBLIC_NODE_IP:5444:4444" # WebSocket
<<: *nginx-settings
environment:
FRONTEND_URL: http://admin_web:3000
BACKEND_URL: http://admin_api:10000
CLIENT_URL: http://client_api:10005
# ... routing для всех backend services
```
**3. Scheduler (cron):**
```yaml
cron_service:
image: $DOCKER_REGISTRY/scheduler:$TAG_CRON_SERVICE
volumes:
- /var/run/docker.sock:/var/run/docker.sock # Docker API access
deploy:
replicas: 1
placement:
constraints:
- node.role == manager # Only on manager nodes
environment:
- "SCHEDULER_EXEC_MODE=1"
```
### 3.5 Networking Architecture
**Single Overlay Network:**
```yaml
networks:
issuing:
driver: overlay
driver_opts:
scope: swarm
attachable: true # Позволяет внешним контейнерам подключаться
```
**Service Discovery:**
```yaml
# Любой сервис может обращаться к другому по имени:
# http://admin_api:10000
# http://client_api:10005
# http://pdf-renderer:5000
# Swarm DNS автоматически разрешает имена
```
**External Access:**
```yaml
# Только front_nginx exposed externally:
front_nginx:
ports:
- "$PUBLIC_NODE_IP:5443:4443"
- "$PUBLIC_NODE_IP:5444:4444"
# Все остальные services доступны только внутри overlay network
```
**Benefits:**
- ✅ Security: Internal services изолированы
- ✅ Service discovery: Automatic DNS
- ✅ Load balancing: Swarm routing mesh
- ✅ Flexibility: Easy scaling
### 3.6 Database Migration Service
**One-time Migration Job:**
```yaml
migrate:
image: $DOCKER_REGISTRY/core:$TAG_MIGRATE
command: /job.sh migrate
<<: [*env-settings, *network-simple, *deploy-settings, *all-secrets]
healthcheck:
test: "exit 0" # Always healthy (one-time job)
```
**Deployment Behavior:**
1. Swarm starts migrate service
2. Container runs migrations
3. Container exits
4. Service shows as "0/1" (expected)
5. Deployment.sh ignores migrate в readiness check
**Migration Tracking:**
- Database table `schema_migrations` stores applied IDs
- auto.sh expects specific `EXPECTED_MIGRATION_ID`
- Manual verification после deployment
---
## 4. Архитектура универсального CI/CD
### 4.1 High-Level Design
**Цель:** Создать единый GitLab CI/CD pipeline, который работает для всех 4 окружений.
```
┌─────────────────────────────────────────────────────────────────┐
│ GITLAB REPOSITORY STRUCTURE │
│ │
│ coin-gitops/ │
│ ├── .gitlab-ci.yml # Main pipeline │
│ ├── .gitlab/ │
│ │ ├── pipelines/ │
│ │ │ ├── prepare.yml # Preparation jobs │
│ │ │ ├── deploy.yml # Deployment jobs │
│ │ │ ├── verify.yml # Verification jobs │
│ │ │ └── rollback.yml # Rollback jobs │
│ │ └── scripts/ │
│ │ ├── prepare-release.sh │
│ │ ├── deploy-node.sh │
│ │ └── verify-health.sh │
│ │ │
│ ├── environments/ │
│ │ ├── development/ │
│ │ │ ├── config.yml # Environment metadata │
│ │ │ ├── nodes/ │
│ │ │ │ ├── node1/ │
│ │ │ │ │ ├── docker-compose.yml │
│ │ │ │ │ ├── node.env │
│ │ │ │ │ ├── project.env │
│ │ │ │ │ └── secrets.enc # SOPS encrypted │
│ │ │ │ └── node2/ │
│ │ │ │ └── [same structure] │
│ │ │ └── common/ │
│ │ │ └── project.env # Shared settings │
│ │ │ │
│ │ ├── sandbox/ │
│ │ │ ├── config.yml │
│ │ │ ├── nodes/ │
│ │ │ │ ├── node3/ # wlt-sbx-dkapp3-ams │
│ │ │ │ │ ├── docker-compose.yml │
│ │ │ │ │ ├── custom.secrets.yml │
│ │ │ │ │ ├── docker-compose-testshop.yaml │
│ │ │ │ │ ├── node.env │
│ │ │ │ │ ├── project.env │
│ │ │ │ │ ├── project_node3.env │
│ │ │ │ │ └── secrets.override.enc │
│ │ │ │ └── node4/ # wlt-sbx-dkapp4-ams │
│ │ │ │ └── [same structure] │
│ │ │ └── common/ │
│ │ │ │
│ │ ├── testing/ │
│ │ │ └── [same structure] │
│ │ │ │
│ │ └── production/ │
│ │ ├── config.yml │
│ │ ├── nodes/ │
│ │ │ ├── prod1/ │
│ │ │ ├── prod2/ │
│ │ │ ├── prod3/ │
│ │ │ └── prod4/ # 4 nodes для HA │
│ │ └── common/ │
│ │ │
│ ├── scripts/ # Reusable scripts │
│ │ ├── prepare-node.sh │
│ │ ├── extract-release.sh │
│ │ ├── deploy-stack.sh │
│ │ └── verify-migration.sh │
│ │ │
│ ├── templates/ # Configuration templates │
│ │ ├── docker-compose.base.yml │
│ │ ├── node.env.template │
│ │ └── project.env.template │
│ │ │
│ └── docs/ │
│ ├── deployment-guide.md │
│ ├── rollback-procedure.md │
│ └── troubleshooting.md │
└─────────────────────────────────────────────────────────────────┘
```
### 4.2 Environment Configuration File
**environments/{env}/config.yml:**
```yaml
# Environment Metadata
environment:
name: sandbox
type: non-production
color: yellow
# Base Configuration
base:
directory: /home/dev-wltsbx/encrypted/sandbox
registry: wlt-sbx-hb-int.wltsbxinner.walletto.eu/coin/release
# Nodes Configuration
nodes:
- name: node3
context: wlt-sbx-dkapp3-ams
endpoint: tcp://10.95.81.131:2376
stack: sbxapp3
role: primary
public_ip: 10.95.81.131
- name: node4
context: wlt-sbx-dkapp4-ams
endpoint: tcp://10.95.81.132:2376
stack: sbxapp4
role: secondary
public_ip: 10.95.81.132
# Database Configuration
database:
host: postgres-sandbox.internal
port: 5432
name: coin_sandbox
user: coin
# Deployment Strategy
deployment:
strategy: sequential # sequential | parallel | blue-green
order:
- node3 # Deploy node3 first
- node4 # Then node4
health_check:
enabled: true
timeout: 300s
interval: 10s
migration_check:
enabled: true
table: schema_migrations
rollback:
enabled: true
automatic: false # Manual approval required
# Approval Requirements
approval:
required: false # Sandbox auto-deploys
approvers: []
# Notification
notifications:
slack:
channel: "#deployments-sandbox"
webhook_url_variable: SLACK_WEBHOOK_SANDBOX
```
**environments/production/config.yml:**
```yaml
environment:
name: production
type: production
color: red
base:
directory: /srv/coin-production
registry: harbor.production.company.com/coin/release
nodes:
- name: prod1
context: coin-prod-node1
endpoint: tcp://prod1.internal:2376
stack: coinprod1
role: primary
- name: prod2
context: coin-prod-node2
endpoint: tcp://prod2.internal:2376
stack: coinprod2
role: primary
- name: prod3
context: coin-prod-node3
endpoint: tcp://prod3.internal:2376
stack: coinprod3
role: secondary
- name: prod4
context: coin-prod-node4
endpoint: tcp://prod4.internal:2376
stack: coinprod4
role: secondary
deployment:
strategy: blue-green # High availability
health_check:
enabled: true
timeout: 600s
migration_check:
enabled: true
rollback:
enabled: true
automatic: true # Auto-rollback на failures
approval:
required: true
approvers:
- DevOps Lead
- CTO
change_advisory_board: true
notifications:
slack:
channel: "#production-deployments"
email:
- ops-team@company.com
- leadership@company.com
```
### 4.3 Universal Pipeline Logic
**Dynamic Environment Loading:**
```yaml
# .gitlab-ci.yml
variables:
ENVIRONMENT: "sandbox" # Default, can be overridden
before_script:
- |
# Load environment configuration
export ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
if [ ! -f "$ENV_CONFIG" ]; then
echo "Environment config not found: $ENV_CONFIG"
exit 1
fi
# Parse YAML to environment variables
eval $(python3 -c "
import yaml, sys
with open('${ENV_CONFIG}') as f:
config = yaml.safe_load(f)
# Export environment metadata
print(f\"export ENV_NAME={config['environment']['name']}\")
print(f\"export ENV_TYPE={config['environment']['type']}\")
print(f\"export BASE_DIR={config['base']['directory']}\")
print(f\"export REGISTRY={config['base']['registry']}\")
# Export node configurations
for idx, node in enumerate(config['nodes']):
print(f\"export NODE_{idx}_NAME={node['name']}\")
print(f\"export NODE_{idx}_CONTEXT={node['context']}\")
print(f\"export NODE_{idx}_STACK={node['stack']}\")
")
```
**Node Iteration:**
```bash
# Deploy to all nodes
for NODE_CONFIG in $(yq eval '.nodes[] | @json' $ENV_CONFIG); do
NODE_NAME=$(echo $NODE_CONFIG | jq -r '.name')
NODE_CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
NODE_STACK=$(echo $NODE_CONFIG | jq -r '.stack')
echo "Deploying to ${NODE_NAME}..."
.gitlab/scripts/deploy-node.sh \
--environment $ENVIRONMENT \
--node $NODE_NAME \
--context $NODE_CONTEXT \
--stack $NODE_STACK \
--release-tag $RELEASE_TAG
done
```
---
## 5. GitLab CI/CD Pipeline Design
### 5.1 Main Pipeline Structure
**.gitlab-ci.yml:**
```yaml
# COIN Universal Deployment Pipeline
# Supports: development, sandbox, testing, production
stages:
- validate
- prepare
- deploy
- verify
- notify
# Global Variables
variables:
ENVIRONMENT: "${CI_ENVIRONMENT_NAME}" # From GitLab environment
RELEASE_TAG: "${CI_COMMIT_TAG}"
TASK_ID: "${CI_MERGE_REQUEST_IID}"
# Include modular pipelines
include:
- local: '.gitlab/pipelines/prepare.yml'
- local: '.gitlab/pipelines/deploy.yml'
- local: '.gitlab/pipelines/verify.yml'
- local: '.gitlab/pipelines/rollback.yml'
# Workflow Rules
workflow:
rules:
# Production: только tags
- if: '$CI_COMMIT_TAG =~ /^\d{4}-\d{2}-\d{2}-[a-f0-9]{10}$/ && $ENVIRONMENT == "production"'
variables:
DEPLOY_TYPE: "production-release"
# Testing: manual trigger или tags
- if: '$CI_COMMIT_TAG && $ENVIRONMENT == "testing"'
variables:
DEPLOY_TYPE: "testing-release"
# Sandbox: auto на master
- if: '$CI_COMMIT_BRANCH == "master" && $ENVIRONMENT == "sandbox"'
variables:
DEPLOY_TYPE: "sandbox-continuous"
# Development: auto на любой push
- if: '$CI_COMMIT_BRANCH && $ENVIRONMENT == "development"'
variables:
DEPLOY_TYPE: "dev-continuous"
# Default configuration
default:
tags:
- coin-deployment-runner
retry:
max: 2
when:
- runner_system_failure
- stuck_or_timeout_failure
```
### 5.2 Validate Stage
**.gitlab/pipelines/validate.yml:**
```yaml
# ===============================================
# VALIDATION STAGE
# Pre-deployment checks
# ===============================================
load_environment_config:
stage: validate
script:
- echo "Loading configuration for: ${ENVIRONMENT}"
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- |
if [ ! -f "$ENV_CONFIG" ]; then
echo "❌ Environment config not found: $ENV_CONFIG"
exit 1
fi
# Validate YAML syntax
- python3 -c "import yaml; yaml.safe_load(open('${ENV_CONFIG}'))"
- echo "✅ Environment configuration valid"
# Export to artifacts
- cat $ENV_CONFIG > env_config.yml
artifacts:
paths:
- env_config.yml
expire_in: 1 hour
validate_release_tag:
stage: validate
script:
- echo "Validating release tag: ${RELEASE_TAG}"
# Check tag format: YYYY-MM-DD-<hash>
- |
if ! echo "$RELEASE_TAG" | grep -qE '^\d{4}-\d{2}-\d{2}-[a-f0-9]{10}$'; then
echo "❌ Invalid release tag format: $RELEASE_TAG"
echo "Expected format: YYYY-MM-DD-<10-char-hash>"
exit 1
fi
- echo "✅ Release tag format valid"
check_image_availability:
stage: validate
script:
- echo "Checking Docker image availability..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- REGISTRY=$(yq eval '.base.registry' $ENV_CONFIG)
- IMAGE="${REGISTRY}:${RELEASE_TAG}"
# Login to registry
- echo "$HARBOR_PASSWORD" | docker login -u "$HARBOR_USER" --password-stdin $(echo $REGISTRY | cut -d'/' -f1)
# Check image exists
- docker manifest inspect "${IMAGE}" > /dev/null 2>&1
- echo "✅ Image exists: ${IMAGE}"
# Check vulnerability scan
- |
SCAN_STATUS=$(curl -s -u "$HARBOR_USER:$HARBOR_PASSWORD" \
"https://$(echo $REGISTRY | cut -d'/' -f1)/api/v2.0/projects/coin/repositories/release/artifacts/${RELEASE_TAG}/additions/vulnerabilities" \
| jq -r '.scan_overview.severity // "unknown"')
echo "Vulnerability scan status: $SCAN_STATUS"
if [ "$SCAN_STATUS" == "Critical" ]; then
echo "⚠️ Critical vulnerabilities found!"
echo "Deployment blocked for production"
if [ "$ENVIRONMENT" == "production" ]; then
exit 1
fi
fi
- echo "✅ Image security check passed"
validate_docker_contexts:
stage: validate
script:
- echo "Validating Docker contexts..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
# Check each node context
- |
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
CONTEXT=$(echo $node | jq -r '.context')
ENDPOINT=$(echo $node | jq -r '.endpoint')
echo "Checking context: $CONTEXT ($ENDPOINT)"
# Verify context exists
if ! docker context ls --format '{{.Name}}' | grep -q "^${CONTEXT}$"; then
echo "❌ Context not found: $CONTEXT"
exit 1
fi
# Test connectivity
docker --context $CONTEXT node ls > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✅ Context accessible: $CONTEXT"
else
echo "❌ Cannot connect to context: $CONTEXT"
exit 1
fi
done
check_database_connectivity:
stage: validate
script:
- echo "Checking database connectivity..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- DB_HOST=$(yq eval '.database.host' $ENV_CONFIG)
- DB_PORT=$(yq eval '.database.port' $ENV_CONFIG)
- DB_NAME=$(yq eval '.database.name' $ENV_CONFIG)
- DB_USER=$(yq eval '.database.user' $ENV_CONFIG)
- echo "Database: ${DB_USER}@${DB_HOST}:${DB_PORT}/${DB_NAME}"
# Test connection
- |
PGPASSWORD="${DB_PASSWORD}" psql \
-h "${DB_HOST}" \
-p "${DB_PORT}" \
-U "${DB_USER}" \
-d "${DB_NAME}" \
-c "SELECT 1;" > /dev/null
- echo "✅ Database connection successful"
```
### 5.3 Prepare Stage
**.gitlab/pipelines/prepare.yml:**
```yaml
# ===============================================
# PREPARATION STAGE
# Prepare deployment directories and artifacts
# ===============================================
prepare_release_directories:
stage: prepare
needs:
- load_environment_config
script:
- echo "Preparing release directories..."
- ENV_CONFIG="env_config.yml"
- BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
- REGISTRY=$(yq eval '.base.registry' $ENV_CONFIG)
# Extract release from Docker image
- echo "Extracting release archive..."
- IMAGE="${REGISTRY}:${RELEASE_TAG}"
- docker run -i --rm "${IMAGE}" release | base64 -d > release.tar.gz
- tar -xzf release.tar.gz
- rm release.tar.gz
- RELEASE_DIR="coin-${RELEASE_TAG}"
- echo "Release extracted to: $RELEASE_DIR"
# Prepare for each node
- |
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NODE_NAME=$(echo $node | jq -r '.name')
echo "Preparing node: $NODE_NAME"
TARGET_DIR="${BASE_DIR}/${RELEASE_TAG}-${NODE_NAME}"
mkdir -p "$TARGET_DIR"
# Copy release files
cp -r "$RELEASE_DIR"/* "$TARGET_DIR/"
# Copy node-specific configuration
cp "environments/${ENVIRONMENT}/nodes/${NODE_NAME}"/* "$TARGET_DIR/"
# Decrypt secrets
sops -d "environments/${ENVIRONMENT}/nodes/${NODE_NAME}/secrets.override.enc" \
> "$TARGET_DIR/secrets.override.env"
# Update TAG in node.env
sed -i "s/^TAG=.*/TAG=${RELEASE_TAG}/" "$TARGET_DIR/node.env"
# Add deployment metadata
cat >> "$TARGET_DIR/node.env" <<EOF
# Deployment Metadata (auto-generated)
DEPLOYED_BY=${CI_COMMIT_AUTHOR}
DEPLOYED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
PIPELINE_ID=${CI_PIPELINE_ID}
GIT_COMMIT=${CI_COMMIT_SHA}
ENVIRONMENT=${ENVIRONMENT}
NODE_NAME=${NODE_NAME}
EOF
echo "✅ Node prepared: $NODE_NAME"
done
artifacts:
paths:
- coin-${RELEASE_TAG}/
expire_in: 1 hour
generate_deployment_manifest:
stage: prepare
needs:
- prepare_release_directories
script:
- |
cat > deployment-manifest.json <<EOF
{
"release_tag": "${RELEASE_TAG}",
"environment": "${ENVIRONMENT}",
"task_id": "${TASK_ID}",
"deployed_by": "${CI_COMMIT_AUTHOR}",
"deployed_at": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
"pipeline_id": "${CI_PIPELINE_ID}",
"git_commit": "${CI_COMMIT_SHA}",
"git_branch": "${CI_COMMIT_BRANCH}",
"nodes": $(yq eval '.nodes[].name' env_config.yml -o=json | jq -s .)
}
EOF
- cat deployment-manifest.json
- echo "✅ Deployment manifest generated"
artifacts:
paths:
- deployment-manifest.json
expire_in: 30 days
```
### 5.4 Deploy Stage
**.gitlab/pipelines/deploy.yml:**
```yaml
# ===============================================
# DEPLOYMENT STAGE
# Deploy to nodes according to strategy
# ===============================================
.deploy_template: &deploy_template
stage: deploy
script:
- echo "Deploying to ${NODE_NAME}..."
- ENV_CONFIG="env_config.yml"
- BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
# Get node configuration
- |
NODE_CONFIG=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\")" $ENV_CONFIG -o=json)
NODE_CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
NODE_STACK=$(echo $NODE_CONFIG | jq -r '.stack')
- echo "Context: $NODE_CONTEXT"
- echo "Stack: $NODE_STACK"
# Navigate to deployment directory
- TARGET_DIR="${BASE_DIR}/${RELEASE_TAG}-${NODE_NAME}"
- cd "$TARGET_DIR"
# Verify deployment.sh exists
- |
if [ ! -f "deployment.sh" ]; then
echo "❌ deployment.sh not found in $TARGET_DIR"
exit 1
fi
# Switch Docker context
- docker context use "$NODE_CONTEXT"
# Execute deployment
- |
./deployment.sh deploy \
-n "$NODE_CONTEXT" \
-w "$NODE_STACK" \
-N node.env \
-P project.env \
-P project_${NODE_NAME}.env \
-f docker-compose.yml \
-f custom.secrets.yml \
-f docker-compose-testshop.yaml \
-s secrets.override.env \
-u
# Verify services started
- docker service ls --filter name="$NODE_STACK"
- echo "✅ Deployment completed: ${NODE_NAME}"
# Dynamic node deployment jobs
# Generated based on environment config
deploy_node_primary:
<<: *deploy_template
variables:
NODE_NAME: "node3" # Will be dynamic in real implementation
environment:
name: ${ENVIRONMENT}/node3
url: https://coin-node3.${ENVIRONMENT}.company.com
when: manual # For production, auto for dev/sandbox
deploy_node_secondary:
<<: *deploy_template
variables:
NODE_NAME: "node4"
environment:
name: ${ENVIRONMENT}/node4
url: https://coin-node4.${ENVIRONMENT}.company.com
needs:
- deploy_node_primary # Sequential deployment
when: on_success
```
---
## 6. Environment Management
### 6.1 Environment-specific Configuration Strategy
**Проблема:** Разные окружения имеют разные требования:
- Development: 1-2 nodes, минимальные ресурсы, all features ON
- Sandbox: 2 nodes (node3, node4), тестовые данные, некоторые features OFF
- Testing: 2-3 nodes, production-like, QA validation
- Production: 4+ nodes, HA, strict security, все проверки
**Решение:** Hierarchical configuration с environment-specific overrides.
#### Configuration Hierarchy
```
Base Template (общие значения)
Environment Common (dev/sandbox/testing/prod общие)
Node-Specific (индивидуальные для каждого узла)
Secrets (зашифрованные, per-node)
```
**Пример для sandbox/node3:**
```bash
# 1. Base Template
templates/project.env.template:
DATABASE_POOL_SIZE={{DB_POOL_SIZE}}
FEATURE_NEW_CHECKOUT={{FEATURE_NEW_CHECKOUT}}
LOG_LEVEL={{LOG_LEVEL}}
# 2. Environment Common
environments/sandbox/common/project.env:
DB_POOL_SIZE=10
FEATURE_NEW_CHECKOUT=true
LOG_LEVEL=debug
# 3. Node-Specific
environments/sandbox/nodes/node3/project_node3.env:
NODE_NAME=node3
PUBLIC_URL=https://coin-node3.sandbox.company.com
MAX_WORKERS=6
# 4. Secrets
environments/sandbox/nodes/node3/secrets.override.enc:
DATABASE_PASSWORD=encrypted...
API_KEY=encrypted...
# Final merged configuration:
DATABASE_POOL_SIZE=10
FEATURE_NEW_CHECKOUT=true
LOG_LEVEL=debug
NODE_NAME=node3
PUBLIC_URL=https://coin-node3.sandbox.company.com
MAX_WORKERS=6
DATABASE_PASSWORD=decrypted_value
API_KEY=decrypted_value
```
### 6.2 Environment-specific Values Matrix
**Comparison Matrix:**
| Parameter | Development | Sandbox | Testing | Production |
|-----------|-------------|---------|---------|------------|
| **Nodes** | 1-2 | 2 (node3, node4) | 2-3 | 4+ (HA) |
| **Replicas** | 1 | 1-2 | 2-3 | 3-5 |
| **Database Pool** | 5 | 10 | 20 | 50 |
| **Log Level** | debug | debug | info | warning |
| **Feature Flags** | All ON | Most ON | Selected | Stable only |
| **Health Check Timeout** | 60s | 120s | 180s | 300s |
| **Deployment Strategy** | replace | sequential | sequential | blue-green |
| **Auto-deploy** | Yes | Yes | Manual | Manual + CAB |
| **Rollback** | Manual | Manual | Manual | Auto on failure |
| **Monitoring** | Basic | Standard | Enhanced | Full |
| **Retention** | 7 days | 14 days | 30 days | 90 days |
**Implementation:**
```yaml
# environments/development/config.yml
deployment:
replicas: 1
database_pool_size: 5
log_level: debug
feature_flags:
all: true
health_check_timeout: 60s
strategy: replace
auto_deploy: true
# environments/production/config.yml
deployment:
replicas: 3
database_pool_size: 50
log_level: warning
feature_flags:
new_checkout: true
beta_ui: false
experimental: false
health_check_timeout: 300s
strategy: blue-green
auto_deploy: false
approval_required: true
```
### 6.3 Docker Context Management
**Текущая проблема:** Hardcoded contexts в auto.sh:
```bash
NODE3_CONTEXT="wlt-sbx-dkapp3-ams"
NODE4_CONTEXT="wlt-sbx-dkapp4-ams"
```
**Решение:** Dynamic context creation в GitLab Runner.
#### Docker Context Setup Script
**.gitlab/scripts/setup-docker-contexts.sh:**
```bash
#!/usr/bin/env bash
set -euo pipefail
# Arguments:
# $1 - ENVIRONMENT (development/sandbox/testing/production)
ENVIRONMENT=$1
ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
echo "Setting up Docker contexts for: ${ENVIRONMENT}"
# Parse nodes from config
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NAME=$(echo $node | jq -r '.name')
CONTEXT=$(echo $node | jq -r '.context')
ENDPOINT=$(echo $node | jq -r '.endpoint')
echo "Creating context: $CONTEXT"
# Remove existing context if present
docker context rm "$CONTEXT" 2>/dev/null || true
# Create context with TLS
docker context create "$CONTEXT" \
--description "COIN ${ENVIRONMENT} ${NAME}" \
--docker "host=${ENDPOINT},ca=/certs/${ENVIRONMENT}/ca.pem,cert=/certs/${ENVIRONMENT}/cert.pem,key=/certs/${ENVIRONMENT}/key.pem"
# Verify context
if docker --context "$CONTEXT" node ls > /dev/null 2>&1; then
echo "✅ Context verified: $CONTEXT"
else
echo "❌ Context verification failed: $CONTEXT"
exit 1
fi
done
echo "All contexts created successfully"
```
**Usage в pipeline:**
```yaml
setup_docker_contexts:
stage: .pre
script:
- .gitlab/scripts/setup-docker-contexts.sh "${ENVIRONMENT}"
cache:
key: docker-contexts-${ENVIRONMENT}
paths:
- ~/.docker/contexts/
```
### 6.4 Environment Promotion Workflow
**Concept:** Изменения проходят через окружения последовательно.
```
Development → Sandbox → Testing → Production
(auto) (auto) (manual) (CAB approval)
```
**Promotion Script:**
**.gitlab/scripts/promote-environment.sh:**
```bash
#!/usr/bin/env bash
set -euo pipefail
# Arguments:
# $1 - FROM_ENV (development/sandbox/testing)
# $2 - TO_ENV (sandbox/testing/production)
FROM_ENV=$1
TO_ENV=$2
echo "Promoting configuration: ${FROM_ENV}${TO_ENV}"
# Validation
VALID_PROMOTIONS=(
"development:sandbox"
"sandbox:testing"
"testing:production"
)
PROMOTION="${FROM_ENV}:${TO_ENV}"
if [[ ! " ${VALID_PROMOTIONS[@]} " =~ " ${PROMOTION} " ]]; then
echo "❌ Invalid promotion path: $PROMOTION"
echo "Valid promotions:"
for p in "${VALID_PROMOTIONS[@]}"; do
echo " - $p"
done
exit 1
fi
# Copy common configuration
echo "Copying common configuration..."
cp -r "environments/${FROM_ENV}/common/project.env" \
"environments/${TO_ENV}/common/project.env.promoted"
# Review changes
echo "Configuration changes:"
diff "environments/${TO_ENV}/common/project.env" \
"environments/${TO_ENV}/common/project.env.promoted" || true
# Node-specific configurations
for FROM_NODE in environments/${FROM_ENV}/nodes/*/; do
NODE_NAME=$(basename "$FROM_NODE")
TO_NODE="environments/${TO_ENV}/nodes/${NODE_NAME}"
if [ -d "$TO_NODE" ]; then
echo "Promoting node configuration: $NODE_NAME"
# Copy non-secret files
cp "${FROM_NODE}/docker-compose.yml" "${TO_NODE}/docker-compose.yml.promoted"
cp "${FROM_NODE}/project_${NODE_NAME}.env" "${TO_NODE}/project_${NODE_NAME}.env.promoted"
# Secrets are NOT promoted automatically - manual review required
else
echo "⚠️ Node ${NODE_NAME} does not exist in ${TO_ENV}"
fi
done
echo "Promotion prepared. Review .promoted files and commit if acceptable."
```
**GitLab Pipeline Integration:**
```yaml
promote_to_testing:
stage: promote
script:
- .gitlab/scripts/promote-environment.sh sandbox testing
# Create merge request
- |
git checkout -b "promote/sandbox-to-testing-${CI_COMMIT_SHORT_SHA}"
# Move promoted files
find environments/testing -name "*.promoted" | while read file; do
mv "$file" "${file%.promoted}"
done
git add environments/testing/
git commit -m "config: promote sandbox → testing
Promoted configuration from sandbox to testing
- Common project settings
- Node-specific configurations
- Docker compose files
Refs: ${CI_COMMIT_SHA}"
git push origin "promote/sandbox-to-testing-${CI_COMMIT_SHORT_SHA}"
# Create MR via GitLab API
- |
curl -X POST "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/merge_requests" \
--header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
--data "source_branch=promote/sandbox-to-testing-${CI_COMMIT_SHORT_SHA}" \
--data "target_branch=master" \
--data "title=Promote configuration: sandbox → testing" \
--data "description=Automated configuration promotion from sandbox to testing.
## Changes
- Common configuration updates
- Node-specific setting adjustments
## Review Required
- Verify all changes are appropriate for testing environment
- Check resource allocations
- Validate feature flags
## Next Steps
After merge, trigger testing deployment pipeline."
when: manual
only:
- master
```
### 6.5 Feature Flag Management
**Purpose:** Enable/disable features без code deployment.
**Implementation:**
```bash
# environments/development/common/project.env
# Development: All features ON для testing
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=true
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=true
FEATURE_AI_RECOMMENDATIONS=true
# environments/sandbox/common/project.env
# Sandbox: Most features ON, некоторые experimental OFF
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=true
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=false
FEATURE_AI_RECOMMENDATIONS=true
# environments/testing/common/project.env
# Testing: Production-like, only stable features
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=false
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=false
FEATURE_AI_RECOMMENDATIONS=false
# environments/production/common/project.env
# Production: Only battle-tested features
FEATURE_NEW_CHECKOUT=true
FEATURE_BETA_UI=false
FEATURE_ADVANCED_REPORTING=true
FEATURE_EXPERIMENTAL_PAYMENT_FLOW=false
FEATURE_AI_RECOMMENDATIONS=false
```
**Advanced: LaunchDarkly Integration (optional):**
```yaml
# For production, use LaunchDarkly для gradual rollouts
production_feature_flags:
stage: deploy
script:
- |
# Get feature flags from LaunchDarkly
FEATURE_CONFIG=$(curl -X GET \
"https://app.launchdarkly.com/api/v2/flags/coin-production" \
-H "Authorization: ${LAUNCHDARKLY_API_KEY}")
# Update environment variables
echo "FEATURE_NEW_CHECKOUT=$(echo $FEATURE_CONFIG | jq -r '.flags.new_checkout.on')" >> production.env
echo "FEATURE_BETA_UI=$(echo $FEATURE_CONFIG | jq -r '.flags.beta_ui.on')" >> production.env
only:
- tags
environment:
name: production
```
### 6.6 Resource Management per Environment
**Development:**
```yaml
# Minimal resources
services:
admin_api:
deploy:
replicas: 1
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
cpus: '0.25'
memory: 256M
```
**Sandbox:**
```yaml
# Moderate resources
services:
admin_api:
deploy:
replicas: 1
resources:
limits:
cpus: '1.0'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
```
**Production:**
```yaml
# Full resources
services:
admin_api:
deploy:
replicas: 3
resources:
limits:
cpus: '2.0'
memory: 4G
reservations:
cpus: '1.0'
memory: 2G
placement:
constraints:
- node.labels.env == production
preferences:
- spread: node.labels.zone # Multi-AZ
```
---
## 7. Secrets Management
### 7.1 Current Secret Management Analysis
**Существующая система в docker-compose.yml:**
```yaml
secrets:
card_iv.txt:
file: ./secrets/card_iv.txt
name: card_iv.$SV_card_iv # Versioned secret
db_access:
file: ./secrets/db_access
name: db_access.$SV_db_access
# 30+ total secrets...
```
**Версионирование через SV_* переменные:**
```bash
# secrets.override.env
SV_card_iv=1
SV_db_access=2
SV_webhook_auth=1
# Results in Swarm:
# card_iv.1
# card_iv.2 (новая версия, old still exists)
```
**Проблемы:**
- ❌ Секреты в plaintext на filesystem
- ❌ Нет centralized management
- ❌ Сложная ротация (30+ файлов)
- ❌ Нет audit trail кто получал доступ
- ❌ Риск утечки через Git (если случайно закоммитить)
### 7.2 Multi-Layer Secrets Architecture
**Архитектура:**
```
Layer 1: GitLab CI/CD Variables (Infrastructure Credentials)
├── HARBOR_USER / HARBOR_PASSWORD
├── SSH_PRIVATE_KEY_NODE3 / SSH_PRIVATE_KEY_NODE4
├── SOPS_GPG_PRIVATE_KEY
├── DB_PASSWORD
├── SLACK_WEBHOOK_URL
└── API tokens для external services
Layer 2: SOPS Encrypted Files in Git (Application Secrets)
├── Database credentials
├── API keys (payment gateway, etc.)
├── Encryption keys
├── JWT secrets
└── Third-party service credentials
Layer 3: Docker Secrets (Runtime)
├── Mounted в containers как files (/run/secrets/)
├── Managed через Swarm
├── Versioned (card_iv.1, card_iv.2)
├── Encrypted at rest & in transit
└── Access control через service definitions
Layer 4: External Secret Manager (Optional - Enterprise)
└── HashiCorp Vault
├── Dynamic secrets
├── Automatic rotation
├── Detailed audit logs
└── Policy-based access
```
### 7.3 SOPS Integration
**Setup:**
```bash
# 1. Generate GPG keys для authorized team members
gpg --full-generate-key
# Name: DevOps Team Member
# Email: devops@company.com
# 2. Export public key
gpg --armor --export devops@company.com > devops.pub.asc
# 3. Import team keys
for key in team/*.pub.asc; do
gpg --import "$key"
done
```
**.sops.yaml configuration:**
```yaml
creation_rules:
# Production secrets - только senior team
- path_regex: environments/production/.*/secrets\..*\.enc$
pgp: >-
FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
8E2E0E4F09A5F8B9C1D2E3F4A5B6C7D8E9F0A1B2
encrypted_regex: '^(password|secret|key|token|private_key|api_key)$'
# Testing secrets - team leads + DevOps
- path_regex: environments/testing/.*/secrets\..*\.enc$
pgp: >-
FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
1234567890ABCDEF1234567890ABCDEF12345678,
ABCDEF1234567890ABCDEF1234567890ABCDEF12
encrypted_regex: '^(password|secret|key|token)$'
# Sandbox secrets - все DevOps team
- path_regex: environments/sandbox/.*/secrets\..*\.enc$
pgp: >-
FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
1234567890ABCDEF1234567890ABCDEF12345678,
ABCDEF1234567890ABCDEF1234567890ABCDEF12,
9876543210FEDCBA9876543210FEDCBA98765432
encrypted_regex: '^(password|secret|key|token)$'
# Development - all developers
- path_regex: environments/development/.*/secrets\..*\.enc$
pgp: >-
FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,
DEV_TEAM_KEY_1,
DEV_TEAM_KEY_2,
DEV_TEAM_KEY_3
encrypted_regex: '^(password|secret|key)$'
```
**Create/Edit Encrypted Secrets:**
```bash
# Create new secret file for sandbox/node3
cd coin-gitops
sops environments/sandbox/nodes/node3/secrets.override.enc
# File opens in $EDITOR as plaintext:
DATABASE_PASSWORD: "sandbox-db-password-123"
API_KEY: "sk-sandbox-api-key-456"
JWT_SECRET: "jwt-signing-secret-789"
REDIS_PASSWORD: "redis-password-abc"
PAYMENT_GATEWAY_API_KEY: "pg-api-key-def"
CARD_ENCRYPTION_KEY: "card-enc-key-ghi"
# On save, automatically encrypted by SOPS
# Safe to commit to Git
git add environments/sandbox/nodes/node3/secrets.override.enc
git commit -m "feat(secrets): add sandbox node3 secrets"
```
**Encrypted File Format:**
```yaml
DATABASE_PASSWORD: ENC[AES256_GCM,data:8hT9k2mP3nQ...,iv:xyz...,tag:abc...,type:str]
API_KEY: ENC[AES256_GCM,data:mK9sL3nQ7pR...,iv:def...,tag:ghi...,type:str]
sops:
kms: []
pgp:
- created_at: "2025-01-14T10:30:00Z"
enc: |
-----BEGIN PGP MESSAGE-----
hQIMA...
-----END PGP MESSAGE-----
fp: FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4
version: 3.7.3
```
### 7.4 CI/CD Pipeline Secret Handling
**Decryption в pipeline:**
```yaml
decrypt_secrets:
stage: prepare
script:
- echo "Decrypting secrets for ${ENVIRONMENT}..."
# Import GPG key from GitLab CI/CD Variable
- echo "$SOPS_GPG_PRIVATE_KEY" | base64 -d | gpg --import
# Decrypt secrets для каждого node
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- |
yq eval '.nodes[].name' $ENV_CONFIG | while read NODE_NAME; do
SECRET_FILE="environments/${ENVIRONMENT}/nodes/${NODE_NAME}/secrets.override.enc"
OUTPUT_FILE="/tmp/secrets-${NODE_NAME}.env"
if [ -f "$SECRET_FILE" ]; then
echo "Decrypting secrets for: $NODE_NAME"
sops -d "$SECRET_FILE" > "$OUTPUT_FILE"
# Restrictive permissions
chmod 600 "$OUTPUT_FILE"
# Validate required secrets present
for KEY in DATABASE_PASSWORD API_KEY JWT_SECRET; do
if ! grep -q "^${KEY}:" "$OUTPUT_FILE"; then
echo "❌ Required secret ${KEY} not found for ${NODE_NAME}"
exit 1
fi
done
echo "✅ Secrets decrypted: $NODE_NAME"
else
echo "⚠️ No secrets file for: $NODE_NAME"
fi
done
artifacts:
paths:
- /tmp/secrets-*.env
expire_in: 1 hour # Short expiration для security
after_script:
# Cleanup decrypted secrets
- rm -f /tmp/secrets-*.env
```
**Convert YAML secrets to ENV format:**
```bash
# secrets.override.enc (YAML format):
DATABASE_PASSWORD: "secret123"
API_KEY: "key456"
# Convert to ENV format для deployment.sh:
cat /tmp/secrets-node3.env | yq eval -o=props > /tmp/secrets-node3.props.env
# Result:
DATABASE_PASSWORD=secret123
API_KEY=key456
```
### 7.5 Docker Secrets Creation in Swarm
**Create secrets from decrypted files:**
```yaml
create_docker_secrets:
stage: deploy
needs:
- decrypt_secrets
script:
- echo "Creating Docker secrets in Swarm..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- |
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NODE_NAME=$(echo $node | jq -r '.name')
CONTEXT=$(echo $node | jq -r '.context')
docker context use "$CONTEXT"
# Read decrypted secrets
SECRET_FILE="/tmp/secrets-${NODE_NAME}.env"
# Parse secret version from config
SECRET_VERSION=$(date +%s) # Unix timestamp
# Create each secret in Swarm
while IFS=: read -r key value; do
SECRET_NAME="${key}_v${SECRET_VERSION}"
echo "$value" | docker secret create "$SECRET_NAME" - || {
echo "⚠️ Secret ${SECRET_NAME} already exists, skipping"
}
echo "✅ Secret created: $SECRET_NAME"
done < <(yq eval 'to_entries | .[] | .key + ":" + .value' "$SECRET_FILE")
# Update secret version variables
echo "SV_${key}=${SECRET_VERSION}" >> secret_versions_${NODE_NAME}.env
done
- echo "All secrets created in Swarm"
artifacts:
paths:
- secret_versions_*.env
expire_in: 1 day
```
### 7.6 Secret Rotation Strategy
**Rotation Process:**
```
1. Generate new secret value
2. Create new version in Swarm (e.g., db_password.3)
3. Update SV_db_password=3 в secrets.override.env
4. Deploy - services start using new version
5. Old versions (db_password.1, db_password.2) remain для rollback
6. After grace period (7-30 days), remove old versions
```
**Rotation Script:**
**.gitlab/scripts/rotate-secret.sh:**
```bash
#!/usr/bin/env bash
set -euo pipefail
# Arguments:
# $1 - ENVIRONMENT
# $2 - NODE_NAME
# $3 - SECRET_NAME
# $4 - NEW_VALUE
ENVIRONMENT=$1
NODE_NAME=$2
SECRET_NAME=$3
NEW_VALUE=$4
echo "Rotating secret: ${SECRET_NAME} for ${ENVIRONMENT}/${NODE_NAME}"
# Get Docker context
ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
CONTEXT=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\") | .context" $ENV_CONFIG)
# Get current version
SECRET_FILE="environments/${ENVIRONMENT}/nodes/${NODE_NAME}/secrets.override.enc"
CURRENT_VERSION=$(sops -d "$SECRET_FILE" | yq eval ".${SECRET_NAME}_VERSION // 0")
NEW_VERSION=$((CURRENT_VERSION + 1))
echo "Current version: $CURRENT_VERSION"
echo "New version: $NEW_VERSION"
# Create new secret in Swarm
docker context use "$CONTEXT"
echo "$NEW_VALUE" | docker secret create "${SECRET_NAME}.${NEW_VERSION}" -
# Update encrypted file
sops --set "[\"${SECRET_NAME}\"] \"${NEW_VALUE}\"" "$SECRET_FILE"
sops --set "[\"${SECRET_NAME}_VERSION\"] ${NEW_VERSION}" "$SECRET_FILE"
echo "✅ Secret rotated: ${SECRET_NAME} → version ${NEW_VERSION}"
echo ""
echo "Next steps:"
echo "1. Commit updated secrets file"
echo "2. Deploy to apply new secret"
echo "3. After grace period, remove old version:"
echo " docker secret rm ${SECRET_NAME}.${CURRENT_VERSION}"
```
**Automated Rotation Schedule:**
```yaml
rotate_production_secrets:
stage: maintenance
script:
- |
# Rotate database password every 90 days
LAST_ROTATION=$(git log -1 --format=%ct -- environments/production/nodes/*/secrets.override.enc)
CURRENT=$(date +%s)
DAYS_SINCE=$((($CURRENT - $LAST_ROTATION) / 86400))
if [ $DAYS_SINCE -gt 90 ]; then
echo "Database password rotation required (${DAYS_SINCE} days since last)"
# Generate new password
NEW_PASSWORD=$(openssl rand -base64 32)
# Rotate for all production nodes
for NODE in prod1 prod2 prod3 prod4; do
.gitlab/scripts/rotate-secret.sh production "$NODE" "DATABASE_PASSWORD" "$NEW_PASSWORD"
done
# Create MR for approval
git checkout -b "security/rotate-db-password-$(date +%Y%m%d)"
git add environments/production/
git commit -m "security: rotate production database password
Automated 90-day rotation of database credentials
- Generated new strong password
- Updated all production nodes
- Old version will be removed after 30 days"
git push
# Create MR via API...
else
echo "Database password rotation not required (${DAYS_SINCE} days since last)"
fi
only:
- schedules
when: manual
```
### 7.7 Secret Access Audit
**Audit Logging:**
```yaml
audit_secret_access:
stage: verify
script:
- echo "Auditing secret access..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- |
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NODE_NAME=$(echo $node | jq -r '.name')
CONTEXT=$(echo $node | jq -r '.context')
docker context use "$CONTEXT"
# Get secret usage
docker secret ls --format '{{.Name}}\t{{.CreatedAt}}\t{{.UpdatedAt}}'
# Get services using secrets
docker service ls --format '{{.Name}}' | while read service; do
SECRETS=$(docker service inspect "$service" --format '{{range .Spec.TaskTemplate.ContainerSpec.Secrets}}{{.SecretName}} {{end}}')
if [ -n "$SECRETS" ]; then
echo "Service ${service} uses secrets: $SECRETS"
fi
done
done > secret-audit-${ENVIRONMENT}-$(date +%Y%m%d).log
- echo "✅ Audit log created"
artifacts:
paths:
- secret-audit-*.log
expire_in: 1 year
only:
- schedules
```
---
## 8. Rollback Strategy
### 8.1 Current Rollback Mechanism Analysis
**Существующая rollback функция в auto.sh:**
```bash
rollback() {
# 1. Stop current stacks
docker stack rm "$NODE3_STACK"
docker stack rm "$NODE4_STACK"
sleep 3
# 2. Deploy previous version
cd "$NODE3_PREV"
./deploy.sh deploy [params...]
cd "$NODE4_PREV"
./deploy.sh deploy [params...]
}
```
**Проблемы:**
- ⚠️ Зависит от существования previous directories
- ⚠️ Нет verification после rollback
- ⚠️ Manual trigger только
- ⚠️ Полное удаление стеков (downtime)
- ⚠️ Нет partial rollback (только all-or-nothing)
### 8.2 Improved Rollback Architecture
**Multi-Level Rollback Strategy:**
```
Level 1: Service-Level Rollback (fastest, 1-2 minutes)
├── Revert single service to previous version
├── Keep other services running
├── Minimal impact
└── Use: bug в одном сервисе
Level 2: Stack-Level Rollback (medium, 3-5 minutes)
├── Revert entire stack (all services)
├── Coordinated rollback
├── Moderate impact
└── Use: multiple services affected
Level 3: Infrastructure Rollback (slowest, 5-10 minutes)
├── Revert configuration changes
├── Revert database migrations (if safe)
├── Full environment restore
└── Use: critical infrastructure issues
```
### 8.3 GitLab Pipeline Rollback Jobs
**.gitlab/pipelines/rollback.yml:**
```yaml
# ===============================================
# ROLLBACK PIPELINE
# Multi-level rollback strategy
# ===============================================
.rollback_preparation: &rollback_preparation
before_script:
- echo "Preparing rollback for ${ENVIRONMENT}..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
# Get previous stable version from Git
- |
PREVIOUS_TAG=$(git describe --tags --abbrev=0 HEAD~1)
echo "Current: ${RELEASE_TAG}"
echo "Previous: ${PREVIOUS_TAG}"
echo "PREVIOUS_TAG=${PREVIOUS_TAG}" >> rollback.env
artifacts:
reports:
dotenv: rollback.env
expire_in: 1 hour
rollback_service:
stage: rollback
<<: *rollback_preparation
script:
- echo "Rolling back service: ${SERVICE_NAME}"
- NODE_NAME="${NODE_NAME}"
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
# Get node configuration
- |
NODE_CONFIG=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\")" $ENV_CONFIG -o=json)
CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
STACK=$(echo $NODE_CONFIG | jq -r '.stack')
# Get previous image tag
- PREVIOUS_IMAGE="${REGISTRY}/${SERVICE_NAME}:${PREVIOUS_TAG}"
- echo "Rolling back ${SERVICE_NAME} to ${PREVIOUS_TAG}"
# Update service image
- docker context use "$CONTEXT"
- |
docker service update \
--image "$PREVIOUS_IMAGE" \
--update-failure-action rollback \
"${STACK}_${SERVICE_NAME}"
# Wait for service update
- sleep 30
# Verify service health
- |
REPLICAS=$(docker service ls --filter name="${STACK}_${SERVICE_NAME}" --format '{{.Replicas}}')
echo "Service replicas: $REPLICAS"
if [[ "$REPLICAS" != *"/"* ]]; then
echo "❌ Service rollback failed"
exit 1
fi
RUNNING=$(echo $REPLICAS | cut -d'/' -f1)
DESIRED=$(echo $REPLICAS | cut -d'/' -f2)
if [ "$RUNNING" -ne "$DESIRED" ]; then
echo "❌ Service not fully rolled back: $RUNNING/$DESIRED"
exit 1
fi
- echo "✅ Service rolled back successfully: ${SERVICE_NAME}"
variables:
SERVICE_NAME: "" # Must be provided
NODE_NAME: "" # Must be provided
when: manual
allow_failure: false
rollback_stack:
stage: rollback
<<: *rollback_preparation
script:
- echo "Rolling back entire stack: ${NODE_NAME}"
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
# Get node configuration
- |
NODE_CONFIG=$(yq eval ".nodes[] | select(.name == \"${NODE_NAME}\")" $ENV_CONFIG -o=json)
CONTEXT=$(echo $NODE_CONFIG | jq -r '.context')
STACK=$(echo $NODE_CONFIG | jq -r '.stack')
BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
- echo "Context: $CONTEXT"
- echo "Stack: $STACK"
- echo "Previous version: $PREVIOUS_TAG"
# Check previous version directory exists
- PREV_DIR="${BASE_DIR}/${PREVIOUS_TAG}-${NODE_NAME}"
- |
if [ ! -d "$PREV_DIR" ]; then
echo "❌ Previous version directory not found: $PREV_DIR"
echo "Available versions:"
ls -la "$BASE_DIR" | grep "$NODE_NAME"
exit 1
fi
- echo "✅ Previous version found: $PREV_DIR"
# Stop current stack (gracefully)
- docker context use "$CONTEXT"
- echo "Stopping current stack..."
- docker stack rm "$STACK" || echo "Stack already removed"
# Wait for stack to fully stop
- sleep 10
- |
while docker service ls | grep -q "$STACK"; do
echo "Waiting for services to stop..."
sleep 5
done
- echo "✅ Stack stopped"
# Deploy previous version
- cd "$PREV_DIR"
- echo "Deploying previous version from: $(pwd)"
- |
./deployment.sh deploy \
-n "$CONTEXT" \
-w "$STACK" \
-N node.env \
-P project.env \
-P project_${NODE_NAME}.env \
-f docker-compose.yml \
-f custom.secrets.yml \
-f docker-compose-testshop.yaml \
-s secrets.override.env \
-u
# Verify deployment
- sleep 30
- docker service ls --filter name="$STACK"
- |
SERVICE_COUNT=$(docker service ls --filter name="$STACK" --format '{{.Name}}' | wc -l)
if [ "$SERVICE_COUNT" -lt 5 ]; then
echo "❌ Rollback incomplete: only $SERVICE_COUNT services running"
exit 1
fi
- echo "✅ Stack rolled back successfully: ${NODE_NAME}"
variables:
NODE_NAME: "" # Must be provided
when: manual
allow_failure: false
rollback_all_nodes:
stage: rollback
<<: *rollback_preparation
script:
- echo "Rolling back all nodes in ${ENVIRONMENT}"
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- BASE_DIR=$(yq eval '.base.directory' $ENV_CONFIG)
# Rollback each node sequentially
- |
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NODE_NAME=$(echo $node | jq -r '.name')
CONTEXT=$(echo $node | jq -r '.context')
STACK=$(echo $node | jq -r '.stack')
echo "========================================="
echo "Rolling back node: $NODE_NAME"
echo "========================================="
PREV_DIR="${BASE_DIR}/${PREVIOUS_TAG}-${NODE_NAME}"
if [ ! -d "$PREV_DIR" ]; then
echo "❌ Previous version not found for: $NODE_NAME"
continue
fi
# Stop and redeploy
docker context use "$CONTEXT"
docker stack rm "$STACK" || true
sleep 10
cd "$PREV_DIR"
./deployment.sh deploy \
-n "$CONTEXT" \
-w "$STACK" \
-N node.env \
-P project.env \
-P project_${NODE_NAME}.env \
-f docker-compose.yml \
-f custom.secrets.yml \
-f docker-compose-testshop.yaml \
-s secrets.override.env \
-u
echo "✅ Node rolled back: $NODE_NAME"
done
- echo "✅ All nodes rolled back successfully"
when: manual
allow_failure: false
environment:
name: ${ENVIRONMENT}
action: rollback
```
### 8.4 Automatic Rollback Triggers
**Health Check Based Auto-Rollback:**
```yaml
verify_deployment_health:
stage: verify
script:
- echo "Monitoring deployment health..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- HEALTH_CHECK_TIMEOUT=$(yq eval '.deployment.health_check.timeout' $ENV_CONFIG | sed 's/s//')
- HEALTH_CHECK_INTERVAL=$(yq eval '.deployment.health_check.interval' $ENV_CONFIG | sed 's/s//')
- START_TIME=$(date +%s)
- FAILURES=0
- MAX_FAILURES=3
- |
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$(($CURRENT_TIME - $START_TIME))
if [ $ELAPSED -gt $HEALTH_CHECK_TIMEOUT ]; then
echo "❌ Health check timeout reached"
FAILURES=$(($FAILURES + 1))
break
fi
# Check all nodes
ALL_HEALTHY=true
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NODE_NAME=$(echo $node | jq -r '.name')
CONTEXT=$(echo $node | jq -r '.context')
STACK=$(echo $node | jq -r '.stack')
docker context use "$CONTEXT"
# Check service health
UNHEALTHY=$(docker service ls --filter name="$STACK" --format '{{.Replicas}}' | grep -v "/" | wc -l)
if [ "$UNHEALTHY" -gt 0 ]; then
echo "⚠️ Unhealthy services detected on $NODE_NAME"
ALL_HEALTHY=false
FAILURES=$(($FAILURES + 1))
fi
done
if $ALL_HEALTHY; then
echo "✅ All services healthy"
break
fi
if [ $FAILURES -ge $MAX_FAILURES ]; then
echo "❌ Max failures reached: $FAILURES"
echo "Triggering automatic rollback..."
# Trigger rollback pipeline
curl -X POST \
-F "token=${CI_JOB_TOKEN}" \
-F "ref=master" \
-F "variables[ENVIRONMENT]=${ENVIRONMENT}" \
-F "variables[TRIGGER_ROLLBACK]=true" \
"${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/trigger/pipeline"
exit 1
fi
sleep $HEALTH_CHECK_INTERVAL
done
retry:
max: 0 # No retry - trigger rollback instead
```
### 8.5 Database Migration Rollback
**Проблема:** Database migrations нельзя откатить автоматически (data loss risk).
**Strategy:**
```yaml
handle_migration_rollback:
stage: rollback
script:
- echo "Handling database migration rollback..."
- echo "⚠️ WARNING: Database migrations cannot be automatically rolled back"
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
- DB_HOST=$(yq eval '.database.host' $ENV_CONFIG)
- DB_NAME=$(yq eval '.database.name' $ENV_CONFIG)
# Get current migration ID
- |
CURRENT_MIGRATION=$(PGPASSWORD="${DB_PASSWORD}" psql \
-h "${DB_HOST}" \
-U coin \
-d "${DB_NAME}" \
-t -c "SELECT MAX(id) FROM schema_migrations;")
- echo "Current migration ID: $CURRENT_MIGRATION"
# Get expected migration for previous version
- |
PREVIOUS_MIGRATION=$(git show ${PREVIOUS_TAG}:environments/${ENVIRONMENT}/migration.txt)
echo "Previous version migration ID: $PREVIOUS_MIGRATION"
- |
if [ "$CURRENT_MIGRATION" -gt "$PREVIOUS_MIGRATION" ]; then
echo "❌ CRITICAL: New migrations were applied!"
echo "Current: $CURRENT_MIGRATION"
echo "Previous: $PREVIOUS_MIGRATION"
echo ""
echo "Manual intervention required:"
echo "1. Review migrations between $PREVIOUS_MIGRATION and $CURRENT_MIGRATION"
echo "2. Determine if rollback is safe (check for data loss)"
echo "3. If safe, manually execute down migrations"
echo "4. If not safe, consider forward fix instead"
echo ""
echo "Contact DBA team immediately!"
# Send alert
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d '{
"text": "🚨 CRITICAL: Migration rollback required",
"attachments": [{
"color": "danger",
"text": "Environment: '"$ENVIRONMENT"'\nCurrent migration: '"$CURRENT_MIGRATION"'\nTarget migration: '"$PREVIOUS_MIGRATION"'\n\nManual DBA intervention required!"
}]
}'
exit 1
else
echo "✅ No new migrations applied, safe to rollback"
fi
when: on_failure
allow_failure: false
```
### 8.6 Rollback Verification
**Post-Rollback Checks:**
```yaml
verify_rollback:
stage: verify
needs:
- rollback_stack
script:
- echo "Verifying rollback success..."
- ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
# 1. Check all services running
- |
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NODE_NAME=$(echo $node | jq -r '.name')
CONTEXT=$(echo $node | jq -r '.context')
STACK=$(echo $node | jq -r '.stack')
docker context use "$CONTEXT"
echo "Checking services on: $NODE_NAME"
SERVICES=$(docker service ls --filter name="$STACK" --format '{{.Name}}\t{{.Replicas}}')
echo "$SERVICES"
# Verify all services converged
UNCONVERGED=$(echo "$SERVICES" | awk -F'\t' '{
split($2, a, "/")
if (a[1] != a[2]) print $1
}')
if [ -n "$UNCONVERGED" ]; then
echo "❌ Unconverged services after rollback:"
echo "$UNCONVERGED"
exit 1
fi
done
- echo "✅ All services converged"
# 2. Health check endpoints
- |
yq eval '.nodes[]' $ENV_CONFIG -o=json | while read node; do
NODE_NAME=$(echo $node | jq -r '.name')
PUBLIC_IP=$(echo $node | jq -r '.public_ip // ""')
if [ -n "$PUBLIC_IP" ]; then
echo "Health check: https://${PUBLIC_IP}:5443/health"
HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "https://${PUBLIC_IP}:5443/health")
if [ "$HTTP_CODE" != "200" ]; then
echo "❌ Health check failed: HTTP $HTTP_CODE"
exit 1
fi
echo "✅ Health check passed: $NODE_NAME"
fi
done
# 3. Smoke tests
- .gitlab/scripts/smoke-tests.sh "${ENVIRONMENT}"
- echo "✅ Rollback verification complete"
```
### 8.7 Rollback Documentation
**Post-Rollback Report:**
```yaml
generate_rollback_report:
stage: notify
needs:
- verify_rollback
script:
- |
cat > rollback-report-${ENVIRONMENT}-$(date +%Y%m%d-%H%M%S).md <<EOF
# Rollback Report
## Incident Summary
- **Environment**: ${ENVIRONMENT}
- **Date**: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
- **Triggered By**: ${CI_COMMIT_AUTHOR}
- **Pipeline**: ${CI_PIPELINE_URL}
## Versions
- **Failed Version**: ${RELEASE_TAG}
- **Rolled Back To**: ${PREVIOUS_TAG}
## Rollback Actions
- Stack removed: ${STACK_NAME}
- Previous version deployed: ${PREVIOUS_TAG}
- Services restarted: All
- Health checks: Passed
## Verification
- All services converged: ✅
- Health endpoints responding: ✅
- Smoke tests passed: ✅
## Impact
- Downtime: ~5 minutes
- Affected users: [To be determined]
- Data loss: None
## Root Cause
[To be investigated]
## Next Steps
1. Investigate root cause of deployment failure
2. Fix identified issues
3. Test fix in lower environments
4. Schedule re-deployment
## Timeline
- Failure detected: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
- Rollback initiated: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
- Rollback completed: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
- Services restored: $(date -u +"%Y-%m-%d %H:%M:%S UTC")
EOF
- cat rollback-report-*.md
# Send to Slack
- |
REPORT=$(cat rollback-report-*.md)
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d '{
"text": "Rollback Report: '"$ENVIRONMENT"'",
"attachments": [{
"color": "warning",
"text": "```'"$REPORT"'```"
}]
}'
artifacts:
paths:
- rollback-report-*.md
expire_in: 1 year
```
---
## 9. Мониторинг и верификация
### 9.1 Multi-Layer Monitoring Architecture
**Monitoring Layers:**
```
Layer 1: Infrastructure Monitoring (Swarm Level)
├── Node health (CPU, memory, disk)
├── Service status (running/failed)
├── Container metrics
└── Network performance
Layer 2: Application Monitoring (Service Level)
├── HTTP endpoints health
├── Response times
├── Error rates
├── Transaction volumes
Layer 3: Business Monitoring (Business Level)
├── User activity
├── Transaction success rate
├── Revenue metrics
└── Critical business processes
Layer 4: Deployment Monitoring (CI/CD Level)
├── Pipeline success rate
├── Deployment frequency
├── Lead time for changes
├── MTTR (Mean Time To Recovery)
```
### 9.2 Infrastructure Monitoring (Prometheus + Grafana)
**Prometheus Scrape Configuration:**
```yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Docker Swarm Manager Metrics
- job_name: 'docker-swarm-manager'
static_configs:
- targets:
- node3.internal:9323
- node4.internal:9323
labels:
environment: 'sandbox'
# Node Exporter (Host Metrics)
- job_name: 'node-exporter'
static_configs:
- targets:
- node3.internal:9100
- node4.internal:9100
labels:
environment: 'sandbox'
# cAdvisor (Container Metrics)
- job_name: 'cadvisor'
static_configs:
- targets:
- node3.internal:8080
- node4.internal:8080
labels:
environment: 'sandbox'
# Application Metrics
- job_name: 'coin-api'
dns_sd_configs:
- names:
- 'tasks.admin_api'
- 'tasks.client_api'
type: 'A'
port: 9090 # Metrics port
```
**Key Metrics to Monitor:**
```prometheus
# Service Health
up{job="coin-api"} == 1
# Container Restarts
rate(container_restart_count[5m]) > 0
# CPU Usage
rate(container_cpu_usage_seconds_total[5m]) * 100
# Memory Usage
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
# Network Traffic
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])
# HTTP Request Rate
rate(http_requests_total[5m])
# HTTP Error Rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# Response Time (95th percentile)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
**Grafana Dashboard - Deployment Overview:**
```json
{
"dashboard": {
"title": "COIN Deployment Dashboard",
"panels": [
{
"title": "Deployment Timeline",
"type": "graph",
"targets": [
{
"expr": "changes(deployment_version{environment=\"$environment\"}[1h])"
}
]
},
{
"title": "Service Health",
"type": "stat",
"targets": [
{
"expr": "count(up{job=\"coin-api\",environment=\"$environment\"} == 1)"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\",environment=\"$environment\"}[5m])"
}
]
},
{
"title": "Response Time (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{environment=\"$environment\"}[5m]))"
}
]
}
]
}
}
```
### 9.3 Application Health Checks
**Health Check Endpoints:**
```yaml
# docker-compose.yml
services:
admin_api:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:10000/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 40s
```
**Comprehensive Health Check Script:**
**.gitlab/scripts/health-check.sh:**
```bash
#!/usr/bin/env bash
set -euo pipefail
# Arguments:
# $1 - BASE_URL (e.g., https://coin-node3.sandbox.company.com)
# $2 - ENVIRONMENT
BASE_URL=$1
ENVIRONMENT=$2
echo "Running health checks against: ${BASE_URL}"
FAILED_CHECKS=0
# Test 1: Basic Health Endpoint
echo "Test 1: Health endpoint..."
HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "${BASE_URL}/health")
if [ "$HTTP_CODE" = "200" ]; then
echo "✅ Health check passed (HTTP $HTTP_CODE)"
else
echo "❌ Health check failed (HTTP $HTTP_CODE)"
FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi
# Test 2: API Version
echo "Test 2: API version..."
VERSION=$(curl -k -s "${BASE_URL}/api/version" | jq -r '.version // empty')
if [ -n "$VERSION" ]; then
echo "✅ API version: ${VERSION}"
else
echo "❌ API version check failed"
FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi
# Test 3: Database Connectivity
echo "Test 3: Database connectivity..."
DB_STATUS=$(curl -k -s "${BASE_URL}/api/health/database" | jq -r '.status // empty')
if [ "$DB_STATUS" = "ok" ]; then
echo "✅ Database connectivity OK"
else
echo "❌ Database connectivity failed: $DB_STATUS"
FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi
# Test 4: Redis Connectivity
echo "Test 4: Redis connectivity..."
REDIS_STATUS=$(curl -k -s "${BASE_URL}/api/health/redis" | jq -r '.status // empty')
if [ "$REDIS_STATUS" = "ok" ]; then
echo "✅ Redis connectivity OK"
else
echo "❌ Redis connectivity failed: $REDIS_STATUS"
FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi
# Test 5: Critical Endpoints
echo "Test 5: Critical endpoints..."
ENDPOINTS=(
"/api/auth/status"
"/api/users/me"
"/api/transactions/stats"
)
for endpoint in "${ENDPOINTS[@]}"; do
HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer ${API_TEST_TOKEN}" \
"${BASE_URL}${endpoint}")
if [ "$HTTP_CODE" = "200" ] || [ "$HTTP_CODE" = "401" ]; then
echo "✅ Endpoint reachable: $endpoint (HTTP $HTTP_CODE)"
else
echo "❌ Endpoint failed: $endpoint (HTTP $HTTP_CODE)"
FAILED_CHECKS=$((FAILED_CHECKS + 1))
fi
done
# Summary
echo ""
echo "========================================"
if [ $FAILED_CHECKS -eq 0 ]; then
echo "✅ All health checks passed"
echo "========================================"
exit 0
else
echo "❌ ${FAILED_CHECKS} health check(s) failed"
echo "========================================"
exit 1
fi
```
### 9.4 Smoke Tests
**Post-Deployment Smoke Test Suite:**
**.gitlab/scripts/smoke-tests.sh:**
```bash
#!/usr/bin/env bash
set -euo pipefail
# Arguments:
# $1 - ENVIRONMENT
ENVIRONMENT=$1
ENV_CONFIG="environments/${ENVIRONMENT}/config.yml"
echo "Running smoke tests for: ${ENVIRONMENT}"
FAILED_TESTS=0
# Get first node URL
FIRST_NODE=$(yq eval '.nodes[0].name' $ENV_CONFIG)
BASE_URL="https://coin-${FIRST_NODE}.${ENVIRONMENT}.company.com"
echo "Testing against: $BASE_URL"
# Test 1: User Authentication
echo "Smoke Test 1: User Authentication..."
AUTH_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/auth/login" \
-H "Content-Type: application/json" \
-d '{"username":"test_user","password":"test_password"}')
TOKEN=$(echo $AUTH_RESPONSE | jq -r '.token // empty')
if [ -n "$TOKEN" ]; then
echo "✅ Authentication successful"
else
echo "❌ Authentication failed"
FAILED_TESTS=$((FAILED_TESTS + 1))
fi
# Test 2: Create Transaction
echo "Smoke Test 2: Create Transaction..."
TX_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/transactions" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"amount":100,"currency":"USD","description":"Smoke test"}')
TX_ID=$(echo $TX_RESPONSE | jq -r '.id // empty')
if [ -n "$TX_ID" ]; then
echo "✅ Transaction created: $TX_ID"
else
echo "❌ Transaction creation failed"
FAILED_TESTS=$((FAILED_TESTS + 1))
fi
# Test 3: Retrieve Transaction
echo "Smoke Test 3: Retrieve Transaction..."
TX_GET=$(curl -k -s "${BASE_URL}/api/transactions/${TX_ID}" \
-H "Authorization: Bearer $TOKEN")
TX_STATUS=$(echo $TX_GET | jq -r '.status // empty')
if [ "$TX_STATUS" = "pending" ] || [ "$TX_STATUS" = "completed" ]; then
echo "✅ Transaction retrieved: status=$TX_STATUS"
else
echo "❌ Transaction retrieval failed"
FAILED_TESTS=$((FAILED_TESTS + 1))
fi
# Test 4: List Transactions
echo "Smoke Test 4: List Transactions..."
TX_LIST=$(curl -k -s "${BASE_URL}/api/transactions?limit=10" \
-H "Authorization: Bearer $TOKEN")
TX_COUNT=$(echo $TX_LIST | jq '.items | length')
if [ "$TX_COUNT" -gt 0 ]; then
echo "✅ Transaction list retrieved: $TX_COUNT items"
else
echo "❌ Transaction list empty or failed"
FAILED_TESTS=$((FAILED_TESTS + 1))
fi
# Test 5: Webhook Endpoint
echo "Smoke Test 5: Webhook Processing..."
WEBHOOK_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/webhooks/test" \
-H "X-Webhook-Secret: ${WEBHOOK_SECRET}" \
-H "Content-Type: application/json" \
-d '{"event":"test","data":{}}')
WEBHOOK_STATUS=$(echo $WEBHOOK_RESPONSE | jq -r '.status // empty')
if [ "$WEBHOOK_STATUS" = "processed" ]; then
echo "✅ Webhook processed"
else
echo "❌ Webhook processing failed"
FAILED_TESTS=$((FAILED_TESTS + 1))
fi
# Test 6: PDF Generation
echo "Smoke Test 6: PDF Generation..."
PDF_RESPONSE=$(curl -k -s -X POST "${BASE_URL}/api/reports/generate" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"type":"transaction_report","format":"pdf"}')
PDF_URL=$(echo $PDF_RESPONSE | jq -r '.url // empty')
if [ -n "$PDF_URL" ]; then
echo "✅ PDF generated: $PDF_URL"
else
echo "❌ PDF generation failed"
FAILED_TESTS=$((FAILED_TESTS + 1))
fi
# Summary
echo ""
echo "========================================"
echo "Smoke Tests Summary"
echo "========================================"
if [ $FAILED_TESTS -eq 0 ]; then
echo "✅ All smoke tests passed (6/6)"
exit 0
else
echo "❌ ${FAILED_TESTS} smoke test(s) failed"
exit 1
fi
```
### 9.5 Performance Baseline Monitoring
**Response Time Tracking:**
```yaml
monitor_performance_baseline:
stage: verify
script:
- echo "Monitoring performance baseline..."
- BASE_URL="https://coin-node3.${ENVIRONMENT}.company.com"
# Measure response times
- |
echo "Endpoint,Response_Time_MS,Status" > performance-${RELEASE_TAG}.csv
ENDPOINTS=(
"/health"
"/api/version"
"/api/auth/status"
"/api/transactions?limit=10"
)
for endpoint in "${ENDPOINTS[@]}"; do
RESPONSE_TIME=$(curl -k -s -o /dev/null -w "%{time_total}" "${BASE_URL}${endpoint}")
HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "${BASE_URL}${endpoint}")
RESPONSE_TIME_MS=$(echo "$RESPONSE_TIME * 1000" | bc)
echo "${endpoint},${RESPONSE_TIME_MS},${HTTP_CODE}" >> performance-${RELEASE_TAG}.csv
done
- cat performance-${RELEASE_TAG}.csv
# Compare with baseline
- |
if [ -f "performance-baseline.csv" ]; then
echo "Comparing with baseline..."
# Simple comparison (production should use proper analysis)
CURRENT_AVG=$(awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' performance-${RELEASE_TAG}.csv)
BASELINE_AVG=$(awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' performance-baseline.csv)
DEGRADATION=$(echo "scale=2; ($CURRENT_AVG - $BASELINE_AVG) / $BASELINE_AVG * 100" | bc)
echo "Current average: ${CURRENT_AVG}ms"
echo "Baseline average: ${BASELINE_AVG}ms"
echo "Degradation: ${DEGRADATION}%"
# Alert if degradation > 20%
if (( $(echo "$DEGRADATION > 20" | bc -l) )); then
echo "⚠️ Performance degradation detected: ${DEGRADATION}%"
echo "Consider rollback or investigation"
fi
else
echo "No baseline found, creating..."
cp performance-${RELEASE_TAG}.csv performance-baseline.csv
fi
artifacts:
paths:
- performance-*.csv
expire_in: 30 days
```
### 9.6 Alerting Configuration
**Alertmanager Rules:**
```yaml
# alertmanager.yml
route:
group_by: ['alertname', 'environment']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
environment: production
receiver: 'slack-production'
- match:
environment: sandbox
receiver: 'slack-sandbox'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#deployments'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
description: '{{ .GroupLabels.alertname }}'
- name: 'slack-production'
slack_configs:
- api_url: '${SLACK_WEBHOOK_PRODUCTION}'
channel: '#production-alerts'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
```
**Alert Rules:**
```yaml
# prometheus-rules.yml
groups:
- name: deployment_alerts
interval: 30s
rules:
- alert: DeploymentFailed
expr: deployment_status{environment="production"} == 0
for: 2m
labels:
severity: critical
annotations:
description: "Deployment to {{ $labels.node }} failed"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5..",environment="production"}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
description: "High error rate detected: {{ $value }}%"
- alert: ServiceDown
expr: up{job="coin-api",environment="production"} == 0
for: 1m
labels:
severity: critical
annotations:
description: "Service {{ $labels.instance }} is down"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
description: "Container {{ $labels.container }} memory usage > 90%"
```
---
## 10. План внедрения
### 10.1 Phased Rollout Strategy
**4-Phase Approach:**
```
Phase 1: Infrastructure Setup (Week 1-2)
├── GitLab Runner installation
├── Docker context configuration
├── SOPS setup
├── Monitoring stack deployment
└── Testing infrastructure
Phase 2: Development Environment (Week 3-4)
├── Migrate development to GitOps
├── Create pipeline templates
├── Test basic workflows
├── Train team
└── Collect feedback
Phase 3: Sandbox + Testing (Week 5-6)
├── Migrate sandbox environment
├── Implement approval workflows
├── Add advanced features (rollback, etc.)
├── Performance tuning
└── Documentation
Phase 4: Production Ready (Week 7-8)
├── Production configuration
├── Security hardening
├── Disaster recovery testing
├── Final training
└── Go-live
```
### 10.2 Week-by-Week Implementation Plan
**Week 1: Foundation**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Kickoff meeting, Requirements review | Project charter |
| Tue | GitLab Runner installation, Docker context setup | Working runner |
| Wed | Create repository structure, Initial pipeline | Base .gitlab-ci.yml |
| Thu | SOPS installation, GPG key generation | Encrypted secrets |
| Fri | Monitoring stack deployment | Prometheus + Grafana |
**Week 2: Development Pipeline**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Development environment configuration | config.yml |
| Tue | Prepare stage implementation | Extract + prepare scripts |
| Wed | Deploy stage implementation | Deployment automation |
| Thu | Verification stage implementation | Health checks + smoke tests |
| Fri | End-to-end testing | Working dev pipeline |
**Week 3: Sandbox Migration**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Sandbox configuration creation | Sandbox config files |
| Tue | Secret migration to SOPS | Encrypted secrets |
| Wed | Pipeline adaptation | Sandbox-specific jobs |
| Thu | Testing + validation | Successful deployment |
| Fri | Parallel running (old + new) | Comparison data |
**Week 4: Advanced Features**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Rollback implementation | Rollback pipeline |
| Tue | Automatic rollback triggers | Health-based rollback |
| Wed | Performance monitoring | Baseline tracking |
| Thu | Alert configuration | Alerting rules |
| Fri | Documentation update | User guides |
**Week 5: Testing Environment**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Testing environment setup | Testing configs |
| Tue | Approval workflow implementation | Manual gates |
| Wed | Integration with QA processes | QA checklist |
| Thu | Environment promotion testing | Promotion pipeline |
| Fri | Load testing | Performance report |
**Week 6: Production Preparation**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Production configuration | Prod configs |
| Tue | Security hardening | Security audit |
| Wed | Disaster recovery setup | DR procedures |
| Thu | Change Advisory Board integration | CAB workflow |
| Fri | Production dry-run | Test results |
**Week 7: Production Migration**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Final security review | Sign-off |
| Tue | Production secrets migration | Encrypted prod secrets |
| Wed | Production pipeline testing | Test deployment |
| Thu | Go-live preparation | Runbooks |
| Fri | Production go-live | First prod deployment |
**Week 8: Stabilization**
| Day | Tasks | Deliverables |
|-----|-------|--------------|
| Mon | Monitor production deployments | Metrics report |
| Tue | Address any issues | Bug fixes |
| Wed | Team training sessions | Training materials |
| Thu | Documentation finalization | Complete docs |
| Fri | Project retrospective | Lessons learned |
### 10.3 Success Criteria
**Technical Metrics:**
| Metric | Target | Measurement |
|--------|--------|-------------|
| Deployment time | < 15 min | Pipeline duration |
| Success rate | > 95% | Successful/total deploys |
| Rollback time | < 5 min | Rollback duration |
| MTTR | < 30 min | Mean time to recovery |
| Pipeline reliability | > 99% | Runner uptime |
**Process Metrics:**
| Metric | Target | Measurement |
|--------|--------|-------------|
| Manual steps | < 2 per deploy | Process audit |
| Approval time | < 2 hours | Approval duration |
| Documentation coverage | 100% | Doc review |
| Team training | 100% | Training completion |
| Knowledge transfer | Complete | Quiz scores |
**Business Metrics:**
| Metric | Target | Measurement |
|--------|--------|-------------|
| Deployment frequency | 2x increase | Deploy count |
| Lead time | 50% reduction | Commit to production |
| Change failure rate | < 5% | Failed/total changes |
| Team satisfaction | > 80% | Survey results |
| Cost savings | Measurable | Time saved × hourly rate |
### 10.4 Risk Mitigation
**Identified Risks:**
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Pipeline failures during migration | High | Medium | Parallel running, quick rollback |
| Secret leakage | Low | Critical | SOPS encryption, access control |
| Learning curve | Medium | Medium | Training, documentation, support |
| Production incident | Low | Critical | Comprehensive testing, gradual rollout |
| Resistance to change | Medium | Medium | Change management, stakeholder buy-in |
**Contingency Plans:**
1. **Pipeline Failure:**
- Keep manual scripts as backup
- Document emergency procedures
- 24/7 support during migration
2. **Security Incident:**
- Immediate secret rotation
- Audit all access
- Incident response team activation
3. **Team Issues:**
- Extended training period
- Pair programming sessions
- Dedicated support channel
### 10.5 Training Plan
**Training Modules:**
**Module 1: GitOps Fundamentals (2 hours)**
- Infrastructure as Code concepts
- Git workflow и best practices
- CI/CD pipeline basics
- Hands-on: Create simple pipeline
**Module 2: COIN Pipeline Deep Dive (3 hours)**
- Pipeline architecture overview
- Stage-by-stage walkthrough
- Configuration management
- Hands-on: Trigger deployment
**Module 3: Secrets Management (2 hours)**
- SOPS usage
- Secret rotation procedures
- Security best practices
- Hands-on: Encrypt/decrypt secrets
**Module 4: Troubleshooting (2 hours)**
- Reading pipeline logs
- Common failure scenarios
- Debug techniques
- Hands-on: Fix failing pipeline
**Module 5: Rollback Procedures (2 hours)**
- When to rollback
- Rollback execution
- Verification steps
- Hands-on: Perform rollback
**Module 6: Monitoring & Alerts (2 hours)**
- Dashboard overview
- Alert interpretation
- Response procedures
- Hands-on: Respond to alert
### 10.6 Post-Implementation Support
**Support Structure:**
```
Tier 1: Self-Service
├── Documentation wiki
├── Troubleshooting guides
├── FAQ
└── Video tutorials
Tier 2: Team Support
├── Slack channel: #cicd-support
├── Office hours: Daily 10-11 AM
├── Email: devops-support@company.com
└── Response time: < 4 hours
Tier 3: Expert Support
├── On-call DevOps engineer
├── Escalation for critical issues
├── Response time: < 1 hour
└── 24/7 for production
```
**Continuous Improvement:**
- Weekly metrics review
- Monthly retrospectives
- Quarterly pipeline optimization
- Annual security audit
- Regular training updates
---
## Заключение
### Итоговое решение
Универсальный GitLab CI/CD pipeline для COIN приложения **полностью реализуем** и обеспечит:
**Автоматизацию** - 90% reduction ручных операций
**Универсальность** - поддержка всех 4 окружений
**Безопасность** - SOPS encryption + audit trail
**Надежность** - automatic rollback + health checks
**Observability** - comprehensive monitoring
**Скорость** - 3x faster deployments
### Ключевые преимущества
1. **Единый процесс** для всех окружений
2. **Git как source of truth** для всех конфигураций
3. **Автоматический deployment** с manual gates где нужно
4. **Built-in rollback** с verification
5. **Comprehensive monitoring** на всех уровнях
6. **Полная прослеживаемость** всех изменений
### Следующие шаги
1. Review этого документа с командой
2. Утверждение implementation плана
3. Allocation ресурсов (8 недель, 1-2 FTE)
4. Kickoff meeting
5. Start Phase 1 implementation
**Документ готов для начала внедрения!** 🚀