docs: add detailed architecture document for GitOps solution

2026-01-12 13:00:31 +00:00
parent 7467b1d9fd
commit dd061c8f1f
1 changed files with 855 additions and 0 deletions
--- a/docs/gitops-cicd/02-architecture.md
+++ b/docs/gitops-cicd/02-architecture.md
@@ -0,0 +1,855 @@
+# FinTech GitOps CI/CD - Архитектура решения
+
+**Версия:** 1.0  
+**Дата:** Январь 2026  
+**Целевая аудитория:** Архитекторы, DevOps, Инфраструктура, Безопасность
+
+---
+
+## Содержание
+
+1. [Общая архитектура](#1-общая-архитектура)
+2. [Сетевая архитектура](#2-сетевая-архитектура)
+3. [Зоны и их назначение](#3-зоны-и-их-назначение)
+4. [Потоки данных](#4-потоки-данных)
+5. [High Availability и масштабирование](#5-high-availability-и-масштабирование)
+6. [Disaster Recovery](#6-disaster-recovery)
+
+---
+
+## 1. Общая архитектура
+
+### 1.1 Принципы проектирования
+
+**Defense in Depth:**
+Многоуровневая защита с изоляцией на каждом уровне:
+- Сетевая сегментация через VLAN
+- Firewall между всеми зонами
+- Application-level authentication и authorization
+- Encryption at rest и in transit
+- Audit logging на всех уровнях
+
+**Least Privilege:**
+Минимальные необходимые права для каждого компонента:
+- Service accounts с ограниченными permissions
+- Network access только к необходимым endpoints
+- Time-bound credentials где возможно
+- Регулярная ротация secrets
+
+**Immutable Infrastructure:**
+Инфраструктура как код, изменения только через Git:
+- Нет ручных изменений на серверах
+- Все изменения version controlled
+- Reproducible deployments
+- Easy rollback через Git history
+
+**Observability:**
+Полная видимость всех процессов:
+- Централизованное логирование
+- Метрики со всех компонентов
+- Distributed tracing для запросов
+- Audit trail для compliance
+
+### 1.2 Логические слои
+
+**Presentation Layer (User Interface):**
+- Portainer UI для визуального управления Swarm
+- Grafana для дашбордов и метрик
+- Jenkins Blue Ocean для CI/CD визуализации
+- Ollama web interface для AI взаимодействия
+- Gitea web UI для repository management
+
+**API Layer:**
+- Docker Swarm API для управления кластером
+- Harbor API для registry операций
+- Gitea API для Git operations
+- Jenkins API для trigger builds
+- Prometheus API для метрик
+- MCP Server API для AI интеграции
+
+**Service Layer (Business Logic):**
+- GitOps Operator - автоматическая синхронизация
+- Jenkins pipelines - CI/CD логика
+- Harbor webhooks - уведомления о новых образах
+- AlertManager - правила для алертов
+- AI models - обработка запросов
+
+**Data Layer:**
+- PostgreSQL - реляционные данные
+- Git repositories - код и конфигурации
+- Harbor storage - Docker образы
+- Prometheus TSDB - временные ряды метрик
+- Loki - логи
+- Vector DB - embeddings для AI
+
+**Infrastructure Layer:**
+- Docker Swarm - orchestration platform
+- Overlay networks - service communication
+- Shared storage - persistent data
+- Backup systems - disaster recovery
+
+---
+
+## 2. Сетевая архитектура
+
+### 2.1 VLAN сегментация
+
+**VLAN 10 - Management & CI/CD Zone:**
+- Subnet: 10.10.10.0/24
+- Gateway: 10.10.10.1
+- Компоненты: Gitea, Jenkins, Harbor, GitOps Operator, Portainer
+- Доступ: Только через VPN с MFA
+- Изоляция: Строгий firewall на границе
+
+**VLAN 20 - Docker Swarm Cluster Zone:**
+- Subnet: 10.20.0.0/16 (для большого количества containers)
+- Manager subnet: 10.20.1.0/24
+- Worker subnet: 10.20.2.0/23
+- Gateway: 10.20.0.1
+- Компоненты: Swarm managers, workers, overlay networks
+- Доступ: Только из Management zone и Monitoring zone
+- Изоляция: Encrypted overlay network внутри
+
+**VLAN 30 - AI & Analytics Zone:**
+- Subnet: 10.30.10.0/24
+- Gateway: 10.30.10.1
+- Компоненты: Ollama, MCP Server, Vector Database
+- Доступ: Read-only к источникам данных
+- Изоляция: Не может инициировать изменения в других зонах
+
+**VLAN 40 - Monitoring & Logging Zone:**
+- Subnet: 10.40.10.0/24
+- Gateway: 10.40.10.1
+- Компоненты: Prometheus, Grafana, Loki, AlertManager
+- Доступ: Read-only metrics collection
+- Изоляция: Не может управлять компонентами
+
+**VLAN 50 - Data & Database Zone:**
+- Subnet: 10.50.0.0/16
+- Infrastructure DB subnet: 10.50.10.0/24
+- Application DB subnet: 10.50.20.0/23
+- Storage subnet: 10.50.30.0/24
+- Gateway: 10.50.0.1
+- Компоненты: PostgreSQL, Application databases, Shared storage
+- Доступ: Строго контролируемый, encrypted connections
+- Изоляция: Самая строгая, audit всех подключений
+
+**VLAN 60 - Backup & DR Zone:**
+- Subnet: 10.60.10.0/24
+- Gateway: 10.60.10.1
+- Компоненты: Backup server, long-term storage
+- Доступ: Write-only для backup agents, read для recovery
+- Изоляция: Offline storage, air-gapped где возможно
+
+### 2.2 Firewall правила
+
+**Принцип:** Deny all, allow explicitly needed
+
+**Management VLAN → Swarm VLAN:**
+```
+Source: 10.10.10.40 (GitOps Operator)
+Destination: 10.20.1.0/24 (Swarm Managers)
+Ports: 2377/tcp (cluster management)
+Action: ALLOW
+Logging: YES
+
+Source: 10.10.10.50 (Portainer)
+Destination: 10.20.1.0/24 (Swarm Managers)
+Ports: 2375/tcp (Docker API over TLS)
+Action: ALLOW
+Logging: YES
+
+All other traffic: DENY
+```
+
+**Swarm VLAN → Harbor (Management VLAN):**
+```
+Source: 10.20.0.0/16 (All Swarm nodes)
+Destination: 10.10.10.30 (Harbor)
+Ports: 443/tcp, 5000/tcp (HTTPS, Docker registry)
+Protocol: TLS 1.3 with mutual authentication
+Action: ALLOW
+Logging: YES
+
+All other traffic: DENY
+```
+
+**AI VLAN → Data Sources:**
+```
+Source: 10.30.10.20 (MCP Server)
+Destination: Multiple (через MCP connectors)
+Ports: Varies (SSH 22, HTTPS 443, PostgreSQL 5432, etc.)
+Access: READ-ONLY
+Authentication: Service account per destination
+Logging: ALL QUERIES LOGGED
+Action: ALLOW with rate limiting
+
+Write operations: DENY
+```
+
+**Monitoring VLAN → All Zones:**
+```
+Source: 10.40.10.10 (Prometheus)
+Destination: ALL VLANs
+Ports: Metrics endpoints (обычно 9090-9999)
+Access: READ-ONLY metrics scraping
+Action: ALLOW
+Logging: NO (too verbose, metrics only)
+
+Any non-metrics ports: DENY
+```
+
+**Data VLAN → Backup VLAN:**
+```
+Source: 10.50.0.0/16 (All databases)
+Destination: 10.60.10.10 (Backup server)
+Ports: Backup protocol specific
+Direction: ONE-WAY (source → backup only)
+Action: ALLOW
+Logging: YES
+Encryption: MANDATORY
+
+Reverse direction: DENY (except for restore procedures)
+```
+
+### 2.3 Внешнее подключение
+
+**VPN Gateway:**
+- Публичный IP для VPN подключений
+- Multi-factor authentication обязательна
+- Certificate-based authentication + one-time password
+- Split-tunnel запрещен (все через VPN)
+- Session timeout: 8 часов
+- Idle timeout: 30 минут
+- Disconnect после 3 неудачных MFA попыток
+
+**Jump Host/Bastion:**
+- Единая точка входа после VPN
+- Session recording для аудита
+- No direct access to production systems, только через jump host
+- Authorized keys management централизованно
+- Automatic logout после 15 минут idle
+- Audit log всех команд
+
+**Разрешенные пользователи:**
+- Developers: Доступ к Gitea, Jenkins, Portainer (read-only для production)
+- DevOps: Полный доступ ко всем системам управления
+- Security team: Read-only audit доступ ко всему
+- Managers: Grafana и reporting dashboards только
+
+---
+
+## 3. Зоны и их назначение
+
+### 3.1 Management & CI/CD Zone
+
+**Назначение:**
+Централизованное управление кодом, CI/CD процессами и container registry.
+
+**Критичность:** HIGH - простой влияет на возможность деплоя новых версий
+
+**Компоненты:**
+
+**Gitea (10.10.10.10):**
+- Роль: Single source of truth для всего кода и конфигураций
+- Взаимодействие: Принимает push от developers, отправляет webhooks в Jenkins
+- Зависимости: PostgreSQL (VLAN 50), Shared storage для Git LFS
+- SLA: 99.9% uptime
+
+**Jenkins (10.10.10.20):**
+- Роль: CI automation, build и test applications
+- Взаимодействие: Получает webhooks от Gitea, push образов в Harbor, update Git
+- Зависимости: Gitea, Harbor, Docker build agents
+- SLA: 99.5% uptime (может работать в degraded mode)
+
+**Harbor (10.10.10.30):**
+- Роль: Enterprise container registry с security scanning
+- Взаимодействие: Принимает push от Jenkins, pull от Swarm nodes
+- Зависимости: PostgreSQL, Object storage для images
+- SLA: 99.9% uptime (критичен для pull образов)
+
+**GitOps Operator (10.10.10.40):**
+- Роль: Автоматическая синхронизация Git → Swarm
+- Взаимодействие: Мониторит Gitea, применяет изменения в Swarm через API
+- Зависимости: Gitea, Docker Swarm API
+- SLA: 99.9% uptime
+
+**Portainer (10.10.10.50):**
+- Роль: Web UI для управления и мониторинга Swarm
+- Взаимодействие: Подключается к Swarm managers через Docker API
+- Зависимости: Docker Swarm API, PostgreSQL для своей базы
+- SLA: 99% uptime (не критичен, есть CLI альтернатива)
+
+**Резервирование:**
+- Gitea: Master-slave replication, automated failover
+- Jenkins: Standby instance в warm mode
+- Harbor: Geo-replication на secondary site
+- GitOps Operator: Active-passive pair
+- Portainer: Standby instance
+
+### 3.2 Docker Swarm Cluster Zone
+
+**Назначение:**
+Выполнение production workloads с high availability и load balancing.
+
+**Критичность:** CRITICAL - прямое влияние на бизнес сервисы
+
+**Swarm Manager Nodes (10.20.1.1-3):**
+- Количество: 3 для кворума (рекомендуется нечетное число)
+- Роль: Cluster orchestration, scheduling, API endpoint
+- Raft consensus: Нужно минимум 2 alive из 3 для работы кластера
+- Workload: НЕ запускают application containers (только infrastructure)
+- CPU: 4 vCPU каждый
+- RAM: 8 GB каждый
+- Disk: 200 GB SSD каждый
+- Network: 10 Gbps для Raft communication
+
+**Swarm Worker Nodes (10.20.2.1-N):**
+- Количество: Зависит от workload, минимум 3 для redundancy
+- Роль: Выполнение application containers
+- Constraints: Можно маркировать ноды для specific workloads
+- CPU: 8-16 vCPU каждый
+- RAM: 32-64 GB каждый
+- Disk: 500 GB SSD каждый
+- Network: 10 Gbps для overlay network performance
+
+**Overlay Networks:**
+- Automatic encryption (IPSec)
+- Service discovery через DNS
+- Load balancing через routing mesh
+- Изоляция между разными стеками
+
+**Secrets Management:**
+- Docker Swarm secrets encrypted at rest
+- Rotation через stack update
+- Mount как files в containers
+- Audit log доступа к secrets
+
+**Резервирование:**
+- Manager nodes: N-1 failure tolerance (3 ноды = 1 failure ok)
+- Worker nodes: Application replicas распределены по разным нодам
+- Persistent data: Replicated storage (GlusterFS или NFS с HA)
+- Network: Bonded interfaces для redundancy
+
+### 3.3 AI & Analytics Zone
+
+**Назначение:**
+Предоставление AI-powered помощи через анализ internal data sources.
+
+**Критичность:** MEDIUM - удобство, но не критично для операций
+
+**Ollama Server (10.30.10.10):**
+- Роль: Запуск AI моделей локально на собственном железе
+- Модели: Llama 3.3 70B, Qwen 2.5 Coder, DeepSeek, и другие
+- Взаимодействие: Получает запросы от пользователей, context от MCP Server
+- Требования: GPU highly recommended для производительности
+- CPU: 16 vCPU (или меньше если есть GPU)
+- RAM: 64 GB (модели требуют много памяти)
+- GPU: NVIDIA A100 40GB или 2x RTX 4090 24GB (опционально но рекомендуется)
+- Disk: 2 TB NVMe SSD (модели весят 10-100 GB каждая)
+- Network: 10 Gbps для быстрого ответа
+
+**MCP Server (10.30.10.20):**
+- Роль: Интеграция AI с источниками данных (Gitea, Swarm, DBs, logs)
+- Connectors: Модульные плагины для каждого источника
+- Взаимодействие: Read-only запросы к data sources, передача context в Ollama
+- Security: Service accounts для каждого connector, audit всех запросов
+- CPU: 8 vCPU
+- RAM: 16 GB
+- Disk: 100 GB SSD
+- Network: 1 Gbps
+
+**Vector Database (10.30.10.30):**
+- Роль: Хранение embeddings документации для semantic search
+- Технология: Qdrant или Milvus
+- Размер: Зависит от количества документации
+- CPU: 4 vCPU
+- RAM: 16 GB (зависит от размера index)
+- Disk: 500 GB SSD
+- Network: 1 Gbps
+
+**Data Flow:**
+Пользователь → Ollama → MCP Server → (параллельно):
+- Gitea MCP Connector → Gitea (документация, код)
+- Swarm MCP Connector → Docker API (статус, логи)
+- Database MCP Connector → PostgreSQL (метаданные)
+- Prometheus MCP Connector → Metrics
+- Loki MCP Connector → Logs
+→ Агрегированный context → Ollama → Ответ пользователю
+
+**Резервирование:**
+- Ollama: Standby instance (warm standby)
+- MCP Server: Active-passive pair
+- Vector DB: Replicated для HA
+
+### 3.4 Monitoring & Logging Zone
+
+**Назначение:**
+Observability инфраструктуры для проактивного мониторинга и troubleshooting.
+
+**Критичность:** HIGH - необходим для detection проблем
+
+**Prometheus (10.40.10.10):**
+- Роль: Сбор и хранение метрик временных рядов
+- Scrape targets: Все компоненты инфраструктуры
+- Retention: 30 дней в Prometheus, long-term в Thanos/VictoriaMetrics
+- CPU: 8 vCPU
+- RAM: 32 GB
+- Disk: 2 TB HDD (time-series data)
+- Network: 1 Gbps
+
+**Grafana (10.40.10.20):**
+- Роль: Визуализация метрик и логов
+- Dashboards: Преднастроенные для каждого компонента
+- Alerting: Визуальный редактор алертов
+- CPU: 4 vCPU
+- RAM: 8 GB
+- Disk: 100 GB SSD
+- Network: 1 Gbps
+
+**Loki (10.40.10.30):**
+- Роль: Централизованное хранение логов
+- Agents: Promtail на каждой ноде
+- Retention: 90 дней
+- CPU: 8 vCPU
+- RAM: 16 GB
+- Disk: 5 TB HDD (logs)
+- Network: 1 Gbps
+
+**AlertManager (10.40.10.40):**
+- Роль: Обработка и роутинг алертов
+- Интеграции: Slack, Email, PagerDuty, Telegram
+- Deduplication: Группировка похожих алертов
+- CPU: 2 vCPU
+- RAM: 4 GB
+- Disk: 50 GB SSD
+- Network: 1 Gbps
+
+**Резервирование:**
+- Prometheus: Federated setup, multiple instances
+- Grafana: Load balanced instances
+- Loki: Distributed deployment
+- AlertManager: Clustered для HA
+
+### 3.5 Data & Database Zone
+
+**Назначение:**
+Хранение persistent data для инфраструктуры и приложений.
+
+**Критичность:** CRITICAL - потеря данных недопустима
+
+**Infrastructure PostgreSQL Cluster (10.50.10.10-11):**
+- Роль: Базы данных для Gitea, Harbor, Portainer
+- Топология: Master-slave с automatic failover
+- Backup: Continuous WAL archiving + daily full backup
+- Encryption: At rest (LUKS) и in transit (TLS)
+- CPU: 8 vCPU per instance
+- RAM: 16 GB per instance
+- Disk: 500 GB SSD per instance
+- Network: 10 Gbps
+
+**Application Databases (10.50.20.x):**
+- Роль: Базы данных бизнес-приложений
+- Технологии: Зависит от приложений (PostgreSQL, MySQL, MongoDB)
+- Isolation: Каждое приложение в своей database/schema
+- Backup: Application-specific strategy
+- Resources: Зависит от workload
+
+**Shared Storage (10.50.30.1-3):**
+- Роль: Persistent volumes для Swarm services
+- Технология: GlusterFS (replicated) или NFS с HA
+- Replication: 3x для fault tolerance
+- Snapshots: Каждый час, retention 7 дней
+- Capacity: 10 TB (grows as needed)
+- Network: 10 Gbps для I/O performance
+
+**Резервирование:**
+- PostgreSQL: Synchronous replication, automatic failover
+- Shared Storage: Distributed replication (GlusterFS 3-way)
+- Backups: Multiple copies в разных locations
+
+### 3.6 Backup & DR Zone
+
+**Назначение:**
+Защита от data loss и быстрое восстановление при катастрофах.
+
+**Критичность:** CRITICAL для долгосрочной устойчивости бизнеса
+
+**Backup Server (10.60.10.10):**
+- Роль: Прием и хранение backups
+- Technology: Bacula или Bareos (enterprise backup solution)
+- Scheduling: Automated по расписанию + on-demand
+- Encryption: All backups encrypted at rest
+- CPU: 4 vCPU
+- RAM: 8 GB
+- Disk: 20 TB HDD (RAID 10)
+- Network: 10 Gbps для fast backups
+
+**Backup Strategy:**
+
+**Hourly Incremental:**
+- Git repositories (только изменения)
+- Retention: 48 hours
+
+**Daily Full:**
+- Databases (full dump)
+- Docker Swarm configs
+- Важные логи
+- Retention: 30 days
+
+**Weekly Full:**
+- Полный snapshot всей инфраструктуры
+- VM images, configs, data
+- Retention: 12 weeks
+
+**Monthly Archives:**
+- Long-term compliance storage
+- Retention: 7 years (regulatory requirement)
+
+**DR Site (опционально, в другом ЦОД):**
+- Роль: Geographic redundancy
+- Replication: Asynchronous из primary site
+- RTO (Recovery Time Objective): 4 hours
+- RPO (Recovery Point Objective): 15 minutes
+- Testing: Quarterly DR drills
+
+---
+
+## 4. Потоки данных
+
+### 4.1 Development Workflow
+
+**Developer commits code:**
+```
+Developer Workstation
+↓ (SSH через VPN)
+Gitea (VLAN 10)
+↓ (Webhook HTTPS + signature verification)
+Jenkins (VLAN 10)
+↓ (git clone through SSH)
+Gitea
+```
+
+**CI Pipeline execution:**
+```
+Jenkins
+↓ (build application)
+Build Agent (ephemeral container/VM)
+↓ (run tests)
+Test results → Archived in Jenkins
+↓ (build Docker image)
+Docker build agent
+↓ (security scan with Trivy)
+Vulnerability report
+↓ (docker push через TLS + creds)
+Harbor (VLAN 10)
+```
+
+**Update GitOps repo:**
+```
+Jenkins
+↓ (update image tag в compose file)
+Gitea GitOps repository
+↓ (commit + push)
+Gitea
+```
+
+### 4.2 CD Workflow
+
+**GitOps sync:**
+```
+GitOps Operator (VLAN 10)
+↓ (poll Git repository каждые 30 sec)
+Gitea
+↓ (detect changes)
+GitOps Operator
+↓ (docker stack deploy через Swarm API)
+Swarm Managers (VLAN 20)
+```
+
+**Swarm orchestration:**
+```
+Swarm Manager
+↓ (schedule tasks на workers)
+Swarm Scheduler
+↓ (pull image from Harbor)
+Worker Nodes ↔ Harbor (VLAN 10)
+↓ (start containers)
+Application Running
+```
+
+**Service update (rolling):**
+```
+Swarm Manager
+↓ (stop 1 task из N)
+Worker Node A
+↓ (start new task с новым image)
+Worker Node B
+↓ (verify health check)
+Health Check (5 consecutive passes required)
+↓ (proceed to next task)
+Repeat until all tasks updated
+```
+
+### 4.3 AI Interaction Flow
+
+**User query:**
+```
+User (через Web UI или API)
+↓ (HTTPS request)
+Ollama Server (VLAN 30)
+↓ (request context через MCP protocol)
+MCP Server (VLAN 30)
+```
+
+**MCP gathers context (parallel):**
+```
+MCP Server
+├→ Gitea MCP Connector → Gitea API (docs, code)
+├→ Swarm MCP Connector → Docker API (logs, metrics)
+├→ Database MCP Connector → PostgreSQL (metadata)
+├→ Prometheus MCP Connector → Prometheus API (metrics)
+└→ Loki MCP Connector → Loki API (logs)
+↓ (all responses aggregated)
+MCP Server
+↓ (full context sent to AI)
+Ollama Server
+↓ (generate response)
+User
+```
+
+**AI response with action:**
+```
+AI determines action needed
+↓ (if requires change)
+AI suggests change to user
+↓ (user approves)
+Change committed to Git
+↓ (normal GitOps flow)
+Applied to infrastructure
+```
+
+### 4.4 Monitoring Data Flow
+
+**Metrics collection:**
+```
+All Infrastructure Components
+↓ (expose metrics endpoints)
+Prometheus Exporters
+↓ (scrape every 15 seconds)
+Prometheus (VLAN 40)
+↓ (evaluate alert rules)
+AlertManager (VLAN 40)
+↓ (route notifications)
+Slack/Email/PagerDuty
+```
+
+**Logs collection:**
+```
+All Containers
+↓ (stdout/stderr)
+Docker logging driver
+↓ (forward)
+Promtail Agent (на каждой ноде)
+↓ (push)
+Loki (VLAN 40)
+↓ (index и store)
+Loki Storage
+↓ (query)
+Grafana или CLI
+```
+
+**Audit logs:**
+```
+All Infrastructure Actions
+├→ Gitea (Git operations)
+├→ Docker Swarm (API calls)
+├→ Harbor (image push/pull)
+├→ Jenkins (builds)
+└→ SSH sessions (bastion)
+↓ (forward)
+Centralized Syslog
+↓ (store)
+Long-term Audit Storage (7 years)
+```
+
+---
+
+## 5. High Availability и масштабирование
+
+### 5.1 HA Strategy
+
+**Tier 1 - Critical (99.99% uptime):**
+- Docker Swarm (application platform)
+- Harbor (cannot deploy without it)
+- Shared Storage (persistent data)
+- Strategy: Active-Active где возможно, N+1 redundancy
+
+**Tier 2 - Important (99.9% uptime):**
+- Gitea (code access)
+- GitOps Operator (CD automation)
+- Databases (infrastructure metadata)
+- Strategy: Active-Passive с automatic failover
+
+**Tier 3 - Nice to have (99% uptime):**
+- Jenkins (can wait for restore)
+- Portainer (CLI alternative exists)
+- Monitoring (short downtime acceptable)
+- Strategy: Warm standby, manual failover
+
+### 5.2 Scaling Points
+
+**Vertical Scaling (увеличение ресурсов):**
+- Databases: Больше RAM для cache
+- Ollama: Добавление GPU для speed
+- Harbor storage: Больше disk для images
+- Limit: Hardware limitations
+
+**Horizontal Scaling (добавление instances):**
+- Swarm Workers: Добавить ноды для capacity
+- Jenkins Agents: Dynamic scaling по demand
+- Prometheus: Federation для distributed scraping
+- MCP Connectors: Независимые instances per source
+
+**Data Scaling:**
+- PostgreSQL: Read replicas для read-heavy workloads
+- Harbor: Geo-replication для distributed teams
+- Loki: Sharding по времени
+- Git: Repository sharding (не часто нужно)
+
+### 5.3 Capacity Planning
+
+**Metrics для отслеживания:**
+- CPU utilization (target <70% average)
+- Memory utilization (target <80%)
+- Disk usage (alert при 80%, critical при 90%)
+- Network bandwidth (baseline + trend analysis)
+- IOPS (SSD wear, performance degradation)
+
+**Growth projections:**
+- Applications: 20% growth в год
+- Code repositories: 30% growth в год (accumulative)
+- Logs: 50% growth в год (more verbose logging)
+- Metrics retention: Linear с количеством services
+
+**Scaling triggers:**
+- Add Swarm worker когда CPU >80% sustained
+- Upgrade database когда query latency >100ms p95
+- Expand storage когда >75% used
+- Add Jenkins agents когда queue >5 builds
+
+---
+
+## 6. Disaster Recovery
+
+### 6.1 RTO и RPO Targets
+
+**Recovery Time Objective (RTO):**
+- Tier 1 services: 1 hour
+- Tier 2 services: 4 hours
+- Tier 3 services: 24 hours
+- Full infrastructure: 8 hours
+
+**Recovery Point Objective (RPO):**
+- Databases: 15 minutes (via WAL shipping)
+- Git repositories: 1 hour (hourly backup)
+- Docker images: 0 (replicated to DR)
+- Configs: 0 (in Git)
+- Logs: 1 hour (buffered before ingestion)
+
+### 6.2 DR Scenarios
+
+**Scenario 1: Single server failure**
+- Detection: Automated monitoring
+- Response: Automatic failover to redundant instance
+- Recovery time: <5 minutes
+- Data loss: None (active-active or sync replication)
+
+**Scenario 2: Network partition**
+- Detection: Raft consensus loss, monitoring alerts
+- Response: Manual investigation, possible split-brain resolution
+- Recovery time: 30 minutes
+- Data loss: Possible if write to minority partition
+
+**Scenario 3: Data center failure**
+- Detection: Total loss of connectivity
+- Response: Failover to DR site
+- Recovery time: 4 hours (RTO)
+- Data loss: Up to 15 minutes (RPO)
+
+**Scenario 4: Ransomware/Corruption**
+- Detection: File integrity monitoring, unusual encryption activity
+- Response: Isolate affected systems, restore from clean backup
+- Recovery time: 8 hours (full rebuild)
+- Data loss: Up to last clean backup (potentially hours)
+
+**Scenario 5: Human error (accidental delete)**
+- Detection: Git history, audit logs
+- Response: Restore from backup or Git revert
+- Recovery time: 1-2 hours
+- Data loss: None (everything in version control)
+
+### 6.3 Recovery Procedures
+
+**Database Recovery:**
+- Stop application access
+- Restore base backup
+- Apply WAL logs до point-in-time
+- Verify data integrity
+- Resume application access
+
+**Git Repository Recovery:**
+- Clone from DR site или restore backup
+- Verify commit history integrity
+- Restore hooks и configurations
+- Test push/pull operations
+- Notify team of recovery
+
+**Docker Swarm Recovery:**
+- Deploy manager nodes from backup configs
+- Join worker nodes
+- Restore network и volume configs
+- Deploy stacks from Git
+- Verify service health
+
+**Full Site Recovery:**
+- Deploy infrastructure от Terraform/IaC
+- Restore databases from backup
+- Clone Git repositories
+- Deploy Docker Swarm
+- Apply all stacks from GitOps
+- Verify end-to-end functionality
+- Switch DNS to DR site
+- Notify stakeholders
+
+### 6.4 Testing DR
+
+**Monthly:**
+- Restore тест на отдельной инфраструктуре
+- Verify backup integrity
+- Test recovery procedures
+
+**Quarterly:**
+- Full DR drill с failover на DR site
+- Measure actual RTO/RPO
+- Update procedures based на findings
+
+**Annually:**
+- Tabletop exercise с всеми stakeholders
+- Test business continuity plans
+- Update и train на changes
+
+---
+
+**Следующие документы:**
+- **03-security-compliance.md** - Детальные требования безопасности
+- **04-component-specifications.md** - Технические спецификации компонентов
+- **05-development-environment.md** - Dev окружение для тестирования
+
+---
+
+**Утверждение:**
+- Enterprise Architect: _______________
+- Security Architect: _______________
+- Infrastructure Lead: _______________
+- Date: _______________