docs: add detailed architecture document for GitOps solution

2026-01-12 13:00:31 +00:00
parent 7467b1d9fd
commit dd061c8f1f
1 changed files with 855 additions and 0 deletions
--- a/docs/gitops-cicd/02-architecture.md
+++ b/docs/gitops-cicd/02-architecture.md
@@ -0,0 +1,855 @@
 # FinTech GitOps CI/CD - Архитектура решения
 **Версия:** 1.0  
 **Дата:** Январь 2026  
 **Целевая аудитория:** Архитекторы, DevOps, Инфраструктура, Безопасность
 ---
 ## Содержание
 1. [Общая архитектура](#1-общая-архитектура)
 2. [Сетевая архитектура](#2-сетевая-архитектура)
 3. [Зоны и их назначение](#3-зоны-и-их-назначение)
 4. [Потоки данных](#4-потоки-данных)
 5. [High Availability и масштабирование](#5-high-availability-и-масштабирование)
 6. [Disaster Recovery](#6-disaster-recovery)
 ---
 ## 1. Общая архитектура
 ### 1.1 Принципы проектирования
 **Defense in Depth:**
 Многоуровневая защита с изоляцией на каждом уровне:
 - Сетевая сегментация через VLAN
 - Firewall между всеми зонами
 - Application-level authentication и authorization
 - Encryption at rest и in transit
 - Audit logging на всех уровнях
 **Least Privilege:**
 Минимальные необходимые права для каждого компонента:
 - Service accounts с ограниченными permissions
 - Network access только к необходимым endpoints
 - Time-bound credentials где возможно
 - Регулярная ротация secrets
 **Immutable Infrastructure:**
 Инфраструктура как код, изменения только через Git:
 - Нет ручных изменений на серверах
 - Все изменения version controlled
 - Reproducible deployments
 - Easy rollback через Git history
 **Observability:**
 Полная видимость всех процессов:
 - Централизованное логирование
 - Метрики со всех компонентов
 - Distributed tracing для запросов
 - Audit trail для compliance
 ### 1.2 Логические слои
 **Presentation Layer (User Interface):**
 - Portainer UI для визуального управления Swarm
 - Grafana для дашбордов и метрик
 - Jenkins Blue Ocean для CI/CD визуализации
 - Ollama web interface для AI взаимодействия
 - Gitea web UI для repository management
 **API Layer:**
 - Docker Swarm API для управления кластером
 - Harbor API для registry операций
 - Gitea API для Git operations
 - Jenkins API для trigger builds
 - Prometheus API для метрик
 - MCP Server API для AI интеграции
 **Service Layer (Business Logic):**
 - GitOps Operator - автоматическая синхронизация
 - Jenkins pipelines - CI/CD логика
 - Harbor webhooks - уведомления о новых образах
 - AlertManager - правила для алертов
 - AI models - обработка запросов
 **Data Layer:**
 - PostgreSQL - реляционные данные
 - Git repositories - код и конфигурации
 - Harbor storage - Docker образы
 - Prometheus TSDB - временные ряды метрик
 - Loki - логи
 - Vector DB - embeddings для AI
 **Infrastructure Layer:**
 - Docker Swarm - orchestration platform
 - Overlay networks - service communication
 - Shared storage - persistent data
 - Backup systems - disaster recovery
 ---
 ## 2. Сетевая архитектура
 ### 2.1 VLAN сегментация
 **VLAN 10 - Management & CI/CD Zone:**
 - Subnet: 10.10.10.0/24
 - Gateway: 10.10.10.1
 - Компоненты: Gitea, Jenkins, Harbor, GitOps Operator, Portainer
 - Доступ: Только через VPN с MFA
 - Изоляция: Строгий firewall на границе
 **VLAN 20 - Docker Swarm Cluster Zone:**
 - Subnet: 10.20.0.0/16 (для большого количества containers)
 - Manager subnet: 10.20.1.0/24
 - Worker subnet: 10.20.2.0/23
 - Gateway: 10.20.0.1
 - Компоненты: Swarm managers, workers, overlay networks
 - Доступ: Только из Management zone и Monitoring zone
 - Изоляция: Encrypted overlay network внутри
 **VLAN 30 - AI & Analytics Zone:**
 - Subnet: 10.30.10.0/24
 - Gateway: 10.30.10.1
 - Компоненты: Ollama, MCP Server, Vector Database
 - Доступ: Read-only к источникам данных
 - Изоляция: Не может инициировать изменения в других зонах
 **VLAN 40 - Monitoring & Logging Zone:**
 - Subnet: 10.40.10.0/24
 - Gateway: 10.40.10.1
 - Компоненты: Prometheus, Grafana, Loki, AlertManager
 - Доступ: Read-only metrics collection
 - Изоляция: Не может управлять компонентами
 **VLAN 50 - Data & Database Zone:**
 - Subnet: 10.50.0.0/16
 - Infrastructure DB subnet: 10.50.10.0/24
 - Application DB subnet: 10.50.20.0/23
 - Storage subnet: 10.50.30.0/24
 - Gateway: 10.50.0.1
 - Компоненты: PostgreSQL, Application databases, Shared storage
 - Доступ: Строго контролируемый, encrypted connections
 - Изоляция: Самая строгая, audit всех подключений
 **VLAN 60 - Backup & DR Zone:**
 - Subnet: 10.60.10.0/24
 - Gateway: 10.60.10.1
 - Компоненты: Backup server, long-term storage
 - Доступ: Write-only для backup agents, read для recovery
 - Изоляция: Offline storage, air-gapped где возможно
 ### 2.2 Firewall правила
 **Принцип:** Deny all, allow explicitly needed
 **Management VLAN → Swarm VLAN:**
 ```
 Source: 10.10.10.40 (GitOps Operator)
 Destination: 10.20.1.0/24 (Swarm Managers)
 Ports: 2377/tcp (cluster management)
 Action: ALLOW
 Logging: YES
 Source: 10.10.10.50 (Portainer)
 Destination: 10.20.1.0/24 (Swarm Managers)
 Ports: 2375/tcp (Docker API over TLS)
 Action: ALLOW
 Logging: YES
 All other traffic: DENY
 ```
 **Swarm VLAN → Harbor (Management VLAN):**
 ```
 Source: 10.20.0.0/16 (All Swarm nodes)
 Destination: 10.10.10.30 (Harbor)
 Ports: 443/tcp, 5000/tcp (HTTPS, Docker registry)
 Protocol: TLS 1.3 with mutual authentication
 Action: ALLOW
 Logging: YES
 All other traffic: DENY
 ```
 **AI VLAN → Data Sources:**
 ```
 Source: 10.30.10.20 (MCP Server)
 Destination: Multiple (через MCP connectors)
 Ports: Varies (SSH 22, HTTPS 443, PostgreSQL 5432, etc.)
 Access: READ-ONLY
 Authentication: Service account per destination
 Logging: ALL QUERIES LOGGED
 Action: ALLOW with rate limiting
 Write operations: DENY
 ```
 **Monitoring VLAN → All Zones:**
 ```
 Source: 10.40.10.10 (Prometheus)
 Destination: ALL VLANs
 Ports: Metrics endpoints (обычно 9090-9999)
 Access: READ-ONLY metrics scraping
 Action: ALLOW
 Logging: NO (too verbose, metrics only)
 Any non-metrics ports: DENY
 ```
 **Data VLAN → Backup VLAN:**
 ```
 Source: 10.50.0.0/16 (All databases)
 Destination: 10.60.10.10 (Backup server)
 Ports: Backup protocol specific
 Direction: ONE-WAY (source → backup only)
 Action: ALLOW
 Logging: YES
 Encryption: MANDATORY
 Reverse direction: DENY (except for restore procedures)
 ```
 ### 2.3 Внешнее подключение
 **VPN Gateway:**
 - Публичный IP для VPN подключений
 - Multi-factor authentication обязательна
 - Certificate-based authentication + one-time password
 - Split-tunnel запрещен (все через VPN)
 - Session timeout: 8 часов
 - Idle timeout: 30 минут
 - Disconnect после 3 неудачных MFA попыток
 **Jump Host/Bastion:**
 - Единая точка входа после VPN
 - Session recording для аудита
 - No direct access to production systems, только через jump host
 - Authorized keys management централизованно
 - Automatic logout после 15 минут idle
 - Audit log всех команд
 **Разрешенные пользователи:**
 - Developers: Доступ к Gitea, Jenkins, Portainer (read-only для production)
 - DevOps: Полный доступ ко всем системам управления
 - Security team: Read-only audit доступ ко всему
 - Managers: Grafana и reporting dashboards только
 ---
 ## 3. Зоны и их назначение
 ### 3.1 Management & CI/CD Zone
 **Назначение:**
 Централизованное управление кодом, CI/CD процессами и container registry.
 **Критичность:** HIGH - простой влияет на возможность деплоя новых версий
 **Компоненты:**
 **Gitea (10.10.10.10):**
 - Роль: Single source of truth для всего кода и конфигураций
 - Взаимодействие: Принимает push от developers, отправляет webhooks в Jenkins
 - Зависимости: PostgreSQL (VLAN 50), Shared storage для Git LFS
 - SLA: 99.9% uptime
 **Jenkins (10.10.10.20):**
 - Роль: CI automation, build и test applications
 - Взаимодействие: Получает webhooks от Gitea, push образов в Harbor, update Git
 - Зависимости: Gitea, Harbor, Docker build agents
 - SLA: 99.5% uptime (может работать в degraded mode)
 **Harbor (10.10.10.30):**
 - Роль: Enterprise container registry с security scanning
 - Взаимодействие: Принимает push от Jenkins, pull от Swarm nodes
 - Зависимости: PostgreSQL, Object storage для images
 - SLA: 99.9% uptime (критичен для pull образов)
 **GitOps Operator (10.10.10.40):**
 - Роль: Автоматическая синхронизация Git → Swarm
 - Взаимодействие: Мониторит Gitea, применяет изменения в Swarm через API
 - Зависимости: Gitea, Docker Swarm API
 - SLA: 99.9% uptime
 **Portainer (10.10.10.50):**
 - Роль: Web UI для управления и мониторинга Swarm
 - Взаимодействие: Подключается к Swarm managers через Docker API
 - Зависимости: Docker Swarm API, PostgreSQL для своей базы
 - SLA: 99% uptime (не критичен, есть CLI альтернатива)
 **Резервирование:**
 - Gitea: Master-slave replication, automated failover
 - Jenkins: Standby instance в warm mode
 - Harbor: Geo-replication на secondary site
 - GitOps Operator: Active-passive pair
 - Portainer: Standby instance
 ### 3.2 Docker Swarm Cluster Zone
 **Назначение:**
 Выполнение production workloads с high availability и load balancing.
 **Критичность:** CRITICAL - прямое влияние на бизнес сервисы
 **Swarm Manager Nodes (10.20.1.1-3):**
 - Количество: 3 для кворума (рекомендуется нечетное число)
 - Роль: Cluster orchestration, scheduling, API endpoint
 - Raft consensus: Нужно минимум 2 alive из 3 для работы кластера
 - Workload: НЕ запускают application containers (только infrastructure)
 - CPU: 4 vCPU каждый
 - RAM: 8 GB каждый
 - Disk: 200 GB SSD каждый
 - Network: 10 Gbps для Raft communication
 **Swarm Worker Nodes (10.20.2.1-N):**
 - Количество: Зависит от workload, минимум 3 для redundancy
 - Роль: Выполнение application containers
 - Constraints: Можно маркировать ноды для specific workloads
 - CPU: 8-16 vCPU каждый
 - RAM: 32-64 GB каждый
 - Disk: 500 GB SSD каждый
 - Network: 10 Gbps для overlay network performance
 **Overlay Networks:**
 - Automatic encryption (IPSec)
 - Service discovery через DNS
 - Load balancing через routing mesh
 - Изоляция между разными стеками
 **Secrets Management:**
 - Docker Swarm secrets encrypted at rest
 - Rotation через stack update
 - Mount как files в containers
 - Audit log доступа к secrets
 **Резервирование:**
 - Manager nodes: N-1 failure tolerance (3 ноды = 1 failure ok)
 - Worker nodes: Application replicas распределены по разным нодам
 - Persistent data: Replicated storage (GlusterFS или NFS с HA)
 - Network: Bonded interfaces для redundancy
 ### 3.3 AI & Analytics Zone
 **Назначение:**
 Предоставление AI-powered помощи через анализ internal data sources.
 **Критичность:** MEDIUM - удобство, но не критично для операций
 **Ollama Server (10.30.10.10):**
 - Роль: Запуск AI моделей локально на собственном железе
 - Модели: Llama 3.3 70B, Qwen 2.5 Coder, DeepSeek, и другие
 - Взаимодействие: Получает запросы от пользователей, context от MCP Server
 - Требования: GPU highly recommended для производительности
 - CPU: 16 vCPU (или меньше если есть GPU)
 - RAM: 64 GB (модели требуют много памяти)
 - GPU: NVIDIA A100 40GB или 2x RTX 4090 24GB (опционально но рекомендуется)
 - Disk: 2 TB NVMe SSD (модели весят 10-100 GB каждая)
 - Network: 10 Gbps для быстрого ответа
 **MCP Server (10.30.10.20):**
 - Роль: Интеграция AI с источниками данных (Gitea, Swarm, DBs, logs)
 - Connectors: Модульные плагины для каждого источника
 - Взаимодействие: Read-only запросы к data sources, передача context в Ollama
 - Security: Service accounts для каждого connector, audit всех запросов
 - CPU: 8 vCPU
 - RAM: 16 GB
 - Disk: 100 GB SSD
 - Network: 1 Gbps
 **Vector Database (10.30.10.30):**
 - Роль: Хранение embeddings документации для semantic search
 - Технология: Qdrant или Milvus
 - Размер: Зависит от количества документации
 - CPU: 4 vCPU
 - RAM: 16 GB (зависит от размера index)
 - Disk: 500 GB SSD
 - Network: 1 Gbps
 **Data Flow:**
 Пользователь → Ollama → MCP Server → (параллельно):
 - Gitea MCP Connector → Gitea (документация, код)
 - Swarm MCP Connector → Docker API (статус, логи)
 - Database MCP Connector → PostgreSQL (метаданные)
 - Prometheus MCP Connector → Metrics
 - Loki MCP Connector → Logs
 → Агрегированный context → Ollama → Ответ пользователю
 **Резервирование:**
 - Ollama: Standby instance (warm standby)
 - MCP Server: Active-passive pair
 - Vector DB: Replicated для HA
 ### 3.4 Monitoring & Logging Zone
 **Назначение:**
 Observability инфраструктуры для проактивного мониторинга и troubleshooting.
 **Критичность:** HIGH - необходим для detection проблем
 **Prometheus (10.40.10.10):**
 - Роль: Сбор и хранение метрик временных рядов
 - Scrape targets: Все компоненты инфраструктуры
 - Retention: 30 дней в Prometheus, long-term в Thanos/VictoriaMetrics
 - CPU: 8 vCPU
 - RAM: 32 GB
 - Disk: 2 TB HDD (time-series data)
 - Network: 1 Gbps
 **Grafana (10.40.10.20):**
 - Роль: Визуализация метрик и логов
 - Dashboards: Преднастроенные для каждого компонента
 - Alerting: Визуальный редактор алертов
 - CPU: 4 vCPU
 - RAM: 8 GB
 - Disk: 100 GB SSD
 - Network: 1 Gbps
 **Loki (10.40.10.30):**
 - Роль: Централизованное хранение логов
 - Agents: Promtail на каждой ноде
 - Retention: 90 дней
 - CPU: 8 vCPU
 - RAM: 16 GB
 - Disk: 5 TB HDD (logs)
 - Network: 1 Gbps
 **AlertManager (10.40.10.40):**
 - Роль: Обработка и роутинг алертов
 - Интеграции: Slack, Email, PagerDuty, Telegram
 - Deduplication: Группировка похожих алертов
 - CPU: 2 vCPU
 - RAM: 4 GB
 - Disk: 50 GB SSD
 - Network: 1 Gbps
 **Резервирование:**
 - Prometheus: Federated setup, multiple instances
 - Grafana: Load balanced instances
 - Loki: Distributed deployment
 - AlertManager: Clustered для HA
 ### 3.5 Data & Database Zone
 **Назначение:**
 Хранение persistent data для инфраструктуры и приложений.
 **Критичность:** CRITICAL - потеря данных недопустима
 **Infrastructure PostgreSQL Cluster (10.50.10.10-11):**
 - Роль: Базы данных для Gitea, Harbor, Portainer
 - Топология: Master-slave с automatic failover
 - Backup: Continuous WAL archiving + daily full backup
 - Encryption: At rest (LUKS) и in transit (TLS)
 - CPU: 8 vCPU per instance
 - RAM: 16 GB per instance
 - Disk: 500 GB SSD per instance
 - Network: 10 Gbps
 **Application Databases (10.50.20.x):**
 - Роль: Базы данных бизнес-приложений
 - Технологии: Зависит от приложений (PostgreSQL, MySQL, MongoDB)
 - Isolation: Каждое приложение в своей database/schema
 - Backup: Application-specific strategy
 - Resources: Зависит от workload
 **Shared Storage (10.50.30.1-3):**
 - Роль: Persistent volumes для Swarm services
 - Технология: GlusterFS (replicated) или NFS с HA
 - Replication: 3x для fault tolerance
 - Snapshots: Каждый час, retention 7 дней
 - Capacity: 10 TB (grows as needed)
 - Network: 10 Gbps для I/O performance
 **Резервирование:**
 - PostgreSQL: Synchronous replication, automatic failover
 - Shared Storage: Distributed replication (GlusterFS 3-way)
 - Backups: Multiple copies в разных locations
 ### 3.6 Backup & DR Zone
 **Назначение:**
 Защита от data loss и быстрое восстановление при катастрофах.
 **Критичность:** CRITICAL для долгосрочной устойчивости бизнеса
 **Backup Server (10.60.10.10):**
 - Роль: Прием и хранение backups
 - Technology: Bacula или Bareos (enterprise backup solution)
 - Scheduling: Automated по расписанию + on-demand
 - Encryption: All backups encrypted at rest
 - CPU: 4 vCPU
 - RAM: 8 GB
 - Disk: 20 TB HDD (RAID 10)
 - Network: 10 Gbps для fast backups
 **Backup Strategy:**
 **Hourly Incremental:**
 - Git repositories (только изменения)
 - Retention: 48 hours
 **Daily Full:**
 - Databases (full dump)
 - Docker Swarm configs
 - Важные логи
 - Retention: 30 days
 **Weekly Full:**
 - Полный snapshot всей инфраструктуры
 - VM images, configs, data
 - Retention: 12 weeks
 **Monthly Archives:**
 - Long-term compliance storage
 - Retention: 7 years (regulatory requirement)
 **DR Site (опционально, в другом ЦОД):**
 - Роль: Geographic redundancy
 - Replication: Asynchronous из primary site
 - RTO (Recovery Time Objective): 4 hours
 - RPO (Recovery Point Objective): 15 minutes
 - Testing: Quarterly DR drills
 ---
 ## 4. Потоки данных
 ### 4.1 Development Workflow
 **Developer commits code:**
 ```
 Developer Workstation
 ↓ (SSH через VPN)
 Gitea (VLAN 10)
 ↓ (Webhook HTTPS + signature verification)
 Jenkins (VLAN 10)
 ↓ (git clone through SSH)
 Gitea
 ```
 **CI Pipeline execution:**
 ```
 Jenkins
 ↓ (build application)
 Build Agent (ephemeral container/VM)
 ↓ (run tests)
 Test results → Archived in Jenkins
 ↓ (build Docker image)
 Docker build agent
 ↓ (security scan with Trivy)
 Vulnerability report
 ↓ (docker push через TLS + creds)
 Harbor (VLAN 10)
 ```
 **Update GitOps repo:**
 ```
 Jenkins
 ↓ (update image tag в compose file)
 Gitea GitOps repository
 ↓ (commit + push)
 Gitea
 ```
 ### 4.2 CD Workflow
 **GitOps sync:**
 ```
 GitOps Operator (VLAN 10)
 ↓ (poll Git repository каждые 30 sec)
 Gitea
 ↓ (detect changes)
 GitOps Operator
 ↓ (docker stack deploy через Swarm API)
 Swarm Managers (VLAN 20)
 ```
 **Swarm orchestration:**
 ```
 Swarm Manager
 ↓ (schedule tasks на workers)
 Swarm Scheduler
 ↓ (pull image from Harbor)
 Worker Nodes ↔ Harbor (VLAN 10)
 ↓ (start containers)
 Application Running
 ```
 **Service update (rolling):**
 ```
 Swarm Manager
 ↓ (stop 1 task из N)
 Worker Node A
 ↓ (start new task с новым image)
 Worker Node B
 ↓ (verify health check)
 Health Check (5 consecutive passes required)
 ↓ (proceed to next task)
 Repeat until all tasks updated
 ```
 ### 4.3 AI Interaction Flow
 **User query:**
 ```
 User (через Web UI или API)
 ↓ (HTTPS request)
 Ollama Server (VLAN 30)
 ↓ (request context через MCP protocol)
 MCP Server (VLAN 30)
 ```
 **MCP gathers context (parallel):**
 ```
 MCP Server
 ├→ Gitea MCP Connector → Gitea API (docs, code)
 ├→ Swarm MCP Connector → Docker API (logs, metrics)
 ├→ Database MCP Connector → PostgreSQL (metadata)
 ├→ Prometheus MCP Connector → Prometheus API (metrics)
 └→ Loki MCP Connector → Loki API (logs)
 ↓ (all responses aggregated)
 MCP Server
 ↓ (full context sent to AI)
 Ollama Server
 ↓ (generate response)
 User
 ```
 **AI response with action:**
 ```
 AI determines action needed
 ↓ (if requires change)
 AI suggests change to user
 ↓ (user approves)
 Change committed to Git
 ↓ (normal GitOps flow)
 Applied to infrastructure
 ```
 ### 4.4 Monitoring Data Flow
 **Metrics collection:**
 ```
 All Infrastructure Components
 ↓ (expose metrics endpoints)
 Prometheus Exporters
 ↓ (scrape every 15 seconds)
 Prometheus (VLAN 40)
 ↓ (evaluate alert rules)
 AlertManager (VLAN 40)
 ↓ (route notifications)
 Slack/Email/PagerDuty
 ```
 **Logs collection:**
 ```
 All Containers
 ↓ (stdout/stderr)
 Docker logging driver
 ↓ (forward)
 Promtail Agent (на каждой ноде)
 ↓ (push)
 Loki (VLAN 40)
 ↓ (index и store)
 Loki Storage
 ↓ (query)
 Grafana или CLI
 ```
 **Audit logs:**
 ```
 All Infrastructure Actions
 ├→ Gitea (Git operations)
 ├→ Docker Swarm (API calls)
 ├→ Harbor (image push/pull)
 ├→ Jenkins (builds)
 └→ SSH sessions (bastion)
 ↓ (forward)
 Centralized Syslog
 ↓ (store)
 Long-term Audit Storage (7 years)
 ```
 ---
 ## 5. High Availability и масштабирование
 ### 5.1 HA Strategy
 **Tier 1 - Critical (99.99% uptime):**
 - Docker Swarm (application platform)
 - Harbor (cannot deploy without it)
 - Shared Storage (persistent data)
 - Strategy: Active-Active где возможно, N+1 redundancy
 **Tier 2 - Important (99.9% uptime):**
 - Gitea (code access)
 - GitOps Operator (CD automation)
 - Databases (infrastructure metadata)
 - Strategy: Active-Passive с automatic failover
 **Tier 3 - Nice to have (99% uptime):**
 - Jenkins (can wait for restore)
 - Portainer (CLI alternative exists)
 - Monitoring (short downtime acceptable)
 - Strategy: Warm standby, manual failover
 ### 5.2 Scaling Points
 **Vertical Scaling (увеличение ресурсов):**
 - Databases: Больше RAM для cache
 - Ollama: Добавление GPU для speed
 - Harbor storage: Больше disk для images
 - Limit: Hardware limitations
 **Horizontal Scaling (добавление instances):**
 - Swarm Workers: Добавить ноды для capacity
 - Jenkins Agents: Dynamic scaling по demand
 - Prometheus: Federation для distributed scraping
 - MCP Connectors: Независимые instances per source
 **Data Scaling:**
 - PostgreSQL: Read replicas для read-heavy workloads
 - Harbor: Geo-replication для distributed teams
 - Loki: Sharding по времени
 - Git: Repository sharding (не часто нужно)
 ### 5.3 Capacity Planning
 **Metrics для отслеживания:**
 - CPU utilization (target <70% average)
 - Memory utilization (target <80%)
 - Disk usage (alert при 80%, critical при 90%)
 - Network bandwidth (baseline + trend analysis)
 - IOPS (SSD wear, performance degradation)
 **Growth projections:**
 - Applications: 20% growth в год
 - Code repositories: 30% growth в год (accumulative)
 - Logs: 50% growth в год (more verbose logging)
 - Metrics retention: Linear с количеством services
 **Scaling triggers:**
 - Add Swarm worker когда CPU >80% sustained
 - Upgrade database когда query latency >100ms p95
 - Expand storage когда >75% used
 - Add Jenkins agents когда queue >5 builds
 ---
 ## 6. Disaster Recovery
 ### 6.1 RTO и RPO Targets
 **Recovery Time Objective (RTO):**
 - Tier 1 services: 1 hour
 - Tier 2 services: 4 hours
 - Tier 3 services: 24 hours
 - Full infrastructure: 8 hours
 **Recovery Point Objective (RPO):**
 - Databases: 15 minutes (via WAL shipping)
 - Git repositories: 1 hour (hourly backup)
 - Docker images: 0 (replicated to DR)
 - Configs: 0 (in Git)
 - Logs: 1 hour (buffered before ingestion)
 ### 6.2 DR Scenarios
 **Scenario 1: Single server failure**
 - Detection: Automated monitoring
 - Response: Automatic failover to redundant instance
 - Recovery time: <5 minutes
 - Data loss: None (active-active or sync replication)
 **Scenario 2: Network partition**
 - Detection: Raft consensus loss, monitoring alerts
 - Response: Manual investigation, possible split-brain resolution
 - Recovery time: 30 minutes
 - Data loss: Possible if write to minority partition
 **Scenario 3: Data center failure**
 - Detection: Total loss of connectivity
 - Response: Failover to DR site
 - Recovery time: 4 hours (RTO)
 - Data loss: Up to 15 minutes (RPO)
 **Scenario 4: Ransomware/Corruption**
 - Detection: File integrity monitoring, unusual encryption activity
 - Response: Isolate affected systems, restore from clean backup
 - Recovery time: 8 hours (full rebuild)
 - Data loss: Up to last clean backup (potentially hours)
 **Scenario 5: Human error (accidental delete)**
 - Detection: Git history, audit logs
 - Response: Restore from backup or Git revert
 - Recovery time: 1-2 hours
 - Data loss: None (everything in version control)
 ### 6.3 Recovery Procedures
 **Database Recovery:**
 - Stop application access
 - Restore base backup
 - Apply WAL logs до point-in-time
 - Verify data integrity
 - Resume application access
 **Git Repository Recovery:**
 - Clone from DR site или restore backup
 - Verify commit history integrity
 - Restore hooks и configurations
 - Test push/pull operations
 - Notify team of recovery
 **Docker Swarm Recovery:**
 - Deploy manager nodes from backup configs
 - Join worker nodes
 - Restore network и volume configs
 - Deploy stacks from Git
 - Verify service health
 **Full Site Recovery:**
 - Deploy infrastructure от Terraform/IaC
 - Restore databases from backup
 - Clone Git repositories
 - Deploy Docker Swarm
 - Apply all stacks from GitOps
 - Verify end-to-end functionality
 - Switch DNS to DR site
 - Notify stakeholders
 ### 6.4 Testing DR
 **Monthly:**
 - Restore тест на отдельной инфраструктуре
 - Verify backup integrity
 - Test recovery procedures
 **Quarterly:**
 - Full DR drill с failover на DR site
 - Measure actual RTO/RPO
 - Update procedures based на findings
 **Annually:**
 - Tabletop exercise с всеми stakeholders
 - Test business continuity plans
 - Update и train на changes
 ---
 **Следующие документы:**
 - **03-security-compliance.md** - Детальные требования безопасности
 - **04-component-specifications.md** - Технические спецификации компонентов
 - **05-development-environment.md** - Dev окружение для тестирования
 ---
 **Утверждение:**
 - Enterprise Architect: _______________
 - Security Architect: _______________
 - Infrastructure Lead: _______________
 - Date: _______________