# Требования к серверу Ollama для FinTech DevOps с MCP интеграцией **Версия:** 2.0 **Дата:** Январь 2026 **Статус:** Production Ready **Целевая аудитория:** Infrastructure Team, DevOps, Security, Management --- ## Executive Summary ### Бизнес-обоснование **Проблема:** FinTech компания генерирует огромное количество технической информации (код, логи, документация, Kubernetes конфигурации), которая распределена по множеству систем. Разработчики и DevOps инженеры тратят 30-40% времени на поиск информации, анализ логов и написание документации. **Решение:** Self-hosted AI ассистент на базе Ollama с интеграцией через MCP (Model Context Protocol) ко всем источникам данных компании. **Ключевые преимущества для FinTech:** - ✅ Данные не покидают корпоративную сеть (PCI DSS, GDPR compliance) - ✅ Нет зависимости от внешних AI провайдеров (OpenAI, Anthropic) - ✅ Полный контроль над обрабатываемой информацией - ✅ Возможность обучения на конфиденциальных данных **Ожидаемый эффект:** - 40% сокращение времени на поиск информации - 50% ускорение написания документации - 30% сокращение времени troubleshooting - ROI: 8-12 месяцев --- ## Содержание 1. [Цели и Use Cases](#1-цели-и-use-cases) 2. [Архитектура решения](#2-архитектура-решения) 3. [Серверные требования](#3-серверные-требования) 4. [Выбор AI моделей](#4-выбор-ai-моделей) 5. [MCP Services](#5-mcp-services) 6. [Knowledge Base (RAG)](#6-knowledge-base-rag) 7. [Безопасность](#7-безопасность) 8. [Развертывание](#8-развертывание) 9. [Мониторинг](#9-мониторинг) 10. [Бюджет](#10-бюджет) --- ## 1. Цели и Use Cases ### 1.1 Основные задачи **Для DevOps команды (5 человек):** 1. **Анализ Kubernetes/Docker Swarm** - "Почему pod в CrashLoopBackOff?" - "Как оптимизировать resource requests?" - "Покажи все pods с высоким memory usage" 2. **Troubleshooting по логам** - "Найди причину ошибки 500 в logs за последний час" - "Какие services показывают connection timeout?" - "Анализ performance degradation" 3. **Генерация инфраструктурного кода** - "Создай Helm chart для microservice с PostgreSQL" - "Напиши Terraform для AWS RDS с encryption" - "Генерация docker-compose.yml" **Для разработчиков (5 человек):** 1. **Code generation и review** - "Напиши unit tests для этого сервиса" - "Оптимизируй этот SQL query" - "Code review: найди potential security issues" 2. **Работа с документацией** - "Как использовать наш internal payment API?" - "Покажи примеры интеграции с fraud detection service" ### 1.2 Технические требования - **Одновременные пользователи:** до 10 человек - **Peak concurrent requests:** 8 одновременно - **Источники данных:** - Gitea (100+ репозиториев) - Docker Swarm (50+ services) - Kubernetes cluster (150+ pods, если используется) - Loki logs (1 TB/месяц) - Technical documentation (5000+ документов) --- ## 2. Архитектура решения ### 2.1 High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ USER ACCESS LAYER │ │ │ │ ┌──────────┐ ┌───────────┐ ┌──────────┐ │ │ │ Web UI │ │ VS Code │ │ CLI Tool │ │ │ │(Gradio) │ │(Extension)│ │ (Python) │ │ │ └────┬─────┘ └─────┬─────┘ └────┬─────┘ │ └───────┼──────────────┼──────────────┼─────────────────────┘ │ │ │ └──────────────┼──────────────┘ │ ┌──────────────────────▼─────────────────────────────────────┐ │ API GATEWAY / REVERSE PROXY │ │ (Traefik/Nginx) │ │ • TLS termination │ │ • Authentication (LDAP/OIDC) │ │ • Rate limiting (100 req/min per user) │ │ • IP: 10.30.10.5 │ └──────────────────────┬─────────────────────────────────────┘ │ ┌──────────────────────▼─────────────────────────────────────┐ │ OLLAMA INFERENCE LAYER │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ Ollama Server │ │ │ │ │ │ │ │ Models (Hot-loaded): │ │ │ │ • qwen2.5-coder:32b (Code) │ │ │ │ • deepseek-r1:32b (Reasoning) │ │ │ │ • llama3.3:70b-q4 (Universal) │ │ │ │ │ │ │ │ GPU: 1x NVIDIA RTX 4090 24GB │ │ │ │ CPU: 32 vCPU │ │ │ │ RAM: 128 GB │ │ │ │ IP: 10.30.10.10:11434 │ │ │ └─────────────────────────────────────┘ │ └──────────────────────┬─────────────────────────────────────┘ │ ┌──────────────────────▼─────────────────────────────────────┐ │ MCP (MODEL CONTEXT PROTOCOL) LAYER │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ MCP Orchestrator │ │ │ │ • Request routing │ │ │ │ • Context assembly │ │ │ │ IP: 10.30.10.20 │ │ │ └───────┬─────────────────────────────┘ │ │ │ │ │ ┌────┼────┬────────┬────────┬────────┬────────┐ │ │ │ │ │ │ │ │ │ │ │ ┌──▼─┐ ┌▼──┐ ┌▼────┐ ┌▼─────┐ ┌▼────┐ ┌▼─────┐ │ │ │Git │ │Swm│ │ K8s │ │ Logs │ │Docs │ │CI/CD │ │ │ │ea │ │arm│ │ │ │(Loki)│ │ │ │ │ │ │ └────┘ └───┘ └─────┘ └──────┘ └─────┘ └──────┘ │ └──────────────────────┬─────────────────────────────────────┘ │ ┌──────────────────────▼─────────────────────────────────────┐ │ KNOWLEDGE BASE / RAG LAYER │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ Vector Database (Qdrant) │ │ │ │ • technical-docs (5000+ docs) │ │ │ │ • code-snippets (10000+ samples) │ │ │ │ • k8s-configs (500+ manifests) │ │ │ │ • incidents (1000+ postmortems) │ │ │ │ Storage: 500 GB │ │ │ │ IP: 10.30.10.30:6333 │ │ │ └─────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ Embedding Service │ │ │ │ • bge-large-en-v1.5 │ │ │ │ • Text chunking (512 tokens) │ │ │ │ IP: 10.30.10.31 │ │ │ └─────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` --- ## 3. Серверные требования ### 3.1 Production Configuration (Recommended) | Component | Specification | Rationale | |-----------|--------------|-----------| | **GPU** | 1x NVIDIA RTX 4090 24GB VRAM | Оптимальный баланс цена/производительность для 32B моделей | | **GPU (альтернатива)** | 1x NVIDIA L40 48GB VRAM | Для 70B моделей и больших контекстов | | **CPU** | AMD Ryzen 9 7950X (16 cores, 32 threads) | Preprocessing, embedding, parallel MCP calls | | **RAM** | 128 GB DDR5 ECC | 64 GB для OS/services + 64 GB для model offloading | | **Storage Primary** | 2x 2TB NVMe SSD (RAID 1) | Model cache, vector DB, fast I/O | | **Storage Secondary** | 4TB SATA SSD | Document storage, backups | | **Network** | 2x 10 Gbps (bonded) | High throughput для MCP data retrieval | | **PSU** | 1600W 80+ Titanium | GPU power requirements | **Estimated Cost:** $12,000-15,000 (with RTX 4090) или $18,000-22,000 (with L40) ### 3.2 GPU Selection Guide | Use Case | GPU | VRAM | Models Supported | Cost | |----------|-----|------|------------------|------| | **Code generation only** | RTX 3090 | 24 GB | qwen2.5-coder:32b | $1,000-1,500 | | **Balanced (recommended)** | RTX 4090 | 24 GB | 32B models, 70B Q4 | $1,600-2,000 | | **Large context (70B)** | L40 | 48 GB | llama3.3:70b | $6,000-8,000 | | **Maximum capacity** | A100 | 80 GB | Multiple 70B models | $10,000-15,000 | **Recommendation для FinTech:** RTX 4090 24GB - оптимальный выбор для 10 пользователей. ### 3.3 Resource Allocation **VRAM:** ``` Model Memory (Q4 quantization): qwen2.5-coder:32b → 22 GB VRAM deepseek-r1:32b → 24 GB VRAM llama3.3:70b-q4 → 40 GB VRAM (needs L40) ``` **RAM (128 GB breakdown):** ``` 16 GB → OS (Ubuntu Server) 8 GB → Ollama service 32 GB → Vector DB (Qdrant) 16 GB → MCP Services 8 GB → Embedding service 8 GB → API Gateway + misc 40 GB → Model offloading buffer ``` **Storage (2 TB NVMe):** ``` 300 GB → AI Models 500 GB → Vector Database 200 GB → MCP Services cache 100 GB → OS и applications 900 GB → Free space / growth ``` --- ## 4. Выбор AI моделей ### 4.1 Рекомендованный Model Pool **Primary Models:** #### 1. qwen2.5-coder:32b - Code Specialist ``` Purpose: Code generation, review, debugging Size: 20 GB (Q4) VRAM: 22 GB Context: 32k tokens Speed: ~45 tokens/sec (RTX 4090) Strengths: ✓ Лучший для infrastructure code (Terraform, K8s) ✓ Понимает DevOps patterns ✓ Отличные комментарии к коду Use cases: • Генерация Helm charts • Написание Bash scripts • Code review для security issues • Dockerfile optimization ``` #### 2. deepseek-r1:32b - Reasoning Engine ``` Purpose: Complex analysis, troubleshooting Size: 22 GB (Q4) VRAM: 24 GB Context: 64k tokens Speed: ~40 tokens/sec Strengths: ✓ Excellent reasoning для root cause analysis ✓ Multi-step problem solving ✓ Complex системный анализ Use cases: • Log analysis и troubleshooting • Architecture decision making • Incident post-mortems • Performance optimization ``` #### 3. llama3.3:70b-q4 - Universal Assistant ``` Purpose: Documentation, explanations Size: 38 GB (Q4) VRAM: 40 GB (needs L40) Context: 128k tokens Speed: ~25 tokens/sec Strengths: ✓ Best для длинной документации ✓ Excellent writing quality ✓ Multi-lingual Use cases: • Technical documentation • README files • Architecture design documents ``` ### 4.2 Model Performance Benchmarks **Real-world performance на RTX 4090:** | Task | Model | Context | Time | Quality | |------|-------|---------|------|---------| | **Code generation** | qwen2.5-coder:32b | 8k | 12 sec | 9/10 | | **Log analysis** | deepseek-r1:32b | 32k | 25 sec | 9/10 | | **Documentation** | llama3.3:70b-q4 | 64k | 90 sec* | 10/10 | | **Quick Q&A** | qwen2.5-coder:32b | 2k | 3 sec | 8/10 | *70B модель на RTX 4090 работает через CPU offloading --- ## 5. MCP Services ### 5.1 MCP Architecture **Model Context Protocol (MCP)** - стандартизированный способ подключения AI моделей к внешним источникам данных. ### 5.2 MCP Server: Gitea **Capabilities:** ``` 1. list_repositories() 2. get_file(repo, path, branch) 3. search_code(query, language) 4. get_commit_history(repo, file) 5. get_pull_requests(repo) 6. compare_branches(repo, base, head) 7. get_documentation(repo) 8. analyze_dependencies(repo) ``` **Configuration:** ```yaml gitea: url: "https://git.thedevops.dev" read_only: true allowed_repos: - "admin/k3s-gitops" - "devops/*" max_requests_per_minute: 100 cache_ttl: 300 ``` ### 5.3 MCP Server: Docker Swarm **Capabilities:** ``` 1. list_services() 2. get_service_logs(service, tail, since) 3. describe_service(service) 4. list_stacks() 5. get_stack_services(stack) 6. analyze_service_health(service) 7. get_swarm_nodes() ``` **Security:** ```yaml docker_swarm: read_only: true secrets_masking: true secret_patterns: - "*_PASSWORD" - "*_TOKEN" - "*_KEY" ``` ### 5.4 MCP Server: Kubernetes **Capabilities:** ``` 1. get_pods(namespace, labels) 2. get_pod_logs(pod, namespace, container) 3. describe_pod(pod, namespace) 4. get_deployments(namespace) 5. get_events(namespace, since) 6. analyze_resource_usage(namespace) ``` **RBAC:** ```yaml kubernetes: read_only: true namespaces: allowed: ["production", "staging"] denied: ["kube-system"] mask_secrets: true ``` ### 5.5 MCP Server: Logs (Loki) **Capabilities:** ``` 1. query_logs(query, start, end) 2. search_errors(service, since) 3. analyze_patterns(service, time_range) 4. get_service_logs(service, tail) 5. trace_request(request_id) ``` **Security:** ```yaml loki: max_query_range: "24h" max_lines: 5000 sensitive_patterns: - regex: '\b\d{16}\b' # Credit cards replacement: "[CARD_REDACTED]" - regex: 'password=\S+' replacement: "password=[REDACTED]" ``` ### 5.6 MCP Server: Documentation **Capabilities:** ``` 1. search_docs(query, category) 2. get_document(doc_id) 3. list_runbooks() 4. get_architecture_docs() 5. search_code_examples(language, topic) ``` ### 5.7 MCP Server: CI/CD **Capabilities:** ``` 1. get_build_status(job) 2. get_build_logs(job, build_number) 3. list_failed_builds(since) 4. get_argocd_applications() 5. get_application_health(app) ``` --- ## 6. Knowledge Base (RAG) ### 6.1 RAG Architecture **Data Sources:** - Technical Documentation (5000+ docs) - Code Repositories (10000+ snippets) - Kubernetes Configs (500+ manifests) - Incident History (1000+ postmortems) ### 6.2 Vector Database (Qdrant) **Configuration:** ```yaml service: host: "0.0.0.0" port: 6333 storage: storage_path: "/var/lib/qdrant/storage" on_disk_payload: true log_level: "INFO" ``` **Collections:** ```python collections = [ "technical_docs", # 5000+ documents "code_snippets", # 10000+ samples "incidents", # 1000+ postmortems "k8s_configs", # 500+ manifests "runbooks" # 200+ procedures ] ``` ### 6.3 Embedding Service **Model:** bge-large-en-v1.5 (1024 dimensions) **Implementation:** ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-large-en-v1.5") @app.post("/embed") async def create_embeddings(texts: list[str]): embeddings = model.encode(texts, normalize_embeddings=True) return {"embeddings": embeddings.tolist()} ``` --- ## 7. Безопасность ### 7.1 Network Isolation **Firewall Rules:** ``` Inbound: ├─ 443 (HTTPS) from Corporate VPN ├─ 11434 (Ollama) from MCP Orchestrator only └─ 6333 (Qdrant) from Ollama server only Outbound: ├─ 3000 (Gitea API) ├─ 2377 (Docker Swarm API) ├─ 6443 (Kubernetes API) └─ 3100 (Loki query API) Default: DENY ALL ``` ### 7.2 Authentication ```yaml authentication: provider: "ldap" ldap: url: "ldaps://ldap.company.local:636" user_base: "ou=users,dc=company,dc=local" authorization: roles: - name: "devops" permissions: - "query:*" - "mcp:*:read" members: - "cn=devops-team,ou=groups" ``` ### 7.3 Secrets Masking ```python PATTERNS = [ (r'password:\s*"?([^"\s]+)"?', r'password: "[REDACTED]"'), (r'token:\s*"?([^"\s]+)"?', r'token: "[REDACTED]"'), (r'\b\d{16}\b', '[CARD_REDACTED]'), # Credit cards (r'\b\d{3}-\d{2}-\d{4}\b', '[SSN_REDACTED]'), # SSN ] ``` ### 7.4 Audit Logging ```python # Log format: # timestamp | user | action | details | result 2026-01-12 14:23:45 | vladimir.levinas | query | model=qwen2.5-coder:32b | success 2026-01-12 14:23:46 | vladimir.levinas | mcp_k8s | method=get_pods | success ``` --- ## 8. Развертывание ### 8.1 Installation (Ubuntu 22.04) **Step 1: System Setup** ```bash # Update system apt update && apt upgrade -y # Install NVIDIA drivers apt install -y nvidia-driver-535 # Install Docker curl -fsSL https://get.docker.com | sh # Reboot reboot ``` **Step 2: Install Ollama** ```bash curl -fsSL https://ollama.com/install.sh | sh systemctl enable ollama systemctl start ollama # Pull models ollama pull qwen2.5-coder:32b ollama pull deepseek-r1:32b ``` **Step 3: Deploy Infrastructure** ```bash # Clone repo git clone https://git.thedevops.dev/devops/ollama-infrastructure cd ollama-infrastructure # Configure cp .env.example .env # Edit .env with your settings # Deploy docker-compose up -d # Initialize Vector DB python3 scripts/init-vector-db.py # Load initial data python3 scripts/load-docs.py ``` ### 8.2 Production Checklist - [ ] Hardware протестирован - [ ] GPU drivers работают (`nvidia-smi`) - [ ] Ollama и модели загружены - [ ] Docker containers запущены - [ ] Vector DB инициализирован - [ ] MCP services тестированы - [ ] End-to-end тест пройден - [ ] TLS сертификаты валидны - [ ] LDAP authentication работает - [ ] Rate limiting настроен - [ ] Audit logging включен - [ ] Backup настроен - [ ] Monitoring настроен - [ ] Team обучена --- ## 9. Мониторинг ### 9.1 Key Metrics **GPU Metrics:** ``` nvidia_gpu_temperature_celsius nvidia_gpu_utilization_percent nvidia_gpu_memory_used_bytes nvidia_gpu_power_usage_watts ``` **Ollama Metrics:** ``` ollama_requests_total ollama_request_duration_seconds ollama_tokens_per_second ``` **MCP Metrics:** ``` mcp_requests_total{service="gitea"} mcp_request_duration_seconds mcp_errors_total ``` ### 9.2 Grafana Dashboards **Dashboard 1: Ollama Overview** - GPU utilization - Request rate - Response time - Active users **Dashboard 2: MCP Services** - Request distribution by service - Success/error rates - Latency percentiles **Dashboard 3: Vector DB** - Collection sizes - Query performance - Cache hit rate --- ## 10. Бюджет ### 10.1 Hardware Costs | Item | Specification | Cost | |------|--------------|------| | **GPU** | NVIDIA RTX 4090 24GB | $1,600-2,000 | | **CPU** | AMD Ryzen 9 7950X | $500-600 | | **RAM** | 128GB DDR5 ECC | $600-800 | | **Storage** | 2x 2TB NVMe + 4TB SATA | $800-1,000 | | **Motherboard** | High-end workstation | $400-500 | | **PSU** | 1600W Titanium | $300-400 | | **Case/Cooling** | Enterprise grade | $300-400 | | **Network** | 2x 10GbE NIC | $200-300 | | **TOTAL** | | **$12,000-15,000** | ### 10.2 Software Costs | Item | Cost | |------|------| | OS (Ubuntu Server) | FREE | | Ollama | FREE | | Qdrant | FREE (open source) | | All MCP services | FREE (self-developed) | | Monitoring (Prometheus/Grafana) | FREE | | **TOTAL** | **$0** | ### 10.3 Annual Operational Costs | Item | Cost | |------|------| | Electricity (~500W 24/7) | $650/year | | Cooling | $200/year | | Maintenance | $500/year | | Training/Documentation | $2,000/year | | **TOTAL Annual OpEx** | **$3,350/year** | ### 10.4 ROI Analysis **Total Initial Investment:** $12,000-15,000 **Annual Savings:** ``` Time savings for 10 engineers: ├─ 4 hours/week saved per person ├─ 40 hours/week total ├─ 2080 hours/year total └─ At $100/hour = $208,000/year saved Productivity increase: ├─ 30% faster troubleshooting ├─ 50% faster documentation └─ Estimated value: $100,000/year Total annual benefit: ~$308,000 ``` **Payback Period:** ~1-2 months **3-Year ROI:** 6000% --- ## Appendix A: Quick Reference ### Service URLs ``` API Gateway: https://ai.company.local Ollama API: http://10.30.10.10:11434 Qdrant: http://10.30.10.30:6333 Grafana: https://monitoring.company.local ``` ### Common Commands ```bash # Check Ollama status ollama list # Run model test ollama run qwen2.5-coder:32b "Hello" # Check GPU nvidia-smi # View logs docker-compose logs -f ollama # Backup Vector DB docker exec qdrant tar -czf /backup/qdrant-$(date +%Y%m%d).tar.gz /qdrant/storage ``` --- **Document Version:** 2.0 **Last Updated:** Январь 2026 **Status:** Production Ready **Approvals:** - [ ] Infrastructure Lead - [ ] Security Lead - [ ] DevOps Lead - [ ] Financial Approval