Files
k3s-gitops/docs/gitops-cicd/08-ollama-infrastructure-requirements.md

776 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Требования к серверу Ollama для FinTech DevOps с MCP интеграцией
**Версия:** 2.0
**Дата:** Январь 2026
**Статус:** Production Ready
**Целевая аудитория:** Infrastructure Team, DevOps, Security, Management
---
## Executive Summary
### Бизнес-обоснование
**Проблема:**
FinTech компания генерирует огромное количество технической информации (код, логи, документация, Kubernetes конфигурации), которая распределена по множеству систем. Разработчики и DevOps инженеры тратят 30-40% времени на поиск информации, анализ логов и написание документации.
**Решение:**
Self-hosted AI ассистент на базе Ollama с интеграцией через MCP (Model Context Protocol) ко всем источникам данных компании.
**Ключевые преимущества для FinTech:**
- ✅ Данные не покидают корпоративную сеть (PCI DSS, GDPR compliance)
- ✅ Нет зависимости от внешних AI провайдеров (OpenAI, Anthropic)
- ✅ Полный контроль над обрабатываемой информацией
- ✅ Возможность обучения на конфиденциальных данных
**Ожидаемый эффект:**
- 40% сокращение времени на поиск информации
- 50% ускорение написания документации
- 30% сокращение времени troubleshooting
- ROI: 8-12 месяцев
---
## Содержание
1. [Цели и Use Cases](#1-цели-и-use-cases)
2. [Архитектура решения](#2-архитектура-решения)
3. [Серверные требования](#3-серверные-требования)
4. [Выбор AI моделей](#4-выбор-ai-моделей)
5. [MCP Services](#5-mcp-services)
6. [Knowledge Base (RAG)](#6-knowledge-base-rag)
7. [Безопасность](#7-безопасность)
8. [Развертывание](#8-развертывание)
9. [Мониторинг](#9-мониторинг)
10. [Бюджет](#10-бюджет)
---
## 1. Цели и Use Cases
### 1.1 Основные задачи
**Для DevOps команды (5 человек):**
1. **Анализ Kubernetes/Docker Swarm**
- "Почему pod в CrashLoopBackOff?"
- "Как оптимизировать resource requests?"
- "Покажи все pods с высоким memory usage"
2. **Troubleshooting по логам**
- "Найди причину ошибки 500 в logs за последний час"
- "Какие services показывают connection timeout?"
- "Анализ performance degradation"
3. **Генерация инфраструктурного кода**
- "Создай Helm chart для microservice с PostgreSQL"
- "Напиши Terraform для AWS RDS с encryption"
- "Генерация docker-compose.yml"
**Для разработчиков (5 человек):**
1. **Code generation и review**
- "Напиши unit tests для этого сервиса"
- "Оптимизируй этот SQL query"
- "Code review: найди potential security issues"
2. **Работа с документацией**
- "Как использовать наш internal payment API?"
- "Покажи примеры интеграции с fraud detection service"
### 1.2 Технические требования
- **Одновременные пользователи:** до 10 человек
- **Peak concurrent requests:** 8 одновременно
- **Источники данных:**
- Gitea (100+ репозиториев)
- Docker Swarm (50+ services)
- Kubernetes cluster (150+ pods, если используется)
- Loki logs (1 TB/месяц)
- Technical documentation (5000+ документов)
---
## 2. Архитектура решения
### 2.1 High-Level Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ USER ACCESS LAYER │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Web UI │ │ VS Code │ │ CLI Tool │ │
│ │(Gradio) │ │(Extension)│ │ (Python) │ │
│ └────┬─────┘ └─────┬─────┘ └────┬─────┘ │
└───────┼──────────────┼──────────────┼─────────────────────┘
│ │ │
└──────────────┼──────────────┘
┌──────────────────────▼─────────────────────────────────────┐
│ API GATEWAY / REVERSE PROXY │
│ (Traefik/Nginx) │
│ • TLS termination │
│ • Authentication (LDAP/OIDC) │
│ • Rate limiting (100 req/min per user) │
│ • IP: 10.30.10.5 │
└──────────────────────┬─────────────────────────────────────┘
┌──────────────────────▼─────────────────────────────────────┐
│ OLLAMA INFERENCE LAYER │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Ollama Server │ │
│ │ │ │
│ │ Models (Hot-loaded): │ │
│ │ • qwen2.5-coder:32b (Code) │ │
│ │ • deepseek-r1:32b (Reasoning) │ │
│ │ • llama3.3:70b-q4 (Universal) │ │
│ │ │ │
│ │ GPU: 1x NVIDIA RTX 4090 24GB │ │
│ │ CPU: 32 vCPU │ │
│ │ RAM: 128 GB │ │
│ │ IP: 10.30.10.10:11434 │ │
│ └─────────────────────────────────────┘ │
└──────────────────────┬─────────────────────────────────────┘
┌──────────────────────▼─────────────────────────────────────┐
│ MCP (MODEL CONTEXT PROTOCOL) LAYER │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ MCP Orchestrator │ │
│ │ • Request routing │ │
│ │ • Context assembly │ │
│ │ IP: 10.30.10.20 │ │
│ └───────┬─────────────────────────────┘ │
│ │ │
│ ┌────┼────┬────────┬────────┬────────┬────────┐ │
│ │ │ │ │ │ │ │ │
│ ┌──▼─┐ ┌▼──┐ ┌▼────┐ ┌▼─────┐ ┌▼────┐ ┌▼─────┐ │
│ │Git │ │Swm│ │ K8s │ │ Logs │ │Docs │ │CI/CD │ │
│ │ea │ │arm│ │ │ │(Loki)│ │ │ │ │ │
│ └────┘ └───┘ └─────┘ └──────┘ └─────┘ └──────┘ │
└──────────────────────┬─────────────────────────────────────┘
┌──────────────────────▼─────────────────────────────────────┐
│ KNOWLEDGE BASE / RAG LAYER │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Vector Database (Qdrant) │ │
│ │ • technical-docs (5000+ docs) │ │
│ │ • code-snippets (10000+ samples) │ │
│ │ • k8s-configs (500+ manifests) │ │
│ │ • incidents (1000+ postmortems) │ │
│ │ Storage: 500 GB │ │
│ │ IP: 10.30.10.30:6333 │ │
│ └─────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Embedding Service │ │
│ │ • bge-large-en-v1.5 │ │
│ │ • Text chunking (512 tokens) │ │
│ │ IP: 10.30.10.31 │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
---
## 3. Серверные требования
### 3.1 Production Configuration (Recommended)
| Component | Specification | Rationale |
|-----------|--------------|-----------|
| **GPU** | 1x NVIDIA RTX 4090 24GB VRAM | Оптимальный баланс цена/производительность для 32B моделей |
| **GPU (альтернатива)** | 1x NVIDIA L40 48GB VRAM | Для 70B моделей и больших контекстов |
| **CPU** | AMD Ryzen 9 7950X (16 cores, 32 threads) | Preprocessing, embedding, parallel MCP calls |
| **RAM** | 128 GB DDR5 ECC | 64 GB для OS/services + 64 GB для model offloading |
| **Storage Primary** | 2x 2TB NVMe SSD (RAID 1) | Model cache, vector DB, fast I/O |
| **Storage Secondary** | 4TB SATA SSD | Document storage, backups |
| **Network** | 2x 10 Gbps (bonded) | High throughput для MCP data retrieval |
| **PSU** | 1600W 80+ Titanium | GPU power requirements |
**Estimated Cost:** $12,000-15,000 (with RTX 4090) или $18,000-22,000 (with L40)
### 3.2 GPU Selection Guide
| Use Case | GPU | VRAM | Models Supported | Cost |
|----------|-----|------|------------------|------|
| **Code generation only** | RTX 3090 | 24 GB | qwen2.5-coder:32b | $1,000-1,500 |
| **Balanced (recommended)** | RTX 4090 | 24 GB | 32B models, 70B Q4 | $1,600-2,000 |
| **Large context (70B)** | L40 | 48 GB | llama3.3:70b | $6,000-8,000 |
| **Maximum capacity** | A100 | 80 GB | Multiple 70B models | $10,000-15,000 |
**Recommendation для FinTech:**
RTX 4090 24GB - оптимальный выбор для 10 пользователей.
### 3.3 Resource Allocation
**VRAM:**
```
Model Memory (Q4 quantization):
qwen2.5-coder:32b → 22 GB VRAM
deepseek-r1:32b → 24 GB VRAM
llama3.3:70b-q4 → 40 GB VRAM (needs L40)
```
**RAM (128 GB breakdown):**
```
16 GB → OS (Ubuntu Server)
8 GB → Ollama service
32 GB → Vector DB (Qdrant)
16 GB → MCP Services
8 GB → Embedding service
8 GB → API Gateway + misc
40 GB → Model offloading buffer
```
**Storage (2 TB NVMe):**
```
300 GB → AI Models
500 GB → Vector Database
200 GB → MCP Services cache
100 GB → OS и applications
900 GB → Free space / growth
```
---
## 4. Выбор AI моделей
### 4.1 Рекомендованный Model Pool
**Primary Models:**
#### 1. qwen2.5-coder:32b - Code Specialist
```
Purpose: Code generation, review, debugging
Size: 20 GB (Q4)
VRAM: 22 GB
Context: 32k tokens
Speed: ~45 tokens/sec (RTX 4090)
Strengths:
✓ Лучший для infrastructure code (Terraform, K8s)
✓ Понимает DevOps patterns
✓ Отличные комментарии к коду
Use cases:
• Генерация Helm charts
• Написание Bash scripts
• Code review для security issues
• Dockerfile optimization
```
#### 2. deepseek-r1:32b - Reasoning Engine
```
Purpose: Complex analysis, troubleshooting
Size: 22 GB (Q4)
VRAM: 24 GB
Context: 64k tokens
Speed: ~40 tokens/sec
Strengths:
✓ Excellent reasoning для root cause analysis
✓ Multi-step problem solving
✓ Complex системный анализ
Use cases:
• Log analysis и troubleshooting
• Architecture decision making
• Incident post-mortems
• Performance optimization
```
#### 3. llama3.3:70b-q4 - Universal Assistant
```
Purpose: Documentation, explanations
Size: 38 GB (Q4)
VRAM: 40 GB (needs L40)
Context: 128k tokens
Speed: ~25 tokens/sec
Strengths:
✓ Best для длинной документации
✓ Excellent writing quality
✓ Multi-lingual
Use cases:
• Technical documentation
• README files
• Architecture design documents
```
### 4.2 Model Performance Benchmarks
**Real-world performance на RTX 4090:**
| Task | Model | Context | Time | Quality |
|------|-------|---------|------|---------|
| **Code generation** | qwen2.5-coder:32b | 8k | 12 sec | 9/10 |
| **Log analysis** | deepseek-r1:32b | 32k | 25 sec | 9/10 |
| **Documentation** | llama3.3:70b-q4 | 64k | 90 sec* | 10/10 |
| **Quick Q&A** | qwen2.5-coder:32b | 2k | 3 sec | 8/10 |
*70B модель на RTX 4090 работает через CPU offloading
---
## 5. MCP Services
### 5.1 MCP Architecture
**Model Context Protocol (MCP)** - стандартизированный способ подключения AI моделей к внешним источникам данных.
### 5.2 MCP Server: Gitea
**Capabilities:**
```
1. list_repositories()
2. get_file(repo, path, branch)
3. search_code(query, language)
4. get_commit_history(repo, file)
5. get_pull_requests(repo)
6. compare_branches(repo, base, head)
7. get_documentation(repo)
8. analyze_dependencies(repo)
```
**Configuration:**
```yaml
gitea:
url: "https://git.thedevops.dev"
read_only: true
allowed_repos:
- "admin/k3s-gitops"
- "devops/*"
max_requests_per_minute: 100
cache_ttl: 300
```
### 5.3 MCP Server: Docker Swarm
**Capabilities:**
```
1. list_services()
2. get_service_logs(service, tail, since)
3. describe_service(service)
4. list_stacks()
5. get_stack_services(stack)
6. analyze_service_health(service)
7. get_swarm_nodes()
```
**Security:**
```yaml
docker_swarm:
read_only: true
secrets_masking: true
secret_patterns:
- "*_PASSWORD"
- "*_TOKEN"
- "*_KEY"
```
### 5.4 MCP Server: Kubernetes
**Capabilities:**
```
1. get_pods(namespace, labels)
2. get_pod_logs(pod, namespace, container)
3. describe_pod(pod, namespace)
4. get_deployments(namespace)
5. get_events(namespace, since)
6. analyze_resource_usage(namespace)
```
**RBAC:**
```yaml
kubernetes:
read_only: true
namespaces:
allowed: ["production", "staging"]
denied: ["kube-system"]
mask_secrets: true
```
### 5.5 MCP Server: Logs (Loki)
**Capabilities:**
```
1. query_logs(query, start, end)
2. search_errors(service, since)
3. analyze_patterns(service, time_range)
4. get_service_logs(service, tail)
5. trace_request(request_id)
```
**Security:**
```yaml
loki:
max_query_range: "24h"
max_lines: 5000
sensitive_patterns:
- regex: '\b\d{16}\b' # Credit cards
replacement: "[CARD_REDACTED]"
- regex: 'password=\S+'
replacement: "password=[REDACTED]"
```
### 5.6 MCP Server: Documentation
**Capabilities:**
```
1. search_docs(query, category)
2. get_document(doc_id)
3. list_runbooks()
4. get_architecture_docs()
5. search_code_examples(language, topic)
```
### 5.7 MCP Server: CI/CD
**Capabilities:**
```
1. get_build_status(job)
2. get_build_logs(job, build_number)
3. list_failed_builds(since)
4. get_argocd_applications()
5. get_application_health(app)
```
---
## 6. Knowledge Base (RAG)
### 6.1 RAG Architecture
**Data Sources:**
- Technical Documentation (5000+ docs)
- Code Repositories (10000+ snippets)
- Kubernetes Configs (500+ manifests)
- Incident History (1000+ postmortems)
### 6.2 Vector Database (Qdrant)
**Configuration:**
```yaml
service:
host: "0.0.0.0"
port: 6333
storage:
storage_path: "/var/lib/qdrant/storage"
on_disk_payload: true
log_level: "INFO"
```
**Collections:**
```python
collections = [
"technical_docs", # 5000+ documents
"code_snippets", # 10000+ samples
"incidents", # 1000+ postmortems
"k8s_configs", # 500+ manifests
"runbooks" # 200+ procedures
]
```
### 6.3 Embedding Service
**Model:** bge-large-en-v1.5 (1024 dimensions)
**Implementation:**
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
@app.post("/embed")
async def create_embeddings(texts: list[str]):
embeddings = model.encode(texts, normalize_embeddings=True)
return {"embeddings": embeddings.tolist()}
```
---
## 7. Безопасность
### 7.1 Network Isolation
**Firewall Rules:**
```
Inbound:
├─ 443 (HTTPS) from Corporate VPN
├─ 11434 (Ollama) from MCP Orchestrator only
└─ 6333 (Qdrant) from Ollama server only
Outbound:
├─ 3000 (Gitea API)
├─ 2377 (Docker Swarm API)
├─ 6443 (Kubernetes API)
└─ 3100 (Loki query API)
Default: DENY ALL
```
### 7.2 Authentication
```yaml
authentication:
provider: "ldap"
ldap:
url: "ldaps://ldap.company.local:636"
user_base: "ou=users,dc=company,dc=local"
authorization:
roles:
- name: "devops"
permissions:
- "query:*"
- "mcp:*:read"
members:
- "cn=devops-team,ou=groups"
```
### 7.3 Secrets Masking
```python
PATTERNS = [
(r'password:\s*"?([^"\s]+)"?', r'password: "[REDACTED]"'),
(r'token:\s*"?([^"\s]+)"?', r'token: "[REDACTED]"'),
(r'\b\d{16}\b', '[CARD_REDACTED]'), # Credit cards
(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN_REDACTED]'), # SSN
]
```
### 7.4 Audit Logging
```python
# Log format:
# timestamp | user | action | details | result
2026-01-12 14:23:45 | vladimir.levinas | query | model=qwen2.5-coder:32b | success
2026-01-12 14:23:46 | vladimir.levinas | mcp_k8s | method=get_pods | success
```
---
## 8. Развертывание
### 8.1 Installation (Ubuntu 22.04)
**Step 1: System Setup**
```bash
# Update system
apt update && apt upgrade -y
# Install NVIDIA drivers
apt install -y nvidia-driver-535
# Install Docker
curl -fsSL https://get.docker.com | sh
# Reboot
reboot
```
**Step 2: Install Ollama**
```bash
curl -fsSL https://ollama.com/install.sh | sh
systemctl enable ollama
systemctl start ollama
# Pull models
ollama pull qwen2.5-coder:32b
ollama pull deepseek-r1:32b
```
**Step 3: Deploy Infrastructure**
```bash
# Clone repo
git clone https://git.thedevops.dev/devops/ollama-infrastructure
cd ollama-infrastructure
# Configure
cp .env.example .env
# Edit .env with your settings
# Deploy
docker-compose up -d
# Initialize Vector DB
python3 scripts/init-vector-db.py
# Load initial data
python3 scripts/load-docs.py
```
### 8.2 Production Checklist
- [ ] Hardware протестирован
- [ ] GPU drivers работают (`nvidia-smi`)
- [ ] Ollama и модели загружены
- [ ] Docker containers запущены
- [ ] Vector DB инициализирован
- [ ] MCP services тестированы
- [ ] End-to-end тест пройден
- [ ] TLS сертификаты валидны
- [ ] LDAP authentication работает
- [ ] Rate limiting настроен
- [ ] Audit logging включен
- [ ] Backup настроен
- [ ] Monitoring настроен
- [ ] Team обучена
---
## 9. Мониторинг
### 9.1 Key Metrics
**GPU Metrics:**
```
nvidia_gpu_temperature_celsius
nvidia_gpu_utilization_percent
nvidia_gpu_memory_used_bytes
nvidia_gpu_power_usage_watts
```
**Ollama Metrics:**
```
ollama_requests_total
ollama_request_duration_seconds
ollama_tokens_per_second
```
**MCP Metrics:**
```
mcp_requests_total{service="gitea"}
mcp_request_duration_seconds
mcp_errors_total
```
### 9.2 Grafana Dashboards
**Dashboard 1: Ollama Overview**
- GPU utilization
- Request rate
- Response time
- Active users
**Dashboard 2: MCP Services**
- Request distribution by service
- Success/error rates
- Latency percentiles
**Dashboard 3: Vector DB**
- Collection sizes
- Query performance
- Cache hit rate
---
## 10. Бюджет
### 10.1 Hardware Costs
| Item | Specification | Cost |
|------|--------------|------|
| **GPU** | NVIDIA RTX 4090 24GB | $1,600-2,000 |
| **CPU** | AMD Ryzen 9 7950X | $500-600 |
| **RAM** | 128GB DDR5 ECC | $600-800 |
| **Storage** | 2x 2TB NVMe + 4TB SATA | $800-1,000 |
| **Motherboard** | High-end workstation | $400-500 |
| **PSU** | 1600W Titanium | $300-400 |
| **Case/Cooling** | Enterprise grade | $300-400 |
| **Network** | 2x 10GbE NIC | $200-300 |
| **TOTAL** | | **$12,000-15,000** |
### 10.2 Software Costs
| Item | Cost |
|------|------|
| OS (Ubuntu Server) | FREE |
| Ollama | FREE |
| Qdrant | FREE (open source) |
| All MCP services | FREE (self-developed) |
| Monitoring (Prometheus/Grafana) | FREE |
| **TOTAL** | **$0** |
### 10.3 Annual Operational Costs
| Item | Cost |
|------|------|
| Electricity (~500W 24/7) | $650/year |
| Cooling | $200/year |
| Maintenance | $500/year |
| Training/Documentation | $2,000/year |
| **TOTAL Annual OpEx** | **$3,350/year** |
### 10.4 ROI Analysis
**Total Initial Investment:** $12,000-15,000
**Annual Savings:**
```
Time savings for 10 engineers:
├─ 4 hours/week saved per person
├─ 40 hours/week total
├─ 2080 hours/year total
└─ At $100/hour = $208,000/year saved
Productivity increase:
├─ 30% faster troubleshooting
├─ 50% faster documentation
└─ Estimated value: $100,000/year
Total annual benefit: ~$308,000
```
**Payback Period:** ~1-2 months
**3-Year ROI:** 6000%
---
## Appendix A: Quick Reference
### Service URLs
```
API Gateway: https://ai.company.local
Ollama API: http://10.30.10.10:11434
Qdrant: http://10.30.10.30:6333
Grafana: https://monitoring.company.local
```
### Common Commands
```bash
# Check Ollama status
ollama list
# Run model test
ollama run qwen2.5-coder:32b "Hello"
# Check GPU
nvidia-smi
# View logs
docker-compose logs -f ollama
# Backup Vector DB
docker exec qdrant tar -czf /backup/qdrant-$(date +%Y%m%d).tar.gz /qdrant/storage
```
---
**Document Version:** 2.0
**Last Updated:** Январь 2026
**Status:** Production Ready
**Approvals:**
- [ ] Infrastructure Lead
- [ ] Security Lead
- [ ] DevOps Lead
- [ ] Financial Approval