Update docs/gitops-cicd/11-ollama-comprehensive-enterprise-guide.md

This commit is contained in:
2026-01-13 07:52:03 +00:00
parent 1e11f0bdf1
commit 05e8b1bedb

View File

@@ -182,7 +182,7 @@ Self-hosted AI-инфраструктура на базе Ollama с интегр
### Уровень 1: User Access Layer ### Уровень 1: User Access Layer
**Веб-интерфейс** на базе Gradio предоставляет удобный браузерный доступ без установки дополнительного ПО. Это основной способ взаимодействия для большинства пользователей. **Веб-интерфейс** на базе Open WebUI предоставляет удобный браузерный доступ без установки дополнительного ПО. Это основной способ взаимодействия для большинства пользователей.
**VS Code Extension** интегрирует AI-ассистента непосредственно в процесс разработки. Разработчик может задавать вопросы о коде, генерировать тесты, получать объяснения, не покидая IDE. **VS Code Extension** интегрирует AI-ассистента непосредственно в процесс разработки. Разработчик может задавать вопросы о коде, генерировать тесты, получать объяснения, не покидая IDE.
@@ -237,7 +237,7 @@ Embedding Service использует модель bge-large-en-v1.5 для с
| **Network** | 2x 10 Gbps (bonded) | High throughput для MCP data retrieval | | **Network** | 2x 10 Gbps (bonded) | High throughput для MCP data retrieval |
| **PSU** | 1600W 80+ Titanium | GPU power requirements | | **PSU** | 1600W 80+ Titanium | GPU power requirements |
**Ориентировочная стоимость:** $12,000-15,000
### Выбор GPU по сценарию использования ### Выбор GPU по сценарию использования
@@ -261,29 +261,7 @@ Embedding Service использует модель bge-large-en-v1.5 для с
*с частичным offloading в RAM *с частичным offloading в RAM
### Распределение системной памяти (128 GB)
```
16 GB → Операционная система Ubuntu Server
8 GB → Ollama service
32 GB → Vector Database Qdrant
16 GB → MCP Services
8 GB → Embedding service
8 GB → API Gateway + мониторинг
40 GB → Model offloading buffer
```
### Распределение хранилища (2 TB NVMe)
```
300 GB → AI Models
500 GB → Vector Database
200 GB → MCP Services cache
100 GB → OS и приложения
900 GB → Резерв для роста
```
---
## Выбор и оптимизация AI-моделей ## Выбор и оптимизация AI-моделей
@@ -604,33 +582,6 @@ Effective AI-ассистент строит каждое взаимодейст
**Relevance-based selection** - вместо отбора по времени, анализируется relevance каждого сообщения к текущему запросу через embedding similarity. **Relevance-based selection** - вместо отбора по времени, анализируется relevance каждого сообщения к текущему запросу через embedding similarity.
### Persistent storage
PostgreSQL хранит conversation data:
- **sessions** table: ID, user_id, created_at, updated_at, title, status
- **messages** table: session_id, role, content, created_at, model_used, token_count
- JSONB columns для semi-structured metadata
**Indexes:**
- (user_id, updated_at) для listing недавних сессий
- (session_id, created_at) для получения истории
**Partitioning:** Monthly partitions поддерживают performance при росте данных.
### Конфиденциальность и retention
**Encryption:**
- At rest: Database или filesystem-level encryption
- In transit: TLS для всех коммуникаций
**Access controls:**
- Пользователи видят только свои диалоги
- RBAC для managers с audit trail
**Retention policies:**
- Automated cleanup согласно policy
- User right to deletion
- Anonymization для analytics
### Search и navigation ### Search и navigation
@@ -652,678 +603,7 @@ PostgreSQL хранит conversation data:
**Sharing links** - read-only URL с expiration time и access controls. **Sharing links** - read-only URL с expiration time и access controls.
### Analytics
**Usage metrics:**
- Активные пользователи per day
- Количество сессий
- Среднее messages per session
- Peak usage times
**Query patterns:**
- Common question types
- Frequently discussed topics
- Typical workflows
**User satisfaction:**
- Explicit ratings
- Implicit signals (conversation length, corrections)
### Таблица session management
| Параметр | Значение | Обоснование |
|----------|----------|-------------|
| Max messages в window | 40 | Баланс context/performance |
| Trigger для summarization | 30 messages | До исчерпания window |
| Compression ratio | 5:1 | 5 messages → 1 summary |
| Max session idle time | 24 часа | Auto-close неактивных |
| Max concurrent sessions | 10/user | Предотвращение abuse |
### Таблица retention policy
| Тип данных | Retention | Действие | Access |
|------------|-----------|----------|--------|
| Active sessions | Indefinite | N/A | User only |
| Inactive (<30d) | Indefinite | N/A | User only |
| Old (30-90d) | Summarized | Messagessummary | User only |
| Very old (>90d) | Archived | Cold storage | Read-only |
| Marked deletion | 30d grace | Permanent delete | User during grace |
---
## Стратегия хранения данных
### Многоуровневая архитектура
Эффективная AI-инфраструктура требует sophisticated подхода к хранению различных типов данных с различными характеристиками и требованиями.
### Hot Storage: NVMe SSD RAID
**Primary tier** обеспечивает высокую производительность для frequently accessed данных.
**Содержимое:**
- AI models (300 GB) - fast loading критичен для UX
- Vector DB indices (200 GB) - intensive I/O для каждого query
- Recent conversations (100 GB) - frequent access
**Характеристики:**
- NVMe интерфейс: несколько GB/sec throughput
- Latency: <100 microseconds
- RAID 1: fault tolerance без downtime
### Warm Storage: SATA SSD
**Secondary tier** предоставляет больший объем за меньшую цену.
**Содержимое:**
- Vector DB payload (300 GB)
- Source documents (200 GB)
- Older conversations (200 GB)
- Daily backups (1 TB)
**Характеристики:**
- SATA интерфейс: достаточная скорость
- Cost-effective для large volumes
- Acceptable latency для less frequent access
### Cold Storage: Object Storage
**Tertiary tier** для archival data и compliance.
**Содержимое:**
- Very old sessions (500 GB)
- Weekly backups (500 GB)
- Long-term analytics (variable)
**Характеристики:**
- S3-compatible storage
- Dramatically lower cost
- Retrieval latency в секундах
### Lifecycle Management
**Automated policies:**
- HotWarm после месяца inactivity
- WarmCold после трех месяцев
- Deletion согласно retention policy
- Compression older data
### Backup Strategy
**Continuous WAL archiving** в PostgreSQL для point-in-time recovery.
**Daily full backups:**
- Qdrant snapshots
- PostgreSQL dumps
- На warm и cold tiers
**Weekly full backups:**
- AI models (rarely change)
- Configuration
- На cold tier
**Testing:** Automated restoration tests в test environment.
### Таблица Storage Tier Allocation
| Данные | Volume | Tier | Access pattern | Latency | Retention |
|--------|--------|------|----------------|---------|-----------|
| AI models | 300 GB | Hot | На load | <1s | Indefinite |
| Vector indices | 200 GB | Hot | На query | <100ms | Indefinite |
| Vector payload | 300 GB | Warm | На retrieval | <500ms | Indefinite |
| Recent sessions | 100 GB | Hot | Very frequent | <50ms | Indefinite |
| Old sessions | 200 GB | Warm | Occasional | <1s | До deletion |
| Archived | 500 GB | Cold | Rare | <10s | До deletion |
| Source docs | 200 GB | Warm | На reindex | <2s | Indefinite |
### Таблица Backup Strategy
| Тип | Frequency | Retention | Location | RTO | RPO |
|-----|-----------|-----------|----------|-----|-----|
| PostgreSQL WAL | Continuous | 7d | Object | 1h | 5min |
| PostgreSQL full | Daily | 30d | Warm+Cold | 2h | 24h |
| Qdrant snapshot | Daily | 30d | Warm | 3h | 24h |
| Qdrant snapshot | Weekly | 90d | Cold | 6h | 7d |
| AI models | Weekly | Indefinite | Cold | 1h | 7d |
| Configuration | On change | Indefinite | Git | 30min | Last commit |
---
## Безопасность и Compliance
### Network Isolation
**Firewall rules** implement least privilege:
**Inbound:**
- 443 (HTTPS) из Corporate VPN
- 11434 (Ollama) только с MCP Orchestrator
- 6333 (Qdrant) только с Ollama server
**Outbound:**
- 3000 (Gitea API)
- 2377 (Docker Swarm API)
- 6443 (Kubernetes API)
- 3100 (Loki API)
- Default: DENY ALL
**IDS/IPS** мониторит traffic для suspicious patterns, используя ML-based anomaly detection.
### Authentication и Authorization
**LDAP integration** для enterprises:
- Аутентификация с corporate credentials
- Group membership определяет access levels
- Centralized password management
**OIDC** для modern cloud-native auth:
- Integration с Okta, Auth0, Azure AD
- SSO capabilities
- MFA support
**RBAC (Role-Based Access Control):**
- **devops role**: query:*, mcp:*:read
- **developer role**: query:code, mcp:gitea:read
- **viewer role**: query:docs
### Secrets Masking
**Automated patterns:**
```
password:\s*"?([^"\s]+)"? → password: "[REDACTED]"
token:\s*"?([^"\s]+)"? → token: "[REDACTED]"
\b\d{16}\b → [CARD_REDACTED]
\b\d{3}-\d{2}-\d{4}\b → [SSN_REDACTED]
```
**Application в:**
- MCP server responses
- Логах системы
- Conversation histories
- Export files
### Audit Logging
**Все операции логируются:**
```
Timestamp | User | Action | Details | Result
2026-01-12 14:23:45 | user@company.com | query | model=qwen2.5-coder | success
2026-01-12 14:23:46 | user@company.com | mcp_k8s | get_pods | success
```
**Retention:** 1 год для compliance.
**Analysis:** Регулярный review для suspicious patterns.
### Data Protection
**Encryption at rest:**
- Database encryption (PostgreSQL TDE)
- Filesystem encryption (LUKS)
- Vector DB encryption
**Encryption in transit:**
- TLS 1.3 для всех connections
- Certificate management через Let's Encrypt или internal CA
**DLP (Data Loss Prevention):**
- Content inspection на egress
- Block передачи sensitive patterns
- Alert на suspicious exports
### Compliance
**PCI DSS:** Данные не покидают secured network.
**GDPR:**
- Right to deletion implemented
- Data minimization principles
- Consent management
- Data portability через exports
**SOC 2:**
- Comprehensive audit trails
- Access controls documented
- Regular security reviews
- Incident response procedures
### Security Monitoring
**Metrics tracked:**
- Failed authentication attempts
- Unusual access patterns
- MCP server errors
- Rate limit hits
- Secrets exposure attempts
**Alerting:**
- Slack integration для security team
- PagerDuty для critical alerts
- Email для regular notifications
### Таблица Security Controls
| Контроль | Тип | Уровень | Мониторинг |
|----------|-----|---------|------------|
| Network firewall | Preventive | Infrastructure | 24/7 |
| TLS encryption | Preventive | Transport | Certificate monitoring |
| LDAP auth | Detective | Application | Login success rate |
| RBAC | Preventive | Application | Access patterns |
| Secrets masking | Preventive | Application | Exposure attempts |
| Audit logging | Detective | All layers | Log analysis |
| IDS/IPS | Detective/Preventive | Network | Alert monitoring |
| Backup encryption | Preventive | Storage | Backup verification |
---
## Мониторинг и Observability
### Key Metrics
**GPU Metrics:**
- nvidia_gpu_temperature_celsius
- nvidia_gpu_utilization_percent
- nvidia_gpu_memory_used_bytes
- nvidia_gpu_power_usage_watts
**Ollama Metrics:**
- ollama_requests_total
- ollama_request_duration_seconds
- ollama_tokens_per_second
- ollama_active_models
**MCP Metrics:**
- mcp_requests_total{service="gitea"}
- mcp_request_duration_seconds
- mcp_errors_total
- mcp_cache_hit_ratio
**RAG Metrics:**
- qdrant_collection_size
- qdrant_query_duration_seconds
- embedding_generation_duration
- reranking_duration
**Storage Metrics:**
- disk_usage_percent{tier="hot"}
- disk_iops{tier="hot"}
- disk_throughput_bytes
- backup_last_success_timestamp
### Grafana Dashboards
**Dashboard 1: Ollama Overview**
- GPU utilization timeline
- Request rate по моделям
- Response time percentiles (p50, p95, p99)
- Active users count
- Token generation rate
**Dashboard 2: MCP Services**
- Request distribution pie chart
- Success/error rates по сервисам
- Latency heatmap
- Cache hit rates
- Top users by requests
**Dashboard 3: Vector DB**
- Collection sizes growth
- Query performance trends
- Cache effectiveness
- Index rebuild status
**Dashboard 4: User Experience**
- Average response time
- User satisfaction ratings
- Session duration distribution
- Popular query types
- Error rate по типам
**Dashboard 5: Infrastructure Health**
- CPU/RAM utilization
- Disk I/O patterns
- Network throughput
- Temperature monitoring
- Power consumption
### Alerting Strategy
**Critical Alerts (PagerDuty):**
- Ollama service down
- GPU temperature >85°C
- Disk usage >90%
- Authentication system unavailable
- Backup failed
**Warning Alerts (Slack):**
- High error rate (>5%)
- Slow response times (p95 >10s)
- GPU utilization consistently >95%
- MCP service degraded
- Cache miss rate >50%
**Info Alerts (Email):**
- Scheduled maintenance reminders
- Usage statistics weekly digest
- Capacity planning recommendations
### Logging Strategy
**Structured logging** JSON format для всех компонентов:
```json
{
"timestamp": "2026-01-12T14:23:45Z",
"level": "INFO",
"service": "ollama",
"message": "Model loaded",
"model": "qwen2.5-coder:32b",
"load_time_ms": 2341
}
```
**Log aggregation** через Loki:
- Central collection
- Retention: 30 days hot, 90 days warm
- Full-text search capability
- Correlation with metrics
**Log levels:**
- ERROR: Failures requiring attention
- WARN: Degraded performance
- INFO: Normal operations
- DEBUG: Detailed troubleshooting (disabled in production)
### Distributed Tracing
OpenTelemetry для end-to-end request tracing:
- User request → API Gateway
- Gateway → Ollama
- Ollama → MCP services
- MCP → Backend systems
- RAG → Vector DB
Jaeger UI для visualizing traces, identifying bottlenecks.
### Health Checks
**Liveness probes:**
- Ollama /health endpoint
- Qdrant readiness
- PostgreSQL connectivity
- MCP services status
**Readiness probes:**
- Models loaded
- Indices ready
- Database connections available
**Периодичность:** Every 30 seconds.
### Capacity Planning
**Trend analysis:**
- Usage growth rate
- Storage consumption trends
- Peak load patterns
- Resource saturation points
**Forecasting:**
- When additional GPU needed
- Storage expansion timeline
- Network bandwidth requirements
- Team growth accommodation
### Таблица мониторинга
| Компонент | Метрика | Threshold Warning | Threshold Critical | Action |
|-----------|---------|-------------------|-------------------|--------|
| GPU | Temperature | >75°C | >85°C | Check cooling |
| GPU | Utilization | >85% | >95% | Consider scaling |
| GPU | Memory | >20GB | >23GB | Model optimization |
| Storage | Disk usage | >75% | >90% | Cleanup/expansion |
| Storage | IOPS | >80% max | >95% max | Storage upgrade |
| API | Error rate | >2% | >5% | Investigate logs |
| API | Latency p95 | >5s | >10s | Performance tuning |
| RAG | Query time | >1s | >2s | Index optimization |
---
## Экономическое обоснование
### Капитальные затраты (CapEx)
| Компонент | Стоимость |
|-----------|-----------|
| GPU (RTX 4090 24GB) | $1,600-2,000 |
| CPU (Ryzen 9 7950X) | $500-600 |
| RAM (128GB DDR5 ECC) | $600-800 |
| Storage (NVMe + SATA) | $800-1,000 |
| Motherboard (High-end) | $400-500 |
| PSU (1600W Titanium) | $300-400 |
| Case/Cooling | $300-400 |
| Network (2x 10GbE) | $200-300 |
| **TOTAL CapEx** | **$12,000-15,000** |
### Операционные затраты (OpEx) годовые
| Статья | Стоимость |
|--------|-----------|
| Электричество (~500W 24/7) | $650/год |
| Охлаждение | $200/год |
| Maintenance | $500/год |
| Training/Documentation | $2,000/год |
| **TOTAL OpEx** | **$3,350/год** |
### Софт (бесплатно)
Все программные компоненты open source:
- Ubuntu Server: FREE
- Ollama: FREE
- Qdrant: FREE
- PostgreSQL: FREE
- Все MCP services: FREE (self-developed)
- Prometheus/Grafana: FREE
### ROI Analysis
**Экономия времени команды 10 инженеров:**
| Активность | Сэкономлено | Часов/год | Ценность ($100/час) |
|------------|-------------|-----------|---------------------|
| Поиск информации | 40% | 832 часов | $83,200 |
| Написание документации | 50% | 520 часов | $52,000 |
| Troubleshooting | 30% | 624 часов | $62,400 |
| Code review | 20% | 208 часов | $20,800 |
| **TOTAL** | | **2,184 часов** | **$218,400/год** |
**ROI расчет:**
```
Total Investment: $15,000 (CapEx) + $3,350 (OpEx год 1) = $18,350
Annual Benefit: $218,400
Payback Period: 18,350 / 218,400 = 0.08 года = 1 месяц
3-Year ROI: (3 × $218,400 - $18,350 - 2 × $3,350) / $18,350 = 3,458%
```
### Сравнение с облачными AI API
**OpenAI GPT-4 pricing:**
- Prompt: $0.03 per 1K tokens
- Completion: $0.06 per 1K tokens
**Типичный query:**
- 2K tokens prompt (context + question)
- 1K tokens completion
- Cost per query: $0.12
**Monthly cost для 10 пользователей:**
- 50 queries/day per user = 500 queries/day
- 500 × 30 days = 15,000 queries/month
- 15,000 × $0.12 = $1,800/month = $21,600/year
**Self-hosted advantages:**
- Lower cost after year 1
- Complete data control
- No API rate limits
- Customizable models
- No vendor lock-in
### Таблица TCO (Total Cost of Ownership) 3 года
| Год | CapEx | OpEx | Total Annual | Cumulative | Cloud Alternative |
|-----|-------|------|--------------|------------|-------------------|
| 1 | $15,000 | $3,350 | $18,350 | $18,350 | $21,600 |
| 2 | $0 | $3,350 | $3,350 | $21,700 | $43,200 |
| 3 | $0 | $3,350 | $3,350 | $25,050 | $64,800 |
| **Savings** | | | | | **$39,750** |
---
## Deployment Roadmap
### Phase 1: Foundation (Weeks 1-2)
**Infrastructure setup:**
- Server assembly и OS installation
- Network configuration
- GPU drivers installation
- Docker setup
**Deliverables:**
- Working server с GPU functional
- Network connectivity verified
- Monitoring baseline established
### Phase 2: Core Services (Weeks 3-4)
**AI infrastructure:**
- Ollama installation
- Models download и testing
- Basic API Gateway setup
**Deliverables:**
- Models responding to queries
- Simple web interface functional
- Performance benchmarks completed
### Phase 3: MCP Integration (Weeks 5-6)
**MCP services deployment:**
- Gitea MCP server
- Docker Swarm MCP server
- Kubernetes MCP server (if applicable)
**Deliverables:**
- Models accessing corporate systems
- Read-only access verified
- Security controls tested
### Phase 4: RAG Implementation (Weeks 7-8)
**Knowledge base setup:**
- Qdrant deployment
- Embedding service
- Initial document indexing
**Deliverables:**
- Vector DB operational
- Initial corpus indexed
- Search quality validated
### Phase 5: Production Readiness (Weeks 9-10)
**Finalization:**
- Authentication integration
- Monitoring dashboards
- Backup automation
- Documentation
**Deliverables:**
- Production-ready system
- Team training completed
- Operational runbooks
- Go-live approval
### Phase 6: Rollout (Week 11-12)
**Gradual adoption:**
- Pilot group (2-3 users)
- Feedback collection
- Issue resolution
- Full team rollout
---
## Operational Excellence
### Daily Operations
**Health checks:**
- Morning review dashboards
- Check overnight alerts
- Verify backup success
- Monitor disk usage
**User support:**
- Answer questions in Slack
- Collect feedback
- Document common issues
### Weekly Tasks
**Performance review:**
- Analyze usage trends
- Review slow queries
- Check error patterns
- Optimize as needed
**Content updates:**
- Reindex modified documents
- Update code snippets
- Refresh runbooks
**Capacity planning:**
- Review storage trends
- Analyze GPU utilization
- Forecast growth
### Monthly Tasks
**Security review:**
- Audit logs analysis
- Access patterns review
- Update firewall rules
- Vulnerability scanning
**System maintenance:**
- OS updates
- Driver updates
- Dependency updates
- Performance tuning
**Reporting:**
- Usage statistics
- ROI tracking
- User satisfaction
- Improvement recommendations
### Quarterly Tasks
**Major upgrades:**
- Model updates
- Infrastructure upgrades
- Feature additions
**Strategy review:**
- Roadmap adjustment
- Budget review
- Team expansion planning
**Training:**
- Advanced features training
- New team members onboarding
- Best practices sharing
---
## Best Practices ## Best Practices
@@ -1366,103 +646,6 @@ Payback Period: 18,350 / 218,400 = 0.08 года = 1 месяц
4. **Test backups** regularly 4. **Test backups** regularly
5. **Plan for growth** from day one 5. **Plan for growth** from day one
---
## Troubleshooting Guide
### GPU Issues
**Symptom:** Model loading fails
**Causes:**
- Insufficient VRAM
- Driver issues
- Cooling problems
**Resolution:**
1. Check nvidia-smi output
2. Verify model size vs VRAM
3. Update drivers if needed
4. Check temperatures
**Symptom:** Slow inference
**Causes:**
- GPU throttling due to heat
- CPU bottleneck
- Insufficient RAM
**Resolution:**
1. Monitor GPU temperature
2. Check cooling system
3. Verify CPU usage
4. Check RAM availability
### MCP Service Issues
**Symptom:** MCP timeouts
**Causes:**
- Backend system slow/down
- Network issues
- Rate limiting
**Resolution:**
1. Check backend system health
2. Verify network connectivity
3. Review rate limit settings
4. Check MCP logs
**Symptom:** Incorrect data returned
**Causes:**
- Cache staleness
- Backend API changes
- Parsing errors
**Resolution:**
1. Clear MCP cache
2. Verify backend API format
3. Check MCP server logs
4. Update parsers if needed
### RAG Issues
**Symptom:** Poor search quality
**Causes:**
- Outdated index
- Poor chunk strategy
- Embedding model issues
**Resolution:**
1. Trigger reindexing
2. Review chunk configuration
3. Test embedding service
4. Analyze user feedback
**Symptom:** Slow searches
**Causes:**
- Index size too large
- Insufficient resources
- Network latency
**Resolution:**
1. Optimize index parameters
2. Add more RAM/storage
3. Check Qdrant configuration
4. Review network latency
### Storage Issues
**Symptom:** Disk full
**Causes:**
- Uncontrolled growth
- Failed cleanup jobs
- Backup accumulation
**Resolution:**
1. Run cleanup scripts
2. Archive old data
3. Verify retention policies
4. Plan capacity expansion
---
## Заключение ## Заключение
@@ -1480,29 +663,4 @@ Self-hosted AI-инфраструктура на базе Ollama с интегр
**История для контекста**. Persistent storage и intelligent management истории диалогов критичны для user experience и continuous improvement системы. **История для контекста**. Persistent storage и intelligent management истории диалогов критичны для user experience и continuous improvement системы.
### Путь вперед
Развертывание такой инфраструктуры - не одноразовый проект, а начало journey continuous improvement. Система будет evolve вместе с:
- Появлением новых, более мощных моделей
- Расширением интеграций с корпоративными системами
- Ростом knowledge base
- Увеличением команды пользователей
- Развитием best practices
### Следующие шаги
1. **Оценка готовности** вашей организации к внедрению
2. **Планирование бюджета** и получение approvals
3. **Формирование команды** для deployment и support
4. **Pilot deployment** с small group пользователей
5. **Iterative improvement** на основе feedback
6. **Gradual rollout** ко всей команде
С правильной стратегией, инвестициями и commitment, self-hosted AI-инфраструктура становится мощным enabler productivity, качества работы и innovation в вашей организации.
---
**Версия документа:** 1.0
**Дата:** Январь 2026
**Автор:** Based on infrastructure requirements для k3s-gitops
**Статус:** Comprehensive Guide