DevOps AI Toolkit

dot-ai
prds

358-mcp-server-integration.md•13.4 KiB

# PRD: MCP Server Integration for Extended Tool Capabilities **Created**: 2026-01-30 **Status**: Draft **Owner**: Viktor Farcic **Last Updated**: 2026-01-30 **GitHub Issue**: [#358](https://github.com/vfarcic/dot-ai/issues/358) **Priority**: Medium **Complexity**: High **Supersedes**: [PRD #150](https://github.com/vfarcic/dot-ai/blob/main/prds/done/150-remediation-observability-integration.md) --- ## Executive Summary Add MCP client capability to dot-ai, enabling connection to MCP servers running in the cluster. This allows dot-ai tools (remediate, operate, query) to leverage tools from external MCP servers without building custom integrations for each platform. Two integration patterns are supported: - **Bundled MCP servers** (`mcpServers`): We deploy and configure, users just enable - **Custom MCP servers** (`customMcpServers`): Users deploy their own, provide endpoint The `attachTo` mechanism configures which MCP servers provide tools to which dot-ai tools. --- ## Problem Statement ### Current Limitations The remediation tool (and other dot-ai tools) rely solely on kubectl-based investigation, limiting diagnostic capabilities for issues requiring: - **Historical metrics** (CPU/memory trends over time) - **Performance data** (latency, throughput, error rates) - **Distributed traces** (cross-service correlation) - **Observability alerts** (active incidents, alert history) ### Why Not Build Custom Integrations? The original approach (PRD #150) proposed building custom observability tools following the kubectl-tools pattern. This was rejected because: - Requires building and maintaining N integrations (Prometheus, Datadog, Jaeger, etc.) - Duplicates work already done in the MCP ecosystem - Limits users to only the tools we implement - High maintenance burden as APIs evolve ### MCP Ecosystem Opportunity The MCP ecosystem already has servers for many observability platforms. By adding MCP client capability, we can: - Leverage existing, maintained MCP servers - Allow users to use any MCP server, not just ones we implement - Maintain single integration point (MCP client) vs. N custom integrations - Benefit automatically from community MCP improvements --- ## Solution Overview ### Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ User's Cluster │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ MCP Protocol ┌─────────────────┐ │ │ │ dot-ai │◄───────────────────► │ Prometheus MCP │ │ │ │ (MCP Client)│ │ (bundled) │ │ │ │ │ └─────────────────┘ │ │ │ remediate │ MCP Protocol ┌─────────────────┐ │ │ │ operate │◄───────────────────► │ Jaeger MCP │ │ │ │ query │ │ (custom) │ │ │ └─────────────┘ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ### Configuration Model ```yaml # values.yaml # Bundled MCP servers - we deploy, users just enable and configure backing service mcpServers: prometheus: enabled: true prometheusUrl: "http://prometheus.monitoring.svc:9090" attachTo: - remediate - query # Custom MCP servers - users deploy, provide endpoint customMcpServers: jaeger: endpoint: "http://jaeger-mcp.tracing.svc:3000" attachTo: - remediate ``` ### Key Concepts | Concept | Description | |---------|-------------| | **`mcpServers`** | MCP servers we bundle and deploy via Helm chart | | **`customMcpServers`** | User-deployed MCP servers, just provide endpoint | | **`attachTo`** | List of dot-ai tools that can use this MCP server's tools | | **MCP Client** | New capability in dot-ai to connect to MCP servers | ### How It Works 1. User enables `mcpServers.prometheus` or adds entry to `customMcpServers` 2. Helm chart deploys bundled MCP servers (if enabled) 3. dot-ai connects to configured MCP servers as MCP client 4. dot-ai discovers available tools from each MCP server 5. When `remediate` runs, it gets tools from MCP servers where `attachTo` includes "remediate" 6. AI uses both kubectl tools (from agentic-tools plugin) and MCP server tools --- ## User Journey ### Before (kubectl-only) ``` 1. User: "My pods keep restarting" 2. AI calls kubectl tools → sees OOMKilled events 3. AI: "Pods restarting due to OOMKilled - increase memory limit" 4. User manually checks Prometheus for actual memory usage 5. User discovers memory spikes to 450Mi, limit is 512Mi 6. User implements fix based on combined data ``` ### After (with MCP integration) ``` 1. User: "My pods keep restarting" 2. AI calls kubectl_describe → sees OOMKilled events 3. AI calls prometheus_query → sees memory peaks at 450Mi consistently 4. AI calls prometheus_query → sees memory request set to 128Mi 5. AI: "Memory peaks at 450Mi, limit is 512Mi, but requests too low. Set memory request=256Mi, limit=512Mi based on actual usage" 6. User applies fix with complete confidence ``` --- ## Existing MCP Servers During implementation, evaluate these existing MCP servers for suitability: ### Observability & Monitoring | MCP Server | Purpose | Potential Use | |------------|---------|---------------| | Prometheus MCP | Metrics queries, alerts | remediate, query | | Grafana MCP | Dashboard data, alerts | remediate, query | | Datadog MCP | APM, metrics, logs | remediate, query | | New Relic MCP | APM, metrics | remediate, query | ### Tracing & Debugging | MCP Server | Purpose | Potential Use | |------------|---------|---------------| | Jaeger MCP | Distributed traces | remediate | | Tempo MCP | Trace queries | remediate | | OpenTelemetry MCP | Traces, metrics | remediate, query | ### Logging | MCP Server | Purpose | Potential Use | |------------|---------|---------------| | Loki MCP | Log queries | remediate | | Elasticsearch MCP | Log search | remediate | ### Selection Criteria When selecting MCP servers to bundle or recommend: - **Maturity**: Is it actively maintained? - **Popularity**: Is the backing service commonly used? - **Testing ease**: Can we test it in CI without complex setup? - **API stability**: Is the MCP server API stable? - **License**: Compatible with our project? --- ## Success Criteria ### Functional Requirements - [ ] dot-ai can connect to MCP servers running in the cluster - [ ] Tool discovery works for connected MCP servers - [ ] `attachTo` correctly routes tools to specified dot-ai tools - [ ] Bundled MCP servers deploy correctly via Helm - [ ] Custom MCP servers integrate via endpoint configuration - [ ] Remediation uses MCP server tools alongside kubectl tools ### Quality Requirements - [ ] MCP connection failures don't break core functionality - [ ] Graceful degradation when MCP servers are unavailable - [ ] Clear error messages for configuration issues - [ ] Integration tests validate both bundled and custom patterns ### User Experience - [ ] Configuration is intuitive (`mcpServers` vs `customMcpServers`) - [ ] Documentation explains both integration patterns - [ ] Example configurations provided for common setups --- ## Milestones ### Milestone 1: MCP Client + Helm Config + Prometheus + Remediate **Goal**: Validate entire architecture end-to-end with Prometheus + remediate **Tasks**: - [ ] Implement MCP client capability in dot-ai - [ ] Add Helm configuration for `mcpServers` and `customMcpServers` - [ ] Implement `attachTo` mechanism for tool routing - [ ] Research and select Prometheus MCP server to bundle - [ ] Bundle Prometheus MCP server in Helm chart - [ ] Update remediation to use tools from attached MCP servers - [ ] Integration tests for Prometheus + remediate flow **Success Criteria**: - Prometheus MCP server deploys via Helm when enabled - Remediation uses Prometheus tools for metrics-based analysis - AI correlates kubectl data with Prometheus metrics - End-to-end test passes ### Milestone 2: Operate Tool Integration **Goal**: Extend MCP server integration to operate tool **Tasks**: - [ ] Discussion: Which MCP server is a good candidate for operate? - [ ] Add selected MCP server as bundled or test-only (customMcpServers) - [ ] Update operate tool to use attached MCP server tools - [ ] Integration tests for operate + MCP server **Success Criteria**: - Operate tool can use tools from attached MCP servers - Integration test validates the flow ### Milestone 3: Query Tool Integration **Goal**: Extend MCP server integration to query tool **Tasks**: - [ ] Discussion: Which MCP server is a good candidate for query? - [ ] Add selected MCP server as bundled or test-only (customMcpServers) - [ ] Update query tool to use attached MCP server tools - [ ] Integration tests for query + MCP server **Success Criteria**: - Query tool can use tools from attached MCP servers - Integration test validates the flow ### Milestone 4: Documentation **Goal**: Complete user documentation for MCP server integration **Tasks**: - [ ] Setup guide for bundled MCP servers - [ ] Setup guide for custom MCP servers - [ ] Configuration reference for `mcpServers` and `customMcpServers` - [ ] Troubleshooting guide - [ ] "Request new bundled MCP server" process in docs **Success Criteria**: - Users can set up both bundled and custom MCP servers from docs - Documentation follows existing patterns --- ## Technical Considerations ### MCP Client Implementation dot-ai needs to act as an MCP client to connect to MCP servers. Key considerations: - Support SSE and/or streamable-http transport - Handle tool discovery from multiple MCP servers - Merge tools from MCP servers with plugin tools - Route tool execution to correct MCP server ### Tool Routing with `attachTo` ```typescript // Pseudocode for tool collection function getToolsForOperation(operation: 'remediate' | 'operate' | 'query') { const tools = []; // Add plugin tools (kubectl, helm, etc.) tools.push(...pluginManager.getDiscoveredTools()); // Add MCP server tools where attachTo includes this operation for (const mcpServer of connectedMcpServers) { if (mcpServer.attachTo.includes(operation)) { tools.push(...mcpServer.tools); } } return tools; } ``` ### Helm Chart Changes New templates needed: - MCP server Deployment + Service for each bundled server - ConfigMap for MCP server configuration - Update dot-ai ConfigMap with MCP server connection info ### Failure Handling - MCP server unavailable at startup: Log warning, continue without those tools - MCP server becomes unavailable: Graceful degradation, tools marked unavailable - Tool execution fails: Return error to AI, let it try alternative approach --- ## Dependencies ### External Dependencies - MCP server implementations (Prometheus MCP, etc.) - MCP protocol libraries for client implementation ### Internal Dependencies - Plugin system (tools from plugins + MCP servers need to coexist) - Remediate, operate, query tools (need updates to use MCP server tools) - Helm chart (needs new templates) --- ## Risks and Mitigations | Risk | Impact | Likelihood | Mitigation | |------|--------|------------|------------| | Suitable MCP servers don't exist | High | Low | Fall back to building minimal custom integration | | MCP server API instability | Medium | Medium | Pin versions, test upgrades | | Performance overhead | Medium | Low | Connection pooling, async discovery | | Complex user configuration | Medium | Medium | Good defaults, clear docs, validation | --- ## Out of Scope - Building custom observability integrations (superseded approach) - Running MCP servers outside the cluster (network complexity) - Authentication between dot-ai and MCP servers (assume in-cluster trust) - MCP server development (we use existing servers) --- ## Open Questions 1. **Q**: Which Prometheus MCP server should we bundle? **A**: TBD - evaluate during Milestone 1 2. **Q**: Should we support MCP servers outside the cluster? **A**: No - adds network/auth complexity, can revisit later if needed 3. **Q**: How do we handle MCP servers that require configuration? **A**: Use `env` field in Helm values, similar to plugins --- ## Work Log ### 2026-01-30: PRD Creation **Status**: Draft **Context**: Supersedes PRD #150 **Key Decisions**: - Use MCP client approach instead of custom tool integrations - Two configuration patterns: `mcpServers` (bundled) and `customMcpServers` (user-deployed) - `attachTo` mechanism for explicit tool-to-operation routing - Prometheus + remediate as first integration to validate architecture - Separate milestones per dot-ai tool (operate, query) with discussion phase for MCP server selection **Next Steps**: Begin Milestone 1 - MCP Client + Prometheus + Remediate

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vfarcic/dot-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

358-mcp-server-integration.md•13.4 KiB