DevOps AI Toolkit

dot-ai
prds

150-remediation-observability-integration.md•12.6 KiB

# PRD: Remediation Observability Integration

**Created**: 2025-10-06
**Status**: Draft
**Owner**: Viktor Farcic
**Last Updated**: 2025-10-06
**GitHub Issue**: [#150](https://github.com/vfarcic/dot-ai/issues/150)
**Priority**: Medium
**Complexity**: High
**Related PRDs**: [#143 - Tool-Based Remediation (Phase 1)](https://github.com/vfarcic/dot-ai/blob/main/prds/143-tool-based-remediation-observability.md)

---

## Executive Summary

Extend the remediation tool with observability data source tools (metrics, traces, alerts) to enable comprehensive issue analysis beyond kubectl-based investigation. This builds on the tool-based architecture established in PRD #143 Phase 1.

---

## Problem Statement

### Current Limitations

The remediation tool successfully uses kubectl-based tools for cluster state investigation but lacks access to:
- **Historical metrics** (CPU/memory trends over time)
- **Performance data** (latency, throughput, error rates)
- **Observability platforms** (monitoring alerts, trace data)
- **Custom metrics** (application-specific measurements)

### Real-World Impact

**Example Scenario**: Pod crashes with OOMKilled
- **Current behavior**: AI sees "OOMKilled" event, recommends increasing memory limit
- **Missing context**:
  - Actual memory usage patterns over last hour
  - Whether memory was growing steadily or spiked suddenly
  - Memory utilization across all replicas
  - Any correlated alerts or incidents

**Result**: User must manually:
- Check Prometheus/DataDog/Grafana after AI analysis
- Correlate metrics with Kubernetes events
- Make remediation decisions without complete picture
- Run multiple tools to get full understanding

---

## Solution Overview

Extend remediation investigation with observability tools following the architecture pattern established for kubectl tools in PRD #143.

### Design Principles

1. **Reuse existing architecture**: Follow the pattern from `src/core/kubectl-tools.ts`:
   - Tool definitions using `AITool` interface
   - Tool executor function with switch statement
   - Tool collection array for passing to `toolLoop()`

2. **User-configurable**: Let users choose which observability tools to enable via server configuration

3. **AI-driven selection**: AI autonomously decides when to use observability vs kubectl based on issue type

4. **Provider-agnostic**: Works with both Anthropic and Vercel AI SDK providers

### Tools to Be Determined

During implementation, we'll research and select which observability tools to integrate. Candidates include:
- **Metrics platforms**: Prometheus, DataDog, Grafana, New Relic
- **Tracing systems**: Jaeger, Tempo, OpenTelemetry
- **Log aggregation**: Loki, Elasticsearch, Splunk
- **Alert systems**: Alertmanager, PagerDuty

The specific tools and their priority will be decided based on:
- User demand and common use cases
- Integration complexity
- API availability and stability
- Maintenance burden

---

## User Journey

### Before (Kubectl-Only)
```
1. User: "My pods keep restarting"
2. AI calls kubectl tools → sees OOMKilled events
3. AI: "Pods restarting due to OOMKilled - increase memory limit"
4. User manually checks Prometheus for actual memory usage
5. User discovers memory spikes to 450Mi, limit is 512Mi
6. User implements fix based on combined data
```

### After (With Observability Integration)
```
1. User: "My pods keep restarting"
2. AI calls kubectl_describe → sees OOMKilled events
3. AI calls observability tool → sees memory peaks at 450Mi consistently
4. AI calls observability tool → sees memory request set to 128Mi
5. AI analyzes: "Memory peaks at 450Mi, limit is 512Mi, but requests too low"
6. AI: "Set memory request=256Mi, limit=512Mi based on actual usage"
7. User applies fix with complete confidence

(Complete analysis without manual correlation)
```

---

## Technical Approach

### Architecture Pattern (Already Established)

From PRD #143, we have this proven pattern:

**Tool Definition**:
```typescript
// src/core/kubectl-tools.ts
export const KUBECTL_GET_TOOL: AITool = {
  name: 'kubectl_get',
  description: 'Get Kubernetes resources...',
  inputSchema: {
    type: 'object',
    properties: { ... },
    required: ['resource']
  }
};
```

**Tool Executor**:
```typescript
export async function executeKubectlTools(toolName: string, input: any) {
  switch (toolName) {
    case 'kubectl_get':
      // Execute and return result
      return { success: true, data: output };
    default:
      return { success: false, error: 'Unknown tool' };
  }
}
```

**Tool Collection**:
```typescript
export const KUBECTL_INVESTIGATION_TOOLS: AITool[] = [
  KUBECTL_GET_TOOL,
  KUBECTL_DESCRIBE_TOOL,
  // ... more tools
];
```

**Integration with Remediation**:
```typescript
// src/tools/remediate.ts
import { KUBECTL_INVESTIGATION_TOOLS, executeKubectlTools } from '../core/kubectl-tools';

// Use in toolLoop()
const result = await aiProvider.toolLoop({
  systemPrompt,
  tools: KUBECTL_INVESTIGATION_TOOLS,
  executeFunction: executeKubectlTools
});
```

### Observability Tools (To Be Designed)

Following the same pattern, we'll create:

```typescript
// src/core/observability-tools.ts (to be created)

export const OBSERVABILITY_TOOL_1: AITool = {
  name: 'tool_name_tbd',
  description: 'Query observability data...',
  inputSchema: { /* TBD during implementation */ }
};

export async function executeObservabilityTools(toolName: string, input: any) {
  switch (toolName) {
    case 'tool_name_tbd':
      // Implementation TBD
      return { success: true, data: result };
    default:
      return { success: false, error: 'Unknown tool' };
  }
}

export const OBSERVABILITY_INVESTIGATION_TOOLS: AITool[] = [
  // Tools TBD during implementation
];
```

### Configuration System

Server-level configuration for enabling/disabling observability tools:

```typescript
// Environment variables (example pattern)
OBSERVABILITY_ENABLED=true
OBSERVABILITY_PROVIDER_1_ENABLED=true
OBSERVABILITY_PROVIDER_1_URL=https://...
OBSERVABILITY_PROVIDER_1_API_KEY=xxx

// Runtime tool selection
const investigationTools = [
  ...KUBECTL_INVESTIGATION_TOOLS,
  ...(config.observabilityEnabled ? OBSERVABILITY_INVESTIGATION_TOOLS : [])
];
```

---

## Success Criteria

### Functional Requirements
- [ ] At least one observability tool integrated and functional
- [ ] AI successfully correlates Kubernetes events with observability data
- [ ] Remediation recommendations include metrics-based justification
- [ ] Users can enable/disable observability tools via server configuration
- [ ] Works with both Anthropic and Vercel AI SDK providers

### Quality Requirements
- [ ] Investigation quality improved (metrics-driven recommendations)
- [ ] At least 80% of performance issues include observability analysis
- [ ] Tool execution errors handled gracefully
- [ ] Integration tests validate observability tool functionality

### User Experience
- [ ] Configuration clear and well-documented
- [ ] Error messages helpful when observability endpoints unavailable
- [ ] No degradation when observability tools disabled
- [ ] AI explanations include observability evidence when used

---

## Milestones

### Milestone 1: Tool Selection and Design ⏳
**Goal**: Decide which observability tools to integrate and how

**Tasks**:
- [ ] Research common observability platforms used with Kubernetes
- [ ] Survey user needs and use cases
- [ ] Design tool definitions (name, description, input schema)
- [ ] Define priority order for implementation
- [ ] Document tool selection rationale

**Success Criteria**:
- Clear list of tools to implement with priority order
- Tool interface designs reviewed and approved
- Integration complexity assessed for each tool

### Milestone 2: Configuration System ⏳
**Goal**: Server-level configuration for enabling observability tools

**Tasks**:
- [ ] Design configuration schema for observability providers
- [ ] Implement configuration validation and loading
- [ ] Add runtime tool enablement based on config
- [ ] Health check for configured observability endpoints
- [ ] Environment variable documentation

**Success Criteria**:
- Users can configure observability endpoints via env vars
- Server validates configuration on startup
- Health checks report observability connectivity
- Clear error messages for misconfiguration

### Milestone 3: First Observability Tool Implementation ⏳
**Goal**: Integrate highest-priority observability tool

**Tasks**:
- [ ] Create observability tool definitions file
- [ ] Implement tool executor function
- [ ] Add client library integration
- [ ] Error handling for unreachable endpoints
- [ ] Integration tests with real/mock endpoint
- [ ] Validate works with Anthropic provider
- [ ] Validate works with Vercel provider

**Success Criteria**:
- Tool functional and returning correct data
- Integration tests passing
- AI successfully uses tool in investigations
- Documentation complete

### Milestone 4: AI Correlation Enhancement ⏳
**Goal**: AI intelligently combines kubectl + observability data

**Tasks**:
- [ ] Update investigation prompt for multi-source analysis
- [ ] Test AI tool selection logic (when to use observability)
- [ ] Validate recommendations include observability evidence
- [ ] Measure improvement in root cause accuracy
- [ ] Document AI behavior patterns

**Success Criteria**:
- AI autonomously selects appropriate tools for issue type
- Recommendations reference both kubectl and observability data
- Root cause analysis quality measurably improved
- Performance issues include metrics analysis

### Milestone 5: Additional Tools (If Applicable) ⏳
**Goal**: Add more observability tools based on Milestone 1 priority

**Tasks**:
- [ ] Implement additional tools following established pattern
- [ ] Integration tests for each new tool
- [ ] Documentation updates
- [ ] Configuration examples

**Success Criteria**:
- Each tool follows same architecture pattern
- All integration tests passing
- Users can enable any combination of tools

---

## Dependencies

### External Dependencies
- ✅ `toolLoop()` architecture from PRD #143 Phase 1
- ✅ kubectl tools pattern established and proven
- ⏳ Access to observability endpoints for testing
- ⏳ API keys for observability platforms

### Internal Dependencies
- ✅ AIProvider interface with tool support
- ✅ Remediation investigation framework
- ⏳ Server configuration system for observability settings

### Potential Blockers
- Observability API rate limits
- Authentication complexity for certain platforms
- Network access restrictions to external services
- Cost of observability API calls

---

## Risks and Mitigations

| Risk | Impact | Likelihood | Mitigation |
|------|--------|-----------|------------|
| Observability API rate limits | High | Medium | Implement caching, request throttling |
| Unreliable external APIs | Medium | Medium | Graceful fallbacks, timeout handling |
| Increased complexity | Medium | High | Follow established pattern, good docs |
| Authentication challenges | Medium | Medium | Support multiple auth methods |
| Integration maintenance | High | Medium | Start with most stable/popular platforms |

---

## Out of Scope

- **NOT building observability platforms**: We integrate with existing platforms, don't replace them
- **NOT creating custom metrics**: We query existing metrics only
- **NOT modifying observability configurations**: Read-only access to observability data
- **NOT including all possible platforms**: Focus on most common/requested tools

---

## Open Questions

1. **Q**: Which observability tools should we prioritize first?
   **A**: TBD - will research during Milestone 1

2. **Q**: Should configuration be per-MCP-call or server-level?
   **A**: Server-level (decided in PRD #143) - users configure once, applies to all investigations

3. **Q**: How do we handle observability API costs?
   **A**: TBD - may need usage monitoring and rate limiting

4. **Q**: Should we support on-premise vs cloud observability?
   **A**: TBD - depends on selected tools and user requirements

---

## Work Log

### 2025-10-06: PRD Creation (Extracted from PRD #143)
**Status**: Draft
**Context**: PRD #143 Phase 1 complete - tool-based architecture validated

**PRD Scope**:
- Extracted Phase 2 (Observability Integration) from PRD #143 into separate PRD
- Kept tool selection flexible - to be determined during implementation
- Focused on extending proven kubectl tools architecture pattern
- Emphasized user-configurable tool enablement
- Milestones designed for incremental delivery

**Design Decisions**:
- Follow established architecture pattern from `src/core/kubectl-tools.ts`
- Server-level configuration (not per-call)
- AI-driven tool selection (not prescriptive rules)
- Provider-agnostic implementation (Anthropic + Vercel)

**Next Session**: Begin Milestone 1 (Tool Selection and Design)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vfarcic/dot-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

150-remediation-observability-integration.md•12.6 KiB

# PRD: Remediation Observability Integration

**Created**: 2025-10-06
**Status**: Draft
**Owner**: Viktor Farcic
**Last Updated**: 2025-10-06
**GitHub Issue**: [#150](https://github.com/vfarcic/dot-ai/issues/150)
**Priority**: Medium
**Complexity**: High
**Related PRDs**: [#143 - Tool-Based Remediation (Phase 1)](https://github.com/vfarcic/dot-ai/blob/main/prds/143-tool-based-remediation-observability.md)

---

## Executive Summary

Extend the remediation tool with observability data source tools (metrics, traces, alerts) to enable comprehensive issue analysis beyond kubectl-based investigation. This builds on the tool-based architecture established in PRD #143 Phase 1.

---

## Problem Statement

### Current Limitations

The remediation tool successfully uses kubectl-based tools for cluster state investigation but lacks access to:
- **Historical metrics** (CPU/memory trends over time)
- **Performance data** (latency, throughput, error rates)
- **Observability platforms** (monitoring alerts, trace data)
- **Custom metrics** (application-specific measurements)

### Real-World Impact

**Example Scenario**: Pod crashes with OOMKilled
- **Current behavior**: AI sees "OOMKilled" event, recommends increasing memory limit
- **Missing context**:
  - Actual memory usage patterns over last hour
  - Whether memory was growing steadily or spiked suddenly
  - Memory utilization across all replicas
  - Any correlated alerts or incidents

**Result**: User must manually:
- Check Prometheus/DataDog/Grafana after AI analysis
- Correlate metrics with Kubernetes events
- Make remediation decisions without complete picture
- Run multiple tools to get full understanding

---

## Solution Overview

Extend remediation investigation with observability tools following the architecture pattern established for kubectl tools in PRD #143.

### Design Principles

1. **Reuse existing architecture**: Follow the pattern from `src/core/kubectl-tools.ts`:
   - Tool definitions using `AITool` interface
   - Tool executor function with switch statement
   - Tool collection array for passing to `toolLoop()`

2. **User-configurable**: Let users choose which observability tools to enable via server configuration

3. **AI-driven selection**: AI autonomously decides when to use observability vs kubectl based on issue type

4. **Provider-agnostic**: Works with both Anthropic and Vercel AI SDK providers

### Tools to Be Determined

During implementation, we'll research and select which observability tools to integrate. Candidates include:
- **Metrics platforms**: Prometheus, DataDog, Grafana, New Relic
- **Tracing systems**: Jaeger, Tempo, OpenTelemetry
- **Log aggregation**: Loki, Elasticsearch, Splunk
- **Alert systems**: Alertmanager, PagerDuty

The specific tools and their priority will be decided based on:
- User demand and common use cases
- Integration complexity
- API availability and stability
- Maintenance burden

---

## User Journey

### Before (Kubectl-Only)
```
1. User: "My pods keep restarting"
2. AI calls kubectl tools → sees OOMKilled events
3. AI: "Pods restarting due to OOMKilled - increase memory limit"
4. User manually checks Prometheus for actual memory usage
5. User discovers memory spikes to 450Mi, limit is 512Mi
6. User implements fix based on combined data
```

### After (With Observability Integration)
```
1. User: "My pods keep restarting"
2. AI calls kubectl_describe → sees OOMKilled events
3. AI calls observability tool → sees memory peaks at 450Mi consistently
4. AI calls observability tool → sees memory request set to 128Mi
5. AI analyzes: "Memory peaks at 450Mi, limit is 512Mi, but requests too low"
6. AI: "Set memory request=256Mi, limit=512Mi based on actual usage"
7. User applies fix with complete confidence

(Complete analysis without manual correlation)
```

---

## Technical Approach

### Architecture Pattern (Already Established)

From PRD #143, we have this proven pattern:

**Tool Definition**:
```typescript
// src/core/kubectl-tools.ts
export const KUBECTL_GET_TOOL: AITool = {
  name: 'kubectl_get',
  description: 'Get Kubernetes resources...',
  inputSchema: {
    type: 'object',
    properties: { ... },
    required: ['resource']
  }
};
```

**Tool Executor**:
```typescript
export async function executeKubectlTools(toolName: string, input: any) {
  switch (toolName) {
    case 'kubectl_get':
      // Execute and return result
      return { success: true, data: output };
    default:
      return { success: false, error: 'Unknown tool' };
  }
}
```

**Tool Collection**:
```typescript
export const KUBECTL_INVESTIGATION_TOOLS: AITool[] = [
  KUBECTL_GET_TOOL,
  KUBECTL_DESCRIBE_TOOL,
  // ... more tools
];
```

**Integration with Remediation**:
```typescript
// src/tools/remediate.ts
import { KUBECTL_INVESTIGATION_TOOLS, executeKubectlTools } from '../core/kubectl-tools';

// Use in toolLoop()
const result = await aiProvider.toolLoop({
  systemPrompt,
  tools: KUBECTL_INVESTIGATION_TOOLS,
  executeFunction: executeKubectlTools
});
```

### Observability Tools (To Be Designed)

Following the same pattern, we'll create:

```typescript
// src/core/observability-tools.ts (to be created)

export const OBSERVABILITY_TOOL_1: AITool = {
  name: 'tool_name_tbd',
  description: 'Query observability data...',
  inputSchema: { /* TBD during implementation */ }
};

export async function executeObservabilityTools(toolName: string, input: any) {
  switch (toolName) {
    case 'tool_name_tbd':
      // Implementation TBD
      return { success: true, data: result };
    default:
      return { success: false, error: 'Unknown tool' };
  }
}

export const OBSERVABILITY_INVESTIGATION_TOOLS: AITool[] = [
  // Tools TBD during implementation
];
```

### Configuration System

Server-level configuration for enabling/disabling observability tools:

```typescript
// Environment variables (example pattern)
OBSERVABILITY_ENABLED=true
OBSERVABILITY_PROVIDER_1_ENABLED=true
OBSERVABILITY_PROVIDER_1_URL=https://...
OBSERVABILITY_PROVIDER_1_API_KEY=xxx

// Runtime tool selection
const investigationTools = [
  ...KUBECTL_INVESTIGATION_TOOLS,
  ...(config.observabilityEnabled ? OBSERVABILITY_INVESTIGATION_TOOLS : [])
];
```

---

## Success Criteria

### Functional Requirements
- [ ] At least one observability tool integrated and functional
- [ ] AI successfully correlates Kubernetes events with observability data
- [ ] Remediation recommendations include metrics-based justification
- [ ] Users can enable/disable observability tools via server configuration
- [ ] Works with both Anthropic and Vercel AI SDK providers

### Quality Requirements
- [ ] Investigation quality improved (metrics-driven recommendations)
- [ ] At least 80% of performance issues include observability analysis
- [ ] Tool execution errors handled gracefully
- [ ] Integration tests validate observability tool functionality

### User Experience
- [ ] Configuration clear and well-documented
- [ ] Error messages helpful when observability endpoints unavailable
- [ ] No degradation when observability tools disabled
- [ ] AI explanations include observability evidence when used

---

## Milestones

### Milestone 1: Tool Selection and Design ⏳
**Goal**: Decide which observability tools to integrate and how

**Tasks**:
- [ ] Research common observability platforms used with Kubernetes
- [ ] Survey user needs and use cases
- [ ] Design tool definitions (name, description, input schema)
- [ ] Define priority order for implementation
- [ ] Document tool selection rationale

**Success Criteria**:
- Clear list of tools to implement with priority order
- Tool interface designs reviewed and approved
- Integration complexity assessed for each tool

### Milestone 2: Configuration System ⏳
**Goal**: Server-level configuration for enabling observability tools

**Tasks**:
- [ ] Design configuration schema for observability providers
- [ ] Implement configuration validation and loading
- [ ] Add runtime tool enablement based on config
- [ ] Health check for configured observability endpoints
- [ ] Environment variable documentation

**Success Criteria**:
- Users can configure observability endpoints via env vars
- Server validates configuration on startup
- Health checks report observability connectivity
- Clear error messages for misconfiguration

### Milestone 3: First Observability Tool Implementation ⏳
**Goal**: Integrate highest-priority observability tool

**Tasks**:
- [ ] Create observability tool definitions file
- [ ] Implement tool executor function
- [ ] Add client library integration
- [ ] Error handling for unreachable endpoints
- [ ] Integration tests with real/mock endpoint
- [ ] Validate works with Anthropic provider
- [ ] Validate works with Vercel provider

**Success Criteria**:
- Tool functional and returning correct data
- Integration tests passing
- AI successfully uses tool in investigations
- Documentation complete

### Milestone 4: AI Correlation Enhancement ⏳
**Goal**: AI intelligently combines kubectl + observability data

**Tasks**:
- [ ] Update investigation prompt for multi-source analysis
- [ ] Test AI tool selection logic (when to use observability)
- [ ] Validate recommendations include observability evidence
- [ ] Measure improvement in root cause accuracy
- [ ] Document AI behavior patterns

**Success Criteria**:
- AI autonomously selects appropriate tools for issue type
- Recommendations reference both kubectl and observability data
- Root cause analysis quality measurably improved
- Performance issues include metrics analysis

### Milestone 5: Additional Tools (If Applicable) ⏳
**Goal**: Add more observability tools based on Milestone 1 priority

**Tasks**:
- [ ] Implement additional tools following established pattern
- [ ] Integration tests for each new tool
- [ ] Documentation updates
- [ ] Configuration examples

**Success Criteria**:
- Each tool follows same architecture pattern
- All integration tests passing
- Users can enable any combination of tools

---

## Dependencies

### External Dependencies
- ✅ `toolLoop()` architecture from PRD #143 Phase 1
- ✅ kubectl tools pattern established and proven
- ⏳ Access to observability endpoints for testing
- ⏳ API keys for observability platforms

### Internal Dependencies
- ✅ AIProvider interface with tool support
- ✅ Remediation investigation framework
- ⏳ Server configuration system for observability settings

### Potential Blockers
- Observability API rate limits
- Authentication complexity for certain platforms
- Network access restrictions to external services
- Cost of observability API calls

---

## Risks and Mitigations

| Risk | Impact | Likelihood | Mitigation |
|------|--------|-----------|------------|
| Observability API rate limits | High | Medium | Implement caching, request throttling |
| Unreliable external APIs | Medium | Medium | Graceful fallbacks, timeout handling |
| Increased complexity | Medium | High | Follow established pattern, good docs |
| Authentication challenges | Medium | Medium | Support multiple auth methods |
| Integration maintenance | High | Medium | Start with most stable/popular platforms |

---

## Out of Scope

- **NOT building observability platforms**: We integrate with existing platforms, don't replace them
- **NOT creating custom metrics**: We query existing metrics only
- **NOT modifying observability configurations**: Read-only access to observability data
- **NOT including all possible platforms**: Focus on most common/requested tools

---

## Open Questions

1. **Q**: Which observability tools should we prioritize first?
   **A**: TBD - will research during Milestone 1

2. **Q**: Should configuration be per-MCP-call or server-level?
   **A**: Server-level (decided in PRD #143) - users configure once, applies to all investigations

3. **Q**: How do we handle observability API costs?
   **A**: TBD - may need usage monitoring and rate limiting

4. **Q**: Should we support on-premise vs cloud observability?
   **A**: TBD - depends on selected tools and user requirements

---

## Work Log

### 2025-10-06: PRD Creation (Extracted from PRD #143)
**Status**: Draft
**Context**: PRD #143 Phase 1 complete - tool-based architecture validated

**PRD Scope**:
- Extracted Phase 2 (Observability Integration) from PRD #143 into separate PRD
- Kept tool selection flexible - to be determined during implementation
- Focused on extending proven kubectl tools architecture pattern
- Emphasized user-configurable tool enablement
- Milestones designed for incremental delivery

**Design Decisions**:
- Follow established architecture pattern from `src/core/kubectl-tools.ts`
- Server-level configuration (not per-call)
- AI-driven tool selection (not prescriptive rules)
- Provider-agnostic implementation (Anthropic + Vercel)

**Next Session**: Begin Milestone 1 (Tool Selection and Design)