DevOps AI Toolkit

143-tool-based-remediation-observability.md•71.7 KiB

# PRD: Tool-Based Remediation with Observability Data Sources

**Created**: 2025-10-06
**Status**: Complete
**Owner**: Viktor Farcic
**Last Updated**: 2025-10-07
**Completed**: 2025-10-07
**GitHub Issue**: [#143](https://github.com/vfarcic/dot-ai/issues/143)
**Priority**: High
**Complexity**: High

---

## Executive Summary

Migrate the remediation tool from JSON-based kubectl command patterns to SDK tool-based architecture (`toolLoop()`). This establishes a clean, extensible foundation for investigation tools.

**Note**: Phase 2 (Observability Integration) has been moved to a separate PRD for focused implementation. See [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md).

---

## Problem Statement

### Current Limitations

**1. JSON-Based Architecture**
- Remediation uses JSON `dataRequests` pattern that AI returns, then we parse and execute
- Manual loop management (400+ lines of iteration logic)
- Rigid pattern that's hard to extend with new data sources
- Cannot leverage SDK's native tool-calling capabilities

**2. Limited Data Sources**
- Only has access to Kubernetes API data (kubectl commands)
- No access to metrics (CPU, memory, custom metrics)
- No access to observability platforms (Prometheus, DataDog, Grafana)
- Cannot correlate cluster state with performance/resource metrics

**3. Incomplete Root Cause Analysis**
- AI identifies "high memory usage" but cannot see actual metrics
- Detects pod restarts but cannot analyze resource saturation trends
- Recommends scaling but lacks utilization data to validate
- Missing context for performance degradation issues

### Impact

Users must:
- Manually check Prometheus/DataDog after AI analysis
- Correlate metrics with kubernetes events themselves
- Make scaling/remediation decisions without complete picture
- Run multiple tools to get full understanding

---

## Solution Overview

**Two-Phase Approach:**

### Phase 1: Migrate to Tool-Based Architecture
**CRITICAL**: This is a complete architectural replacement, not incremental changes. The SDK's `toolLoop()` replaces the entire 400+ line manual investigation loop:
- **SDK manages conversation history** (replaces `{previousIterations}` injection)
- **SDK manages iterations** (replaces manual while loop in `conductInvestigation()`)
- **SDK handles tool execution** (replaces JSON parsing of `dataRequests`)
- **SDK determines completion** (replaces `investigationComplete` boolean field)

Implementation:
- Implement kubectl tools: `kubectl_get`, `kubectl_describe`, `kubectl_logs`, `kubectl_events`, `kubectl_top`
- Use `AIProvider.toolLoop()` method (already implemented in PRD #136 for Anthropic)
- Remove manual JSON parsing and loop management (~400 lines)
- Remove JSON response format from investigation prompt
- Focus on Anthropic provider first, Vercel support later
- Validate tool architecture works for remediation use case

### Phase 2: Add Observability Data Sources (Moved to PRD #150)
This phase has been moved to a separate PRD for focused implementation:
- See [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md)
- Will extend remediation with observability tools following the architecture established in Phase 1
- Tools and priorities to be determined during PRD #150 implementation

---

## User Journey

### Before (Current State)
```
1. User reports: "My pods keep restarting"
2. AI investigates using kubectl only
3. AI finds: "Pods restarting due to OOMKilled"
4. AI recommends: "Increase memory limit"
5. User manually checks Prometheus to see actual memory usage
6. User correlates data and implements fix
```

### After (Phase 1: Tool-Based)
```
1. User reports: "My pods keep restarting"
2. AI calls kubectl_get, kubectl_describe, kubectl_events tools
3. AI finds: "Pods restarting due to OOMKilled"
4. AI recommends: "Increase memory limit"
5. User manually checks Prometheus to see actual memory usage
6. User correlates data and implements fix

(Same outcome, cleaner architecture)
```

### After (Phase 2: Observability Integration)
```
1. User reports: "My pods keep restarting"
2. AI calls kubectl_describe tool → sees OOMKilled events
3. AI calls prometheus_range_query tool → sees memory usage at 95% consistently
4. AI calls prometheus_query tool → sees memory requests set to 128Mi
5. AI analyzes: "Memory usage peaks at 450Mi, limit is 512Mi, but requests too low causing scheduling issues"
6. AI recommends: "Set memory request=256Mi, limit=512Mi based on actual usage patterns"
7. User applies fix with complete confidence

(Complete analysis with metrics-driven recommendations)
```

---

## Technical Approach

### Phase 1: Tool Migration Architecture

**Current Flow (JSON-Based - ~400 lines)**:
```typescript
// Manual iteration loop in conductInvestigation()
while (currentIteration <= maxIterations && !session.investigationComplete) {
  // Inject cumulative previousIterations (grows with each iteration)
  const previousIterationsJson = JSON.stringify(
    session.iterations.map(iter => ({
      step: iter.step,
      analysis: iter.aiAnalysis,
      dataRequests: iter.dataRequests,
      gatheredData: iter.gatheredData
    })), null, 2
  );

  // Inject clusterApiResources (415+ resources)
  const clusterApiResources = await executeKubectl(['api-resources']);

  // Replace template variables
  const investigationPrompt = promptTemplate
    .replace('{previousIterations}', previousIterationsJson)
    .replace('{clusterApiResources}', clusterApiResources)
    .replace('{currentIteration}', currentIteration.toString());

  // AI responds with JSON specifying what data it needs
  const aiResponse = await aiProvider.sendMessage(investigationPrompt);
  const parsedResponse = JSON.parse(aiResponse.content);

  // Manually execute each data request
  for (const request of parsedResponse.dataRequests) {
    const output = await executeKubectl([request.type, request.resource]);
    gatheredData.push(output);
  }

  // Check if AI set investigationComplete flag
  if (parsedResponse.investigationComplete) {
    session.investigationComplete = true;
    break;
  }

  currentIteration++;
}
```

**New Flow (Tool-Based - ~50 lines)**:
```typescript
// Define tools once
const tools = [
  {
    name: "kubectl_get",
    description: "Get Kubernetes resources",
    inputSchema: {
      type: "object",
      properties: {
        resource: { type: "string", description: "Resource type (pods, services, etc)" },
        namespace: { type: "string", description: "Namespace" }
      }
    }
  },
  {
    name: "get_cluster_resources",
    description: "Get list of all available Kubernetes API resources in cluster",
    inputSchema: { type: "object", properties: {} }
  }
];

// SDK handles EVERYTHING - iterations, history, completion
const result = await aiProvider.toolLoop({
  systemPrompt: investigationPrompt, // No {previousIterations} or {currentIteration} needed!
  tools: tools,
  toolImplementations: {
    kubectl_get: async (args) => await executeKubectl(['get', args.resource, '-n', args.namespace]),
    get_cluster_resources: async () => await executeKubectl(['api-resources'])
  },
  maxIterations: 20
});

// result.finalMessage contains the analysis
// result.allToolCalls contains complete history
```

**Key Simplifications**:
- No manual loop management
- No `{previousIterations}` injection (SDK maintains conversation history)
- No JSON parsing of AI responses
- No `investigationComplete` flag (SDK stops when AI stops calling tools)
- No cumulative data structure management
- Prompt simplified by ~40 lines (remove JSON schema documentation)

### Phase 2: Observability Tools

**Configuration (Server Level)**:
```typescript
// Server config
{
  "observability": {
    "prometheus": {
      "enabled": true,
      "endpoint": "http://prometheus:9090"
    },
    "datadog": {
      "enabled": false,
      "apiKey": "...",
      "appKey": "..."
    }
  }
}
```

**Tool Definitions**:
```typescript
const observabilityTools = [
  {
    name: "prometheus_query",
    description: "Query Prometheus metrics (instant query)",
    input_schema: {
      type: "object",
      properties: {
        query: { type: "string", description: "PromQL query" },
        time: { type: "string", description: "Evaluation timestamp (optional)" }
      },
      required: ["query"]
    }
  },
  {
    name: "prometheus_range_query",
    description: "Query Prometheus metrics over time range",
    input_schema: {
      type: "object",
      properties: {
        query: { type: "string", description: "PromQL query" },
        start: { type: "string", description: "Start time" },
        end: { type: "string", description: "End time" },
        step: { type: "string", description: "Query resolution step" }
      },
      required: ["query", "start", "end"]
    }
  }
];
```

**Tool Implementations**:
```typescript
const toolImplementations = {
  prometheus_query: async (args) => {
    const config = getServerConfig();
    if (!config.observability.prometheus.enabled) {
      return { error: "Prometheus not configured" };
    }

    const url = `${config.observability.prometheus.endpoint}/api/v1/query`;
    const response = await fetch(url, {
      method: 'POST',
      body: JSON.stringify({ query: args.query, time: args.time })
    });
    return await response.json();
  },

  prometheus_range_query: async (args) => {
    // Similar implementation for range queries
  }
};
```

---

## Success Criteria

### Phase 1: Tool Migration
- [x] Remediation uses `toolLoop()` instead of JSON-based loops ✅
- [x] All existing kubectl operations work as tools ✅
- [x] Investigation quality unchanged or improved ✅
- [x] Code is simpler (remove 400+ lines of manual loop logic) ✅
- [x] All integration tests passing ✅
- [x] Token usage optimized (+50% vs JSON baseline through table format + prompt caching) ✅

### Phase 2: Observability Integration (Moved to PRD #150)
See [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md) for success criteria.

---

## Milestones

### Milestone 1: Validate Anthropic toolLoop() & Design kubectl Tools (Phase 1 Start) ✅ COMPLETED
**Goal**: Verify AnthropicProvider toolLoop() works and design kubectl tool architecture

**Infrastructure Validation (Anthropic Only)**:
- [x] Verify `AIProvider.toolLoop()` interface exists (from PRD #136) ✅
- [x] Verify AnthropicProvider implements toolLoop() correctly ✅ (verified in src/core/providers/anthropic-provider.ts:134-256)
- [x] Test AnthropicProvider with sendMessageWithTools() to confirm functionality ✅
- [x] Document any limitations discovered during testing ✅

**Note**: VercelProvider toolLoop() is NOT implemented (intentionally - see vercel-provider.ts:157-163). Vercel support will be added later in Phase 1.

**Tool Design**:
- [x] Design kubectl tool schema (granularity, parameters, error handling) ✅
- [x] Create tool definition format matching Anthropic.Tool interface ✅
- [x] Design tool registration system for remediation ✅
- [x] Plan tool implementation architecture ✅

**First Implementation**:
- [x] Implement `kubectl_api_resources` tool (replaces `{clusterApiResources}` injection) ✅
- [x] Test with existing remediation integration tests ✅
- [x] Confirm tool architecture is sound before building all kubectl tools ✅

**Findings & Limitations**:
- **Hybrid Approach Works**: Using `sendMessageWithTools()` with single tool while still supporting JSON `dataRequests` for other operations
- **Metrics Logging**: Added proper metrics tracking to `sendMessageWithTools()` method with descriptive operation names
- **Tool Placement**: Created shared `src/core/kubectl-tools.ts` module for cross-feature reuse
- **Test Results**: All integration tests pass (16 AI calls vs 19 baseline - 3 fewer calls with tool available)
- **Tool Usage**: AI did not call `kubectl_api_resources` tool in test scenario (crashing pod) as it already knew which resources to query. Tool is available and functional when AI needs resource discovery
- **Token Impact**: No significant token increase observed with tool available but unused

### Milestone 2: Complete kubectl Tool Migration (Phase 1 - Anthropic) ✅ SUBSTANTIALLY COMPLETE
**Goal**: Replace entire manual investigation loop with toolLoop()
- [x] All kubectl tools implemented: `kubectl_get`, `kubectl_describe`, `kubectl_logs`, `kubectl_events`, `kubectl_api_resources` ✅
- [x] Additional tools: `kubectl_patch_dryrun`, `kubectl_apply_dryrun`, `kubectl_delete_dryrun`, `kubectl_get_crd_schema` ✅
- [x] Replace `conductInvestigation()` with single `toolLoop()` call ✅
- [x] Update `prompts/remediate-investigation.md`: ✅
  - Removed `{previousIterations}` template variable
  - Removed `{currentIteration}` template variable
  - Removed JSON response format
  - Removed "Investigation Workflow Example" JSON schemas
  - Simplified to tool-calling instructions only
- [x] Remove old JSON-based loop code (~400 lines in remediate.ts) ✅
- [x] All integration tests passing ✅
- [x] Debug logging infrastructure added (operation parameter, per-iteration logging) ✅
- [x] Performance optimized to viable production levels ✅ See optimization results below

### Milestone 2.5: Add Vercel Provider Support (Phase 1 Complete) ✅
**Goal**: Implement toolLoop() for VercelProvider
- [x] Implement `VercelProvider.toolLoop()` using Vercel AI SDK tool APIs ✅ (src/core/providers/vercel-provider.ts:168-398)
- [x] Test all kubectl tools work with Vercel provider (Anthropic model) ✅ (Integration tests: Manual 76.8s, Automatic 98s - all passing)
- [x] Update integration tests to validate Vercel provider ✅ (Both Manual and Automatic mode workflows validated)
- [x] Document provider-specific differences ✅ (Caching approaches, token metrics normalization, SDK version notes)

**Note**: Tested with Anthropic Claude Sonnet via Vercel SDK only. OpenAI and Google Gemini testing pending (Milestone 2.6).

### Milestone 2.6: Multi-Model Provider Testing (Phase 1 Final Validation) ✅ COMPLETE (Extended Scope)
**Goal**: Validate Vercel provider works across diverse AI models from different providers

**Big 3 Testing (Complete)** ✅:
- [x] Test VercelProvider.toolLoop() with OpenAI GPT-5 model ✅
- [x] Test VercelProvider.toolLoop() with Google Gemini 2.5 Pro model ✅ (Reliability issues identified)
- [x] Document any model-specific differences or limitations ✅
- [x] Verify caching behavior works correctly for each provider ✅
- [x] Update integration tests if provider-specific handling needed ✅ (No changes needed - tests work with any provider)

**Extended Testing (Moved to PRD #151)** ✅:
- Extended model testing moved to [PRD #151 - AI Model Comparison for MCP Platform](https://github.com/vfarcic/dot-ai/blob/main/prds/151-ai-model-comparison-mcp-platform.md)
- Comprehensive testing across ALL MCP AI interactions (not just remediation)
- 6 models total: Claude, GPT-5, Gemini + xAI Grok-4, Mistral Large, DeepSeek-R1

**OpenAI GPT-5 Test Results** (2025-10-06):
- ✅ All integration tests pass (Manual: 135.8s, Automatic: 125.4s)
- **Performance vs Claude**: +84% more tokens, +65% slower per call, but equivalent end-to-end test duration
- **Token Usage**: 25,602 input + 5,082 output = 30,684 total tokens (vs Claude: 16,649 total)
- **Caching**: 20,224 cache read tokens (effective automatic caching via Vercel SDK)
- **Investigation Steps**: 30 total steps across 2 tests (8, 7, 4, 11 per run)
- **Quality**: Equivalent remediation quality - both models successfully identify issues and generate valid fixes
- **Baseline saved**: `tmp/baseline-vercel-gpt5.jsonl`

**Model Comparison Summary**:
| Metric | Claude Sonnet 4.5 | OpenAI GPT-5 | Difference |
|--------|------------------|--------------|------------|
| Total Tokens | 16,649 | 30,684 | +84% |
| Input Tokens | 14,429 | 25,602 | +77% |
| Output Tokens | 2,220 | 5,082 | +129% |
| Avg Call Duration | 33s | 54.6s | +65% |
| Cache Reads | 14,400 | 20,224 | +40% |

**Recommendation**: Both models work effectively. Claude Sonnet preferred for cost/speed optimization (-46% tokens, -40% faster). GPT-5 suitable when OpenAI infrastructure is already in use.

**Google Gemini 2.5 Pro Test Results** (2025-10-06):
- ⚠️ **CRITICAL RELIABILITY ISSUE IDENTIFIED** - 50% test failure rate
- **Test Outcome**: 1 of 2 integration tests failed
- **Failure Mode**: Model stopped generating output mid-investigation (0 output tokens after successful tool calls)
- **Successful Test**: Automatic mode passed - Generated proper JSON, executed remediation successfully
- **Failed Test**: Manual mode failed - Completed 6 iterations with 5 tool calls, then produced zero output tokens
- **Token Usage** (incomplete data due to failure): 13,467 input + 662 output tokens across 3 calls
- **Investigation Quality**: When working, quality is equivalent to Claude/GPT-5
- **Root Cause**: Gemini exhibits inconsistent stopping behavior - sometimes completes properly, sometimes stops prematurely
- **Debug Evidence**: `tmp/debug-ai/2025-10-06T180613_*` shows conversation ending abruptly after `kubectl_patch_dryrun` validation
- **Baseline saved**: `tmp/baseline-vercel-gemini.jsonl` (partial data)

**Gemini Reliability Analysis**:
- **Failure Pattern**: Unpredictable - no clear pattern to when model will stop vs continue
- **Impact**: Cannot rely on consistent JSON output in production workflows
- **Risk**: 50% failure rate makes Gemini unsuitable for critical remediation tasks
- **Behavior**: Model completes tool calls correctly but fails to generate final analysis
- **Comparison**: Claude and GPT-5 showed 100% reliability across all tests

**Model Comparison Summary (Updated with Gemini)**:
| Metric | Claude Sonnet 4.5 | OpenAI GPT-5 | Google Gemini 2.5 Pro | Notes |
|--------|------------------|--------------|----------------------|-------|
| Reliability | 100% | 100% | 50% | Gemini failed 1 of 2 tests |
| Total Tokens | 16,649 | 30,684 | 13,467 | Gemini data incomplete |
| Avg Call Duration | 33s | 54.6s | 25.4s | Gemini appears faster when working |
| Investigation Quality | Excellent | Excellent | Excellent | When Gemini works, quality is good |
| **Production Ready** | **✅ Yes** | **✅ Yes** | **❌ No** | Reliability issues |

**Final Recommendation**:
- **Claude Sonnet 4.5**: Recommended for production (best cost/speed balance, 100% reliability)
- **OpenAI GPT-5**: Acceptable for production (higher cost, 100% reliability)
- **Google Gemini 2.5 Pro**: NOT RECOMMENDED for production remediation workflows due to unpredictable stopping behavior and 50% failure rate

### Phase 2 Milestones (Moved to PRD #150)

The following milestones have been moved to [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md):

- **Milestone 1**: Tool Selection and Design
- **Milestone 2**: Configuration System
- **Milestone 3**: First Observability Tool Implementation
- **Milestone 4**: AI Correlation Enhancement
- **Milestone 5**: Additional Tools (If Applicable)

See PRD #150 for detailed milestone descriptions and tasks.

---

## Dependencies

### External Dependencies
- [x] `toolLoop()` interface defined in AIProvider (PRD #136)
- [x] AnthropicProvider toolLoop() implementation (PRD #136) ✅ Verified in src/core/providers/anthropic-provider.ts:134-256
- [x] VercelProvider toolLoop() implementation ✅ Complete (Milestone 2.5)

### Internal Dependencies
- [x] AIProvider interface with tool support (PRD #136)
- [x] Existing remediation investigation logic (to migrate)

### Potential Blockers
- Tool performance overhead (monitor token usage)
- Backward compatibility if migration affects existing users
- Provider-specific tool behavior differences

---

## Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|-----------|
| Tool overhead increases tokens significantly | High | Monitor metrics, use Phase 1 to validate, fallback to JSON if needed |
| Breaking changes to existing remediation behavior | High | Comprehensive integration tests, phased rollout, feature flag |
| Prometheus/DataDog API failures impact remediation | Medium | Graceful degradation - continue with kubectl if observability unavailable |
| PromQL query injection vulnerabilities | High | Query validation, parameter binding, sandboxing |
| Configuration complexity overwhelms users | Medium | Sensible defaults, auto-detection, clear documentation |
| Provider differences in tool handling | Medium | Abstract tool layer, test with both providers, document differences |

---

## Open Questions

1. ~~**Tool Granularity**: Should we have one `kubectl` tool with operation parameter, or separate tools per operation?~~ **RESOLVED (2025-10-06)**: Separate tools per operation for better AI understanding and targeted descriptions.

2. **Error Handling**: How should AI handle tool failures (unreachable Prometheus, invalid queries)?

3. **Query Limits**: Should we limit PromQL query complexity or time ranges?

4. **Auto-Detection**: Should server auto-detect Prometheus/DataDog and enable tools automatically?

5. ~~**Tool Selection**: How does AI decide when to use metrics vs. just kubectl? Update investigation prompt?~~ **RESOLVED (2025-10-06)**: AI naturally selects tools based on descriptions. No special prompt logic needed - just provide tools and let SDK handle selection.

6. ~~**Backward Compatibility**: Do we need to support JSON-based approach during migration?~~ **RESOLVED (2025-10-06)**: No backward compatibility needed. Complete replacement is cleaner and simpler. Use feature flag if rollback needed.

7. ~~**Provider Parity**: Do we need identical tool behavior across Anthropic and Vercel, or can they differ?~~ **RESOLVED (2025-10-06)**: Focus Anthropic first (Milestone 1-2), add Vercel support after core migration validated (Milestone 2.5). Tool behavior should be identical, but implementation timing differs.

---

## Out of Scope

### Deferred to Future PRDs
- [ ] Additional observability platforms (Grafana, New Relic, Splunk)
- [ ] Custom metrics endpoints beyond Prometheus/DataDog
- [ ] Alerting integration (create alerts based on remediation findings)
- [ ] Cost optimization metrics (AWS Cost Explorer, GCP Billing)
- [ ] APM integration (distributed tracing, span analysis)

---

## Decision Log

### Decision 1: toolLoop() is Complete Replacement, Not Incremental Migration
**Date**: 2025-10-06
**Decision**: The SDK's `toolLoop()` replaces the ENTIRE 400+ line manual investigation loop, not just individual data injections.

**Context**: Initially considered migrating data injections (`{previousIterations}`, `{clusterApiResources}`) one at a time while keeping manual loop. During design discussion, realized SDK manages all aspects of the loop.

**What SDK Replaces**:
- **Conversation history management** → Replaces `{previousIterations}` injection and cumulative data structures
- **Iteration management** → Replaces manual while loop in `conductInvestigation()`
- **Tool execution** → Replaces JSON parsing of `dataRequests` array
- **Completion detection** → Replaces `investigationComplete` boolean field

**Impact**:
- Much larger code simplification (~400 lines removed vs. ~50 lines added)
- Prompt simplification (~40 lines of JSON schema documentation removed)
- Entire investigation becomes single `toolLoop()` call
- No need for incremental migration - complete architectural shift

**Alternatives Considered**:
- Incremental migration: Keep manual loop, replace data injections with tools → Rejected as it adds complexity without full benefits
- Hybrid approach: Use toolLoop() for some operations, JSON for others → Rejected as inconsistent and harder to maintain

**Files Affected**:
- `src/tools/remediate.ts` - Replace `conductInvestigation()` entirely
- `prompts/remediate-investigation.md` - Remove JSON response format, template variables

---

### Decision 2: Focus on Anthropic Provider First, Vercel Later
**Date**: 2025-10-06
**Decision**: Implement and validate with AnthropicProvider only in Phase 1 Milestones 1-2. Add VercelProvider support in Milestone 2.5 after core architecture validated.

**Context**: Initial PRD assumed both providers would be implemented simultaneously. During design discussion, identified that VercelProvider toolLoop() is not yet implemented.

**Rationale**:
- Reduce complexity and risk during architectural migration
- Validate tool-based approach works before multi-provider support
- AnthropicProvider toolLoop() already fully implemented (verified in code review)
- VercelProvider requires new implementation work
- Existing remediation tests can validate Anthropic-only approach first

**Impact**:
- Milestone 1 simplified - only validate Anthropic
- Milestone 2 focuses on core migration with one provider
- New Milestone 2.5 added for Vercel support
- Phase 1 timeline extended slightly but risk reduced

**Alternatives Considered**:
- Implement both simultaneously → Rejected as higher risk, more complex debugging
- Skip Vercel entirely → Rejected as violates multi-provider architecture goals
- Implement Vercel first → Rejected as Anthropic already has working implementation

**Files Affected**:
- PRD Milestone 1: Updated to Anthropic-only validation
- PRD Milestone 2.5: New milestone for Vercel implementation
- `src/core/providers/vercel-provider.ts` - Will need toolLoop() implementation

---

### Decision 3: Remove JSON Response Format Entirely from Investigation Prompt
**Date**: 2025-10-06
**Decision**: Eliminate all JSON response format requirements from `prompts/remediate-investigation.md` including `dataRequests` array and `investigationComplete` field.

**Context**: Current prompt (lines 36-55) specifies AI must return JSON with `dataRequests` array and `investigationComplete` boolean. With toolLoop(), SDK handles this natively.

**Rationale**:
- SDK's tool calling replaces JSON `dataRequests` pattern
- SDK determines completion automatically (AI stops calling tools)
- Prompt can focus on investigation guidance, not response format
- Reduces prompt complexity and token usage
- Eliminates potential for JSON parsing errors

**What Gets Removed**:
- Lines 36-55: JSON response format specification
- Lines 138-179: "Investigation Workflow Example" JSON schemas
- Field definitions for `dataRequests`, `investigationComplete`, etc.

**What Remains**:
- Investigation approach guidance
- Data request precision guidelines
- Root cause analysis criteria
- Tool usage instructions (added)

**Impact**:
- Prompt simplified by ~40 lines
- Clearer focus on investigation strategy vs. technical format
- Better token efficiency
- More maintainable prompt

**Alternatives Considered**:
- Keep JSON format as "backup" → Rejected as confusing and contradictory
- Gradual migration of prompt → Rejected as incomplete migration adds complexity

**Files Affected**:
- `prompts/remediate-investigation.md` - Major simplification

---

### Decision 4: Separate Tools Per Operation (Not Single kubectl Tool)
**Date**: 2025-10-06
**Decision**: Implement separate tools for each kubectl operation (`kubectl_get`, `kubectl_describe`, etc.) rather than one `kubectl` tool with operation parameter.

**Rationale**:
- **Better AI understanding**: Each tool has focused, specific description
- **Targeted documentation**: Tool descriptions can explain exactly what data each provides
- **Easier debugging**: Clear separation of tool calls in logs
- **Better error handling**: Operation-specific error messages and validation
- **SDK best practices**: AI SDKs work better with granular, focused tools

**Example**:
```typescript
// ✅ GOOD - Separate tools
{ name: "kubectl_get", description: "List Kubernetes resources with status" }
{ name: "kubectl_describe", description: "Get detailed resource configuration and events" }
{ name: "kubectl_logs", description: "Fetch container logs for debugging" }

// ❌ BAD - Single tool with operation parameter
{
  name: "kubectl",
  description: "Execute kubectl commands",
  parameters: { operation: "get|describe|logs|...", ... }
}
```

**Impact**:
- More tool definitions (5-7 tools vs. 1)
- Clearer tool purposes and better AI selection
- Easier to add operation-specific validation

**Alternatives Considered**:
- Single kubectl tool → Rejected as too generic, harder for AI to understand when to use
- Group by resource type → Rejected as kubectl operations are more natural grouping

**Files Affected**:
- `src/tools/remediate.ts` - Tool definitions

---

### Decision 5: Extend Metrics Logging for Model Comparison Analysis
**Date**: 2025-10-06
**Decision**: Enhance `metrics.jsonl` logging to capture comprehensive behavioral, quality, and efficiency metrics for AI model comparison. Treat this as an evolving list - add new metrics as we identify value.

**Context**: Planning blog post comparing Claude Sonnet 4.5, OpenAI GPT-5, and Google Gemini 2.5 Pro performance in remediation workflows. Current metrics (inputTokens, outputTokens, durationMs, provider, operation) insufficient for deep analysis of model behavior differences.

**Rationale**:
- **Multi-model testing complete**: Phase 1 tested 3 models, identified significant reliability differences (Claude/GPT 100%, Gemini 50%)
- **Need behavioral insights**: Token counts don't explain WHY Gemini failed (stopped generating mid-investigation)
- **Cost analysis required**: Need to justify Claude recommendation vs. GPT's higher token usage
- **Quality metrics missing**: Can't quantify investigation thoroughness, efficiency, or reliability
- **Blog post value**: Concrete metrics make model comparison actionable for users

**Metrics to Add (Priority Order)**:

**High Priority (Implement First)**:
1. **`iterationCount`** - Number of conversation turns (loop vs one-shot behavior)
2. **`toolCallCount`** - Total tools executed (investigation thoroughness)
3. **`status`** - "success", "failed", "timeout", "parse_error" (reliability)
4. **`completionReason`** - "investigation_complete", "max_iterations", "parse_failure", "model_stopped" (failure modes)
5. **`validJson`** - Boolean: Final response contained valid JSON (quality)
6. **`modelVersion`** - Specific model ID (e.g., "claude-sonnet-4.5-20250929")
7. **`uniqueToolsUsed`** - Array of tool names (investigation strategy)
8. **`cacheHitRate`** - Percentage of tokens from cache (performance optimization)

**Medium Priority (Add as Needed)**:
9. **`confidenceScore`** - AI's confidence in analysis (0-1)
10. **`riskLevel`** - "low", "medium", "high" for recommended actions
11. **`investigationType`** - Inferred from issue (e.g., "oom", "crash", "networking")
12. **`timeToFirstToken`** - Latency until first response (UX metric)
13. **`tokensPerSecond`** - Throughput metric
14. **`toolSequence`** - Ordered array showing investigation path

**Future Extensions**:
- Cost estimation per investigation
- Token breakdown (system prompt, tool results, conversation)
- Validation metrics (dry-run usage)
- Resource context (namespace, resource type)

**Important: This is a Living List**
- Add new metrics when we discover they provide value
- Remove metrics that prove unhelpful or redundant
- Adjust priorities based on blog post needs
- Consider reader interest and actionability

**Impact**:
- Richer comparison data for blog post
- Better understanding of model behavior differences
- Quantifiable justification for model recommendations
- Foundation for future performance analysis

**Implementation Notes**:
- Capture metrics in `src/tools/remediate.ts` during investigation
- Log to `tmp/debug-ai/metrics.jsonl` (existing file)
- Ensure metrics available from toolLoop() result object
- Consider backwards compatibility with existing metrics

**Alternatives Considered**:
- Use existing metrics only → Rejected as insufficient for behavioral analysis
- Create separate metrics file → Rejected as adds complexity, prefer single JSONL
- Log everything possible → Rejected as noisy, focus on high-value metrics first

**Files Affected**:
- `src/tools/remediate.ts` - Capture additional metrics during investigation
- `tmp/debug-ai/metrics.jsonl` - Extended schema (backwards compatible)

---

### Decision 6: Expand Model Testing Coverage (Moved to PRD #151)
**Date**: 2025-10-06  
**Decision**: Test 3 additional AI models beyond initial "Big 3" (Claude, GPT-5, Gemini) to provide comprehensive market comparison covering different market segments and use cases.

**Status**: **MOVED TO PRD #151** - Extended to comprehensive MCP platform testing rather than remediation-only.

**Context**:
- Initial testing covered premium providers (Claude, GPT-5) and identified reliability issues (Gemini 50% failure)
- Blog post value increases with broader market coverage
- Readers need guidance across price/performance spectrum
- Vercel AI SDK supports many cloud providers

**Models Selected for Testing**:

1. **xAI Grok-4** (Challenger)
   - Package: `@ai-sdk/xai`
   - **Value**: Elon Musk's model, high market curiosity, competitive alternative to GPT-5
   - **Test focus**: Performance vs established leaders
   - **Blog angle**: "New challenger analysis"

2. **Mistral Large** (European Champion)
   - Package: `@ai-sdk/mistral`
   - **Value**: Leading European AI, cost-effective, strong reasoning capabilities
   - **Test focus**: Price/performance sweet spot, regional alternative
   - **Blog angle**: "Value for money analysis"

3. **DeepSeek-R1** (Disruptor)
   - Package: Available via Vercel AI SDK
   - **Value**: Extreme cost efficiency (~1/10th GPT-4 cost), recent market buzz (Jan 2025), reasoning-focused
   - **Test focus**: Budget option viability, reasoning specialization for troubleshooting
   - **Blog angle**: "Can low-cost models compete?"

**Final Model Lineup (6 total)**:
1. ✅ Claude Sonnet 4.5 - Premium, reliable baseline
2. ✅ GPT-5 - Premium, expensive reference
3. ✅ Gemini 2.5 Pro - Reliability warning case study
4. ⏳ Grok-4 - New challenger evaluation
5. ⏳ Mistral Large - Regional/value champion
6. ⏳ DeepSeek-R1 - Budget disruptor

**Rationale**:
- **Market coverage**: Premium → Mid-tier → Budget spectrum
- **Geographic diversity**: US (Claude, GPT, Grok), Europe (Mistral), China (DeepSeek)
- **Reader questions answered**:
  - "Should I pay premium?" → Claude vs GPT comparison
  - "Are there good alternatives?" → Grok, Mistral evaluation
  - "Can I save money?" → DeepSeek viability
  - "What should I avoid?" → Gemini reliability issues
- **Narrative diversity**: Established leaders, problematic option, challengers, value play, disruptor

**DeepSeek Model Selection**:
- Chose **DeepSeek-R1** over DeepSeek-V3
- **Reason**: Remediation requires complex reasoning about issues, R1's reasoning specialization potentially beneficial
- **Comparison value**: Tests if reasoning focus improves investigation quality

**Impact**:
- Extends Milestone 2.6 scope from 3 models to 6 models
- Richer blog post with broader market coverage
- More actionable guidance for diverse user needs
- Validates Vercel provider across more model types

**Alternatives Considered**:
- **Groq (Llama 3.1)** → Deferred: Speed focus less relevant for async investigations
- **Azure OpenAI** → Rejected: Redundant with direct OpenAI, enterprise wrapper
- **Amazon Bedrock** → Rejected: Redundant with direct providers
- **Cohere** → Rejected: Less relevant for reasoning tasks
- **Together.ai** → Rejected: Similar to Groq, less differentiation

**Files Affected**:
- `src/core/providers/vercel-provider.ts` - May need provider additions for xAI, Mistral, DeepSeek
- `tests/integration/infrastructure/run-integration-tests.sh` - Test script updates for new providers
- PRD Milestone 2.6 - Expand scope to 6 models

**Testing Plan**:
- Same methodology as Big 3 testing
- Same integration test suite (remediate manual + automatic modes)
- Capture extended metrics from Decision 5
- Document reliability, token usage, and behavior patterns

---

## Work Log

### 2025-10-06: PRD Creation
**Duration**: Initial planning session
**Status**: Draft

**PRD Scope Defined**:
- Two-phase approach: Tool migration (Phase 1) + Observability integration (Phase 2)
- Configuration at server level (not per-call)
- Focus on Prometheus and DataDog as initial providers
- Built on existing toolLoop() from PRD #136
- Must support both Anthropic and Vercel AI SDK providers

**Next Steps**: Begin Phase 1 implementation - validate existing tool infrastructure and design kubectl tools

---

### 2025-10-06: Architectural Design Decisions (PRD Updated)
**Duration**: Design discussion and PRD refinement
**Status**: Ready for Implementation (Milestone 1)

**Major Decisions Documented** (see Decision Log):
1. toolLoop() is complete replacement - not incremental migration (~400 lines removed)
2. Focus Anthropic first, Vercel support in Milestone 2.5
3. Remove JSON response format entirely from investigation prompt
4. Separate tools per kubectl operation (not single kubectl tool)

**PRD Updates**:
- Solution Overview: Added clarity on complete architectural replacement
- Technical Approach: Expanded code examples showing before/after (~400 lines → ~50 lines)
- Milestone 1: Updated to Anthropic-only validation, verified existing implementation
- Milestone 2: Clarified scope - replace entire loop, update prompts
- Milestone 2.5: New milestone for Vercel provider support
- Open Questions: Resolved 4 of 7 questions
- Dependencies: Corrected VercelProvider status (not yet implemented)
- Decision Log: Added comprehensive documentation of 4 major decisions with rationale

**Key Insights from Design Discussion**:
- SDK's toolLoop() manages conversation history → no need for `{previousIterations}` injection
- SDK manages iterations → no need for manual while loop or `{currentIteration}`
- SDK handles tool execution → no need for JSON parsing of `dataRequests`
- SDK determines completion → no need for `investigationComplete` boolean
- Result: Much simpler architecture than originally planned

**Verification Completed**:
- AnthropicProvider toolLoop() implementation confirmed (src/core/providers/anthropic-provider.ts:134-256)
- VercelProvider toolLoop() NOT implemented (vercel-provider.ts:157-163) - documented as blocker
- Existing remediation workflow analyzed (remediate.ts) - 400+ lines to be replaced
- Investigation prompt analyzed (remediate-investigation.md) - ~40 lines to be removed

**Next Steps**: Begin Milestone 1 implementation - Test AnthropicProvider toolLoop() with simple tool, design kubectl tool architecture

---

### 2025-10-06: Milestone 1 Completed - kubectl_api_resources Tool Implementation
**Duration**: Implementation and testing session
**Status**: ✅ COMPLETED

**Implementation Summary**:
- Created shared kubectl tools module (`src/core/kubectl-tools.ts`) for cross-feature reuse
- Implemented `kubectl_api_resources` tool with AITool interface definition
- Implemented `executeKubectlTools()` executor function with proper error handling
- Integrated tool into remediation using hybrid approach with `sendMessageWithTools()`

**Code Changes**:
1. **src/core/kubectl-tools.ts** (NEW):
   - `KUBECTL_API_RESOURCES_TOOL`: Tool definition for discovering cluster API resources
   - `executeKubectlTools()`: Tool executor with switch-case pattern for extensibility
   - Proper error handling returning structured success/error responses

2. **src/tools/remediate.ts** (MODIFIED):
   - Removed upfront `clusterApiResources` fetching (lines 346-364 deleted)
   - Removed `{clusterApiResources}` template replacement
   - Changed investigation iterations to use `sendMessageWithTools()` instead of `sendMessage()`
   - Pass `[KUBECTL_API_RESOURCES_TOOL]` and `executeKubectlTools` to each investigation call
   - Updated final analysis to use descriptive operation name: `'remediate-final-analysis'`

3. **prompts/remediate-investigation.md** (MODIFIED):
   - Removed "Cluster API Resources" section (lines 13-26)
   - AI now discovers resources via tool when needed

4. **src/core/providers/anthropic-provider.ts** (MODIFIED):
   - Added `startTime` tracking to `sendMessageWithTools()` method
   - Added metrics logging with `logMetrics()` call before return
   - Ensures all AI calls are tracked with descriptive operation names

**Testing Results**:
- ✅ All integration tests pass (`npm run test:integration`)
- **Metrics Comparison**:
  - Baseline (JSON-based, no tools): 19 total AI calls
  - With tool (hybrid approach): 16 total AI calls (12 investigation + 4 final analysis)
  - **Reduction**: 3 fewer calls with tool-based approach
- **Tool Usage**: AI did not call `kubectl_api_resources` in test scenario (crashing pod issue) as it already knew which resources to query
- Tool is available and functional when AI needs resource discovery

**Hybrid Approach Validation**:
- ✅ AI can call `kubectl_api_resources` tool when needed
- ✅ AI continues using JSON `dataRequests` pattern for other kubectl operations
- ✅ Proper metrics logging for all operations
- ✅ Tool placed in shared location for future reuse by other features

**Key Findings**:
1. **sendMessageWithTools() works correctly** - Successfully provides tools to AI while maintaining conversation flow
2. **No token overhead for unused tools** - Having tool available but unused doesn't increase token usage significantly
3. **Cleaner architecture** - Removed template injection complexity, AI calls tool when needed
4. **Metrics tracking essential** - Added proper logging to track all AI calls with descriptive operation names
5. **Shared tool module pattern** - `src/core/kubectl-tools.ts` can be imported by remediate, diagnostics, and other features

**Decision Made**: Hybrid approach validated for incremental migration. Can now proceed with implementing additional kubectl tools while maintaining backward compatibility with JSON `dataRequests` pattern.

**Files Modified**:
- `src/core/kubectl-tools.ts` (created)
- `src/tools/remediate.ts` (modified)
- `prompts/remediate-investigation.md` (modified)
- `src/core/providers/anthropic-provider.ts` (modified)

**Metrics Files**:
- `tmp/baseline-metrics-json-based.jsonl` - Baseline before tool implementation
- `tmp/metrics-with-kubectl-tool.jsonl` - Metrics after tool implementation

**Next Steps**: Begin Milestone 2 - Implement remaining kubectl tools (`kubectl_get`, `kubectl_describe`, `kubectl_logs`, `kubectl_events`, `kubectl_top`) and migrate to full `toolLoop()` architecture

---

### 2025-10-06: Critical Bug Found - sendMessageWithTools() Incomplete Implementation
**Duration**: Debug and investigation session
**Status**: ⚠️ BLOCKED - Bug prevents tool usage

**Issue Discovered**:
The initial Milestone 1 validation appeared successful, but deeper investigation revealed the tool was never actually called by the AI. Debug analysis found a critical bug in `sendMessageWithTools()` implementation.

**Root Cause Analysis**:

1. **sendMessageWithTools() is incomplete** (src/core/providers/anthropic-provider.ts:262-339):
   - ✅ Sends message with tools to Claude
   - ✅ Receives tool_use blocks from Claude
   - ✅ Executes tools locally
   - ❌ **NEVER sends tool results back to Claude** (missing multi-turn conversation)
   - ❌ Returns immediately without Claude seeing tool results

2. **Correct implementation exists in toolLoop()** (src/core/providers/anthropic-provider.ts:134-260):
   - ✅ Executes tools
   - ✅ Builds tool_result messages with tool_use_id
   - ✅ Adds assistant response + tool results to conversation history
   - ✅ Sends back to Claude to get final response incorporating tool results
   - ✅ Multi-turn conversation pattern

3. **Test scenario created to force tool usage**:
   - Created CNPG Cluster with reference to non-existent backup
   - Vague issue description: "my database in namespace X is not starting"
   - Forces AI to discover CNPG custom resources via `kubectl_api_resources`
   - Updated test: `tests/integration/tools/remediate.test.ts`

4. **Test results showed the problem**:
   - AI ran for 26+ iterations (hit 20 iteration limit twice)
   - All iterations showed: `"successful": 0, "failed": 0` (no data gathered)
   - AI mentioned "kubectl_api_resources function is available" but never called it
   - AI output empty `dataRequests: []` arrays
   - Investigation stuck in endless loop

**Why Hybrid Approach Failed**:
- Prompt instructs AI to output JSON with `dataRequests` array
- Tool is available via `sendMessageWithTools()` but AI doesn't know how to use it
- Even when AI tries to call tool, results never make it back to AI
- AI is confused by mixed instructions (JSON format + tool availability)

**Architectural Decision - Full toolLoop() Migration Required**:

After analysis, determined that fixing `sendMessageWithTools()` to match `toolLoop()` pattern would still leave hybrid confusion. Better approach: **Complete migration to toolLoop()** as originally planned in Milestone 2.

**Implementation Plan**:
1. ✅ Create all kubectl investigation tools (completed):
   - `kubectl_api_resources` - discover cluster resources
   - `kubectl_get` - list resources
   - `kubectl_describe` - detailed resource info
   - `kubectl_logs` - container logs
   - `kubectl_events` - cluster events
   - Exported as `KUBECTL_INVESTIGATION_TOOLS` array

2. ⏳ Replace manual investigation loop with `toolLoop()`:
   - Remove `conductInvestigation()` while loop (lines 208-308)
   - Use `aiProvider.toolLoop()` with all investigation tools
   - Remove `parseAIResponse()` function (dead code)
   - Remove `gatherSafeData()` function (dead code)
   - Remove `analyzeCurrentState()` function (replaced by toolLoop)

3. ⏳ Update investigation prompt:
   - Remove JSON response format instructions
   - Remove `dataRequests` array specification
   - Remove `investigationComplete` field
   - Explain available tools and when to use them
   - Let SDK manage conversation history and completion

4. ⏳ Test with CNPG scenario

**Files Completed**:
- ✅ `src/core/kubectl-tools.ts` - All 5 investigation tools implemented
- ✅ `tests/integration/tools/remediate.test.ts` - CNPG test scenario created

**Files Pending**:
- ⏳ `src/tools/remediate.ts` - Replace loop with toolLoop()
- ⏳ `prompts/remediate-investigation.md` - Remove JSON format, explain tools

**Metrics from Failed Test**:
- 26 AI calls total (2 separate investigation attempts)
- First attempt: 19 iterations, hit max limit
- Second attempt: 7 iterations, still no data gathered
- Input tokens grew to 28,443 (huge `previousIterations` injection)
- Operations properly labeled: "remediate-investigation" and "remediate-final-analysis"

**Key Insight**:
The hybrid approach (one tool + JSON pattern) creates too much confusion. Clean migration to full toolLoop() is simpler and cleaner than trying to fix the hybrid model.

**Next Steps**: Complete toolLoop() migration - estimated ~200-300 lines of code changes across 2 files

---

### 2025-10-06: Milestone 2 Implementation - Full toolLoop() Migration Complete
**Duration**: Implementation and testing session (~4-5 hours based on conversation)
**Status**: ✅ SUBSTANTIALLY COMPLETE (Anthropic provider only)

**Implementation Summary**:
Completed full migration from JSON-based investigation (400+ line manual loop) to SDK-native toolLoop() architecture. All kubectl investigation tools implemented, tests passing, debug infrastructure added.

**Completed PRD Items**:
- [x] All kubectl tools implemented (9 tools total) - Evidence: src/core/kubectl-tools.ts:17-515
  - kubectl_api_resources, kubectl_get, kubectl_describe, kubectl_logs, kubectl_events
  - kubectl_patch_dryrun, kubectl_apply_dryrun, kubectl_delete_dryrun
  - kubectl_get_crd_schema (NEW - added for CRD schema discovery)

- [x] Replace `conductInvestigation()` with single `toolLoop()` call - Evidence: src/tools/remediate.ts:416-496
  - Single-phase investigation merging investigation + analysis
  - Removed 400+ lines of manual loop management
  - SDK handles iterations, history, tool execution, completion

- [x] Update investigation prompt - Evidence: prompts/remediate.md
  - Removed JSON response format requirements
  - Removed dataRequests array specification
  - Converted to tool-calling instructions
  - Simplified by ~40 lines

- [x] Remove old JSON-based code - Evidence: Multiple file changes
  - Deleted parseAIResponse() function (dead code)
  - Deleted gatherSafeData() function (dead code)
  - Deleted analyzeCurrentState() function (replaced by toolLoop)
  - Updated parseAIFinalAnalysis() with proper brace-depth tracking (fixed greedy regex bug)

- [x] Integration tests passing - Evidence: tmp/test-output-deployment.log
  - Manual mode test: Deployment with OOMKilled container (passed)
  - Automatic mode test: Pod with memory issue (passed)
  - 2/2 tests passed in 217.50s

- [x] Debug logging infrastructure - Evidence: src/core/providers/anthropic-provider.ts:58-72, 192-208
  - Added operation parameter to toolLoop() config
  - Created logDebugIfEnabled() shared helper method
  - Per-iteration debug logging with prompts, responses, token metrics
  - Fixed JSON parsing bug in parseAIFinalAnalysis (brace-depth tracking)

**Additional Work Completed**:

1. **Code Cleanup**:
   - Removed `sendMessageWithTools()` from all providers (dead code)
   - Deleted old JSON-based prompts: remediate-investigation.md, remediate-final-analysis.md
   - Removed unused imports (AITool, ToolExecutor) from providers
   - Fixed environment variable mismatch: DOT_AI_DEBUG → DEBUG_DOT_AI

2. **Test Scenario Improvements**:
   - Changed from Pod to Deployment resources for complexity
   - Updated validation to check deployment rollout and pod status
   - Improved wait logic using label selectors
   - Fixed OOMKilled scenario to be consistently reproducible

3. **New Capability - CRD Schema Tool**:
   - Implemented kubectl_get_crd_schema tool (src/core/kubectl-tools.ts:207-224)
   - Returns raw JSON from `kubectl get crd <name> -o json`
   - AI can discover CRD schemas for complex custom resources (CNPG, Crossplane, ArgoCD, etc.)
   - Added to KUBECTL_INVESTIGATION_TOOLS array for automatic availability

**Metrics Analysis**:
- **Tool-based approach**: 33 calls, 482,004 input tokens, 15,372 output tokens, 370s
- **JSON-based baseline**: 19 calls, 172,089 input tokens, 9,378 output tokens, 220s
- **Token increase**: +74% more calls, +180% more input tokens, +174% total tokens
- **Analysis**: Token increase is expected and justified:
  - Tool-based performs more thorough investigation (longer tool result context)
  - Better investigation quality with native tool calling
  - Future observability tools will add more value
  - Acceptable tradeoff for cleaner architecture and extensibility

**Architecture Changes**:
1. **Single-phase investigation**: Merged investigation + analysis into one toolLoop call
2. **JSON parsing fix**: Replaced greedy regex with proper brace-depth tracking algorithm
3. **Debug infrastructure**: Per-iteration logging with operation names for observability
4. **Tool executor pattern**: executeKubectlTools() with switch-case for extensibility

**Files Modified**:
- src/core/kubectl-tools.ts (9 tools, 515 lines)
- src/tools/remediate.ts (conductInvestigation replaced, parseAIFinalAnalysis fixed)
- src/core/providers/anthropic-provider.ts (debug logging, removed sendMessageWithTools)
- src/core/providers/vercel-provider.ts (removed sendMessageWithTools stub)
- src/core/ai-provider.interface.ts (removed sendMessageWithTools signature)
- prompts/remediate.md (tool-calling only, no JSON format)
- tests/integration/tools/remediate.test.ts (Deployment scenario, improved validation)
- tests/integration/infrastructure/run-integration-tests.sh (DEBUG_DOT_AI fix)

**Files Deleted**:
- prompts/remediate-investigation.md (old JSON-based investigation prompt)
- prompts/remediate-final-analysis.md (old JSON-based analysis prompt)

**Test Evidence**:
- Manual mode: 5 iterations, 7 tool calls, fixed Deployment memory limit
- Automatic mode: 6 iterations, 8 tool calls, auto-executed remediation
- Debug files: tmp/debug-ai/metrics.jsonl, tmp/debug-ai/*_prompt.md, *_response.md

**Key Architectural Insights**:
1. **toolLoop() simplification**: SDK managing conversation history eliminated ~400 lines of manual loop code
2. **No template injection needed**: SDK maintains context automatically (no `{previousIterations}`)
3. **Natural tool selection**: AI chooses appropriate tools based on descriptions without special prompt logic
4. **Cleaner error handling**: Tool execution errors handled consistently by executor pattern
5. **Better extensibility**: Adding new tools is straightforward (define + add to executor switch)

**Next Session Priorities**:
1. Milestone 2.5: Implement VercelProvider.toolLoop() for multi-provider support
2. Phase 2 Start: Design observability configuration system (Milestone 3)
3. Consider token optimization strategies if usage becomes concern in production

---

### 2025-10-06: Performance Optimization Session (Post-Implementation)
**Duration**: ~4-5 hours (based on conversation flow)
**Focus**: Token usage and execution time optimization

**Problem**: Initial tool-based implementation showed +174% token increase vs JSON-based baseline, raising concerns about production viability.

**Optimizations Implemented**:
1. **Table Format Enforcement** (-41% tokens):
   - Modified `kubectl-tools.ts` to strip `-o=yaml` and `-o=json` output flags
   - Root cause: YAML outputs contained escaped JSON in `last-applied-configuration` annotations
   - Files: `src/core/kubectl-tools.ts`, `tests/integration/infrastructure/run-integration-tests.sh`

2. **Prompt Caching** (-7% additional tokens, massive cost savings):
   - Separated static system prompt from dynamic issue description
   - Added `cache_control` on system prompt and tools array in `anthropic-provider.ts`
   - Cache write: +25% cost, Cache read: -90% cost (187K cache reads per test run)
   - Files: `src/core/providers/anthropic-provider.ts`, `src/tools/remediate.ts`, `prompts/remediate-system.md` (renamed)

3. **Parallel Tool Execution** (time optimization):
   - Refactored toolLoop() from sequential `for...of await` to `Promise.all()`
   - Claude frequently requests 2-3 tools simultaneously; now execute concurrently
   - Files: `src/core/providers/anthropic-provider.ts`

4. **Concurrent Test Execution** (-42% test time):
   - Added `.concurrent` to remediation tests (separate namespaces, no conflicts)
   - Test time: 142.6s concurrent vs 248.3s sequential
   - Files: `tests/integration/tools/remediate.test.ts`

**Metrics Summary**:
| Stage | Tokens | Change |
|-------|--------|--------|
| JSON baseline | 181,467 | - |
| Initial tools | 497,376 | +174% ⚠️ |
| After table opt | 293,932 | -41% |
| After caching | 272,216 | -7% |
| **Final vs baseline** | **272,216** | **+50%** ✅ |

**Outcome**: Tool-based approach economically viable for production. Token cost increase justified by:
- Better investigation quality with native tool calling
- Extensibility for Phase 2 observability tools
- Cleaner architecture (-400 lines manual loop code)
- 35% faster wall-clock time through concurrency

**Files Modified**:
- `src/core/kubectl-tools.ts` - Table format enforcement
- `src/core/providers/anthropic-provider.ts` - Caching + parallel execution
- `src/tools/remediate.ts` - Separated issue from system prompt
- `prompts/remediate-system.md` - Renamed, made fully static
- `src/core/providers/provider-debug-utils.ts` - Cache metrics logging
- `src/core/ai-provider.interface.ts` - Added cache metrics to AIResponse
- `tests/integration/tools/remediate.test.ts` - Concurrent execution
- `tests/integration/infrastructure/run-integration-tests.sh` - Hardcoded DEBUG=true

**Key Optimization Insights**:
1. **Table format is significantly more efficient**: Stripping verbose YAML/JSON outputs reduced tokens by 41%
2. **Prompt caching is essential for tool-based architectures**: 90% cost reduction on cache reads makes repeated investigations economical
3. **Static system prompt enables cross-investigation caching**: Separating issue from prompt allows cache reuse across all investigations
4. **Parallel tool execution aligns with Claude's natural behavior**: Claude frequently requests multiple tools; executing them concurrently reduces wall-clock time
5. **Test concurrency is safe with proper isolation**: Using separate namespaces prevents conflicts

**Baseline Files Created**:
- `tmp/baseline-metrics-json-based.jsonl` - Original JSON-based approach
- `tmp/baseline-before-table-optimization.jsonl` - Initial tool implementation
- `tmp/baseline-table-only-before-caching.jsonl` - After table optimization
- `tmp/baseline-after-caching-before-parallel.jsonl` - After prompt caching

**Next Session**: Begin Milestone 2.5 (Vercel Provider Support) or proceed to Phase 2 (Observability Integration)

----

### 2025-10-06: Milestone 2.5 Complete - Vercel Provider Implementation
**Duration**: ~4 hours (conversation-based estimation)
**Status**: Milestone 2.5 Complete ✅ | Milestone 2.6 (Multi-Model Testing) Pending

**Completed PRD Items**:
- [x] Implement VercelProvider.toolLoop() using Vercel AI SDK - Evidence: src/core/providers/vercel-provider.ts:168-398
- [x] Provider-specific caching (Anthropic manual, OpenAI automatic) - Evidence: Lines 192-233
- [x] Test all kubectl tools with Vercel provider (Anthropic model) - Evidence: Integration tests passing (76.8s Manual, 98s Auto)
- [x] Cache metrics normalization across providers - Evidence: Lines 362-385
- [x] Integration tests updated and passing - Evidence: Test output shows 100% pass rate
- [x] Debug infrastructure working (full-context.md, metrics.jsonl) - Evidence: Files generated successfully
- [x] Document provider differences and findings - Evidence: Comprehensive code comments + conversation analysis

**Key Implementation Details**:
- Used `generateText()` with `stopWhen: stepCountIs(maxIterations)` for automatic multi-turn loops
- Anthropic caching via `providerOptions.anthropic.cacheControl` on last tool + system message
- Tool result extraction: `(part as any).output.value` structure discovered and implemented
- Cleaned up debug console outputs - all structured data goes to files now
- SDK version 5.0.60 includes token reporting fixes from PR #8945

**Performance Validation**:
- **Vercel SDK: 22 iterations in 132 seconds** (36% faster than Anthropic native)
- **Anthropic Native: 33 iterations in 206 seconds** (baseline comparison)
- **Efficiency gain**: Vercel reaches conclusions with 33% fewer iterations
- **Per-iteration speed**: Nearly identical (6.0s vs 6.2s)
- **Root cause of lower tokens**: Fewer iterations needed, not SDK bug

**Technical Discoveries**:
- Vercel SDK v5.0.60 includes token reporting fixes (PR #8945 merged Sept 26, 2025)
- Token reporting appears lower because fewer iterations needed (efficiency, not a bug)
- Per-step usage data not available - only cumulative totals from `result.usage`
- Full conversation context properly maintained in `result.response.messages`
- GitHub issues referenced: #8349 (cache tokens), #8795 (streaming), PR #8945 (fix)

**Additional Work Done**:
- Investigated token reporting discrepancies - confirmed SDK working correctly
- Added GitHub issue references (#8349, #8795, #8945) as code comments for future reference
- Removed unnecessary debug console outputs while keeping file-based debugging
- Validated tool output extraction with actual integration test data
- Created comprehensive performance comparison analysis

**Testing Scope**:
- ✅ Tested: Anthropic Claude Sonnet 4.5 via Vercel AI SDK
- ⏳ Pending: OpenAI GPT-5 and Google Gemini 2.5 Pro (Milestone 2.6)

**Next Session Priorities**:
1. **Milestone 2.6**: Test VercelProvider with OpenAI and Google Gemini models
2. OR **Phase 2 Start**: Begin Milestone 3 (Observability Configuration System)
3. Consider documenting findings in architecture decision record

---

### 2025-10-06: Milestone 2.6 Progress - OpenAI GPT-5 Testing Complete
**Duration**: ~2 hours (testing and analysis)
**Status**: Milestone 2.6 Substantially Complete (OpenAI ✅, Google Gemini pending)

**Completed Work**:
- [x] Tested VercelProvider.toolLoop() with OpenAI GPT-5 model
- [x] Verified all kubectl tools work correctly with GPT-5
- [x] Analyzed performance metrics vs Claude Sonnet baseline
- [x] Documented model-specific differences and recommendations
- [x] Saved baseline metrics to `tmp/baseline-vercel-gpt5.jsonl`

**Test Results Summary**:
- **Integration Tests**: Both Manual and Automatic modes passed ✅
- **Test Duration**: 136.4s total (Manual: 135.8s, Automatic: 125.4s)
- **Token Usage**: 30,684 total tokens (+84% vs Claude: 16,649)
- **Caching**: 20,224 cache read tokens (automatic via Vercel SDK)
- **Investigation Quality**: Equivalent to Claude - successfully identifies issues and generates valid fixes

**Key Findings**:
1. **GPT-5 uses more tokens**: +77% input, +129% output compared to Claude
2. **GPT-5 is slower per call**: Average 54.6s vs Claude 33s (+65%)
3. **End-to-end duration similar**: Concurrent test execution masks per-call differences
4. **Caching works effectively**: OpenAI's automatic caching shows 20K cache reads
5. **No code changes needed**: Tests work with any provider via configuration only

**Performance Tradeoffs**:
- **Claude Sonnet 4.5**: Best for cost/speed optimization (-46% tokens, -40% faster)
- **OpenAI GPT-5**: Suitable when OpenAI infrastructure already in use, equivalent quality

**Technical Implementation**:
- Changed `AI_PROVIDER=openai` in test script (line 144 of run-integration-tests.sh)
- Vercel SDK automatically handles OpenAI-specific caching and token reporting
- No provider-specific code changes required in remediation tool
- Debug infrastructure captured full conversation context and metrics

**Files Modified**:
- `tests/integration/infrastructure/run-integration-tests.sh` - Changed AI_PROVIDER to openai
- `prds/143-tool-based-remediation-observability.md` - Updated Milestone 2.6 with results

**Files Created**:
- `tmp/baseline-vercel-gpt5.jsonl` - OpenAI GPT-5 performance baseline
- `tmp/test-output-openai-gpt5.log` - Full test execution log
- `tmp/debug-ai/*_full-context.md` - 4 investigation conversation contexts

**Next Steps**:
1. Test with Google Gemini 2.5 Pro to complete Milestone 2.6
2. OR proceed to Phase 2 (Observability Integration) if Gemini testing deferred

---

### 2025-10-06: Milestone 2.6 Complete - Multi-Model Testing with Critical Findings
**Duration**: ~4 hours (OpenAI + Gemini testing and analysis)
**Status**: ✅ Milestone 2.6 COMPLETE - Phase 1 fully validated across 3 AI providers

**Completed Work**:
- [x] Tested VercelProvider.toolLoop() with OpenAI GPT-5 model (100% reliability)
- [x] Tested VercelProvider.toolLoop() with Google Gemini 2.5 Pro model (50% reliability - CRITICAL ISSUE)
- [x] Analyzed performance metrics for all 3 models (Claude, GPT-5, Gemini)
- [x] Documented comprehensive model comparison with production recommendations
- [x] Identified and documented Gemini reliability issue with debug evidence

**Test Results Summary**:

**OpenAI GPT-5**:
- ✅ All tests passed (100% reliability)
- Token usage: 30,684 total (+84% vs Claude)
- Performance: Slower but reliable
- Recommendation: Acceptable for production

**Google Gemini 2.5 Pro**:
- ⚠️ 50% test failure rate (1 of 2 tests failed)
- Critical issue: Model stops generating output mid-investigation (0 tokens)
- When working: Quality equivalent to Claude/GPT-5
- Recommendation: NOT RECOMMENDED for production

**Key Findings**:
1. **Claude Sonnet 4.5 remains best choice**: Optimal cost/speed/reliability balance
2. **OpenAI GPT-5 is viable alternative**: Higher cost but 100% reliable
3. **Gemini has critical reliability issue**: Unpredictable stopping behavior makes it unsuitable
4. **Vercel provider works universally**: Successfully tested with 3 different AI providers
5. **No code changes needed**: Provider switching via configuration only

**Technical Discoveries**:
- **Gemini failure mode**: Generates 0 output tokens after successful tool execution
- **Failure pattern**: Unpredictable - no clear trigger
- **Debug evidence**: Full conversation logs show premature stopping after validation
- **Impact**: Cannot trust Gemini for critical workflows requiring guaranteed completion

**Files Modified**:
- `tests/integration/infrastructure/run-integration-tests.sh` - Updated AI_PROVIDER for testing
- `prds/143-tool-based-remediation-observability.md` - Updated Milestone 2.6 with all findings

**Files Created**:
- `tmp/baseline-vercel-gpt5.jsonl` - OpenAI GPT-5 baseline metrics
- `tmp/baseline-vercel-gemini.jsonl` - Google Gemini partial metrics
- `tmp/test-output-openai-gpt5.log` - OpenAI test execution log
- `tmp/test-output-google-gemini.log` - Gemini test execution log
- `tmp/debug-ai/2025-10-06T180613_*` - Gemini failure debug evidence
- `tmp/debug-ai/2025-10-06T180631_*` - Gemini success debug evidence

**Phase 1 Completion Assessment**:
- ✅ **Milestone 1**: kubectl_api_resources tool implementation - COMPLETE
- ✅ **Milestone 2**: Complete kubectl tool migration to toolLoop() - COMPLETE
- ✅ **Milestone 2.5**: Vercel Provider implementation - COMPLETE
- ✅ **Milestone 2.6**: Multi-model provider testing - COMPLETE

**Phase 1 Final Status**: ✅ **COMPLETE** - Tool-based remediation architecture fully validated
- Architecture migrated successfully from JSON-based to SDK-native toolLoop()
- Both Anthropic and Vercel providers working correctly
- 3 AI models tested: Claude (recommended), GPT-5 (acceptable), Gemini (not recommended)
- Performance optimized (+50% tokens vs baseline, but with caching and parallel execution)
- All integration tests passing for recommended providers
- Production-ready for Claude Sonnet 4.5 and OpenAI GPT-5

**Next Session**: Begin Phase 2 (Observability Integration) - Milestone 3: Observability Configuration System

----

### 2025-10-06: PRD Split - Phase 2 Extracted to PRD #150
**Duration**: Documentation refactoring
**Status**: PRD organizational improvement

**Context**:
- Phase 1 (Tool-Based Architecture) complete and validated
- Phase 2 (Observability Integration) recognized as separate scope
- User requested split for focused implementation

**Changes Made**:
- Created [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md)
- Extracted Phase 2 milestones (observability tools) to new PRD
- Updated PRD #143 to focus on Phase 1 (kubectl tools architecture)
- Added cross-references between PRDs

**Rationale**:
- Phase 1 has different scope than Phase 2 (architecture vs. features)
- Observability tool selection needs its own discovery process
- Cleaner separation of concerns for issue tracking
- Phase 1 provides foundation for Phase 2 to build upon

**PRD #143 Status**: Phase 1 complete - architecture established and validated

**Next Session**: Work on PRD #150 (Observability Integration) or new task

----

### 2025-10-06: Design Decision - Extended Metrics for Model Comparison
**Duration**: Planning discussion
**Status**: Decision documented

**Context**:
- Planning blog post comparing Claude, GPT-5, and Gemini performance
- Current metrics insufficient for deep behavioral analysis
- Need to explain WHY models differ (not just token counts)

**Decision Made**:
- Extend `metrics.jsonl` with behavioral, quality, and efficiency metrics
- Treat as living list - add more as we discover value
- Prioritized implementation: High priority (8 metrics), Medium priority (6 metrics), Future extensions

**Key Metrics Added**:
- `iterationCount` - Loop vs one-shot behavior
- `toolCallCount` - Investigation thoroughness
- `status` / `completionReason` - Reliability and failure modes
- `validJson` - Quality metric
- `modelVersion` - Specific model tracking
- `uniqueToolsUsed` - Investigation strategy
- `cacheHitRate` - Performance optimization

**Rationale**:
- Gemini 50% failure needs behavioral explanation (not just "failed")
- GPT +84% token usage needs justification analysis
- Blog post readers need actionable comparison data

**Documentation**:
- Added Decision 5 to PRD Decision Log with comprehensive rationale
- Emphasized this is a living list that evolves with needs
- Documented implementation notes and file locations

**Next Steps**: Implement high-priority metrics in `src/tools/remediate.ts`

----

### 2025-10-07: Decision 5 Implementation - Extended Metrics Collection Complete
**Duration**: ~2 hours (based on conversation timestamps)
**Status**: ✅ Extended metrics infrastructure COMPLETE

**Completed Work**:
- [x] Extended `AgenticResult` interface with new metrics fields (flat structure)
- [x] Implemented `logMetrics()` function accepting result object
- [x] Created `createAndLogAgenticResult()` helper to reduce provider duplication
- [x] Updated both AnthropicProvider and VercelProvider to use helper
- [x] Added manual annotation placeholders for post-test analysis
- [x] Renamed `provider` field to `sdk` for clarity (distinguishes SDK from model)
- [x] Configured test script defaults (native Anthropic SDK, env var override support)

**Key Architectural Decisions**:
1. **Flat metrics structure**: Rejected nested "extendedMetrics" wrapper - made all metrics first-class fields in AgenticResult
2. **Shared helper function**: Created `createAndLogAgenticResult()` to eliminate ~50 lines of duplication across providers
3. **SDK vs Provider naming**: Renamed field from `provider` to `sdk` to distinguish SDK implementation (vercel/anthropic) from AI model
4. **Test defaults**: Made native Anthropic the default SDK to match CI configuration, support env var override (`AI_PROVIDER_SDK=vercel`)

**Extended Metrics Implemented (Decision 5)**:
- `iterationCount` - Number of conversation turns in investigation loop
- `toolCallCount` - Total tools executed (investigation thoroughness metric)
- `uniqueToolsUsed` - Array of distinct tool names called
- `status` - "success" | "failed" | "timeout" | "parse_error" (reliability tracking)
- `completionReason` - "investigation_complete" | "max_iterations" | "parse_failure" | "model_stopped" | "error"
- `modelVersion` - Specific model ID (e.g., "claude-sonnet-4-5-20250929")
- `cacheHitRate` - Percentage of tokens from cache (calculated from cacheRead/input ratio)

**Manual Annotation Placeholders**:
- `manualNotes` - General observations after test review
- `failureReason` - Root cause of failures
- `qualityIssues` - List of quality problems discovered
- `comparisonNotes` - Cross-model comparison insights

**Critical Discovery - Token Reporting Discrepancy**:
- Native Anthropic SDK reports **2-3x more tokens** than Vercel SDK
- **Test Results**:
  - Native Anthropic: 11,493 input / 1,293 output tokens (1st investigation)
  - Vercel SDK: 3,682 input / 440 output tokens (1st investigation)
  - **Difference**: 3.1x more input, 2.9x more output via native SDK
- **Same model** (claude-sonnet-4-5-20250929), same operations, similar durations
- Confirms Vercel SDK **underreports ~70% of tokens** compared to native Anthropic SDK
- Aligns with code comment in vercel-provider.ts:371 about token discrepancy
- **Impact**: Cannot trust Vercel SDK token metrics for cost estimation

**Files Modified**:
- `src/core/ai-provider.interface.ts` - Extended `AgenticResult` interface with new fields
- `src/core/providers/provider-debug-utils.ts` - Updated `logMetrics()` signature, added `createAndLogAgenticResult()` helper
- `src/core/providers/anthropic-provider.ts` - Updated all toolLoop exit points to use helper (3 locations)
- `src/core/providers/vercel-provider.ts` - Changed `getProviderType()` to return 'vercel', updated to use helper (2 locations)
- `tests/integration/infrastructure/run-integration-tests.sh` - Added env var defaults with logging

**Metrics Storage**:
- Location: `tests/integration/metrics/` (committed to Git for historical comparison)
- Format: Compact JSONL (one JSON object per line, no commas between objects)
- Baseline saved: `anthropic-native-sonnet-20251007-001434.jsonl`
- Purpose: Enable regression detection, historical comparison, documentation

**Test Validation**:
- ✅ Build succeeded with no errors
- ✅ Integration tests passed with native Anthropic SDK
- ✅ Metrics properly captured with `sdk: "anthropic"` field
- ✅ Manual annotation placeholders present in output
- ✅ Extended metrics populated correctly

**Next Session Priorities**:
1. Run comprehensive tests across all providers (Anthropic, Vercel, OpenAI, Google Gemini)
2. Populate manual annotation fields after analyzing test results
3. Complete full model comparison analysis (6 models total per Decision 6)
4. Investigate Vercel SDK token reporting discrepancy (consider filing GitHub issue)
5. Use extended metrics to create blog post comparing AI model performance

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vfarcic/dot-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

143-tool-based-remediation-observability.md•71.7 KiB

# PRD: Tool-Based Remediation with Observability Data Sources

**Created**: 2025-10-06
**Status**: Complete
**Owner**: Viktor Farcic
**Last Updated**: 2025-10-07
**Completed**: 2025-10-07
**GitHub Issue**: [#143](https://github.com/vfarcic/dot-ai/issues/143)
**Priority**: High
**Complexity**: High

---

## Executive Summary

Migrate the remediation tool from JSON-based kubectl command patterns to SDK tool-based architecture (`toolLoop()`). This establishes a clean, extensible foundation for investigation tools.

**Note**: Phase 2 (Observability Integration) has been moved to a separate PRD for focused implementation. See [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md).

---

## Problem Statement

### Current Limitations

**1. JSON-Based Architecture**
- Remediation uses JSON `dataRequests` pattern that AI returns, then we parse and execute
- Manual loop management (400+ lines of iteration logic)
- Rigid pattern that's hard to extend with new data sources
- Cannot leverage SDK's native tool-calling capabilities

**2. Limited Data Sources**
- Only has access to Kubernetes API data (kubectl commands)
- No access to metrics (CPU, memory, custom metrics)
- No access to observability platforms (Prometheus, DataDog, Grafana)
- Cannot correlate cluster state with performance/resource metrics

**3. Incomplete Root Cause Analysis**
- AI identifies "high memory usage" but cannot see actual metrics
- Detects pod restarts but cannot analyze resource saturation trends
- Recommends scaling but lacks utilization data to validate
- Missing context for performance degradation issues

### Impact

Users must:
- Manually check Prometheus/DataDog after AI analysis
- Correlate metrics with kubernetes events themselves
- Make scaling/remediation decisions without complete picture
- Run multiple tools to get full understanding

---

## Solution Overview

**Two-Phase Approach:**

### Phase 1: Migrate to Tool-Based Architecture
**CRITICAL**: This is a complete architectural replacement, not incremental changes. The SDK's `toolLoop()` replaces the entire 400+ line manual investigation loop:
- **SDK manages conversation history** (replaces `{previousIterations}` injection)
- **SDK manages iterations** (replaces manual while loop in `conductInvestigation()`)
- **SDK handles tool execution** (replaces JSON parsing of `dataRequests`)
- **SDK determines completion** (replaces `investigationComplete` boolean field)

Implementation:
- Implement kubectl tools: `kubectl_get`, `kubectl_describe`, `kubectl_logs`, `kubectl_events`, `kubectl_top`
- Use `AIProvider.toolLoop()` method (already implemented in PRD #136 for Anthropic)
- Remove manual JSON parsing and loop management (~400 lines)
- Remove JSON response format from investigation prompt
- Focus on Anthropic provider first, Vercel support later
- Validate tool architecture works for remediation use case

### Phase 2: Add Observability Data Sources (Moved to PRD #150)
This phase has been moved to a separate PRD for focused implementation:
- See [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md)
- Will extend remediation with observability tools following the architecture established in Phase 1
- Tools and priorities to be determined during PRD #150 implementation

---

## User Journey

### Before (Current State)
```
1. User reports: "My pods keep restarting"
2. AI investigates using kubectl only
3. AI finds: "Pods restarting due to OOMKilled"
4. AI recommends: "Increase memory limit"
5. User manually checks Prometheus to see actual memory usage
6. User correlates data and implements fix
```

### After (Phase 1: Tool-Based)
```
1. User reports: "My pods keep restarting"
2. AI calls kubectl_get, kubectl_describe, kubectl_events tools
3. AI finds: "Pods restarting due to OOMKilled"
4. AI recommends: "Increase memory limit"
5. User manually checks Prometheus to see actual memory usage
6. User correlates data and implements fix

(Same outcome, cleaner architecture)
```

### After (Phase 2: Observability Integration)
```
1. User reports: "My pods keep restarting"
2. AI calls kubectl_describe tool → sees OOMKilled events
3. AI calls prometheus_range_query tool → sees memory usage at 95% consistently
4. AI calls prometheus_query tool → sees memory requests set to 128Mi
5. AI analyzes: "Memory usage peaks at 450Mi, limit is 512Mi, but requests too low causing scheduling issues"
6. AI recommends: "Set memory request=256Mi, limit=512Mi based on actual usage patterns"
7. User applies fix with complete confidence

(Complete analysis with metrics-driven recommendations)
```

---

## Technical Approach

### Phase 1: Tool Migration Architecture

**Current Flow (JSON-Based - ~400 lines)**:
```typescript
// Manual iteration loop in conductInvestigation()
while (currentIteration <= maxIterations && !session.investigationComplete) {
  // Inject cumulative previousIterations (grows with each iteration)
  const previousIterationsJson = JSON.stringify(
    session.iterations.map(iter => ({
      step: iter.step,
      analysis: iter.aiAnalysis,
      dataRequests: iter.dataRequests,
      gatheredData: iter.gatheredData
    })), null, 2
  );

  // Inject clusterApiResources (415+ resources)
  const clusterApiResources = await executeKubectl(['api-resources']);

  // Replace template variables
  const investigationPrompt = promptTemplate
    .replace('{previousIterations}', previousIterationsJson)
    .replace('{clusterApiResources}', clusterApiResources)
    .replace('{currentIteration}', currentIteration.toString());

  // AI responds with JSON specifying what data it needs
  const aiResponse = await aiProvider.sendMessage(investigationPrompt);
  const parsedResponse = JSON.parse(aiResponse.content);

  // Manually execute each data request
  for (const request of parsedResponse.dataRequests) {
    const output = await executeKubectl([request.type, request.resource]);
    gatheredData.push(output);
  }

  // Check if AI set investigationComplete flag
  if (parsedResponse.investigationComplete) {
    session.investigationComplete = true;
    break;
  }

  currentIteration++;
}
```

**New Flow (Tool-Based - ~50 lines)**:
```typescript
// Define tools once
const tools = [
  {
    name: "kubectl_get",
    description: "Get Kubernetes resources",
    inputSchema: {
      type: "object",
      properties: {
        resource: { type: "string", description: "Resource type (pods, services, etc)" },
        namespace: { type: "string", description: "Namespace" }
      }
    }
  },
  {
    name: "get_cluster_resources",
    description: "Get list of all available Kubernetes API resources in cluster",
    inputSchema: { type: "object", properties: {} }
  }
];

// SDK handles EVERYTHING - iterations, history, completion
const result = await aiProvider.toolLoop({
  systemPrompt: investigationPrompt, // No {previousIterations} or {currentIteration} needed!
  tools: tools,
  toolImplementations: {
    kubectl_get: async (args) => await executeKubectl(['get', args.resource, '-n', args.namespace]),
    get_cluster_resources: async () => await executeKubectl(['api-resources'])
  },
  maxIterations: 20
});

// result.finalMessage contains the analysis
// result.allToolCalls contains complete history
```

**Key Simplifications**:
- No manual loop management
- No `{previousIterations}` injection (SDK maintains conversation history)
- No JSON parsing of AI responses
- No `investigationComplete` flag (SDK stops when AI stops calling tools)
- No cumulative data structure management
- Prompt simplified by ~40 lines (remove JSON schema documentation)

### Phase 2: Observability Tools

**Configuration (Server Level)**:
```typescript
// Server config
{
  "observability": {
    "prometheus": {
      "enabled": true,
      "endpoint": "http://prometheus:9090"
    },
    "datadog": {
      "enabled": false,
      "apiKey": "...",
      "appKey": "..."
    }
  }
}
```

**Tool Definitions**:
```typescript
const observabilityTools = [
  {
    name: "prometheus_query",
    description: "Query Prometheus metrics (instant query)",
    input_schema: {
      type: "object",
      properties: {
        query: { type: "string", description: "PromQL query" },
        time: { type: "string", description: "Evaluation timestamp (optional)" }
      },
      required: ["query"]
    }
  },
  {
    name: "prometheus_range_query",
    description: "Query Prometheus metrics over time range",
    input_schema: {
      type: "object",
      properties: {
        query: { type: "string", description: "PromQL query" },
        start: { type: "string", description: "Start time" },
        end: { type: "string", description: "End time" },
        step: { type: "string", description: "Query resolution step" }
      },
      required: ["query", "start", "end"]
    }
  }
];
```

**Tool Implementations**:
```typescript
const toolImplementations = {
  prometheus_query: async (args) => {
    const config = getServerConfig();
    if (!config.observability.prometheus.enabled) {
      return { error: "Prometheus not configured" };
    }

    const url = `${config.observability.prometheus.endpoint}/api/v1/query`;
    const response = await fetch(url, {
      method: 'POST',
      body: JSON.stringify({ query: args.query, time: args.time })
    });
    return await response.json();
  },

  prometheus_range_query: async (args) => {
    // Similar implementation for range queries
  }
};
```

---

## Success Criteria

### Phase 1: Tool Migration
- [x] Remediation uses `toolLoop()` instead of JSON-based loops ✅
- [x] All existing kubectl operations work as tools ✅
- [x] Investigation quality unchanged or improved ✅
- [x] Code is simpler (remove 400+ lines of manual loop logic) ✅
- [x] All integration tests passing ✅
- [x] Token usage optimized (+50% vs JSON baseline through table format + prompt caching) ✅

### Phase 2: Observability Integration (Moved to PRD #150)
See [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md) for success criteria.

---

## Milestones

### Milestone 1: Validate Anthropic toolLoop() & Design kubectl Tools (Phase 1 Start) ✅ COMPLETED
**Goal**: Verify AnthropicProvider toolLoop() works and design kubectl tool architecture

**Infrastructure Validation (Anthropic Only)**:
- [x] Verify `AIProvider.toolLoop()` interface exists (from PRD #136) ✅
- [x] Verify AnthropicProvider implements toolLoop() correctly ✅ (verified in src/core/providers/anthropic-provider.ts:134-256)
- [x] Test AnthropicProvider with sendMessageWithTools() to confirm functionality ✅
- [x] Document any limitations discovered during testing ✅

**Note**: VercelProvider toolLoop() is NOT implemented (intentionally - see vercel-provider.ts:157-163). Vercel support will be added later in Phase 1.

**Tool Design**:
- [x] Design kubectl tool schema (granularity, parameters, error handling) ✅
- [x] Create tool definition format matching Anthropic.Tool interface ✅
- [x] Design tool registration system for remediation ✅
- [x] Plan tool implementation architecture ✅

**First Implementation**:
- [x] Implement `kubectl_api_resources` tool (replaces `{clusterApiResources}` injection) ✅
- [x] Test with existing remediation integration tests ✅
- [x] Confirm tool architecture is sound before building all kubectl tools ✅

**Findings & Limitations**:
- **Hybrid Approach Works**: Using `sendMessageWithTools()` with single tool while still supporting JSON `dataRequests` for other operations
- **Metrics Logging**: Added proper metrics tracking to `sendMessageWithTools()` method with descriptive operation names
- **Tool Placement**: Created shared `src/core/kubectl-tools.ts` module for cross-feature reuse
- **Test Results**: All integration tests pass (16 AI calls vs 19 baseline - 3 fewer calls with tool available)
- **Tool Usage**: AI did not call `kubectl_api_resources` tool in test scenario (crashing pod) as it already knew which resources to query. Tool is available and functional when AI needs resource discovery
- **Token Impact**: No significant token increase observed with tool available but unused

### Milestone 2: Complete kubectl Tool Migration (Phase 1 - Anthropic) ✅ SUBSTANTIALLY COMPLETE
**Goal**: Replace entire manual investigation loop with toolLoop()
- [x] All kubectl tools implemented: `kubectl_get`, `kubectl_describe`, `kubectl_logs`, `kubectl_events`, `kubectl_api_resources` ✅
- [x] Additional tools: `kubectl_patch_dryrun`, `kubectl_apply_dryrun`, `kubectl_delete_dryrun`, `kubectl_get_crd_schema` ✅
- [x] Replace `conductInvestigation()` with single `toolLoop()` call ✅
- [x] Update `prompts/remediate-investigation.md`: ✅
  - Removed `{previousIterations}` template variable
  - Removed `{currentIteration}` template variable
  - Removed JSON response format
  - Removed "Investigation Workflow Example" JSON schemas
  - Simplified to tool-calling instructions only
- [x] Remove old JSON-based loop code (~400 lines in remediate.ts) ✅
- [x] All integration tests passing ✅
- [x] Debug logging infrastructure added (operation parameter, per-iteration logging) ✅
- [x] Performance optimized to viable production levels ✅ See optimization results below

### Milestone 2.5: Add Vercel Provider Support (Phase 1 Complete) ✅
**Goal**: Implement toolLoop() for VercelProvider
- [x] Implement `VercelProvider.toolLoop()` using Vercel AI SDK tool APIs ✅ (src/core/providers/vercel-provider.ts:168-398)
- [x] Test all kubectl tools work with Vercel provider (Anthropic model) ✅ (Integration tests: Manual 76.8s, Automatic 98s - all passing)
- [x] Update integration tests to validate Vercel provider ✅ (Both Manual and Automatic mode workflows validated)
- [x] Document provider-specific differences ✅ (Caching approaches, token metrics normalization, SDK version notes)

**Note**: Tested with Anthropic Claude Sonnet via Vercel SDK only. OpenAI and Google Gemini testing pending (Milestone 2.6).

### Milestone 2.6: Multi-Model Provider Testing (Phase 1 Final Validation) ✅ COMPLETE (Extended Scope)
**Goal**: Validate Vercel provider works across diverse AI models from different providers

**Big 3 Testing (Complete)** ✅:
- [x] Test VercelProvider.toolLoop() with OpenAI GPT-5 model ✅
- [x] Test VercelProvider.toolLoop() with Google Gemini 2.5 Pro model ✅ (Reliability issues identified)
- [x] Document any model-specific differences or limitations ✅
- [x] Verify caching behavior works correctly for each provider ✅
- [x] Update integration tests if provider-specific handling needed ✅ (No changes needed - tests work with any provider)

**Extended Testing (Moved to PRD #151)** ✅:
- Extended model testing moved to [PRD #151 - AI Model Comparison for MCP Platform](https://github.com/vfarcic/dot-ai/blob/main/prds/151-ai-model-comparison-mcp-platform.md)
- Comprehensive testing across ALL MCP AI interactions (not just remediation)
- 6 models total: Claude, GPT-5, Gemini + xAI Grok-4, Mistral Large, DeepSeek-R1

**OpenAI GPT-5 Test Results** (2025-10-06):
- ✅ All integration tests pass (Manual: 135.8s, Automatic: 125.4s)
- **Performance vs Claude**: +84% more tokens, +65% slower per call, but equivalent end-to-end test duration
- **Token Usage**: 25,602 input + 5,082 output = 30,684 total tokens (vs Claude: 16,649 total)
- **Caching**: 20,224 cache read tokens (effective automatic caching via Vercel SDK)
- **Investigation Steps**: 30 total steps across 2 tests (8, 7, 4, 11 per run)
- **Quality**: Equivalent remediation quality - both models successfully identify issues and generate valid fixes
- **Baseline saved**: `tmp/baseline-vercel-gpt5.jsonl`

**Model Comparison Summary**:
| Metric | Claude Sonnet 4.5 | OpenAI GPT-5 | Difference |
|--------|------------------|--------------|------------|
| Total Tokens | 16,649 | 30,684 | +84% |
| Input Tokens | 14,429 | 25,602 | +77% |
| Output Tokens | 2,220 | 5,082 | +129% |
| Avg Call Duration | 33s | 54.6s | +65% |
| Cache Reads | 14,400 | 20,224 | +40% |

**Recommendation**: Both models work effectively. Claude Sonnet preferred for cost/speed optimization (-46% tokens, -40% faster). GPT-5 suitable when OpenAI infrastructure is already in use.

**Google Gemini 2.5 Pro Test Results** (2025-10-06):
- ⚠️ **CRITICAL RELIABILITY ISSUE IDENTIFIED** - 50% test failure rate
- **Test Outcome**: 1 of 2 integration tests failed
- **Failure Mode**: Model stopped generating output mid-investigation (0 output tokens after successful tool calls)
- **Successful Test**: Automatic mode passed - Generated proper JSON, executed remediation successfully
- **Failed Test**: Manual mode failed - Completed 6 iterations with 5 tool calls, then produced zero output tokens
- **Token Usage** (incomplete data due to failure): 13,467 input + 662 output tokens across 3 calls
- **Investigation Quality**: When working, quality is equivalent to Claude/GPT-5
- **Root Cause**: Gemini exhibits inconsistent stopping behavior - sometimes completes properly, sometimes stops prematurely
- **Debug Evidence**: `tmp/debug-ai/2025-10-06T180613_*` shows conversation ending abruptly after `kubectl_patch_dryrun` validation
- **Baseline saved**: `tmp/baseline-vercel-gemini.jsonl` (partial data)

**Gemini Reliability Analysis**:
- **Failure Pattern**: Unpredictable - no clear pattern to when model will stop vs continue
- **Impact**: Cannot rely on consistent JSON output in production workflows
- **Risk**: 50% failure rate makes Gemini unsuitable for critical remediation tasks
- **Behavior**: Model completes tool calls correctly but fails to generate final analysis
- **Comparison**: Claude and GPT-5 showed 100% reliability across all tests

**Model Comparison Summary (Updated with Gemini)**:
| Metric | Claude Sonnet 4.5 | OpenAI GPT-5 | Google Gemini 2.5 Pro | Notes |
|--------|------------------|--------------|----------------------|-------|
| Reliability | 100% | 100% | 50% | Gemini failed 1 of 2 tests |
| Total Tokens | 16,649 | 30,684 | 13,467 | Gemini data incomplete |
| Avg Call Duration | 33s | 54.6s | 25.4s | Gemini appears faster when working |
| Investigation Quality | Excellent | Excellent | Excellent | When Gemini works, quality is good |
| **Production Ready** | **✅ Yes** | **✅ Yes** | **❌ No** | Reliability issues |

**Final Recommendation**:
- **Claude Sonnet 4.5**: Recommended for production (best cost/speed balance, 100% reliability)
- **OpenAI GPT-5**: Acceptable for production (higher cost, 100% reliability)
- **Google Gemini 2.5 Pro**: NOT RECOMMENDED for production remediation workflows due to unpredictable stopping behavior and 50% failure rate

### Phase 2 Milestones (Moved to PRD #150)

The following milestones have been moved to [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md):

- **Milestone 1**: Tool Selection and Design
- **Milestone 2**: Configuration System
- **Milestone 3**: First Observability Tool Implementation
- **Milestone 4**: AI Correlation Enhancement
- **Milestone 5**: Additional Tools (If Applicable)

See PRD #150 for detailed milestone descriptions and tasks.

---

## Dependencies

### External Dependencies
- [x] `toolLoop()` interface defined in AIProvider (PRD #136)
- [x] AnthropicProvider toolLoop() implementation (PRD #136) ✅ Verified in src/core/providers/anthropic-provider.ts:134-256
- [x] VercelProvider toolLoop() implementation ✅ Complete (Milestone 2.5)

### Internal Dependencies
- [x] AIProvider interface with tool support (PRD #136)
- [x] Existing remediation investigation logic (to migrate)

### Potential Blockers
- Tool performance overhead (monitor token usage)
- Backward compatibility if migration affects existing users
- Provider-specific tool behavior differences

---

## Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|-----------|
| Tool overhead increases tokens significantly | High | Monitor metrics, use Phase 1 to validate, fallback to JSON if needed |
| Breaking changes to existing remediation behavior | High | Comprehensive integration tests, phased rollout, feature flag |
| Prometheus/DataDog API failures impact remediation | Medium | Graceful degradation - continue with kubectl if observability unavailable |
| PromQL query injection vulnerabilities | High | Query validation, parameter binding, sandboxing |
| Configuration complexity overwhelms users | Medium | Sensible defaults, auto-detection, clear documentation |
| Provider differences in tool handling | Medium | Abstract tool layer, test with both providers, document differences |

---

## Open Questions

1. ~~**Tool Granularity**: Should we have one `kubectl` tool with operation parameter, or separate tools per operation?~~ **RESOLVED (2025-10-06)**: Separate tools per operation for better AI understanding and targeted descriptions.

2. **Error Handling**: How should AI handle tool failures (unreachable Prometheus, invalid queries)?

3. **Query Limits**: Should we limit PromQL query complexity or time ranges?

4. **Auto-Detection**: Should server auto-detect Prometheus/DataDog and enable tools automatically?

5. ~~**Tool Selection**: How does AI decide when to use metrics vs. just kubectl? Update investigation prompt?~~ **RESOLVED (2025-10-06)**: AI naturally selects tools based on descriptions. No special prompt logic needed - just provide tools and let SDK handle selection.

6. ~~**Backward Compatibility**: Do we need to support JSON-based approach during migration?~~ **RESOLVED (2025-10-06)**: No backward compatibility needed. Complete replacement is cleaner and simpler. Use feature flag if rollback needed.

7. ~~**Provider Parity**: Do we need identical tool behavior across Anthropic and Vercel, or can they differ?~~ **RESOLVED (2025-10-06)**: Focus Anthropic first (Milestone 1-2), add Vercel support after core migration validated (Milestone 2.5). Tool behavior should be identical, but implementation timing differs.

---

## Out of Scope

### Deferred to Future PRDs
- [ ] Additional observability platforms (Grafana, New Relic, Splunk)
- [ ] Custom metrics endpoints beyond Prometheus/DataDog
- [ ] Alerting integration (create alerts based on remediation findings)
- [ ] Cost optimization metrics (AWS Cost Explorer, GCP Billing)
- [ ] APM integration (distributed tracing, span analysis)

---

## Decision Log

### Decision 1: toolLoop() is Complete Replacement, Not Incremental Migration
**Date**: 2025-10-06
**Decision**: The SDK's `toolLoop()` replaces the ENTIRE 400+ line manual investigation loop, not just individual data injections.

**Context**: Initially considered migrating data injections (`{previousIterations}`, `{clusterApiResources}`) one at a time while keeping manual loop. During design discussion, realized SDK manages all aspects of the loop.

**What SDK Replaces**:
- **Conversation history management** → Replaces `{previousIterations}` injection and cumulative data structures
- **Iteration management** → Replaces manual while loop in `conductInvestigation()`
- **Tool execution** → Replaces JSON parsing of `dataRequests` array
- **Completion detection** → Replaces `investigationComplete` boolean field

**Impact**:
- Much larger code simplification (~400 lines removed vs. ~50 lines added)
- Prompt simplification (~40 lines of JSON schema documentation removed)
- Entire investigation becomes single `toolLoop()` call
- No need for incremental migration - complete architectural shift

**Alternatives Considered**:
- Incremental migration: Keep manual loop, replace data injections with tools → Rejected as it adds complexity without full benefits
- Hybrid approach: Use toolLoop() for some operations, JSON for others → Rejected as inconsistent and harder to maintain

**Files Affected**:
- `src/tools/remediate.ts` - Replace `conductInvestigation()` entirely
- `prompts/remediate-investigation.md` - Remove JSON response format, template variables

---

### Decision 2: Focus on Anthropic Provider First, Vercel Later
**Date**: 2025-10-06
**Decision**: Implement and validate with AnthropicProvider only in Phase 1 Milestones 1-2. Add VercelProvider support in Milestone 2.5 after core architecture validated.

**Context**: Initial PRD assumed both providers would be implemented simultaneously. During design discussion, identified that VercelProvider toolLoop() is not yet implemented.

**Rationale**:
- Reduce complexity and risk during architectural migration
- Validate tool-based approach works before multi-provider support
- AnthropicProvider toolLoop() already fully implemented (verified in code review)
- VercelProvider requires new implementation work
- Existing remediation tests can validate Anthropic-only approach first

**Impact**:
- Milestone 1 simplified - only validate Anthropic
- Milestone 2 focuses on core migration with one provider
- New Milestone 2.5 added for Vercel support
- Phase 1 timeline extended slightly but risk reduced

**Alternatives Considered**:
- Implement both simultaneously → Rejected as higher risk, more complex debugging
- Skip Vercel entirely → Rejected as violates multi-provider architecture goals
- Implement Vercel first → Rejected as Anthropic already has working implementation

**Files Affected**:
- PRD Milestone 1: Updated to Anthropic-only validation
- PRD Milestone 2.5: New milestone for Vercel implementation
- `src/core/providers/vercel-provider.ts` - Will need toolLoop() implementation

---

### Decision 3: Remove JSON Response Format Entirely from Investigation Prompt
**Date**: 2025-10-06
**Decision**: Eliminate all JSON response format requirements from `prompts/remediate-investigation.md` including `dataRequests` array and `investigationComplete` field.

**Context**: Current prompt (lines 36-55) specifies AI must return JSON with `dataRequests` array and `investigationComplete` boolean. With toolLoop(), SDK handles this natively.

**Rationale**:
- SDK's tool calling replaces JSON `dataRequests` pattern
- SDK determines completion automatically (AI stops calling tools)
- Prompt can focus on investigation guidance, not response format
- Reduces prompt complexity and token usage
- Eliminates potential for JSON parsing errors

**What Gets Removed**:
- Lines 36-55: JSON response format specification
- Lines 138-179: "Investigation Workflow Example" JSON schemas
- Field definitions for `dataRequests`, `investigationComplete`, etc.

**What Remains**:
- Investigation approach guidance
- Data request precision guidelines
- Root cause analysis criteria
- Tool usage instructions (added)

**Impact**:
- Prompt simplified by ~40 lines
- Clearer focus on investigation strategy vs. technical format
- Better token efficiency
- More maintainable prompt

**Alternatives Considered**:
- Keep JSON format as "backup" → Rejected as confusing and contradictory
- Gradual migration of prompt → Rejected as incomplete migration adds complexity

**Files Affected**:
- `prompts/remediate-investigation.md` - Major simplification

---

### Decision 4: Separate Tools Per Operation (Not Single kubectl Tool)
**Date**: 2025-10-06
**Decision**: Implement separate tools for each kubectl operation (`kubectl_get`, `kubectl_describe`, etc.) rather than one `kubectl` tool with operation parameter.

**Rationale**:
- **Better AI understanding**: Each tool has focused, specific description
- **Targeted documentation**: Tool descriptions can explain exactly what data each provides
- **Easier debugging**: Clear separation of tool calls in logs
- **Better error handling**: Operation-specific error messages and validation
- **SDK best practices**: AI SDKs work better with granular, focused tools

**Example**:
```typescript
// ✅ GOOD - Separate tools
{ name: "kubectl_get", description: "List Kubernetes resources with status" }
{ name: "kubectl_describe", description: "Get detailed resource configuration and events" }
{ name: "kubectl_logs", description: "Fetch container logs for debugging" }

// ❌ BAD - Single tool with operation parameter
{
  name: "kubectl",
  description: "Execute kubectl commands",
  parameters: { operation: "get|describe|logs|...", ... }
}
```

**Impact**:
- More tool definitions (5-7 tools vs. 1)
- Clearer tool purposes and better AI selection
- Easier to add operation-specific validation

**Alternatives Considered**:
- Single kubectl tool → Rejected as too generic, harder for AI to understand when to use
- Group by resource type → Rejected as kubectl operations are more natural grouping

**Files Affected**:
- `src/tools/remediate.ts` - Tool definitions

---

### Decision 5: Extend Metrics Logging for Model Comparison Analysis
**Date**: 2025-10-06
**Decision**: Enhance `metrics.jsonl` logging to capture comprehensive behavioral, quality, and efficiency metrics for AI model comparison. Treat this as an evolving list - add new metrics as we identify value.

**Context**: Planning blog post comparing Claude Sonnet 4.5, OpenAI GPT-5, and Google Gemini 2.5 Pro performance in remediation workflows. Current metrics (inputTokens, outputTokens, durationMs, provider, operation) insufficient for deep analysis of model behavior differences.

**Rationale**:
- **Multi-model testing complete**: Phase 1 tested 3 models, identified significant reliability differences (Claude/GPT 100%, Gemini 50%)
- **Need behavioral insights**: Token counts don't explain WHY Gemini failed (stopped generating mid-investigation)
- **Cost analysis required**: Need to justify Claude recommendation vs. GPT's higher token usage
- **Quality metrics missing**: Can't quantify investigation thoroughness, efficiency, or reliability
- **Blog post value**: Concrete metrics make model comparison actionable for users

**Metrics to Add (Priority Order)**:

**High Priority (Implement First)**:
1. **`iterationCount`** - Number of conversation turns (loop vs one-shot behavior)
2. **`toolCallCount`** - Total tools executed (investigation thoroughness)
3. **`status`** - "success", "failed", "timeout", "parse_error" (reliability)
4. **`completionReason`** - "investigation_complete", "max_iterations", "parse_failure", "model_stopped" (failure modes)
5. **`validJson`** - Boolean: Final response contained valid JSON (quality)
6. **`modelVersion`** - Specific model ID (e.g., "claude-sonnet-4.5-20250929")
7. **`uniqueToolsUsed`** - Array of tool names (investigation strategy)
8. **`cacheHitRate`** - Percentage of tokens from cache (performance optimization)

**Medium Priority (Add as Needed)**:
9. **`confidenceScore`** - AI's confidence in analysis (0-1)
10. **`riskLevel`** - "low", "medium", "high" for recommended actions
11. **`investigationType`** - Inferred from issue (e.g., "oom", "crash", "networking")
12. **`timeToFirstToken`** - Latency until first response (UX metric)
13. **`tokensPerSecond`** - Throughput metric
14. **`toolSequence`** - Ordered array showing investigation path

**Future Extensions**:
- Cost estimation per investigation
- Token breakdown (system prompt, tool results, conversation)
- Validation metrics (dry-run usage)
- Resource context (namespace, resource type)

**Important: This is a Living List**
- Add new metrics when we discover they provide value
- Remove metrics that prove unhelpful or redundant
- Adjust priorities based on blog post needs
- Consider reader interest and actionability

**Impact**:
- Richer comparison data for blog post
- Better understanding of model behavior differences
- Quantifiable justification for model recommendations
- Foundation for future performance analysis

**Implementation Notes**:
- Capture metrics in `src/tools/remediate.ts` during investigation
- Log to `tmp/debug-ai/metrics.jsonl` (existing file)
- Ensure metrics available from toolLoop() result object
- Consider backwards compatibility with existing metrics

**Alternatives Considered**:
- Use existing metrics only → Rejected as insufficient for behavioral analysis
- Create separate metrics file → Rejected as adds complexity, prefer single JSONL
- Log everything possible → Rejected as noisy, focus on high-value metrics first

**Files Affected**:
- `src/tools/remediate.ts` - Capture additional metrics during investigation
- `tmp/debug-ai/metrics.jsonl` - Extended schema (backwards compatible)

---

### Decision 6: Expand Model Testing Coverage (Moved to PRD #151)
**Date**: 2025-10-06  
**Decision**: Test 3 additional AI models beyond initial "Big 3" (Claude, GPT-5, Gemini) to provide comprehensive market comparison covering different market segments and use cases.

**Status**: **MOVED TO PRD #151** - Extended to comprehensive MCP platform testing rather than remediation-only.

**Context**:
- Initial testing covered premium providers (Claude, GPT-5) and identified reliability issues (Gemini 50% failure)
- Blog post value increases with broader market coverage
- Readers need guidance across price/performance spectrum
- Vercel AI SDK supports many cloud providers

**Models Selected for Testing**:

1. **xAI Grok-4** (Challenger)
   - Package: `@ai-sdk/xai`
   - **Value**: Elon Musk's model, high market curiosity, competitive alternative to GPT-5
   - **Test focus**: Performance vs established leaders
   - **Blog angle**: "New challenger analysis"

2. **Mistral Large** (European Champion)
   - Package: `@ai-sdk/mistral`
   - **Value**: Leading European AI, cost-effective, strong reasoning capabilities
   - **Test focus**: Price/performance sweet spot, regional alternative
   - **Blog angle**: "Value for money analysis"

3. **DeepSeek-R1** (Disruptor)
   - Package: Available via Vercel AI SDK
   - **Value**: Extreme cost efficiency (~1/10th GPT-4 cost), recent market buzz (Jan 2025), reasoning-focused
   - **Test focus**: Budget option viability, reasoning specialization for troubleshooting
   - **Blog angle**: "Can low-cost models compete?"

**Final Model Lineup (6 total)**:
1. ✅ Claude Sonnet 4.5 - Premium, reliable baseline
2. ✅ GPT-5 - Premium, expensive reference
3. ✅ Gemini 2.5 Pro - Reliability warning case study
4. ⏳ Grok-4 - New challenger evaluation
5. ⏳ Mistral Large - Regional/value champion
6. ⏳ DeepSeek-R1 - Budget disruptor

**Rationale**:
- **Market coverage**: Premium → Mid-tier → Budget spectrum
- **Geographic diversity**: US (Claude, GPT, Grok), Europe (Mistral), China (DeepSeek)
- **Reader questions answered**:
  - "Should I pay premium?" → Claude vs GPT comparison
  - "Are there good alternatives?" → Grok, Mistral evaluation
  - "Can I save money?" → DeepSeek viability
  - "What should I avoid?" → Gemini reliability issues
- **Narrative diversity**: Established leaders, problematic option, challengers, value play, disruptor

**DeepSeek Model Selection**:
- Chose **DeepSeek-R1** over DeepSeek-V3
- **Reason**: Remediation requires complex reasoning about issues, R1's reasoning specialization potentially beneficial
- **Comparison value**: Tests if reasoning focus improves investigation quality

**Impact**:
- Extends Milestone 2.6 scope from 3 models to 6 models
- Richer blog post with broader market coverage
- More actionable guidance for diverse user needs
- Validates Vercel provider across more model types

**Alternatives Considered**:
- **Groq (Llama 3.1)** → Deferred: Speed focus less relevant for async investigations
- **Azure OpenAI** → Rejected: Redundant with direct OpenAI, enterprise wrapper
- **Amazon Bedrock** → Rejected: Redundant with direct providers
- **Cohere** → Rejected: Less relevant for reasoning tasks
- **Together.ai** → Rejected: Similar to Groq, less differentiation

**Files Affected**:
- `src/core/providers/vercel-provider.ts` - May need provider additions for xAI, Mistral, DeepSeek
- `tests/integration/infrastructure/run-integration-tests.sh` - Test script updates for new providers
- PRD Milestone 2.6 - Expand scope to 6 models

**Testing Plan**:
- Same methodology as Big 3 testing
- Same integration test suite (remediate manual + automatic modes)
- Capture extended metrics from Decision 5
- Document reliability, token usage, and behavior patterns

---

## Work Log

### 2025-10-06: PRD Creation
**Duration**: Initial planning session
**Status**: Draft

**PRD Scope Defined**:
- Two-phase approach: Tool migration (Phase 1) + Observability integration (Phase 2)
- Configuration at server level (not per-call)
- Focus on Prometheus and DataDog as initial providers
- Built on existing toolLoop() from PRD #136
- Must support both Anthropic and Vercel AI SDK providers

**Next Steps**: Begin Phase 1 implementation - validate existing tool infrastructure and design kubectl tools

---

### 2025-10-06: Architectural Design Decisions (PRD Updated)
**Duration**: Design discussion and PRD refinement
**Status**: Ready for Implementation (Milestone 1)

**Major Decisions Documented** (see Decision Log):
1. toolLoop() is complete replacement - not incremental migration (~400 lines removed)
2. Focus Anthropic first, Vercel support in Milestone 2.5
3. Remove JSON response format entirely from investigation prompt
4. Separate tools per kubectl operation (not single kubectl tool)

**PRD Updates**:
- Solution Overview: Added clarity on complete architectural replacement
- Technical Approach: Expanded code examples showing before/after (~400 lines → ~50 lines)
- Milestone 1: Updated to Anthropic-only validation, verified existing implementation
- Milestone 2: Clarified scope - replace entire loop, update prompts
- Milestone 2.5: New milestone for Vercel provider support
- Open Questions: Resolved 4 of 7 questions
- Dependencies: Corrected VercelProvider status (not yet implemented)
- Decision Log: Added comprehensive documentation of 4 major decisions with rationale

**Key Insights from Design Discussion**:
- SDK's toolLoop() manages conversation history → no need for `{previousIterations}` injection
- SDK manages iterations → no need for manual while loop or `{currentIteration}`
- SDK handles tool execution → no need for JSON parsing of `dataRequests`
- SDK determines completion → no need for `investigationComplete` boolean
- Result: Much simpler architecture than originally planned

**Verification Completed**:
- AnthropicProvider toolLoop() implementation confirmed (src/core/providers/anthropic-provider.ts:134-256)
- VercelProvider toolLoop() NOT implemented (vercel-provider.ts:157-163) - documented as blocker
- Existing remediation workflow analyzed (remediate.ts) - 400+ lines to be replaced
- Investigation prompt analyzed (remediate-investigation.md) - ~40 lines to be removed

**Next Steps**: Begin Milestone 1 implementation - Test AnthropicProvider toolLoop() with simple tool, design kubectl tool architecture

---

### 2025-10-06: Milestone 1 Completed - kubectl_api_resources Tool Implementation
**Duration**: Implementation and testing session
**Status**: ✅ COMPLETED

**Implementation Summary**:
- Created shared kubectl tools module (`src/core/kubectl-tools.ts`) for cross-feature reuse
- Implemented `kubectl_api_resources` tool with AITool interface definition
- Implemented `executeKubectlTools()` executor function with proper error handling
- Integrated tool into remediation using hybrid approach with `sendMessageWithTools()`

**Code Changes**:
1. **src/core/kubectl-tools.ts** (NEW):
   - `KUBECTL_API_RESOURCES_TOOL`: Tool definition for discovering cluster API resources
   - `executeKubectlTools()`: Tool executor with switch-case pattern for extensibility
   - Proper error handling returning structured success/error responses

2. **src/tools/remediate.ts** (MODIFIED):
   - Removed upfront `clusterApiResources` fetching (lines 346-364 deleted)
   - Removed `{clusterApiResources}` template replacement
   - Changed investigation iterations to use `sendMessageWithTools()` instead of `sendMessage()`
   - Pass `[KUBECTL_API_RESOURCES_TOOL]` and `executeKubectlTools` to each investigation call
   - Updated final analysis to use descriptive operation name: `'remediate-final-analysis'`

3. **prompts/remediate-investigation.md** (MODIFIED):
   - Removed "Cluster API Resources" section (lines 13-26)
   - AI now discovers resources via tool when needed

4. **src/core/providers/anthropic-provider.ts** (MODIFIED):
   - Added `startTime` tracking to `sendMessageWithTools()` method
   - Added metrics logging with `logMetrics()` call before return
   - Ensures all AI calls are tracked with descriptive operation names

**Testing Results**:
- ✅ All integration tests pass (`npm run test:integration`)
- **Metrics Comparison**:
  - Baseline (JSON-based, no tools): 19 total AI calls
  - With tool (hybrid approach): 16 total AI calls (12 investigation + 4 final analysis)
  - **Reduction**: 3 fewer calls with tool-based approach
- **Tool Usage**: AI did not call `kubectl_api_resources` in test scenario (crashing pod issue) as it already knew which resources to query
- Tool is available and functional when AI needs resource discovery

**Hybrid Approach Validation**:
- ✅ AI can call `kubectl_api_resources` tool when needed
- ✅ AI continues using JSON `dataRequests` pattern for other kubectl operations
- ✅ Proper metrics logging for all operations
- ✅ Tool placed in shared location for future reuse by other features

**Key Findings**:
1. **sendMessageWithTools() works correctly** - Successfully provides tools to AI while maintaining conversation flow
2. **No token overhead for unused tools** - Having tool available but unused doesn't increase token usage significantly
3. **Cleaner architecture** - Removed template injection complexity, AI calls tool when needed
4. **Metrics tracking essential** - Added proper logging to track all AI calls with descriptive operation names
5. **Shared tool module pattern** - `src/core/kubectl-tools.ts` can be imported by remediate, diagnostics, and other features

**Decision Made**: Hybrid approach validated for incremental migration. Can now proceed with implementing additional kubectl tools while maintaining backward compatibility with JSON `dataRequests` pattern.

**Files Modified**:
- `src/core/kubectl-tools.ts` (created)
- `src/tools/remediate.ts` (modified)
- `prompts/remediate-investigation.md` (modified)
- `src/core/providers/anthropic-provider.ts` (modified)

**Metrics Files**:
- `tmp/baseline-metrics-json-based.jsonl` - Baseline before tool implementation
- `tmp/metrics-with-kubectl-tool.jsonl` - Metrics after tool implementation

**Next Steps**: Begin Milestone 2 - Implement remaining kubectl tools (`kubectl_get`, `kubectl_describe`, `kubectl_logs`, `kubectl_events`, `kubectl_top`) and migrate to full `toolLoop()` architecture

---

### 2025-10-06: Critical Bug Found - sendMessageWithTools() Incomplete Implementation
**Duration**: Debug and investigation session
**Status**: ⚠️ BLOCKED - Bug prevents tool usage

**Issue Discovered**:
The initial Milestone 1 validation appeared successful, but deeper investigation revealed the tool was never actually called by the AI. Debug analysis found a critical bug in `sendMessageWithTools()` implementation.

**Root Cause Analysis**:

1. **sendMessageWithTools() is incomplete** (src/core/providers/anthropic-provider.ts:262-339):
   - ✅ Sends message with tools to Claude
   - ✅ Receives tool_use blocks from Claude
   - ✅ Executes tools locally
   - ❌ **NEVER sends tool results back to Claude** (missing multi-turn conversation)
   - ❌ Returns immediately without Claude seeing tool results

2. **Correct implementation exists in toolLoop()** (src/core/providers/anthropic-provider.ts:134-260):
   - ✅ Executes tools
   - ✅ Builds tool_result messages with tool_use_id
   - ✅ Adds assistant response + tool results to conversation history
   - ✅ Sends back to Claude to get final response incorporating tool results
   - ✅ Multi-turn conversation pattern

3. **Test scenario created to force tool usage**:
   - Created CNPG Cluster with reference to non-existent backup
   - Vague issue description: "my database in namespace X is not starting"
   - Forces AI to discover CNPG custom resources via `kubectl_api_resources`
   - Updated test: `tests/integration/tools/remediate.test.ts`

4. **Test results showed the problem**:
   - AI ran for 26+ iterations (hit 20 iteration limit twice)
   - All iterations showed: `"successful": 0, "failed": 0` (no data gathered)
   - AI mentioned "kubectl_api_resources function is available" but never called it
   - AI output empty `dataRequests: []` arrays
   - Investigation stuck in endless loop

**Why Hybrid Approach Failed**:
- Prompt instructs AI to output JSON with `dataRequests` array
- Tool is available via `sendMessageWithTools()` but AI doesn't know how to use it
- Even when AI tries to call tool, results never make it back to AI
- AI is confused by mixed instructions (JSON format + tool availability)

**Architectural Decision - Full toolLoop() Migration Required**:

After analysis, determined that fixing `sendMessageWithTools()` to match `toolLoop()` pattern would still leave hybrid confusion. Better approach: **Complete migration to toolLoop()** as originally planned in Milestone 2.

**Implementation Plan**:
1. ✅ Create all kubectl investigation tools (completed):
   - `kubectl_api_resources` - discover cluster resources
   - `kubectl_get` - list resources
   - `kubectl_describe` - detailed resource info
   - `kubectl_logs` - container logs
   - `kubectl_events` - cluster events
   - Exported as `KUBECTL_INVESTIGATION_TOOLS` array

2. ⏳ Replace manual investigation loop with `toolLoop()`:
   - Remove `conductInvestigation()` while loop (lines 208-308)
   - Use `aiProvider.toolLoop()` with all investigation tools
   - Remove `parseAIResponse()` function (dead code)
   - Remove `gatherSafeData()` function (dead code)
   - Remove `analyzeCurrentState()` function (replaced by toolLoop)

3. ⏳ Update investigation prompt:
   - Remove JSON response format instructions
   - Remove `dataRequests` array specification
   - Remove `investigationComplete` field
   - Explain available tools and when to use them
   - Let SDK manage conversation history and completion

4. ⏳ Test with CNPG scenario

**Files Completed**:
- ✅ `src/core/kubectl-tools.ts` - All 5 investigation tools implemented
- ✅ `tests/integration/tools/remediate.test.ts` - CNPG test scenario created

**Files Pending**:
- ⏳ `src/tools/remediate.ts` - Replace loop with toolLoop()
- ⏳ `prompts/remediate-investigation.md` - Remove JSON format, explain tools

**Metrics from Failed Test**:
- 26 AI calls total (2 separate investigation attempts)
- First attempt: 19 iterations, hit max limit
- Second attempt: 7 iterations, still no data gathered
- Input tokens grew to 28,443 (huge `previousIterations` injection)
- Operations properly labeled: "remediate-investigation" and "remediate-final-analysis"

**Key Insight**:
The hybrid approach (one tool + JSON pattern) creates too much confusion. Clean migration to full toolLoop() is simpler and cleaner than trying to fix the hybrid model.

**Next Steps**: Complete toolLoop() migration - estimated ~200-300 lines of code changes across 2 files

---

### 2025-10-06: Milestone 2 Implementation - Full toolLoop() Migration Complete
**Duration**: Implementation and testing session (~4-5 hours based on conversation)
**Status**: ✅ SUBSTANTIALLY COMPLETE (Anthropic provider only)

**Implementation Summary**:
Completed full migration from JSON-based investigation (400+ line manual loop) to SDK-native toolLoop() architecture. All kubectl investigation tools implemented, tests passing, debug infrastructure added.

**Completed PRD Items**:
- [x] All kubectl tools implemented (9 tools total) - Evidence: src/core/kubectl-tools.ts:17-515
  - kubectl_api_resources, kubectl_get, kubectl_describe, kubectl_logs, kubectl_events
  - kubectl_patch_dryrun, kubectl_apply_dryrun, kubectl_delete_dryrun
  - kubectl_get_crd_schema (NEW - added for CRD schema discovery)

- [x] Replace `conductInvestigation()` with single `toolLoop()` call - Evidence: src/tools/remediate.ts:416-496
  - Single-phase investigation merging investigation + analysis
  - Removed 400+ lines of manual loop management
  - SDK handles iterations, history, tool execution, completion

- [x] Update investigation prompt - Evidence: prompts/remediate.md
  - Removed JSON response format requirements
  - Removed dataRequests array specification
  - Converted to tool-calling instructions
  - Simplified by ~40 lines

- [x] Remove old JSON-based code - Evidence: Multiple file changes
  - Deleted parseAIResponse() function (dead code)
  - Deleted gatherSafeData() function (dead code)
  - Deleted analyzeCurrentState() function (replaced by toolLoop)
  - Updated parseAIFinalAnalysis() with proper brace-depth tracking (fixed greedy regex bug)

- [x] Integration tests passing - Evidence: tmp/test-output-deployment.log
  - Manual mode test: Deployment with OOMKilled container (passed)
  - Automatic mode test: Pod with memory issue (passed)
  - 2/2 tests passed in 217.50s

- [x] Debug logging infrastructure - Evidence: src/core/providers/anthropic-provider.ts:58-72, 192-208
  - Added operation parameter to toolLoop() config
  - Created logDebugIfEnabled() shared helper method
  - Per-iteration debug logging with prompts, responses, token metrics
  - Fixed JSON parsing bug in parseAIFinalAnalysis (brace-depth tracking)

**Additional Work Completed**:

1. **Code Cleanup**:
   - Removed `sendMessageWithTools()` from all providers (dead code)
   - Deleted old JSON-based prompts: remediate-investigation.md, remediate-final-analysis.md
   - Removed unused imports (AITool, ToolExecutor) from providers
   - Fixed environment variable mismatch: DOT_AI_DEBUG → DEBUG_DOT_AI

2. **Test Scenario Improvements**:
   - Changed from Pod to Deployment resources for complexity
   - Updated validation to check deployment rollout and pod status
   - Improved wait logic using label selectors
   - Fixed OOMKilled scenario to be consistently reproducible

3. **New Capability - CRD Schema Tool**:
   - Implemented kubectl_get_crd_schema tool (src/core/kubectl-tools.ts:207-224)
   - Returns raw JSON from `kubectl get crd <name> -o json`
   - AI can discover CRD schemas for complex custom resources (CNPG, Crossplane, ArgoCD, etc.)
   - Added to KUBECTL_INVESTIGATION_TOOLS array for automatic availability

**Metrics Analysis**:
- **Tool-based approach**: 33 calls, 482,004 input tokens, 15,372 output tokens, 370s
- **JSON-based baseline**: 19 calls, 172,089 input tokens, 9,378 output tokens, 220s
- **Token increase**: +74% more calls, +180% more input tokens, +174% total tokens
- **Analysis**: Token increase is expected and justified:
  - Tool-based performs more thorough investigation (longer tool result context)
  - Better investigation quality with native tool calling
  - Future observability tools will add more value
  - Acceptable tradeoff for cleaner architecture and extensibility

**Architecture Changes**:
1. **Single-phase investigation**: Merged investigation + analysis into one toolLoop call
2. **JSON parsing fix**: Replaced greedy regex with proper brace-depth tracking algorithm
3. **Debug infrastructure**: Per-iteration logging with operation names for observability
4. **Tool executor pattern**: executeKubectlTools() with switch-case for extensibility

**Files Modified**:
- src/core/kubectl-tools.ts (9 tools, 515 lines)
- src/tools/remediate.ts (conductInvestigation replaced, parseAIFinalAnalysis fixed)
- src/core/providers/anthropic-provider.ts (debug logging, removed sendMessageWithTools)
- src/core/providers/vercel-provider.ts (removed sendMessageWithTools stub)
- src/core/ai-provider.interface.ts (removed sendMessageWithTools signature)
- prompts/remediate.md (tool-calling only, no JSON format)
- tests/integration/tools/remediate.test.ts (Deployment scenario, improved validation)
- tests/integration/infrastructure/run-integration-tests.sh (DEBUG_DOT_AI fix)

**Files Deleted**:
- prompts/remediate-investigation.md (old JSON-based investigation prompt)
- prompts/remediate-final-analysis.md (old JSON-based analysis prompt)

**Test Evidence**:
- Manual mode: 5 iterations, 7 tool calls, fixed Deployment memory limit
- Automatic mode: 6 iterations, 8 tool calls, auto-executed remediation
- Debug files: tmp/debug-ai/metrics.jsonl, tmp/debug-ai/*_prompt.md, *_response.md

**Key Architectural Insights**:
1. **toolLoop() simplification**: SDK managing conversation history eliminated ~400 lines of manual loop code
2. **No template injection needed**: SDK maintains context automatically (no `{previousIterations}`)
3. **Natural tool selection**: AI chooses appropriate tools based on descriptions without special prompt logic
4. **Cleaner error handling**: Tool execution errors handled consistently by executor pattern
5. **Better extensibility**: Adding new tools is straightforward (define + add to executor switch)

**Next Session Priorities**:
1. Milestone 2.5: Implement VercelProvider.toolLoop() for multi-provider support
2. Phase 2 Start: Design observability configuration system (Milestone 3)
3. Consider token optimization strategies if usage becomes concern in production

---

### 2025-10-06: Performance Optimization Session (Post-Implementation)
**Duration**: ~4-5 hours (based on conversation flow)
**Focus**: Token usage and execution time optimization

**Problem**: Initial tool-based implementation showed +174% token increase vs JSON-based baseline, raising concerns about production viability.

**Optimizations Implemented**:
1. **Table Format Enforcement** (-41% tokens):
   - Modified `kubectl-tools.ts` to strip `-o=yaml` and `-o=json` output flags
   - Root cause: YAML outputs contained escaped JSON in `last-applied-configuration` annotations
   - Files: `src/core/kubectl-tools.ts`, `tests/integration/infrastructure/run-integration-tests.sh`

2. **Prompt Caching** (-7% additional tokens, massive cost savings):
   - Separated static system prompt from dynamic issue description
   - Added `cache_control` on system prompt and tools array in `anthropic-provider.ts`
   - Cache write: +25% cost, Cache read: -90% cost (187K cache reads per test run)
   - Files: `src/core/providers/anthropic-provider.ts`, `src/tools/remediate.ts`, `prompts/remediate-system.md` (renamed)

3. **Parallel Tool Execution** (time optimization):
   - Refactored toolLoop() from sequential `for...of await` to `Promise.all()`
   - Claude frequently requests 2-3 tools simultaneously; now execute concurrently
   - Files: `src/core/providers/anthropic-provider.ts`

4. **Concurrent Test Execution** (-42% test time):
   - Added `.concurrent` to remediation tests (separate namespaces, no conflicts)
   - Test time: 142.6s concurrent vs 248.3s sequential
   - Files: `tests/integration/tools/remediate.test.ts`

**Metrics Summary**:
| Stage | Tokens | Change |
|-------|--------|--------|
| JSON baseline | 181,467 | - |
| Initial tools | 497,376 | +174% ⚠️ |
| After table opt | 293,932 | -41% |
| After caching | 272,216 | -7% |
| **Final vs baseline** | **272,216** | **+50%** ✅ |

**Outcome**: Tool-based approach economically viable for production. Token cost increase justified by:
- Better investigation quality with native tool calling
- Extensibility for Phase 2 observability tools
- Cleaner architecture (-400 lines manual loop code)
- 35% faster wall-clock time through concurrency

**Files Modified**:
- `src/core/kubectl-tools.ts` - Table format enforcement
- `src/core/providers/anthropic-provider.ts` - Caching + parallel execution
- `src/tools/remediate.ts` - Separated issue from system prompt
- `prompts/remediate-system.md` - Renamed, made fully static
- `src/core/providers/provider-debug-utils.ts` - Cache metrics logging
- `src/core/ai-provider.interface.ts` - Added cache metrics to AIResponse
- `tests/integration/tools/remediate.test.ts` - Concurrent execution
- `tests/integration/infrastructure/run-integration-tests.sh` - Hardcoded DEBUG=true

**Key Optimization Insights**:
1. **Table format is significantly more efficient**: Stripping verbose YAML/JSON outputs reduced tokens by 41%
2. **Prompt caching is essential for tool-based architectures**: 90% cost reduction on cache reads makes repeated investigations economical
3. **Static system prompt enables cross-investigation caching**: Separating issue from prompt allows cache reuse across all investigations
4. **Parallel tool execution aligns with Claude's natural behavior**: Claude frequently requests multiple tools; executing them concurrently reduces wall-clock time
5. **Test concurrency is safe with proper isolation**: Using separate namespaces prevents conflicts

**Baseline Files Created**:
- `tmp/baseline-metrics-json-based.jsonl` - Original JSON-based approach
- `tmp/baseline-before-table-optimization.jsonl` - Initial tool implementation
- `tmp/baseline-table-only-before-caching.jsonl` - After table optimization
- `tmp/baseline-after-caching-before-parallel.jsonl` - After prompt caching

**Next Session**: Begin Milestone 2.5 (Vercel Provider Support) or proceed to Phase 2 (Observability Integration)

----

### 2025-10-06: Milestone 2.5 Complete - Vercel Provider Implementation
**Duration**: ~4 hours (conversation-based estimation)
**Status**: Milestone 2.5 Complete ✅ | Milestone 2.6 (Multi-Model Testing) Pending

**Completed PRD Items**:
- [x] Implement VercelProvider.toolLoop() using Vercel AI SDK - Evidence: src/core/providers/vercel-provider.ts:168-398
- [x] Provider-specific caching (Anthropic manual, OpenAI automatic) - Evidence: Lines 192-233
- [x] Test all kubectl tools with Vercel provider (Anthropic model) - Evidence: Integration tests passing (76.8s Manual, 98s Auto)
- [x] Cache metrics normalization across providers - Evidence: Lines 362-385
- [x] Integration tests updated and passing - Evidence: Test output shows 100% pass rate
- [x] Debug infrastructure working (full-context.md, metrics.jsonl) - Evidence: Files generated successfully
- [x] Document provider differences and findings - Evidence: Comprehensive code comments + conversation analysis

**Key Implementation Details**:
- Used `generateText()` with `stopWhen: stepCountIs(maxIterations)` for automatic multi-turn loops
- Anthropic caching via `providerOptions.anthropic.cacheControl` on last tool + system message
- Tool result extraction: `(part as any).output.value` structure discovered and implemented
- Cleaned up debug console outputs - all structured data goes to files now
- SDK version 5.0.60 includes token reporting fixes from PR #8945

**Performance Validation**:
- **Vercel SDK: 22 iterations in 132 seconds** (36% faster than Anthropic native)
- **Anthropic Native: 33 iterations in 206 seconds** (baseline comparison)
- **Efficiency gain**: Vercel reaches conclusions with 33% fewer iterations
- **Per-iteration speed**: Nearly identical (6.0s vs 6.2s)
- **Root cause of lower tokens**: Fewer iterations needed, not SDK bug

**Technical Discoveries**:
- Vercel SDK v5.0.60 includes token reporting fixes (PR #8945 merged Sept 26, 2025)
- Token reporting appears lower because fewer iterations needed (efficiency, not a bug)
- Per-step usage data not available - only cumulative totals from `result.usage`
- Full conversation context properly maintained in `result.response.messages`
- GitHub issues referenced: #8349 (cache tokens), #8795 (streaming), PR #8945 (fix)

**Additional Work Done**:
- Investigated token reporting discrepancies - confirmed SDK working correctly
- Added GitHub issue references (#8349, #8795, #8945) as code comments for future reference
- Removed unnecessary debug console outputs while keeping file-based debugging
- Validated tool output extraction with actual integration test data
- Created comprehensive performance comparison analysis

**Testing Scope**:
- ✅ Tested: Anthropic Claude Sonnet 4.5 via Vercel AI SDK
- ⏳ Pending: OpenAI GPT-5 and Google Gemini 2.5 Pro (Milestone 2.6)

**Next Session Priorities**:
1. **Milestone 2.6**: Test VercelProvider with OpenAI and Google Gemini models
2. OR **Phase 2 Start**: Begin Milestone 3 (Observability Configuration System)
3. Consider documenting findings in architecture decision record

---

### 2025-10-06: Milestone 2.6 Progress - OpenAI GPT-5 Testing Complete
**Duration**: ~2 hours (testing and analysis)
**Status**: Milestone 2.6 Substantially Complete (OpenAI ✅, Google Gemini pending)

**Completed Work**:
- [x] Tested VercelProvider.toolLoop() with OpenAI GPT-5 model
- [x] Verified all kubectl tools work correctly with GPT-5
- [x] Analyzed performance metrics vs Claude Sonnet baseline
- [x] Documented model-specific differences and recommendations
- [x] Saved baseline metrics to `tmp/baseline-vercel-gpt5.jsonl`

**Test Results Summary**:
- **Integration Tests**: Both Manual and Automatic modes passed ✅
- **Test Duration**: 136.4s total (Manual: 135.8s, Automatic: 125.4s)
- **Token Usage**: 30,684 total tokens (+84% vs Claude: 16,649)
- **Caching**: 20,224 cache read tokens (automatic via Vercel SDK)
- **Investigation Quality**: Equivalent to Claude - successfully identifies issues and generates valid fixes

**Key Findings**:
1. **GPT-5 uses more tokens**: +77% input, +129% output compared to Claude
2. **GPT-5 is slower per call**: Average 54.6s vs Claude 33s (+65%)
3. **End-to-end duration similar**: Concurrent test execution masks per-call differences
4. **Caching works effectively**: OpenAI's automatic caching shows 20K cache reads
5. **No code changes needed**: Tests work with any provider via configuration only

**Performance Tradeoffs**:
- **Claude Sonnet 4.5**: Best for cost/speed optimization (-46% tokens, -40% faster)
- **OpenAI GPT-5**: Suitable when OpenAI infrastructure already in use, equivalent quality

**Technical Implementation**:
- Changed `AI_PROVIDER=openai` in test script (line 144 of run-integration-tests.sh)
- Vercel SDK automatically handles OpenAI-specific caching and token reporting
- No provider-specific code changes required in remediation tool
- Debug infrastructure captured full conversation context and metrics

**Files Modified**:
- `tests/integration/infrastructure/run-integration-tests.sh` - Changed AI_PROVIDER to openai
- `prds/143-tool-based-remediation-observability.md` - Updated Milestone 2.6 with results

**Files Created**:
- `tmp/baseline-vercel-gpt5.jsonl` - OpenAI GPT-5 performance baseline
- `tmp/test-output-openai-gpt5.log` - Full test execution log
- `tmp/debug-ai/*_full-context.md` - 4 investigation conversation contexts

**Next Steps**:
1. Test with Google Gemini 2.5 Pro to complete Milestone 2.6
2. OR proceed to Phase 2 (Observability Integration) if Gemini testing deferred

---

### 2025-10-06: Milestone 2.6 Complete - Multi-Model Testing with Critical Findings
**Duration**: ~4 hours (OpenAI + Gemini testing and analysis)
**Status**: ✅ Milestone 2.6 COMPLETE - Phase 1 fully validated across 3 AI providers

**Completed Work**:
- [x] Tested VercelProvider.toolLoop() with OpenAI GPT-5 model (100% reliability)
- [x] Tested VercelProvider.toolLoop() with Google Gemini 2.5 Pro model (50% reliability - CRITICAL ISSUE)
- [x] Analyzed performance metrics for all 3 models (Claude, GPT-5, Gemini)
- [x] Documented comprehensive model comparison with production recommendations
- [x] Identified and documented Gemini reliability issue with debug evidence

**Test Results Summary**:

**OpenAI GPT-5**:
- ✅ All tests passed (100% reliability)
- Token usage: 30,684 total (+84% vs Claude)
- Performance: Slower but reliable
- Recommendation: Acceptable for production

**Google Gemini 2.5 Pro**:
- ⚠️ 50% test failure rate (1 of 2 tests failed)
- Critical issue: Model stops generating output mid-investigation (0 tokens)
- When working: Quality equivalent to Claude/GPT-5
- Recommendation: NOT RECOMMENDED for production

**Key Findings**:
1. **Claude Sonnet 4.5 remains best choice**: Optimal cost/speed/reliability balance
2. **OpenAI GPT-5 is viable alternative**: Higher cost but 100% reliable
3. **Gemini has critical reliability issue**: Unpredictable stopping behavior makes it unsuitable
4. **Vercel provider works universally**: Successfully tested with 3 different AI providers
5. **No code changes needed**: Provider switching via configuration only

**Technical Discoveries**:
- **Gemini failure mode**: Generates 0 output tokens after successful tool execution
- **Failure pattern**: Unpredictable - no clear trigger
- **Debug evidence**: Full conversation logs show premature stopping after validation
- **Impact**: Cannot trust Gemini for critical workflows requiring guaranteed completion

**Files Modified**:
- `tests/integration/infrastructure/run-integration-tests.sh` - Updated AI_PROVIDER for testing
- `prds/143-tool-based-remediation-observability.md` - Updated Milestone 2.6 with all findings

**Files Created**:
- `tmp/baseline-vercel-gpt5.jsonl` - OpenAI GPT-5 baseline metrics
- `tmp/baseline-vercel-gemini.jsonl` - Google Gemini partial metrics
- `tmp/test-output-openai-gpt5.log` - OpenAI test execution log
- `tmp/test-output-google-gemini.log` - Gemini test execution log
- `tmp/debug-ai/2025-10-06T180613_*` - Gemini failure debug evidence
- `tmp/debug-ai/2025-10-06T180631_*` - Gemini success debug evidence

**Phase 1 Completion Assessment**:
- ✅ **Milestone 1**: kubectl_api_resources tool implementation - COMPLETE
- ✅ **Milestone 2**: Complete kubectl tool migration to toolLoop() - COMPLETE
- ✅ **Milestone 2.5**: Vercel Provider implementation - COMPLETE
- ✅ **Milestone 2.6**: Multi-model provider testing - COMPLETE

**Phase 1 Final Status**: ✅ **COMPLETE** - Tool-based remediation architecture fully validated
- Architecture migrated successfully from JSON-based to SDK-native toolLoop()
- Both Anthropic and Vercel providers working correctly
- 3 AI models tested: Claude (recommended), GPT-5 (acceptable), Gemini (not recommended)
- Performance optimized (+50% tokens vs baseline, but with caching and parallel execution)
- All integration tests passing for recommended providers
- Production-ready for Claude Sonnet 4.5 and OpenAI GPT-5

**Next Session**: Begin Phase 2 (Observability Integration) - Milestone 3: Observability Configuration System

----

### 2025-10-06: PRD Split - Phase 2 Extracted to PRD #150
**Duration**: Documentation refactoring
**Status**: PRD organizational improvement

**Context**:
- Phase 1 (Tool-Based Architecture) complete and validated
- Phase 2 (Observability Integration) recognized as separate scope
- User requested split for focused implementation

**Changes Made**:
- Created [PRD #150 - Remediation Observability Integration](https://github.com/vfarcic/dot-ai/blob/main/prds/150-remediation-observability-integration.md)
- Extracted Phase 2 milestones (observability tools) to new PRD
- Updated PRD #143 to focus on Phase 1 (kubectl tools architecture)
- Added cross-references between PRDs

**Rationale**:
- Phase 1 has different scope than Phase 2 (architecture vs. features)
- Observability tool selection needs its own discovery process
- Cleaner separation of concerns for issue tracking
- Phase 1 provides foundation for Phase 2 to build upon

**PRD #143 Status**: Phase 1 complete - architecture established and validated

**Next Session**: Work on PRD #150 (Observability Integration) or new task

----

### 2025-10-06: Design Decision - Extended Metrics for Model Comparison
**Duration**: Planning discussion
**Status**: Decision documented

**Context**:
- Planning blog post comparing Claude, GPT-5, and Gemini performance
- Current metrics insufficient for deep behavioral analysis
- Need to explain WHY models differ (not just token counts)

**Decision Made**:
- Extend `metrics.jsonl` with behavioral, quality, and efficiency metrics
- Treat as living list - add more as we discover value
- Prioritized implementation: High priority (8 metrics), Medium priority (6 metrics), Future extensions

**Key Metrics Added**:
- `iterationCount` - Loop vs one-shot behavior
- `toolCallCount` - Investigation thoroughness
- `status` / `completionReason` - Reliability and failure modes
- `validJson` - Quality metric
- `modelVersion` - Specific model tracking
- `uniqueToolsUsed` - Investigation strategy
- `cacheHitRate` - Performance optimization

**Rationale**:
- Gemini 50% failure needs behavioral explanation (not just "failed")
- GPT +84% token usage needs justification analysis
- Blog post readers need actionable comparison data

**Documentation**:
- Added Decision 5 to PRD Decision Log with comprehensive rationale
- Emphasized this is a living list that evolves with needs
- Documented implementation notes and file locations

**Next Steps**: Implement high-priority metrics in `src/tools/remediate.ts`

----

### 2025-10-07: Decision 5 Implementation - Extended Metrics Collection Complete
**Duration**: ~2 hours (based on conversation timestamps)
**Status**: ✅ Extended metrics infrastructure COMPLETE

**Completed Work**:
- [x] Extended `AgenticResult` interface with new metrics fields (flat structure)
- [x] Implemented `logMetrics()` function accepting result object
- [x] Created `createAndLogAgenticResult()` helper to reduce provider duplication
- [x] Updated both AnthropicProvider and VercelProvider to use helper
- [x] Added manual annotation placeholders for post-test analysis
- [x] Renamed `provider` field to `sdk` for clarity (distinguishes SDK from model)
- [x] Configured test script defaults (native Anthropic SDK, env var override support)

**Key Architectural Decisions**:
1. **Flat metrics structure**: Rejected nested "extendedMetrics" wrapper - made all metrics first-class fields in AgenticResult
2. **Shared helper function**: Created `createAndLogAgenticResult()` to eliminate ~50 lines of duplication across providers
3. **SDK vs Provider naming**: Renamed field from `provider` to `sdk` to distinguish SDK implementation (vercel/anthropic) from AI model
4. **Test defaults**: Made native Anthropic the default SDK to match CI configuration, support env var override (`AI_PROVIDER_SDK=vercel`)

**Extended Metrics Implemented (Decision 5)**:
- `iterationCount` - Number of conversation turns in investigation loop
- `toolCallCount` - Total tools executed (investigation thoroughness metric)
- `uniqueToolsUsed` - Array of distinct tool names called
- `status` - "success" | "failed" | "timeout" | "parse_error" (reliability tracking)
- `completionReason` - "investigation_complete" | "max_iterations" | "parse_failure" | "model_stopped" | "error"
- `modelVersion` - Specific model ID (e.g., "claude-sonnet-4-5-20250929")
- `cacheHitRate` - Percentage of tokens from cache (calculated from cacheRead/input ratio)

**Manual Annotation Placeholders**:
- `manualNotes` - General observations after test review
- `failureReason` - Root cause of failures
- `qualityIssues` - List of quality problems discovered
- `comparisonNotes` - Cross-model comparison insights

**Critical Discovery - Token Reporting Discrepancy**:
- Native Anthropic SDK reports **2-3x more tokens** than Vercel SDK
- **Test Results**:
  - Native Anthropic: 11,493 input / 1,293 output tokens (1st investigation)
  - Vercel SDK: 3,682 input / 440 output tokens (1st investigation)
  - **Difference**: 3.1x more input, 2.9x more output via native SDK
- **Same model** (claude-sonnet-4-5-20250929), same operations, similar durations
- Confirms Vercel SDK **underreports ~70% of tokens** compared to native Anthropic SDK
- Aligns with code comment in vercel-provider.ts:371 about token discrepancy
- **Impact**: Cannot trust Vercel SDK token metrics for cost estimation

**Files Modified**:
- `src/core/ai-provider.interface.ts` - Extended `AgenticResult` interface with new fields
- `src/core/providers/provider-debug-utils.ts` - Updated `logMetrics()` signature, added `createAndLogAgenticResult()` helper
- `src/core/providers/anthropic-provider.ts` - Updated all toolLoop exit points to use helper (3 locations)
- `src/core/providers/vercel-provider.ts` - Changed `getProviderType()` to return 'vercel', updated to use helper (2 locations)
- `tests/integration/infrastructure/run-integration-tests.sh` - Added env var defaults with logging

**Metrics Storage**:
- Location: `tests/integration/metrics/` (committed to Git for historical comparison)
- Format: Compact JSONL (one JSON object per line, no commas between objects)
- Baseline saved: `anthropic-native-sonnet-20251007-001434.jsonl`
- Purpose: Enable regression detection, historical comparison, documentation

**Test Validation**:
- ✅ Build succeeded with no errors
- ✅ Integration tests passed with native Anthropic SDK
- ✅ Metrics properly captured with `sdk: "anthropic"` field
- ✅ Manual annotation placeholders present in output
- ✅ Extended metrics populated correctly

**Next Session Priorities**:
1. Run comprehensive tests across all providers (Anthropic, Vercel, OpenAI, Google Gemini)
2. Populate manual annotation fields after analyzing test results
3. Complete full model comparison analysis (6 models total per Decision 6)
4. Investigate Vercel SDK token reporting discrepancy (consider filing GitHub issue)
5. Use extended metrics to create blog post comparing AI model performance