DevOps AI Toolkit

137-opentelemetry-tracing.md•71.7 KiB

# PRD: OpenTelemetry Tracing for MCP Server

**Created**: 2025-10-03
**Status**: Complete
**Owner**: Viktor Farcic
**GitHub Issue**: [#137](https://github.com/vfarcic/dot-ai/issues/137)
**Last Updated**: 2025-10-30

## Executive Summary
Add OpenTelemetry distributed tracing to the MCP server to provide vendor-neutral observability for debugging complex multi-step workflows, measuring AI provider performance, and understanding Kubernetes operation latency. This enables production-ready monitoring without infrastructure lock-in.

## Problem Statement

The DevOps AI Toolkit MCP server handles complex operations including:
- Multi-step workflows (buildPlatform: list → submitAnswers → execute)
- AI provider calls (Claude, OpenAI) with variable latency
- Kubernetes cluster operations (discovery, deployment, remediation)
- Session-based stateful interactions across tool calls
- HTTP/SSE and STDIO transport protocols

**Current Gap**: No distributed tracing capability to understand:
- Where time is spent in multi-tool workflows
- Which AI provider calls are slow or failing
- How Kubernetes API latency impacts user experience
- How errors correlate across complex request chains

**Impact**: Difficult to debug performance issues, optimize AI costs, and troubleshoot production incidents.

## Documentation Changes

### Files Created/Updated
- **`docs/observability-guide.md`** - New File - Complete guide for OpenTelemetry tracing, configuration, and usage
- **`docs/deployment-guide.md`** - Updated - Add tracing configuration for production deployments
- **`docs/development-guide.md`** - New File - Developer guide for adding instrumentation to new tools
- **`README.md`** - Project Overview - Add observability to core capabilities
- **`src/core/tracing/`** - Technical Implementation - OpenTelemetry instrumentation modules

### Content Location Map
- **Feature Overview**: See `docs/observability-guide.md` (Section: "What is Distributed Tracing")
- **Configuration**: See `docs/observability-guide.md` (Section: "Setup and Configuration")
- **Tool Instrumentation**: See `docs/development-guide.md` (Section: "Adding Tracing to Tools")
- **Production Deployment**: See `docs/deployment-guide.md` (Section: "Observability Configuration")
- **Trace Analysis**: See `docs/observability-guide.md` (Section: "Understanding Traces")
- **Integration Examples**: See `docs/observability-guide.md` (Section: "Backend Integration")

### User Journey Validation
- [ ] **Primary workflow** documented end-to-end: Enable tracing → Deploy MCP → View traces → Debug issues
- [ ] **Developer workflow** complete: Add instrumentation → Test locally → Verify traces → Deploy
- [ ] **Operations workflow** complete: Configure collector → Deploy server → Monitor traces → Troubleshoot
- [ ] **Cross-references** between development docs and observability docs work correctly

## Solution Overview

**Standard Server-Side OpenTelemetry Implementation with Official GenAI Semantic Conventions**

Implement OpenTelemetry instrumentation following industry best practices and official GenAI semantic conventions:
1. **Auto-instrumentation**: HTTP, Express middleware tracing (SERVER spans)
2. **Generic wrapper instrumentation**: Tool execution, AI calls via reusable wrappers
3. **Official GenAI conventions**: Use `gen_ai.*` attributes for AI/LLM operations
4. **Hierarchical span structure**: SERVER → INTERNAL → CLIENT span relationships
5. **Direct export**: Server exports traces to OTel collector (not through MCP protocol)
6. **Trace context propagation**: Correlate multi-step workflows and sessions
7. **Integration**: Extend existing Logger and AIProvider interfaces

**NOT implementing**: The controversial "send traces through MCP" approach from modelcontextprotocol/discussions/269.

### Generic Instrumentation Strategy (Decision: 2025-10-29)

**Principle**: Instrument at architectural boundaries using generic wrappers, not manual spans scattered throughout the codebase.

**Four Strategic Instrumentation Points:**

1. **HTTP Entry Points** (Auto-instrumented)
   - Uses `@opentelemetry/instrumentation-express`
   - Zero code changes required
   - Creates SERVER spans with `http.*` attributes
   - Example: `POST /mcp/tools`

2. **MCP Tool Execution** (Generic wrapper: `withToolTracing()`)
   - Single wrapper function instruments all 5 MCP tools automatically
   - Creates INTERNAL spans with `gen_ai.tool.*` attributes
   - Example: `execute_tool recommend`
   - Implementation: `src/core/tracing/tool-tracing.ts`

3. **AI Provider Calls** (Interface-level tracing: `TracedAIProvider`)
   - Base class implements tracing for all AI providers
   - Creates CLIENT spans with official `gen_ai.*` attributes
   - Example: `chat claude-3-5-sonnet`
   - All providers (Anthropic, Vercel, NoOp) inherit tracing behavior

4. **Kubernetes API Calls** (Optional proxy: `createTracedK8sClient()`)
   - Generic proxy wrapper for transparent instrumentation
   - Creates CLIENT spans with `k8s.*` and `http.*` attributes
   - Example: `k8s.listAPIGroups`
   - Implementation: `src/core/tracing/k8s-tracing.ts`

**Benefits:**
- ✅ ~100-150 lines of instrumentation code covers entire system
- ✅ New tools/providers automatically traced
- ✅ No manual span management in business logic
- ✅ Standards-compliant with official OpenTelemetry GenAI conventions
- ✅ Maintainable: change tracing logic in one place

## Implementation Requirements

### Core Functionality
- [ ] **HTTP/MCP request tracing**: Automatic span creation for all incoming requests - Documented in `docs/observability-guide.md` (Section: "Request Tracing")
- [ ] **Tool execution spans**: Each of 5 active MCP tools traced (recommend, version, manageOrgData, remediate, projectSetup) - Documented in `docs/development-guide.md` (Section: "Tool Spans")
- [ ] **Error tracking**: Integration with existing error-handling system - Documented in `docs/observability-guide.md` (Section: "Error Correlation")
- [ ] **Trace context propagation**: Session-based workflow correlation - Documented in `docs/development-guide.md` (Section: "Context Propagation")

### Deep Instrumentation
- [ ] **AI provider tracing**: Claude/OpenAI API call spans with latency/tokens - Documented in `docs/observability-guide.md` (Section: "AI Provider Metrics")
- [ ] **Kubernetes operations**: Cluster API calls, discovery, deployments - Documented in `docs/observability-guide.md` (Section: "Kubernetes Operations")
- [ ] **Multi-step workflows**: Trace buildPlatform intent mapping → script discovery → execution - Documented in `docs/development-guide.md` (Section: "Complex Workflows")
- [ ] **Session lifecycle**: Track session creation, continuity, and completion - Documented in `docs/observability-guide.md` (Section: "Session Tracking")

### Configuration & Deployment
- [ ] **Environment-based config**: OTEL_EXPORTER_OTLP_ENDPOINT, service name, sampling - Documented in `docs/deployment-guide.md` (Section: "Environment Variables")
- [ ] **Multiple exporters**: Console (dev), OTLP (production), Jaeger, Zipkin - Documented in `docs/observability-guide.md` (Section: "Exporter Configuration")
- [ ] **Sampling strategies**: Always-on (dev), probability-based (production) - Documented in `docs/deployment-guide.md` (Section: "Sampling Configuration")
- [ ] **Zero-config default**: Works out-of-box with console exporter for local development - Documented in `docs/development-guide.md` (Section: "Getting Started")

### Documentation Quality Requirements
- [ ] **All examples work**: Configuration examples validated in integration tests
- [ ] **Complete user journeys**: End-to-end workflows from setup to trace analysis documented
- [ ] **Consistent terminology**: OpenTelemetry terms used correctly across all documentation
- [ ] **Working cross-references**: All links between observability docs and core docs resolve correctly

### Success Criteria
- [ ] **Minimal overhead**: <2ms latency added per request with tracing enabled
- [ ] **Complete visibility**: All tool executions, AI calls, and K8s operations traced
- [ ] **Developer experience**: Simple API for adding spans to new tools
- [ ] **Production ready**: Configurable sampling, multiple backends, robust error handling
- [ ] **Zero infrastructure requirement**: Works with any OTel-compatible backend

## Implementation Progress

### Phase 1: Core Tracing Foundation [Status: ✅ COMPLETE - 100%]
**Target**: Basic distributed tracing working for HTTP requests and tool execution
**Note**: Integration tests will be written at the end after all phases complete

**Documentation Changes:**
- [x] **`docs/observability-guide.md`**: Create comprehensive user guide with tracing concepts, setup, and usage
- [x] **`docs/setup/docker-setup.md`**: Add tracing environment variables with link to observability guide
- [x] **`docs/setup/kubernetes-setup.md`**: Add tracing note with link to observability guide
- [x] **`docs/setup/kubernetes-toolhive-setup.md`**: Add tracing note with link to observability guide
- [x] **`docs/mcp-setup.md`**: Add links to observability guide in Configuration Components and Next Steps

**Implementation Tasks:**
- [x] Add OpenTelemetry dependencies (`@opentelemetry/sdk-node`, `@opentelemetry/api`, `@opentelemetry/auto-instrumentations-node`)
- [x] Create `src/core/tracing/tracer.ts` with initialization and configuration logic
- [x] Create `src/core/tracing/config.ts` with environment-based configuration
- [x] Create `src/core/tracing/types.ts` with TypeScript type definitions
- [x] Create `src/core/tracing/http-tracing.ts` with HTTP SERVER span creation
- [x] Implement HTTP middleware tracing for HTTP/SSE transport (SERVER spans working with proper context propagation)
- [x] Fix trace context propagation (CLIENT spans now children of SERVER span)
- [x] Add tool execution span wrapper for all 5 MCP tools (created `src/core/tracing/tool-tracing.ts` with `withToolTracing`)
- [x] Implement STDIO transport tracing (tool spans work for MCP calls through Claude Code)
- [x] Integrate tracer with MCP server startup and graceful shutdown
- [x] Configure console exporter for local development
- [x] Add environment variable configuration (OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACING_ENABLED, OTEL_DEBUG)
- [x] Add OTLP exporter support (Phase 3 work completed early)
- [x] Add OpenTelemetry status to version tool (shows tracing config and health)

### Phase 2: Deep Instrumentation [Status: ✅ COMPLETE - 100%]
**Target**: AI provider calls, Kubernetes operations, and vector database fully traced

**Documentation Changes:**
- [ ] **`docs/development-guide.md`**: Create developer guide for adding instrumentation to new tools and operations
- [ ] **`docs/observability-guide.md`**: Add "AI Provider Metrics" and "Kubernetes Operations" sections
- [ ] **`docs/observability-guide.md`**: Document trace analysis workflows for common debugging scenarios

**Implementation Tasks:**
- [x] **AI Provider Chat Tracing** - Complete generic wrapper instrumentation ✅
  - [x] Created `src/core/tracing/ai-tracing.ts` with `withAITracing()` wrapper supporting `chat`, `tool_loop`, `embeddings` operations
  - [x] Integrated with `AnthropicProvider.sendMessage()` for chat operations
  - [x] Integrated with `VercelProvider.sendMessage()` for chat operations (supports all Vercel providers: OpenAI, Google, Anthropic, xAI, Mistral, DeepSeek, OpenRouter, custom endpoints)
  - [x] Official GenAI semantic conventions: `gen_ai.operation.name`, `gen_ai.provider.name`, `gen_ai.request.model`, `gen_ai.usage.*`
  - [x] Token usage tracking (input tokens, output tokens, cache read tokens, cache creation tokens)
  - [x] Tested with Jaeger - spans showing descriptive names like `chat claude-sonnet-4-5-20250929` with full GenAI attributes
  - [x] Context propagation working - AI spans properly nested under tool execution spans
- [x] **AI Provider ToolLoop Tracing** - Complete generic wrapper instrumentation with iteration visibility ✅
  - [x] Wrap `AnthropicProvider.toolLoop()` with `withAITracing(operation: 'tool_loop')`
  - [x] Wrap `VercelProvider.toolLoop()` with `withAITracing(operation: 'tool_loop')`
  - [x] Add `tool_loop_iteration` spans to `AnthropicProvider.toolLoop()` for per-iteration visibility
  - [x] Removed `isEnabled()` checks from tracing wrappers - trust OpenTelemetry no-op tracer (simpler code, zero overhead when disabled)
  - [x] Tested with Jaeger - iteration spans properly nested under `tool_loop` span, showing clear workflow progression
- [x] **AI Provider Embeddings Tracing** - Complete generic wrapper instrumentation ✅
  - [x] Wrapped `VercelEmbeddingProvider.generateEmbedding()` with `withAITracing(operation: 'embeddings')`
  - [x] Wrapped `VercelEmbeddingProvider.generateEmbeddings()` with `withAITracing(operation: 'embeddings')`
  - [x] Official GenAI semantic conventions: `gen_ai.operation.name`, `gen_ai.provider.name`, `gen_ai.request.model`
  - [x] Embedding metrics tracking (count, dimensions via `gen_ai.embeddings.count`, `gen_ai.embeddings.dimensions`)
  - [x] Tested with capability scan - spans showing `embeddings text-embedding-3-small` with proper nesting
  - [x] Context propagation working - embedding spans properly nested under tool execution spans
- [x] **Kubernetes Client Library Tracing** - Complete generic proxy wrapper instrumentation ✅
  - [x] Created `src/core/tracing/k8s-tracing.ts` with `createTracedK8sClient()` proxy wrapper for transparent instrumentation
  - [x] Integrated in `src/core/discovery.ts` - Wrapped CoreV1Api and VersionApi clients
  - [x] Integrated in `src/tools/version.ts` - Wrapped AppsV1Api and AdmissionregistrationV1Api clients
  - [x] JavaScript Proxy pattern creates CLIENT spans with `k8s.api`, `k8s.method` attributes
  - [x] Zero code changes required in existing K8s operations - automatic tracing via proxy
- [x] **Kubectl CLI Tracing** - Complete wrapper instrumentation for CLI commands ✅
  - [x] Created `withKubectlTracing()` wrapper function in `src/core/tracing/k8s-tracing.ts`
  - [x] Integrated in `src/core/kubernetes-utils.ts` - Wrapped `executeKubectl()` function
  - [x] Creates CLIENT spans with `k8s.client: 'kubectl'`, `k8s.operation`, `k8s.resource` attributes
  - [x] Tested with capability scanning kubectl commands - spans showing proper operation details
- [x] **Qdrant Vector Database Tracing** - Complete generic wrapper instrumentation ✅
  - [x] Created `src/core/tracing/qdrant-tracing.ts` with `withQdrantTracing()` wrapper for all vector DB operations
  - [x] Database semantic conventions: `db.system: 'qdrant'`, `db.operation.name`, `db.collection.name`, `db.vector.*`
  - [x] Integrated with all 10 VectorDBService operations (upsert, search, searchByKeywords, getDocument, deleteDocument, deleteAll, getAllDocuments, getCollectionInfo, healthCheck, initializeCollection)
  - [x] Result metadata tracking: `db.query.result_count`, `db.vector.top_score` for search operations
  - [x] Tested with capability scan workflow - all operations traced correctly (delete_all, upsert, search, list)
  - [x] Context propagation working - vector DB spans properly nested under tool execution spans
- [x] Instrument deployment operations in `src/tools/deploy-manifests.ts` (Uses executeKubectl which is already traced)
- [~] Add session lifecycle tracing with session ID propagation (Moved to PRD #197)
- [~] Implement trace context propagation across multi-step workflows (buildPlatform, remediate) (Moved to PRD #197)
- [~] Add custom span attributes for tool parameters and results (Not required - current attributes sufficient)

### Phase 3: Advanced Features & Production Readiness [Status: 🔄 DEFERRED]
**Target**: Production-grade observability with metrics, sampling, and multiple backends
**Note**: Not required - OTLP exporter and current feature set sufficient for production use

**Documentation Changes:**
- [ ] **`docs/observability-guide.md`**: Add "Advanced Configuration", "Metrics", and "Production Best Practices" sections
- [ ] **`docs/deployment-guide.md`**: Document production sampling strategies and backend integration
- [ ] **Cross-file validation**: Ensure observability integrates seamlessly with deployment and development workflows

**Implementation Tasks:**
- [x] Add OTLP exporter support (HTTP exporter complete, tested with Jaeger)
- [ ] Add native Jaeger and Zipkin exporters with auto-detection
- [ ] Implement configurable sampling strategies (always-on, probability-based, rate-limiting)
- [ ] Add OpenTelemetry Metrics API for request counts, durations, error rates
- [ ] Create custom metrics for AI token usage, K8s API call counts, tool execution frequency
- [ ] Implement trace baggage for user context propagation
- [ ] Performance benchmarking to validate <2ms overhead target

### Phase 4: Testing & Documentation [Status: 🔄 DEFERRED]
**Target**: Comprehensive testing and documentation
**Note**: Not required - feature validated through manual testing, documentation sufficient for current needs

**Testing Tasks:**
- [ ] Add integration tests for tracing with mock OTel collector
- [ ] Test tool execution spans for all 5 MCP tools
- [ ] Test AI provider call spans with different models
- [ ] Test Kubernetes operation spans
- [ ] Test trace context propagation across multi-step workflows
- [ ] Test all exporter types (console, OTLP, Jaeger, Zipkin)
- [ ] Test sampling configurations
- [ ] Performance testing to validate <2ms overhead target

**Documentation Tasks:**
- [ ] Complete `docs/observability-guide.md` with all sections
- [ ] Complete `docs/development-guide.md` with instrumentation examples
- [ ] Update `docs/deployment-guide.md` with tracing configuration
- [ ] Update `README.md` with observability capabilities
- [ ] Validate all code examples work
- [ ] Validate all cross-references resolve correctly

## Technical Implementation Checklist

### Architecture & Design
- [ ] Design tracer initialization with lazy loading to minimize startup overhead (src/core/tracing/tracer.ts)
- [ ] Create span factory with consistent attribute naming conventions (src/core/tracing/span-factory.ts)
- [ ] Design Logger integration for automatic trace context injection (src/core/tracing/logger-integration.ts)
- [ ] Plan exporter selection strategy based on environment variables (src/core/tracing/exporters.ts)
- [ ] Design sampling configuration with environment-based overrides (src/core/tracing/sampling.ts)
- [ ] Document tracing architecture and span hierarchy

### Development Tasks
- [ ] Implement `TracingService` class with start/stop lifecycle management
- [ ] Create `withSpan` utility for wrapping async operations with tracing
- [ ] Add `instrumentTool` decorator for automatic tool span creation
- [ ] Implement trace context extraction from MCP session IDs
- [ ] Build error tracking integration with existing `ErrorHandler` class
- [ ] Create span attribute helpers for consistent metadata

### Documentation Validation
- [ ] **Automated testing**: Configuration examples execute successfully in integration tests
- [ ] **Cross-file consistency**: Tracing terminology aligned across all documentation
- [ ] **User journey testing**: Complete setup-to-analysis workflows can be followed end-to-end
- [ ] **Link validation**: All references between observability docs and core documentation resolve correctly

### Quality Assurance
- [ ] Unit tests for tracer initialization and span creation (>90% coverage)
- [ ] Unit tests for exporter configuration and selection logic (>90% coverage)
- [ ] Integration tests with mock OpenTelemetry collector
- [ ] Performance tests validating <2ms overhead per request
- [ ] Load testing with tracing enabled on large-scale operations
- [ ] Trace data validation ensuring correct span relationships and attributes

## Dependencies & Blockers

### External Dependencies
- [ ] OpenTelemetry SDK and API packages (npm packages) - ✅ Available
- [ ] OpenTelemetry collector or compatible backend (optional for dev) - ✅ Console exporter works out-of-box
- [ ] Backend for production (Jaeger, Zipkin, Grafana Tempo, vendor services) - User choice

### Internal Dependencies
- [ ] Existing Logger interface for trace context integration - ✅ Available
- [ ] Error handling system for error span tracking - ✅ Available (src/core/error-handling.ts)
- [ ] MCP server with 6 tools for instrumentation - ✅ Available
- [ ] HTTP/SSE transport for request tracing - ✅ Available

### Current Blockers
- [ ] None currently identified - all dependencies are satisfied

## Risk Management

### Identified Risks
- [ ] **Risk**: Performance overhead impacting request latency | **Mitigation**: Benchmark early, implement sampling, use async exports | **Owner**: Developer
- [ ] **Risk**: Additional complexity in error handling and logging | **Mitigation**: Extend existing patterns, comprehensive testing | **Owner**: Developer
- [ ] **Risk**: Configuration complexity for users | **Mitigation**: Zero-config defaults, clear documentation, environment variable standards | **Owner**: Developer
- [ ] **Risk**: Vendor lock-in with specific backends | **Mitigation**: OpenTelemetry standard ensures portability, support multiple exporters | **Owner**: Developer

### Mitigation Actions
- [ ] Performance benchmarking in Phase 1 to validate overhead targets
- [ ] Developer guide with clear examples for adding instrumentation
- [ ] Default to console exporter for zero-config local development
- [ ] Support standard OTEL environment variables for backend-agnostic configuration

## Decision Log

### Open Questions
- [ ] What default sampling rate for production (1%, 10%, 100%)?
- [ ] Should we include trace IDs in all log messages automatically?

### Resolved Decisions

**Initial Architecture (2025-10-03):**
- [x] **Standard server-side OTel implementation** - **Decided**: 2025-10-03 | **Rationale**: Industry best practice, avoids MCP protocol controversy, mature ecosystem
- [x] **Direct trace export to collector** - **Decided**: 2025-10-03 | **Rationale**: Standard approach, avoids security concerns, better separation of concerns
- [x] **Extend existing Logger interface** - **Decided**: 2025-10-03 | **Rationale**: Minimal disruption, automatic trace context in logs, familiar developer experience
- [x] **Tracing disabled by default (opt-in)** - **Decided**: 2025-10-29 | **Rationale**: Avoid console noise for users not interested in observability. Set OTEL_TRACING_ENABLED=true to enable. Console exporter outputs to stderr when enabled.

**Semantic Conventions & Architecture (2025-10-29):**
- [x] **Use official OpenTelemetry GenAI semantic conventions** - **Decided**: 2025-10-29 | **Rationale**: OpenTelemetry has comprehensive AI/LLM/Agent support with `gen_ai.*` attributes that are already in development status. Ensures interoperability with AI observability tools, future compatibility, and community alignment. Replaces custom conventions (`mcp.tool.name`, `ai.provider`) with official ones (`gen_ai.tool.name`, `gen_ai.provider.name`). | **Impact**: All attribute names change to official conventions. Code examples need updating. Documentation must reference official OpenTelemetry GenAI specifications. | **Resources**: [GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/), [GenAI Spans](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/), [AI Agent Observability Blog](https://opentelemetry.io/blog/2025/ai-agent-observability/)

- [x] **Combine multiple span types in hierarchical structure** - **Decided**: 2025-10-29 | **Rationale**: Standard OpenTelemetry pattern for multi-layer architectures. HTTP requests create SERVER spans, business logic creates INTERNAL spans, external calls create CLIENT spans. Natural parent-child relationships reflect actual execution flow. | **Impact**: Traces show complete end-to-end visibility: HTTP request → MCP tool → AI provider → K8s API all properly nested with appropriate span kinds.

- [x] **Generic instrumentation via wrappers, not manual spans** - **Decided**: 2025-10-29 | **Rationale**: Instrument at architectural boundaries (HTTP layer, tool execution, AI providers, K8s client) using generic wrappers. Avoids littering codebase with manual span creation. Automatic coverage as new tools/providers are added. | **Impact**: ~100-150 lines of wrapper code instruments entire system. New tools automatically traced when registered. Changes to tracing logic happen in one place. | **Code Changes**: Create `withToolTracing()` wrapper (src/core/tracing/tool-tracing.ts), `TracedAIProvider` base class (src/core/providers/base-traced-provider.ts), optional `createTracedK8sClient()` proxy (src/core/tracing/k8s-tracing.ts).

- [x] **Four-layer instrumentation architecture** - **Decided**: 2025-10-29 | **Rationale**: Clear separation of concerns with instrumentation at each architectural layer: (1) HTTP entry points via auto-instrumentation, (2) MCP tool execution via generic wrapper, (3) AI provider calls via base class, (4) Kubernetes API via optional proxy. | **Impact**: Phase 1 focuses on layers 1-3 (HTTP, tools, AI). Layer 4 (K8s) is optional for Phase 2. Each layer uses appropriate span kind (SERVER, INTERNAL, CLIENT) and semantic conventions (http.*, gen_ai.*, k8s.*).

**Proposed Agentic System Conventions (Future):**
- [~] **Use `gen_ai.task.*` and `gen_ai.action.*` attributes when standardized** - **Status**: Monitoring GitHub Issue [#2664](https://github.com/open-telemetry/semantic-conventions/issues/2664) | **Rationale**: OpenTelemetry is developing comprehensive semantic conventions for AI agentic systems including tasks, actions, agents, teams, artifacts, and memory. These conventions will standardize telemetry for multi-step AI workflows. | **Impact**: When finalized, we can add task-level and action-level spans for complex workflows like `recommend` (which has multiple stages: intent analysis → capability scan → AI recommendation → manifest generation). | **Timeline**: Monitor for stability; likely Phase 3 implementation.

## Scope Management

### In Scope (Current Version)
- [ ] HTTP request and tool execution tracing
- [ ] AI provider call instrumentation (Claude, OpenAI)
- [ ] Kubernetes operation tracing
- [ ] Session lifecycle and context propagation
- [ ] Console, OTLP, Jaeger, Zipkin exporters
- [ ] Configurable sampling strategies
- [ ] Integration with existing error handling and logging
- [ ] Developer utilities for adding instrumentation

### Out of Scope (Future Versions)
- [~] Custom trace visualization UI
- [~] Automatic anomaly detection in traces
- [~] Cost analysis and optimization recommendations
- [~] Trace-based alerting and notifications
- [~] Historical trace analysis and trend identification
- [~] Multi-tenant trace isolation

### Deferred Items
- [~] Custom visualization - **Reason**: Use existing OTel-compatible tools (Jaeger, Grafana) **Target**: Not planned
- [~] Anomaly detection - **Reason**: Focus on instrumentation first, analysis tools exist **Target**: Future enhancement
- [~] Cost optimization - **Reason**: Requires trace correlation with billing data **Target**: v2.0
- [~] Alerting - **Reason**: Use existing observability platform alerting **Target**: Not planned (external tool responsibility)

## Testing & Validation

### Test Coverage Requirements
- [ ] Unit tests for tracer initialization and configuration (>90% coverage)
- [ ] Unit tests for span factory and instrumentation utilities (>90% coverage)
- [ ] Integration tests with mock OpenTelemetry collector
- [ ] Performance tests validating <2ms overhead target
- [ ] Load tests with high-volume trace generation
- [ ] Trace data validation tests ensuring correct span relationships

### User Acceptance Testing
- [ ] Verify traces appear in console exporter during local development
- [ ] Test OTLP export to Jaeger/Zipkin backends
- [ ] Confirm AI provider spans include token counts and model information
- [ ] Validate Kubernetes operation spans include resource types and namespaces
- [ ] Verify error spans correctly capture exception details
- [ ] Test multi-step workflow trace correlation (buildPlatform, remediate)

## Documentation & Communication

### Documentation Completion Status
- [ ] **`docs/observability-guide.md`**: Complete - User guide with tracing concepts, setup, configuration, usage
- [ ] **`docs/development-guide.md`**: Complete - Developer guide for adding instrumentation to tools
- [ ] **`docs/deployment-guide.md`**: Updated - Added tracing configuration for production deployments
- [ ] **`README.md`**: Updated - Added observability to core capabilities
- [ ] **Cross-file consistency**: Complete - OpenTelemetry terminology and patterns aligned

### Communication & Training
- [ ] Team announcement of observability capabilities
- [ ] Create demo showing trace collection and analysis workflow
- [ ] Prepare documentation for interpreting traces and debugging with distributed tracing
- [ ] Establish guidelines for adding instrumentation to new tools and features

## Launch Checklist

### Pre-Launch
- [ ] All Phase 1 implementation tasks completed
- [ ] Performance overhead validated (<2ms per request)
- [ ] Console exporter working for local development
- [ ] Documentation and configuration examples completed
- [ ] Developer guide tested with new tool instrumentation

### Launch
- [ ] Deploy tracing-enabled MCP server to staging environment
- [ ] Monitor performance metrics and overhead
- [ ] Validate trace data quality and completeness
- [ ] Collect team feedback on developer experience

### Post-Launch
- [ ] Analyze trace data to identify performance bottlenecks
- [ ] Monitor overhead and optimize if needed
- [ ] Iterate on instrumentation based on usage insights
- [ ] Plan Phase 2 enhancements (AI/K8s deep instrumentation)

## Work Log

### 2025-10-03: Initial PRD Creation
**Duration**: ~45 minutes
**Primary Focus**: Research OpenTelemetry integration and create comprehensive PRD

**Completed Work**:
- Researched OpenTelemetry MCP integration patterns and community discussions
- Analyzed existing MCP server architecture and logging infrastructure
- Created GitHub issue #137 for OpenTelemetry tracing feature
- Developed comprehensive PRD following documentation-first approach
- Structured implementation as 3 major phases with clear milestones

**Key Decisions**:
- **Standard server-side implementation**: Avoiding controversial MCP protocol trace forwarding
- **Extend existing patterns**: Building on current Logger and ErrorHandler infrastructure
- **Zero-config defaults**: Console exporter for immediate local development value
- **Vendor-neutral**: OpenTelemetry standard ensures backend portability

**Next Steps**: Ready for implementation of Phase 1 - Core Tracing Foundation

### 2025-10-29: Architecture Refinement - GenAI Conventions & Generic Instrumentation
**Duration**: ~90 minutes
**Primary Focus**: Research OpenTelemetry GenAI semantic conventions and design generic instrumentation strategy

**Completed Work**:
- Researched official OpenTelemetry GenAI semantic conventions (gen_ai.* attributes)
- Analyzed OpenTelemetry AI agent observability standards and best practices
- Investigated span hierarchy patterns for multi-layer architectures (SERVER → INTERNAL → CLIENT)
- Designed generic instrumentation strategy using wrappers instead of manual spans
- Identified four strategic instrumentation points for minimal code changes

**Key Decisions**:
- **Use official GenAI conventions**: Replace custom attributes with `gen_ai.*` standards
- **Hierarchical span structure**: Combine SERVER, INTERNAL, CLIENT spans naturally
- **Generic wrappers**: Instrument at architectural boundaries (HTTP, tools, AI, K8s)
- **Four-layer architecture**: Auto-instrumentation + three generic wrappers

**PRD Updates**:
- Updated Decision Log with four major architectural decisions (2025-10-29)
- Replaced custom semantic conventions with official GenAI attributes
- Added comprehensive span hierarchy documentation with examples
- Updated example trace output to use correct attribute names
- Added generic instrumentation strategy to Solution Overview

**Impact on Implementation**:
- Phase 1 implementation simplified: ~100-150 lines of wrapper code
- New tools/providers automatically traced when added
- All attribute names changed to official conventions
- Documentation must reference OpenTelemetry GenAI specifications

**Resources Added**:
- [OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/)
- [GenAI Spans Specification](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/)
- [AI Agent Observability Blog](https://opentelemetry.io/blog/2025/ai-agent-observability/)
- [Agentic Systems Proposal (Issue #2664)](https://github.com/open-telemetry/semantic-conventions/issues/2664)

**Next Steps**: Implement Phase 1 with generic wrappers and official GenAI conventions

### 2025-10-29: Phase 1 Core Tracing Foundation Implementation
**Duration**: ~3 hours
**Primary Focus**: Core OpenTelemetry infrastructure and auto-instrumentation

**Completed PRD Items**:
- [x] Installed OpenTelemetry SDK dependencies (sdk-node, api, auto-instrumentations-node)
- [x] Created core tracing infrastructure (types.ts, config.ts, tracer.ts, index.ts)
- [x] Implemented lazy initialization with console exporter
- [x] Integrated tracer with MCP server startup and graceful shutdown
- [x] Configured environment variable support (OTEL_TRACING_ENABLED, OTEL_SERVICE_NAME, OTEL_DEBUG)
- [x] Tested auto-instrumentation with manual requests (Qdrant, OpenAI, Anthropic CLIENT spans working)
- [x] Changed default session directory to `./tmp/sessions`

**Key Decisions**:
- **Tracing disabled by default**: Set OTEL_TRACING_ENABLED=true to opt-in (avoids console noise)
- **Resource import fix**: Used resourceFromAttributes() instead of new Resource() for TypeScript compatibility

**What's Working**:
- ✅ Auto-instrumentation captures outbound HTTP calls (undici instrumentation)
- ✅ Qdrant vector DB queries traced (GET /collections, POST /points/scroll)
- ✅ OpenAI embeddings API calls traced (1049ms duration observed)
- ✅ Anthropic API calls traced (showing 401 errors correctly)
- ✅ Console exporter outputs JSON spans to stderr
- ✅ Service identification (dot-ai-mcp v0.125.0) with full resource attributes

**Known Limitations**:
- ❌ HTTP SERVER spans not appearing (REST API requests not traced at entry point)
- ❌ MCP tool execution spans not implemented (Layer 2 INTERNAL spans pending)
- ❌ No span hierarchy (all spans are roots with parentSpanContext: undefined)
- ❌ Documentation not written yet

**Files Created**:
- `src/core/tracing/types.ts` - TypeScript type definitions and interfaces
- `src/core/tracing/config.ts` - Environment-based configuration with loadTracingConfig()
- `src/core/tracing/tracer.ts` - Core OpenTelemetry SDK integration with NodeSDK
- `src/core/tracing/index.ts` - Public API exports

**Files Modified**:
- `src/mcp/server.ts` - Added getTracer() initialization and shutdownTracer() cleanup
- `package.json` - Added OpenTelemetry dependencies

**Next Session Priorities**:
- Implement Layer 2: MCP tool execution wrappers (INTERNAL spans with gen_ai.tool.* attributes)
- Add HTTP SERVER span creation for REST API endpoints
- Write `docs/observability-guide.md` comprehensive user guide
- Update `README.md` to mention observability capabilities

### 2025-10-29: HTTP SERVER Spans, OTLP Exporter & Context Propagation
**Duration**: ~3 hours
**Primary Focus**: Complete HTTP tracing infrastructure with proper span hierarchy

**Completed PRD Items**:
- [x] HTTP SERVER span creation - Evidence: `src/core/tracing/http-tracing.ts` (170 lines) with W3C Trace Context extraction, OpenTelemetry HTTP semantic conventions
- [x] HTTP transport integration - Evidence: `src/interfaces/mcp.ts` updated with `createHttpServerSpan()` at entry point
- [x] Context propagation - Evidence: Wrapped request handler in `context.with(trace.setSpan(context.active(), span), async () => {...})`
- [x] OTLP exporter - Evidence: `src/core/tracing/tracer.ts` with `OTLPTraceExporter` for Jaeger integration
- [x] Jaeger testing - Evidence: Traces showing 1 trace with 20 spans, depth 2 (proper parent-child hierarchy)

**Key Implementation Details**:
- **Manual HTTP SERVER span creation**: Resolved auto-instrumentation timing issues by creating explicit SERVER spans with `createHttpServerSpan()` function
- **W3C Trace Context support**: Extract parent trace context from HTTP headers using `propagation.extract()` for distributed tracing compatibility
- **OpenTelemetry semantic conventions**: Full `http.*` attributes (request.method, url.path, response.status_code, client.address, user_agent.original)
- **Context propagation fix**: Wrapped entire HTTP request handler in `context.with(trace.setSpan(context.active(), span), async () => {...})` so CLIENT spans inherit SERVER as parent
- **OTLP exporter**: Implemented with default endpoint `http://localhost:4318/v1/traces`, configurable via `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable
- **Span hierarchy**: Proper SpanKind usage - SERVER (1) for HTTP requests, CLIENT (2) for outbound calls

**Test Results**:
- ✅ SERVER spans appearing in Jaeger with correct span kind (1) and operation name format: `{METHOD} {path}`
- ✅ CLIENT spans (Qdrant, OpenAI, Anthropic) automatically traced by auto-instrumentation
- ✅ **Context propagation working**: Single trace with 20 spans, depth 2 (CLIENT spans are children of SERVER span)
- ✅ Visual waterfall in Jaeger showing timing relationships and request flow
- ✅ Error tracking working (401 from Anthropic API correctly captured with error attributes)
- ✅ Request duration tracking (554-609ms for version tool with all dependency calls)

**Files Created**:
- `src/core/tracing/http-tracing.ts` - HTTP SERVER span creation module with W3C Trace Context extraction

**Files Modified**:
- `src/core/tracing/tracer.ts` - Added OTLP exporter implementation with `OTLPTraceExporter`
- `src/core/tracing/index.ts` - Exported `createHttpServerSpan` and `withHttpServerTracing` functions
- `src/interfaces/mcp.ts` - Integrated HTTP span creation at request entry point, wrapped handler in active context

**Known Limitations**:
- STDIO transport not instrumented (only HTTP has SERVER spans)
- No INTERNAL spans for MCP tool execution yet (Layer 2 pending)
- Only OTLP exporter implemented (Jaeger/Zipkin native exporters pending)
- Documentation not written yet

**Next Session Priorities**:
- Implement Layer 2: INTERNAL spans for MCP tool execution (5 tools: recommend, version, manageOrgData, remediate, projectSetup)
- Add STDIO transport SERVER spans
- Begin Phase 2: AI provider and Kubernetes deep instrumentation
- Write `docs/observability-guide.md` comprehensive user guide

### 2025-10-29: STDIO Tool Tracing & Version Tool Enhancement
**Duration**: ~3 hours
**Primary Focus**: Implement tool execution tracing for STDIO transport (MCP)

**Completed PRD Items**:
- [x] Tool execution span wrapper for all 5 MCP tools - Evidence: `src/core/tracing/tool-tracing.ts` created with `withToolTracing()` function
- [x] STDIO transport tracing - Evidence: Tool spans working for MCP calls through Claude Code
- [x] OpenTelemetry status in version tool - Evidence: `src/tools/version.ts` with `getTracingStatus()` function
- [x] Tool description optimization - Evidence: Reduced from 30+ words to 7 words to save LLM context tokens

**Key Implementation Details**:
- **Generic tool tracing wrapper**: Created `withToolTracing<T>(toolName, args, handler)` that wraps any tool handler with INTERNAL spans
- **GenAI semantic conventions**: Uses `gen_ai.tool.name`, `gen_ai.tool.input`, `gen_ai.tool.duration_ms`, `gen_ai.tool.success` attributes
- **Universal integration**: Modified `registerTool()` in `src/interfaces/mcp.ts` to automatically wrap all tool handlers - zero changes needed to individual tools
- **Context propagation**: Wraps handler execution in `context.with(trace.setSpan(context.active(), span), async () => {...})` so child operations inherit tool span as parent
- **Version tool enhancement**: Added tracing status reporting (enabled, exporterType, endpoint, serviceName, initialized)
- **Token optimization**: Simplified VERSION_TOOL_DESCRIPTION from detailed feature list to "Get comprehensive system health and diagnostics"

**Span Hierarchy**:
- **STDIO (MCP) transport**: `execute_tool version` (INTERNAL) → child spans (CLIENT)
- **HTTP (REST) transport**: `POST /api/v1/tools/version` (SERVER) → `execute_tool version` (INTERNAL) → child spans (CLIENT)

**Test Results**:
- ✅ STDIO tracing working: MCP calls through Claude Code appearing in Jaeger
- ✅ Tool spans showing correct attributes: tool name, input JSON, duration, success status
- ✅ Proper span hierarchy: 1 trace with 20 spans, depth 2
- ✅ Version tool reports tracing status: enabled=true, exporterType=otlp, endpoint, serviceName=dot-ai-mcp
- ✅ Fixed environment variable issue: Clarified `OTEL_EXPORTER_OTLP_ENDPOINT` requirement vs `OTEL_OTLP_ENDPOINT`

**Files Created**:
- `src/core/tracing/tool-tracing.ts` - Generic tool tracing wrapper with GenAI semantic conventions

**Files Modified**:
- `src/core/tracing/index.ts` - Exported `withToolTracing` function
- `src/interfaces/mcp.ts` - Integrated tool tracing wrapper in `registerTool()` method
- `src/tools/version.ts` - Added `getTracingStatus()` function, simplified tool description

**Architecture Decision**:
- **Integration tests at the end**: Decided to write comprehensive integration tests after all implementation phases complete (Phase 4), not incrementally per phase

**Known Discoveries**:
- Phase 2 needed for better span names: Current child spans show generic "GET"/"POST" - need AI provider and K8s instrumentation to show descriptive names like "chat anthropic claude-3-5-sonnet" and "k8s.getCoreV1Api"

**Next Session Priorities**:
- Begin Phase 2: Deep Instrumentation (AI provider call spans, Kubernetes operation spans)
- Make child spans more distinguishable in Jaeger with descriptive operation names
- Consider Phase 4 comprehensive integration testing strategy

### 2025-10-29: Phase 2 AI Provider Chat Tracing Implementation
**Duration**: ~4 hours
**Primary Focus**: Implement AI provider tracing with official GenAI semantic conventions

**Completed PRD Items**:
- [x] **AI Provider Chat Tracing** - Complete generic wrapper instrumentation
  - Created `src/core/tracing/ai-tracing.ts` with unified `withAITracing()` wrapper
  - Integrated with `AnthropicProvider.sendMessage()`
  - Integrated with `VercelProvider.sendMessage()`
  - Implemented official OpenTelemetry GenAI semantic conventions
  - Token usage tracking (input, output, cache metrics)
  - Verified with Jaeger showing `chat claude-sonnet-4-5-20250929` spans

**Key Implementation Details**:
- **Unified tracing wrapper**: Single `withAITracing()` function supports 3 operation types (`chat`, `tool_loop`, `embeddings`)
- **GenAI semantic conventions**: Using official `gen_ai.*` attributes per OpenTelemetry spec
- **Dynamic provider support**: Works with all Vercel AI SDK providers (OpenAI, Google, Anthropic, xAI, Mistral, DeepSeek, OpenRouter, custom)
- **Context propagation**: AI CLIENT spans properly nested under tool INTERNAL spans
- **HTTP auto-instrumentation**: Nested POST spans visible but kept for now as discovery tool

**Bonus Refactoring Work**:
- **Unified embedding providers**: Consolidated `OpenAIEmbeddingProvider`, `GoogleEmbeddingProvider`, `MistralEmbeddingProvider` into single `VercelEmbeddingProvider`
  - Reduced code by 159 lines (32% reduction, 494→335 lines)
  - All three providers now use Vercel AI SDK `embed()` function
  - Dynamic provider selection with `getProviderType()` method
  - Maintains backward compatibility through factory pattern

**Test Results**:
- ✅ Jaeger traces showing proper span hierarchy: SERVER → INTERNAL → CLIENT
- ✅ GenAI attributes complete: operation, provider, model, tokens, duration
- ✅ Build successful with zero TypeScript errors
- ✅ Context propagation working correctly

**Files Created**:
- `src/core/tracing/ai-tracing.ts` - Unified AI tracing wrapper (~150 lines)

**Files Modified**:
- `src/core/providers/anthropic-provider.ts` - Wrapped `sendMessage()` with AI tracing
- `src/core/providers/vercel-provider.ts` - Wrapped `sendMessage()` with AI tracing
- `src/core/embedding-service.ts` - Unified three provider classes into one (major refactor)
- `src/core/tracing/index.ts` - Exported `withAITracing` function
- `src/core/index.ts` - Updated to export `VercelEmbeddingProvider`

**Architecture Decisions**:
- **Keep HTTP auto-instrumentation for now**: Nested POST spans provide discovery value showing what still needs manual instrumentation
- **Embedding provider unification**: Cleaner architecture with single class handling all providers via Vercel AI SDK
- **Generic wrapper pattern proven**: Same pattern will be used for `toolLoop()` and embeddings tracing

**Next Session Priorities**:
- Add tracing to `toolLoop()` operations (AnthropicProvider and VercelProvider)
- Add tracing to embeddings operations (`VercelEmbeddingProvider`)
- Begin Kubernetes client instrumentation
- Consider disabling HTTP auto-instrumentation once all operations are manually traced

### 2025-10-29: ToolLoop Iteration Tracing & Tracing Simplification
**Duration**: ~2 hours
**Primary Focus**: Add per-iteration visibility to toolLoop operations and simplify tracing code

**Completed PRD Items**:
- [x] **ToolLoop Iteration Spans** - Added `tool_loop_iteration` spans to `AnthropicProvider.toolLoop()`
  - Evidence: Each iteration wrapped in INTERNAL span, tested with remediate tool (4 iterations traced)
  - Proper span lifecycle with error handling and early return support
  - Jaeger shows clear iteration progression within `tool_loop` span
- [x] **Tracing Code Simplification** - Removed `isEnabled()` checks from all tracing wrappers
  - Evidence: Updated `ai-tracing.ts`, `http-tracing.ts` to trust OpenTelemetry no-op tracer
  - Simpler code, zero overhead when tracing disabled
  - Kept `isEnabled()` only for status reporting and initialization

**Files Modified**:
- `src/core/providers/anthropic-provider.ts` - Added iteration span wrapping with proper early-return handling
- `src/core/tracing/ai-tracing.ts` - Removed `isEnabled()` check, trust no-op tracer
- `src/core/tracing/http-tracing.ts` - Removed `isEnabled()` check, removed unused `getTracer` import

**Architecture Decisions**:
- **Iteration spans only for Anthropic provider**: VercelProvider uses SDK's internal loop, no per-iteration hooks available
- **Trust OpenTelemetry no-op tracer**: Eliminates redundant checks, cleaner code
- **Clear span naming**: Use `tool_loop_iteration` to show clear parent-child relationship with `tool_loop` span

**Test Results**:
- ✅ Tested with remediate tool - 4 iterations traced successfully
- ✅ Proper span hierarchy: `execute_tool remediate` → `tool_loop claude-sonnet-4-5-20250929` → `tool_loop_iteration` (×4)
- ✅ Token metrics captured correctly across all iterations
- ✅ Build passes with zero errors

**Next Session Priorities**:
- Add embeddings tracing (`VercelEmbeddingProvider`)
- Begin Kubernetes client instrumentation
- Plan HTTP auto-instrumentation removal (Phase 3)

### 2025-10-30: Phase 2 Completion - Embeddings & Kubernetes Tracing + Capability Scan Refactoring
**Duration**: ~6 hours
**Primary Focus**: Complete Phase 2 AI provider and Kubernetes instrumentation, remove manual mode from capability scanning

**Completed PRD Items**:
- [x] **AI Provider Embeddings Tracing** - Complete generic wrapper instrumentation
  - Wrapped `VercelEmbeddingProvider.generateEmbedding()` and `generateEmbeddings()` with `withAITracing(operation: 'embeddings')`
  - Official GenAI semantic conventions: `gen_ai.operation.name: 'embeddings'`, `gen_ai.provider.name`, `gen_ai.request.model`
  - Embedding metrics tracking: `gen_ai.embeddings.count`, `gen_ai.embeddings.dimensions`
  - Tested with capability scan - spans showing `embeddings text-embedding-3-small` with proper context propagation
- [x] **Kubernetes Client Library Tracing** - Complete generic proxy wrapper instrumentation
  - Created `src/core/tracing/k8s-tracing.ts` (~150 lines) with `createTracedK8sClient()` proxy wrapper
  - JavaScript Proxy pattern for transparent method interception - zero code changes to existing operations
  - Integrated in `src/core/discovery.ts` (CoreV1Api, VersionApi) and `src/tools/version.ts` (AppsV1Api, AdmissionregistrationV1Api)
  - Creates CLIENT spans with `k8s.api`, `k8s.method` attributes
- [x] **Kubectl CLI Tracing** - Complete wrapper instrumentation for CLI commands
  - Created `withKubectlTracing()` wrapper function in `src/core/tracing/k8s-tracing.ts`
  - Integrated in `src/core/kubernetes-utils.ts` - wrapped `executeKubectl()` function
  - Creates CLIENT spans with `k8s.client: 'kubectl'`, `k8s.operation`, `k8s.resource` attributes
  - Tested with capability scanning kubectl commands

**Additional Work Done (Out of PRD Scope)**:
- **Capability Scan Workflow Simplification** - Removed manual processing mode for cleaner UX
  - Removed `processingMode` field from session interface (removed `'processing-mode'` step)
  - Deleted `handleProcessingMode` function entirely (~50 lines removed)
  - Updated workflow routing in `src/tools/organizational-data.ts` to skip processing-mode step
  - Modified `handleResourceSelection` and `handleResourceSpecification` to transition directly to scanning
  - Fixed duplicate CRD fetching bug discovered during testing - moved CRD fetch into manual mode block only
  - Updated integration tests in `tests/integration/tools/manage-org-data-capabilities.test.ts`
    - Removed processing-mode expectations from all test workflows
    - Updated resource lists to use actual cluster resources (Deployment.apps, Service, Pod, ConfigMap)
    - All 11 integration tests passing (406.59s duration)
  - Verified end-to-end with manual MCP testing - workflow now: resource-selection → [resource-specification] → scanning → complete
  - Database verification: 64 capabilities stored in qdrant-test container (capabilities-policies collection)

**Key Implementation Details**:
- **Dual K8s tracing strategy**: Client library tracing (Proxy wrapper) + kubectl CLI tracing (function wrapper)
- **Generic instrumentation pattern**: Both K8s wrappers follow same pattern as AI tracing - instrument at boundaries
- **Zero overhead when disabled**: Trust OpenTelemetry no-op tracer, no manual isEnabled() checks
- **Context propagation verified**: K8s CLIENT spans properly nested under tool INTERNAL spans

**Files Created**:
- `src/core/tracing/k8s-tracing.ts` - Dual K8s tracing module (~150 lines)

**Files Modified**:
- `src/core/embedding-service.ts` - Added embeddings tracing wrappers
- `src/core/discovery.ts` - Wrapped K8s API clients with traced proxies
- `src/core/kubernetes-utils.ts` - Wrapped kubectl execution with tracing
- `src/tools/version.ts` - Wrapped K8s API clients with traced proxies
- `src/core/tracing/index.ts` - Exported K8s tracing functions
- `src/core/capability-scan-workflow.ts` - Removed manual mode logic, fixed duplicate CRD bug
- `src/tools/organizational-data.ts` - Updated routing to skip processing-mode
- `tests/integration/tools/manage-org-data-capabilities.test.ts` - Updated all tests for simplified workflow

**Phase 2 Progress**:
- **Before**: 33% complete (3/9 items - AI providers only)
- **After**: 67% complete (6/9 items - AI providers + K8s client/kubectl complete)
- **Remaining**: Deployment operations instrumentation, session lifecycle, multi-step workflow propagation

**Architecture Decisions**:
- **K8s Proxy Pattern**: JavaScript Proxy wrapper provides transparent instrumentation without modifying business logic
- **Kubectl wrapper approach**: Function wrapper intercepts CLI execution, parses args for operation/resource metadata
- **Simplified capability scan**: Removed manual mode based on user feedback - auto mode covers all use cases

**Test Results**:
- ✅ Embeddings tracing working - spans visible in Jaeger during capability scan
- ✅ K8s client tracing working - CoreV1Api methods traced (listNamespace, etc.)
- ✅ Kubectl tracing working - CLI commands traced (kubectl get crd, etc.)
- ✅ Context propagation verified - K8s spans nested under tool spans
- ✅ All 11 capability scan integration tests passing
- ✅ Database storage verified - 64 capabilities in test Qdrant
- ✅ Build successful with zero TypeScript errors

**Bugs Fixed**:
- **Duplicate CRD fetching**: Discovered during Jaeger trace analysis - manual mode was prefetching CRD that auto mode didn't use, causing 2x kubectl get crd calls. Fixed by moving CRD fetch inside manual mode block. (Note: Manual mode subsequently removed entirely)

**Known Discoveries**:
- Manual mode incomplete: Didn't store capabilities to database, only showed preview
- User decision: Remove manual mode entirely - auto mode with resource selection covers all use cases
- Duplicate operations visible in traces helped identify inefficient code paths

**Next Session Priorities**:
- Complete remaining Phase 2 items: deployment operations, session lifecycle, multi-step workflows
- Begin Phase 3: Advanced features (sampling strategies, native exporters, metrics)
- Begin Phase 4: Documentation (`docs/observability-guide.md`, `docs/development-guide.md`)
- Consider Phase 4 comprehensive integration testing for tracing features

### 2025-10-30: Qdrant Vector Database Tracing + Capability Scanning Bug Fix
**Duration**: ~4 hours
**Primary Focus**: Complete Phase 2 vector database instrumentation and fix resource schema fetching

**Completed PRD Items**:
- [x] **Qdrant Vector Database Tracing** - Full instrumentation with database semantic conventions
  - Created `src/core/tracing/qdrant-tracing.ts` with generic `withQdrantTracing()` wrapper
  - Integrated with all 10 VectorDBService operations (upsert, search, searchByKeywords, getDocument, deleteDocument, deleteAll, getAllDocuments, getCollectionInfo, healthCheck, initializeCollection)
  - Official database semantic conventions: `db.system: 'qdrant'`, `db.operation.name`, `db.collection.name`, `db.vector.*`
  - Result metadata tracking: `db.query.result_count`, `db.vector.top_score` for search operations
  - Tested comprehensively with capability scan workflow

**Additional Work Done**:
- **Fixed capability scanning bug** for resources with API groups (Deployment.apps, StatefulSet.apps)
  - Updated `src/core/capability-scan-workflow.ts` resource schema fetching logic
  - Changed from "extract Kind first, then fallback to CRD" to "try full name first, then fallback to Kind"
  - Pattern: Try `kubectl explain <full-name>` → if fails, try `kubectl explain <Kind>`
  - This works for both CRDs (clusters.postgresql.cnpg.io) and built-in resources with groups (Deployment.apps)
  - Verified working with Deployment.apps, Service, ConfigMap, and apps.devopstoolkit.live (CRD)

**Key Implementation Details**:
- **Generic wrapper pattern**: Single `withQdrantTracing()` function handles all 13 operation types
- **Automatic result metadata**: Captures result counts, top scores for search operations
- **Zero overhead when disabled**: Trust OpenTelemetry no-op tracer (no manual isEnabled() checks)
- **Elegant bug fix**: Try-with-fallback pattern avoids hardcoded API group lists

**Files Created**:
- `src/core/tracing/qdrant-tracing.ts` - Qdrant tracing module (~150 lines)

**Files Modified**:
- `src/core/vector-db-service.ts` - Wrapped all 10 operations with Qdrant tracing
- `src/core/tracing/index.ts` - Exported `withQdrantTracing`
- `src/core/capability-scan-workflow.ts` - Fixed resource schema fetching (try full name first, fallback to Kind)

**Test Results**:
- ✅ All Qdrant operations traced: delete_all, upsert, search, list
- ✅ Built-in resources scan successfully: Deployment.apps, Service, ConfigMap
- ✅ CRD scanning works: apps.devopstoolkit.live
- ✅ Database semantic conventions correctly applied (`db.system`, `db.operation.name`, `db.collection.name`, `db.vector.*`)
- ✅ Context propagation verified - Qdrant spans nested under tool spans
- ✅ Build successful with zero TypeScript errors

**Phase 2 Progress**:
- **Before**: 67% complete (6/9 items - AI providers + K8s only)
- **After**: 78% complete (7/9 items - AI providers + K8s + Qdrant complete)
- **Remaining**: Deployment operations instrumentation, session lifecycle, multi-step workflow propagation, custom span attributes

**Architecture Decisions**:
- **Try-with-fallback pattern**: More elegant than hardcoding built-in API groups
  - Works for all CRDs (full name succeeds)
  - Works for all built-in resources with groups (full name fails, Kind succeeds)
  - Works for all core resources (full name succeeds)
- **Generic Qdrant wrapper**: Same pattern as AI and K8s tracing - instrument at boundaries, not scattered throughout code

**Next Session Priorities**:
- Begin Phase 3 or Phase 4: Documentation is critical for production readiness
- Complete remaining Phase 2 items: deployment operations, session lifecycle, multi-step workflows
- Integration tests for tracing features

### 2025-10-30: Observability Documentation - User Guide Creation
**Duration**: ~2 hours
**Primary Focus**: Create comprehensive observability documentation for users

**Completed PRD Items**:
- [x] Created `docs/observability-guide.md` (176 lines) - Complete user guide with:
  - Environment variables table with all tracing configuration options
  - Quick Start section with verification steps
  - "What Gets Traced" covering all instrumentation categories (MCP tools, AI providers, Kubernetes, vector DB)
  - Backend Integration with Jaeger Docker setup and OTLP configuration examples
  - "Viewing Traces" section explaining Jaeger UI navigation and trace hierarchy
- [x] Updated `docs/mcp-setup.md` - Added tracing references in Configuration Components table and Next Steps section
- [x] Removed auto-instrumentation - Completed transition to manual-only tracing (commit 54e30b9)

**Documentation Approach**:
- **Project-specific focus**: No generic OpenTelemetry explanations, only links to official docs for concepts
- **Validated examples**: All commands and configurations tested before documentation (Jaeger Docker, OTLP endpoint)
- **User-centric outputs**: Natural language agent responses instead of raw JSON for verification steps
- **Comprehensive coverage**: Environment variables, backend integration, trace viewing, span hierarchy

**Phase 1 Status**:
- **Implementation**: 100% complete ✅
- **Documentation**: 75% complete (3/4 items - missing deployment-guide.md update)

**Next Session Priorities**:
- Add tracing configuration section to `docs/deployment-guide.md` to complete Phase 1
- Consider creating `docs/development-guide.md` for developer instrumentation patterns (Phase 2 doc requirement)
- Complete remaining Phase 2 implementation items (deployment operations, session lifecycle, multi-step workflows)

### 2025-10-30: Helm Chart Enhancement & Deployment Documentation
**Duration**: ~2 hours
**Primary Focus**: Enable tracing configuration for Kubernetes/Docker deployments

**Completed PRD Items**:
- [x] Added `extraEnv` support to Helm chart:
  - `charts/values.yaml`: Added extraEnv field with commented tracing examples (OTEL_TRACING_ENABLED, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME)
  - `charts/templates/deployment.yaml`: Render extraEnv entries via toYaml template with proper indentation
  - `charts/templates/mcpserver.yaml`: Render extraEnv entries for ToolHive deployment method
  - Added customLlm/customEmbeddings secret key definitions to fix template evaluation error
  - Tested with `helm template` - extraEnv renders correctly in both standard and ToolHive modes
- [x] Updated deployment documentation:
  - `docs/setup/docker-setup.md`: Added optional tracing environment variables section with link to observability guide
  - `docs/setup/kubernetes-setup.md`: Added tracing note in installation notes with link to observability guide
  - `docs/setup/kubernetes-toolhive-setup.md`: Added tracing note in installation notes with link to observability guide

**Implementation Approach**:
- **Generic extraEnv pattern**: Follows Helm best practices by allowing any environment variables, not just tracing-specific fields
- **Minimal documentation**: Simple note with link to observability guide rather than duplicating configuration details
- **Consistent across deployment methods**: Docker, standard Kubernetes, and ToolHive all support tracing configuration
- **User-friendly examples**: Commented examples in values.yaml show exact syntax for tracing configuration

**Phase 1 Status**:
- **Complete**: 100% (all implementation + all documentation) ✅
- **PRD divergence**: Referenced "deployment-guide.md" doesn't exist - actual files are `docs/setup/docker-setup.md`, `docs/setup/kubernetes-setup.md`, `docs/setup/kubernetes-toolhive-setup.md`

**Next Session Priorities**:
- Complete Phase 2 remaining items: deployment operations instrumentation, session lifecycle tracing, multi-step workflow trace propagation
- Consider Phase 3 advanced features (sampling strategies, metrics, native exporters)
- Consider Phase 4 testing and integration tests for tracing functionality

### 2025-10-30: PRD Completion - Phase 2 Complete, Phase 3/4 Deferred
**Duration**: ~15 minutes
**Primary Focus**: Finalize PRD status and document completion

**Completed Work**:
- Verified deployment operations already instrumented via executeKubectl wrapper
- Confirmed Phase 2 is 100% complete (all instrumentation working)
- Updated PRD status from "Draft" to "Complete"
- Marked Phase 3 and Phase 4 as "Deferred" (not required for production use)
- Created PRD #197 for multi-step workflow distributed tracing (separate feature)

**Key Decisions**:
- **Deployment tracing complete**: All kubectl commands (including apply/wait) already traced via withKubectlTracing wrapper
- **Phase 2 at 100%**: All AI providers, K8s operations, vector DB, and deployments fully instrumented
- **Phase 3/4 deferred**: OTLP exporter and current feature set sufficient for production, no need for additional exporters, sampling strategies, metrics, or formal testing
- **Session tracing separate**: Multi-step workflow trace propagation moved to PRD #197 as independent feature

**Final Status**:
- **Phase 1**: ✅ 100% Complete - Core tracing foundation
- **Phase 2**: ✅ 100% Complete - Deep instrumentation (AI, K8s, Qdrant, deployments)
- **Phase 3**: 🔄 Deferred - Advanced features not required
- **Phase 4**: 🔄 Deferred - Testing and additional documentation not required

**Feature Capabilities Delivered**:
- ✅ HTTP/MCP request tracing with proper span hierarchy
- ✅ All 5 MCP tools traced (recommend, version, manageOrgData, remediate, projectSetup)
- ✅ AI provider operations traced (chat, toolLoop, embeddings) with official GenAI conventions
- ✅ Kubernetes client library and kubectl CLI fully traced
- ✅ Qdrant vector database operations traced
- ✅ Deployment operations traced (kubectl apply, wait)
- ✅ OTLP exporter for production observability backends
- ✅ Environment-based configuration (OTEL_TRACING_ENABLED, OTEL_EXPORTER_OTLP_ENDPOINT)
- ✅ Comprehensive user documentation (docs/observability-guide.md)
- ✅ Production deployment configuration (Helm chart, Docker, Kubernetes)

**PRD Complete**: OpenTelemetry tracing is production-ready and fully operational.

---

## Appendix

### Supporting Materials
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - Official OTel documentation
- [OpenTelemetry JavaScript SDK](https://opentelemetry.io/docs/languages/js/) - Node.js implementation guide
- [MCP OpenTelemetry Discussion #269](https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/269) - Community discussion on tracing
- [Existing Error Handling System](./src/core/error-handling.ts) - Current logging and error infrastructure

### Research Findings
- OpenTelemetry is becoming standard for AI agent observability (2025 trend)
- Standard server-side implementation preferred over MCP protocol forwarding
- Minimal overhead (<2ms) achievable with proper async export configuration
- Strong ecosystem support with multiple backend options (Jaeger, Grafana Tempo, vendors)
- Natural integration with existing Logger interface patterns

### Example Trace Output with Official GenAI Conventions

#### Example 1: Complete Request Trace Hierarchy

```json
// Root span: HTTP SERVER
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "1111111111111111",
  "parentSpanId": null,
  "name": "POST /mcp/tools",
  "kind": "SERVER",
  "timestamp": "2025-10-29T10:15:30.123Z",
  "duration": 10500,
  "attributes": {
    "http.request.method": "POST",
    "http.route": "/mcp/tools",
    "url.path": "/mcp/tools",
    "url.scheme": "https",
    "server.address": "mcp.example.com",
    "server.port": 443,
    "http.response.status_code": 200
  },
  "status": { "code": "OK" }
}

// Child span: MCP Tool Execution (INTERNAL)
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "2222222222222222",
  "parentSpanId": "1111111111111111",
  "name": "execute_tool recommend",
  "kind": "INTERNAL",
  "timestamp": "2025-10-29T10:15:30.150Z",
  "duration": 10400,
  "attributes": {
    "gen_ai.operation.name": "execute_tool",
    "gen_ai.tool.name": "recommend",
    "gen_ai.tool.type": "function",
    "gen_ai.tool.description": "Get AI recommendations for Kubernetes deployments",
    "gen_ai.agent.id": "dot-ai-mcp",
    "gen_ai.agent.name": "DevOps AI Assistant",
    "mcp.session.id": "session_abc123",
    "mcp.tool.stage": "recommend"
  },
  "status": { "code": "OK" }
}

// Grandchild span: AI Provider Call (CLIENT)
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "3333333333333333",
  "parentSpanId": "2222222222222222",
  "name": "chat claude-3-5-sonnet",
  "kind": "CLIENT",
  "timestamp": "2025-10-29T10:15:31.200Z",
  "duration": 2000,
  "attributes": {
    "gen_ai.operation.name": "chat",
    "gen_ai.provider.name": "anthropic",
    "gen_ai.request.model": "claude-3-5-sonnet",
    "gen_ai.response.model": "claude-3-5-sonnet-20241022",
    "gen_ai.usage.input_tokens": 1500,
    "gen_ai.usage.output_tokens": 800,
    "gen_ai.request.temperature": 0.7,
    "gen_ai.request.max_tokens": 4096,
    "gen_ai.response.finish_reasons": ["stop"],
    "gen_ai.conversation.id": "session_abc123"
  },
  "status": { "code": "OK" }
}

// Sibling span: Kubernetes API Call (CLIENT)
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "4444444444444444",
  "parentSpanId": "2222222222222222",
  "name": "GET /apis",
  "kind": "CLIENT",
  "timestamp": "2025-10-29T10:15:33.300Z",
  "duration": 150,
  "attributes": {
    "http.request.method": "GET",
    "server.address": "kubernetes.default.svc",
    "server.port": 443,
    "k8s.operation": "list_api_groups",
    "k8s.client": "kubernetes-client",
    "http.response.status_code": 200
  },
  "status": { "code": "OK" }
}
```

#### Example 2: Error Trace with Exception

```json
{
  "traceId": "error1234567890abcdef",
  "spanId": "err111111111111",
  "parentSpanId": "parent222222222",
  "name": "chat gpt-4o",
  "kind": "CLIENT",
  "timestamp": "2025-10-29T10:20:15.000Z",
  "duration": 500,
  "attributes": {
    "gen_ai.operation.name": "chat",
    "gen_ai.provider.name": "openai",
    "gen_ai.request.model": "gpt-4o",
    "error.type": "timeout"
  },
  "events": [
    {
      "name": "exception",
      "timestamp": "2025-10-29T10:20:15.500Z",
      "attributes": {
        "exception.type": "TimeoutError",
        "exception.message": "Request timed out after 30s",
        "exception.stacktrace": "..."
      }
    }
  ],
  "status": {
    "code": "ERROR",
    "message": "Request timed out after 30s"
  }
}
```

### Implementation References
- `@opentelemetry/sdk-node` - Core SDK for Node.js
- `@opentelemetry/api` - OpenTelemetry API
- `@opentelemetry/auto-instrumentations-node` - Automatic HTTP/Express instrumentation
- `@opentelemetry/exporter-trace-otlp-http` - OTLP exporter for production
- `@opentelemetry/exporter-jaeger` - Jaeger exporter
- `@opentelemetry/exporter-zipkin` - Zipkin exporter

### Official OpenTelemetry Semantic Conventions for MCP Operations

**Note**: Using official OpenTelemetry GenAI semantic conventions (development status) as defined in [OpenTelemetry GenAI Specifications](https://opentelemetry.io/docs/specs/semconv/gen-ai/).

#### General Service Attributes
- `service.name`: "dot-ai-mcp"
- `OTEL_SEMCONV_STABILITY_OPT_IN`: "gen_ai_latest_experimental" (environment variable)

#### GenAI Tool Execution Spans (INTERNAL)
Span name: `execute_tool {gen_ai.tool.name}`

**Required attributes:**
- `gen_ai.operation.name`: "execute_tool"

**Recommended attributes:**
- `gen_ai.tool.name`: Tool being executed (e.g., "recommend", "remediate", "manageOrgData")
- `gen_ai.tool.description`: Tool functionality description
- `gen_ai.tool.type`: "function" (MCP tools are function-type tools)
- `gen_ai.tool.call.id`: Optional unique tool call identifier
- `gen_ai.agent.id`: "dot-ai-mcp" (identifies the MCP server agent)
- `gen_ai.agent.name`: "DevOps AI Assistant"
- `gen_ai.agent.description`: "Kubernetes deployment recommendation agent"

**Custom MCP attributes (namespaced):**
- `mcp.session.id`: Session identifier for stateful interactions
- `mcp.transport`: "stdio" | "http"
- `mcp.tool.stage`: Workflow stage (e.g., "recommend", "chooseSolution")

#### GenAI LLM Inference Spans (CLIENT)
Span name: `{gen_ai.operation.name} {gen_ai.request.model}`
Example: `chat claude-3-5-sonnet`

**Required attributes:**
- `gen_ai.operation.name`: "chat" | "text_completion" | "generate_content"
- `gen_ai.provider.name`: "anthropic" | "openai" | "google" | "xai" | etc.

**Conditionally required attributes:**
- `gen_ai.request.model`: Model identifier (e.g., "claude-3-5-sonnet", "gpt-4o")
- `gen_ai.conversation.id`: Conversation/session identifier (when session-based)

**Recommended attributes:**
- `gen_ai.response.model`: Actual model that generated response (may differ from request)
- `gen_ai.usage.input_tokens`: Prompt token count
- `gen_ai.usage.output_tokens`: Completion token count
- `gen_ai.request.temperature`: Temperature parameter
- `gen_ai.request.max_tokens`: Maximum output tokens
- `gen_ai.request.top_p`: Top-p sampling parameter
- `gen_ai.request.frequency_penalty`: Frequency penalty setting
- `gen_ai.request.stop_sequences`: Stop sequences array
- `gen_ai.response.finish_reasons`: Array of finish reasons (e.g., ["stop"], ["length"])
- `gen_ai.response.id`: Unique completion identifier

**Opt-in attributes (contain sensitive data):**
- `gen_ai.input.messages`: Full chat history (JSON array)
- `gen_ai.output.messages`: Model response messages (JSON array)
- `gen_ai.system_instructions`: System prompt/instructions

#### GenAI Embeddings Spans (CLIENT)
Span name: `embeddings {gen_ai.request.model}`

**Required attributes:**
- `gen_ai.operation.name`: "embeddings"

**Recommended attributes:**
- `gen_ai.request.model`: Embedding model name (e.g., "text-embedding-3-small")
- `gen_ai.usage.input_tokens`: Input token count
- `gen_ai.request.encoding_formats`: Requested encoding formats (e.g., ["base64"])

#### Kubernetes API Spans (CLIENT)
Span name: `{http.request.method} {http.route}` or `k8s.{operation}`

**HTTP attributes:**
- `http.request.method`: "GET" | "POST" | "PUT" | "DELETE"
- `server.address`: Kubernetes API server address
- `server.port`: API server port (typically 443)
- `http.response.status_code`: HTTP status code

**Custom K8s attributes:**
- `k8s.operation`: Kubernetes operation type (e.g., "list_api_groups", "get_pod", "create_deployment")
- `k8s.resource.kind`: Resource kind (e.g., "Pod", "Deployment", "Service")
- `k8s.namespace`: Namespace (when applicable)
- `k8s.client`: "kubernetes-client"

#### HTTP Server Spans (SERVER)
Span name: `{http.request.method} {http.route}`

**Required attributes:**
- `http.request.method`: HTTP method (e.g., "POST", "GET")
- `url.path`: URL path (e.g., "/mcp/tools")
- `url.scheme`: "http" | "https"

**Recommended attributes:**
- `http.route`: Route template (e.g., "/mcp/tools")
- `server.address`: Server address
- `server.port`: Server port
- `client.address`: Client IP address
- `user_agent.original`: User agent string
- `http.response.status_code`: HTTP response status code

#### Span Kind Guidelines
- **SERVER**: HTTP/SSE entry points (incoming requests)
- **INTERNAL**: MCP tool execution, business logic, workflows
- **CLIENT**: AI provider calls, K8s API calls, vector DB queries

#### Future: Agentic System Conventions (Proposed)
When [Issue #2664](https://github.com/open-telemetry/semantic-conventions/issues/2664) is finalized:
- `gen_ai.task.*`: Task-level spans for multi-step workflows
- `gen_ai.action.*`: Action-level spans for execution steps
- `gen_ai.artifact.*`: Input/output artifacts
- `gen_ai.memory.*`: Persistent context storage

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vfarcic/dot-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

137-opentelemetry-tracing.md•71.7 KiB

# PRD: OpenTelemetry Tracing for MCP Server

**Created**: 2025-10-03
**Status**: Complete
**Owner**: Viktor Farcic
**GitHub Issue**: [#137](https://github.com/vfarcic/dot-ai/issues/137)
**Last Updated**: 2025-10-30

## Executive Summary
Add OpenTelemetry distributed tracing to the MCP server to provide vendor-neutral observability for debugging complex multi-step workflows, measuring AI provider performance, and understanding Kubernetes operation latency. This enables production-ready monitoring without infrastructure lock-in.

## Problem Statement

The DevOps AI Toolkit MCP server handles complex operations including:
- Multi-step workflows (buildPlatform: list → submitAnswers → execute)
- AI provider calls (Claude, OpenAI) with variable latency
- Kubernetes cluster operations (discovery, deployment, remediation)
- Session-based stateful interactions across tool calls
- HTTP/SSE and STDIO transport protocols

**Current Gap**: No distributed tracing capability to understand:
- Where time is spent in multi-tool workflows
- Which AI provider calls are slow or failing
- How Kubernetes API latency impacts user experience
- How errors correlate across complex request chains

**Impact**: Difficult to debug performance issues, optimize AI costs, and troubleshoot production incidents.

## Documentation Changes

### Files Created/Updated
- **`docs/observability-guide.md`** - New File - Complete guide for OpenTelemetry tracing, configuration, and usage
- **`docs/deployment-guide.md`** - Updated - Add tracing configuration for production deployments
- **`docs/development-guide.md`** - New File - Developer guide for adding instrumentation to new tools
- **`README.md`** - Project Overview - Add observability to core capabilities
- **`src/core/tracing/`** - Technical Implementation - OpenTelemetry instrumentation modules

### Content Location Map
- **Feature Overview**: See `docs/observability-guide.md` (Section: "What is Distributed Tracing")
- **Configuration**: See `docs/observability-guide.md` (Section: "Setup and Configuration")
- **Tool Instrumentation**: See `docs/development-guide.md` (Section: "Adding Tracing to Tools")
- **Production Deployment**: See `docs/deployment-guide.md` (Section: "Observability Configuration")
- **Trace Analysis**: See `docs/observability-guide.md` (Section: "Understanding Traces")
- **Integration Examples**: See `docs/observability-guide.md` (Section: "Backend Integration")

### User Journey Validation
- [ ] **Primary workflow** documented end-to-end: Enable tracing → Deploy MCP → View traces → Debug issues
- [ ] **Developer workflow** complete: Add instrumentation → Test locally → Verify traces → Deploy
- [ ] **Operations workflow** complete: Configure collector → Deploy server → Monitor traces → Troubleshoot
- [ ] **Cross-references** between development docs and observability docs work correctly

## Solution Overview

**Standard Server-Side OpenTelemetry Implementation with Official GenAI Semantic Conventions**

Implement OpenTelemetry instrumentation following industry best practices and official GenAI semantic conventions:
1. **Auto-instrumentation**: HTTP, Express middleware tracing (SERVER spans)
2. **Generic wrapper instrumentation**: Tool execution, AI calls via reusable wrappers
3. **Official GenAI conventions**: Use `gen_ai.*` attributes for AI/LLM operations
4. **Hierarchical span structure**: SERVER → INTERNAL → CLIENT span relationships
5. **Direct export**: Server exports traces to OTel collector (not through MCP protocol)
6. **Trace context propagation**: Correlate multi-step workflows and sessions
7. **Integration**: Extend existing Logger and AIProvider interfaces

**NOT implementing**: The controversial "send traces through MCP" approach from modelcontextprotocol/discussions/269.

### Generic Instrumentation Strategy (Decision: 2025-10-29)

**Principle**: Instrument at architectural boundaries using generic wrappers, not manual spans scattered throughout the codebase.

**Four Strategic Instrumentation Points:**

1. **HTTP Entry Points** (Auto-instrumented)
   - Uses `@opentelemetry/instrumentation-express`
   - Zero code changes required
   - Creates SERVER spans with `http.*` attributes
   - Example: `POST /mcp/tools`

2. **MCP Tool Execution** (Generic wrapper: `withToolTracing()`)
   - Single wrapper function instruments all 5 MCP tools automatically
   - Creates INTERNAL spans with `gen_ai.tool.*` attributes
   - Example: `execute_tool recommend`
   - Implementation: `src/core/tracing/tool-tracing.ts`

3. **AI Provider Calls** (Interface-level tracing: `TracedAIProvider`)
   - Base class implements tracing for all AI providers
   - Creates CLIENT spans with official `gen_ai.*` attributes
   - Example: `chat claude-3-5-sonnet`
   - All providers (Anthropic, Vercel, NoOp) inherit tracing behavior

4. **Kubernetes API Calls** (Optional proxy: `createTracedK8sClient()`)
   - Generic proxy wrapper for transparent instrumentation
   - Creates CLIENT spans with `k8s.*` and `http.*` attributes
   - Example: `k8s.listAPIGroups`
   - Implementation: `src/core/tracing/k8s-tracing.ts`

**Benefits:**
- ✅ ~100-150 lines of instrumentation code covers entire system
- ✅ New tools/providers automatically traced
- ✅ No manual span management in business logic
- ✅ Standards-compliant with official OpenTelemetry GenAI conventions
- ✅ Maintainable: change tracing logic in one place

## Implementation Requirements

### Core Functionality
- [ ] **HTTP/MCP request tracing**: Automatic span creation for all incoming requests - Documented in `docs/observability-guide.md` (Section: "Request Tracing")
- [ ] **Tool execution spans**: Each of 5 active MCP tools traced (recommend, version, manageOrgData, remediate, projectSetup) - Documented in `docs/development-guide.md` (Section: "Tool Spans")
- [ ] **Error tracking**: Integration with existing error-handling system - Documented in `docs/observability-guide.md` (Section: "Error Correlation")
- [ ] **Trace context propagation**: Session-based workflow correlation - Documented in `docs/development-guide.md` (Section: "Context Propagation")

### Deep Instrumentation
- [ ] **AI provider tracing**: Claude/OpenAI API call spans with latency/tokens - Documented in `docs/observability-guide.md` (Section: "AI Provider Metrics")
- [ ] **Kubernetes operations**: Cluster API calls, discovery, deployments - Documented in `docs/observability-guide.md` (Section: "Kubernetes Operations")
- [ ] **Multi-step workflows**: Trace buildPlatform intent mapping → script discovery → execution - Documented in `docs/development-guide.md` (Section: "Complex Workflows")
- [ ] **Session lifecycle**: Track session creation, continuity, and completion - Documented in `docs/observability-guide.md` (Section: "Session Tracking")

### Configuration & Deployment
- [ ] **Environment-based config**: OTEL_EXPORTER_OTLP_ENDPOINT, service name, sampling - Documented in `docs/deployment-guide.md` (Section: "Environment Variables")
- [ ] **Multiple exporters**: Console (dev), OTLP (production), Jaeger, Zipkin - Documented in `docs/observability-guide.md` (Section: "Exporter Configuration")
- [ ] **Sampling strategies**: Always-on (dev), probability-based (production) - Documented in `docs/deployment-guide.md` (Section: "Sampling Configuration")
- [ ] **Zero-config default**: Works out-of-box with console exporter for local development - Documented in `docs/development-guide.md` (Section: "Getting Started")

### Documentation Quality Requirements
- [ ] **All examples work**: Configuration examples validated in integration tests
- [ ] **Complete user journeys**: End-to-end workflows from setup to trace analysis documented
- [ ] **Consistent terminology**: OpenTelemetry terms used correctly across all documentation
- [ ] **Working cross-references**: All links between observability docs and core docs resolve correctly

### Success Criteria
- [ ] **Minimal overhead**: <2ms latency added per request with tracing enabled
- [ ] **Complete visibility**: All tool executions, AI calls, and K8s operations traced
- [ ] **Developer experience**: Simple API for adding spans to new tools
- [ ] **Production ready**: Configurable sampling, multiple backends, robust error handling
- [ ] **Zero infrastructure requirement**: Works with any OTel-compatible backend

## Implementation Progress

### Phase 1: Core Tracing Foundation [Status: ✅ COMPLETE - 100%]
**Target**: Basic distributed tracing working for HTTP requests and tool execution
**Note**: Integration tests will be written at the end after all phases complete

**Documentation Changes:**
- [x] **`docs/observability-guide.md`**: Create comprehensive user guide with tracing concepts, setup, and usage
- [x] **`docs/setup/docker-setup.md`**: Add tracing environment variables with link to observability guide
- [x] **`docs/setup/kubernetes-setup.md`**: Add tracing note with link to observability guide
- [x] **`docs/setup/kubernetes-toolhive-setup.md`**: Add tracing note with link to observability guide
- [x] **`docs/mcp-setup.md`**: Add links to observability guide in Configuration Components and Next Steps

**Implementation Tasks:**
- [x] Add OpenTelemetry dependencies (`@opentelemetry/sdk-node`, `@opentelemetry/api`, `@opentelemetry/auto-instrumentations-node`)
- [x] Create `src/core/tracing/tracer.ts` with initialization and configuration logic
- [x] Create `src/core/tracing/config.ts` with environment-based configuration
- [x] Create `src/core/tracing/types.ts` with TypeScript type definitions
- [x] Create `src/core/tracing/http-tracing.ts` with HTTP SERVER span creation
- [x] Implement HTTP middleware tracing for HTTP/SSE transport (SERVER spans working with proper context propagation)
- [x] Fix trace context propagation (CLIENT spans now children of SERVER span)
- [x] Add tool execution span wrapper for all 5 MCP tools (created `src/core/tracing/tool-tracing.ts` with `withToolTracing`)
- [x] Implement STDIO transport tracing (tool spans work for MCP calls through Claude Code)
- [x] Integrate tracer with MCP server startup and graceful shutdown
- [x] Configure console exporter for local development
- [x] Add environment variable configuration (OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACING_ENABLED, OTEL_DEBUG)
- [x] Add OTLP exporter support (Phase 3 work completed early)
- [x] Add OpenTelemetry status to version tool (shows tracing config and health)

### Phase 2: Deep Instrumentation [Status: ✅ COMPLETE - 100%]
**Target**: AI provider calls, Kubernetes operations, and vector database fully traced

**Documentation Changes:**
- [ ] **`docs/development-guide.md`**: Create developer guide for adding instrumentation to new tools and operations
- [ ] **`docs/observability-guide.md`**: Add "AI Provider Metrics" and "Kubernetes Operations" sections
- [ ] **`docs/observability-guide.md`**: Document trace analysis workflows for common debugging scenarios

**Implementation Tasks:**
- [x] **AI Provider Chat Tracing** - Complete generic wrapper instrumentation ✅
  - [x] Created `src/core/tracing/ai-tracing.ts` with `withAITracing()` wrapper supporting `chat`, `tool_loop`, `embeddings` operations
  - [x] Integrated with `AnthropicProvider.sendMessage()` for chat operations
  - [x] Integrated with `VercelProvider.sendMessage()` for chat operations (supports all Vercel providers: OpenAI, Google, Anthropic, xAI, Mistral, DeepSeek, OpenRouter, custom endpoints)
  - [x] Official GenAI semantic conventions: `gen_ai.operation.name`, `gen_ai.provider.name`, `gen_ai.request.model`, `gen_ai.usage.*`
  - [x] Token usage tracking (input tokens, output tokens, cache read tokens, cache creation tokens)
  - [x] Tested with Jaeger - spans showing descriptive names like `chat claude-sonnet-4-5-20250929` with full GenAI attributes
  - [x] Context propagation working - AI spans properly nested under tool execution spans
- [x] **AI Provider ToolLoop Tracing** - Complete generic wrapper instrumentation with iteration visibility ✅
  - [x] Wrap `AnthropicProvider.toolLoop()` with `withAITracing(operation: 'tool_loop')`
  - [x] Wrap `VercelProvider.toolLoop()` with `withAITracing(operation: 'tool_loop')`
  - [x] Add `tool_loop_iteration` spans to `AnthropicProvider.toolLoop()` for per-iteration visibility
  - [x] Removed `isEnabled()` checks from tracing wrappers - trust OpenTelemetry no-op tracer (simpler code, zero overhead when disabled)
  - [x] Tested with Jaeger - iteration spans properly nested under `tool_loop` span, showing clear workflow progression
- [x] **AI Provider Embeddings Tracing** - Complete generic wrapper instrumentation ✅
  - [x] Wrapped `VercelEmbeddingProvider.generateEmbedding()` with `withAITracing(operation: 'embeddings')`
  - [x] Wrapped `VercelEmbeddingProvider.generateEmbeddings()` with `withAITracing(operation: 'embeddings')`
  - [x] Official GenAI semantic conventions: `gen_ai.operation.name`, `gen_ai.provider.name`, `gen_ai.request.model`
  - [x] Embedding metrics tracking (count, dimensions via `gen_ai.embeddings.count`, `gen_ai.embeddings.dimensions`)
  - [x] Tested with capability scan - spans showing `embeddings text-embedding-3-small` with proper nesting
  - [x] Context propagation working - embedding spans properly nested under tool execution spans
- [x] **Kubernetes Client Library Tracing** - Complete generic proxy wrapper instrumentation ✅
  - [x] Created `src/core/tracing/k8s-tracing.ts` with `createTracedK8sClient()` proxy wrapper for transparent instrumentation
  - [x] Integrated in `src/core/discovery.ts` - Wrapped CoreV1Api and VersionApi clients
  - [x] Integrated in `src/tools/version.ts` - Wrapped AppsV1Api and AdmissionregistrationV1Api clients
  - [x] JavaScript Proxy pattern creates CLIENT spans with `k8s.api`, `k8s.method` attributes
  - [x] Zero code changes required in existing K8s operations - automatic tracing via proxy
- [x] **Kubectl CLI Tracing** - Complete wrapper instrumentation for CLI commands ✅
  - [x] Created `withKubectlTracing()` wrapper function in `src/core/tracing/k8s-tracing.ts`
  - [x] Integrated in `src/core/kubernetes-utils.ts` - Wrapped `executeKubectl()` function
  - [x] Creates CLIENT spans with `k8s.client: 'kubectl'`, `k8s.operation`, `k8s.resource` attributes
  - [x] Tested with capability scanning kubectl commands - spans showing proper operation details
- [x] **Qdrant Vector Database Tracing** - Complete generic wrapper instrumentation ✅
  - [x] Created `src/core/tracing/qdrant-tracing.ts` with `withQdrantTracing()` wrapper for all vector DB operations
  - [x] Database semantic conventions: `db.system: 'qdrant'`, `db.operation.name`, `db.collection.name`, `db.vector.*`
  - [x] Integrated with all 10 VectorDBService operations (upsert, search, searchByKeywords, getDocument, deleteDocument, deleteAll, getAllDocuments, getCollectionInfo, healthCheck, initializeCollection)
  - [x] Result metadata tracking: `db.query.result_count`, `db.vector.top_score` for search operations
  - [x] Tested with capability scan workflow - all operations traced correctly (delete_all, upsert, search, list)
  - [x] Context propagation working - vector DB spans properly nested under tool execution spans
- [x] Instrument deployment operations in `src/tools/deploy-manifests.ts` (Uses executeKubectl which is already traced)
- [~] Add session lifecycle tracing with session ID propagation (Moved to PRD #197)
- [~] Implement trace context propagation across multi-step workflows (buildPlatform, remediate) (Moved to PRD #197)
- [~] Add custom span attributes for tool parameters and results (Not required - current attributes sufficient)

### Phase 3: Advanced Features & Production Readiness [Status: 🔄 DEFERRED]
**Target**: Production-grade observability with metrics, sampling, and multiple backends
**Note**: Not required - OTLP exporter and current feature set sufficient for production use

**Documentation Changes:**
- [ ] **`docs/observability-guide.md`**: Add "Advanced Configuration", "Metrics", and "Production Best Practices" sections
- [ ] **`docs/deployment-guide.md`**: Document production sampling strategies and backend integration
- [ ] **Cross-file validation**: Ensure observability integrates seamlessly with deployment and development workflows

**Implementation Tasks:**
- [x] Add OTLP exporter support (HTTP exporter complete, tested with Jaeger)
- [ ] Add native Jaeger and Zipkin exporters with auto-detection
- [ ] Implement configurable sampling strategies (always-on, probability-based, rate-limiting)
- [ ] Add OpenTelemetry Metrics API for request counts, durations, error rates
- [ ] Create custom metrics for AI token usage, K8s API call counts, tool execution frequency
- [ ] Implement trace baggage for user context propagation
- [ ] Performance benchmarking to validate <2ms overhead target

### Phase 4: Testing & Documentation [Status: 🔄 DEFERRED]
**Target**: Comprehensive testing and documentation
**Note**: Not required - feature validated through manual testing, documentation sufficient for current needs

**Testing Tasks:**
- [ ] Add integration tests for tracing with mock OTel collector
- [ ] Test tool execution spans for all 5 MCP tools
- [ ] Test AI provider call spans with different models
- [ ] Test Kubernetes operation spans
- [ ] Test trace context propagation across multi-step workflows
- [ ] Test all exporter types (console, OTLP, Jaeger, Zipkin)
- [ ] Test sampling configurations
- [ ] Performance testing to validate <2ms overhead target

**Documentation Tasks:**
- [ ] Complete `docs/observability-guide.md` with all sections
- [ ] Complete `docs/development-guide.md` with instrumentation examples
- [ ] Update `docs/deployment-guide.md` with tracing configuration
- [ ] Update `README.md` with observability capabilities
- [ ] Validate all code examples work
- [ ] Validate all cross-references resolve correctly

## Technical Implementation Checklist

### Architecture & Design
- [ ] Design tracer initialization with lazy loading to minimize startup overhead (src/core/tracing/tracer.ts)
- [ ] Create span factory with consistent attribute naming conventions (src/core/tracing/span-factory.ts)
- [ ] Design Logger integration for automatic trace context injection (src/core/tracing/logger-integration.ts)
- [ ] Plan exporter selection strategy based on environment variables (src/core/tracing/exporters.ts)
- [ ] Design sampling configuration with environment-based overrides (src/core/tracing/sampling.ts)
- [ ] Document tracing architecture and span hierarchy

### Development Tasks
- [ ] Implement `TracingService` class with start/stop lifecycle management
- [ ] Create `withSpan` utility for wrapping async operations with tracing
- [ ] Add `instrumentTool` decorator for automatic tool span creation
- [ ] Implement trace context extraction from MCP session IDs
- [ ] Build error tracking integration with existing `ErrorHandler` class
- [ ] Create span attribute helpers for consistent metadata

### Documentation Validation
- [ ] **Automated testing**: Configuration examples execute successfully in integration tests
- [ ] **Cross-file consistency**: Tracing terminology aligned across all documentation
- [ ] **User journey testing**: Complete setup-to-analysis workflows can be followed end-to-end
- [ ] **Link validation**: All references between observability docs and core documentation resolve correctly

### Quality Assurance
- [ ] Unit tests for tracer initialization and span creation (>90% coverage)
- [ ] Unit tests for exporter configuration and selection logic (>90% coverage)
- [ ] Integration tests with mock OpenTelemetry collector
- [ ] Performance tests validating <2ms overhead per request
- [ ] Load testing with tracing enabled on large-scale operations
- [ ] Trace data validation ensuring correct span relationships and attributes

## Dependencies & Blockers

### External Dependencies
- [ ] OpenTelemetry SDK and API packages (npm packages) - ✅ Available
- [ ] OpenTelemetry collector or compatible backend (optional for dev) - ✅ Console exporter works out-of-box
- [ ] Backend for production (Jaeger, Zipkin, Grafana Tempo, vendor services) - User choice

### Internal Dependencies
- [ ] Existing Logger interface for trace context integration - ✅ Available
- [ ] Error handling system for error span tracking - ✅ Available (src/core/error-handling.ts)
- [ ] MCP server with 6 tools for instrumentation - ✅ Available
- [ ] HTTP/SSE transport for request tracing - ✅ Available

### Current Blockers
- [ ] None currently identified - all dependencies are satisfied

## Risk Management

### Identified Risks
- [ ] **Risk**: Performance overhead impacting request latency | **Mitigation**: Benchmark early, implement sampling, use async exports | **Owner**: Developer
- [ ] **Risk**: Additional complexity in error handling and logging | **Mitigation**: Extend existing patterns, comprehensive testing | **Owner**: Developer
- [ ] **Risk**: Configuration complexity for users | **Mitigation**: Zero-config defaults, clear documentation, environment variable standards | **Owner**: Developer
- [ ] **Risk**: Vendor lock-in with specific backends | **Mitigation**: OpenTelemetry standard ensures portability, support multiple exporters | **Owner**: Developer

### Mitigation Actions
- [ ] Performance benchmarking in Phase 1 to validate overhead targets
- [ ] Developer guide with clear examples for adding instrumentation
- [ ] Default to console exporter for zero-config local development
- [ ] Support standard OTEL environment variables for backend-agnostic configuration

## Decision Log

### Open Questions
- [ ] What default sampling rate for production (1%, 10%, 100%)?
- [ ] Should we include trace IDs in all log messages automatically?

### Resolved Decisions

**Initial Architecture (2025-10-03):**
- [x] **Standard server-side OTel implementation** - **Decided**: 2025-10-03 | **Rationale**: Industry best practice, avoids MCP protocol controversy, mature ecosystem
- [x] **Direct trace export to collector** - **Decided**: 2025-10-03 | **Rationale**: Standard approach, avoids security concerns, better separation of concerns
- [x] **Extend existing Logger interface** - **Decided**: 2025-10-03 | **Rationale**: Minimal disruption, automatic trace context in logs, familiar developer experience
- [x] **Tracing disabled by default (opt-in)** - **Decided**: 2025-10-29 | **Rationale**: Avoid console noise for users not interested in observability. Set OTEL_TRACING_ENABLED=true to enable. Console exporter outputs to stderr when enabled.

**Semantic Conventions & Architecture (2025-10-29):**
- [x] **Use official OpenTelemetry GenAI semantic conventions** - **Decided**: 2025-10-29 | **Rationale**: OpenTelemetry has comprehensive AI/LLM/Agent support with `gen_ai.*` attributes that are already in development status. Ensures interoperability with AI observability tools, future compatibility, and community alignment. Replaces custom conventions (`mcp.tool.name`, `ai.provider`) with official ones (`gen_ai.tool.name`, `gen_ai.provider.name`). | **Impact**: All attribute names change to official conventions. Code examples need updating. Documentation must reference official OpenTelemetry GenAI specifications. | **Resources**: [GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/), [GenAI Spans](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/), [AI Agent Observability Blog](https://opentelemetry.io/blog/2025/ai-agent-observability/)

- [x] **Combine multiple span types in hierarchical structure** - **Decided**: 2025-10-29 | **Rationale**: Standard OpenTelemetry pattern for multi-layer architectures. HTTP requests create SERVER spans, business logic creates INTERNAL spans, external calls create CLIENT spans. Natural parent-child relationships reflect actual execution flow. | **Impact**: Traces show complete end-to-end visibility: HTTP request → MCP tool → AI provider → K8s API all properly nested with appropriate span kinds.

- [x] **Generic instrumentation via wrappers, not manual spans** - **Decided**: 2025-10-29 | **Rationale**: Instrument at architectural boundaries (HTTP layer, tool execution, AI providers, K8s client) using generic wrappers. Avoids littering codebase with manual span creation. Automatic coverage as new tools/providers are added. | **Impact**: ~100-150 lines of wrapper code instruments entire system. New tools automatically traced when registered. Changes to tracing logic happen in one place. | **Code Changes**: Create `withToolTracing()` wrapper (src/core/tracing/tool-tracing.ts), `TracedAIProvider` base class (src/core/providers/base-traced-provider.ts), optional `createTracedK8sClient()` proxy (src/core/tracing/k8s-tracing.ts).

- [x] **Four-layer instrumentation architecture** - **Decided**: 2025-10-29 | **Rationale**: Clear separation of concerns with instrumentation at each architectural layer: (1) HTTP entry points via auto-instrumentation, (2) MCP tool execution via generic wrapper, (3) AI provider calls via base class, (4) Kubernetes API via optional proxy. | **Impact**: Phase 1 focuses on layers 1-3 (HTTP, tools, AI). Layer 4 (K8s) is optional for Phase 2. Each layer uses appropriate span kind (SERVER, INTERNAL, CLIENT) and semantic conventions (http.*, gen_ai.*, k8s.*).

**Proposed Agentic System Conventions (Future):**
- [~] **Use `gen_ai.task.*` and `gen_ai.action.*` attributes when standardized** - **Status**: Monitoring GitHub Issue [#2664](https://github.com/open-telemetry/semantic-conventions/issues/2664) | **Rationale**: OpenTelemetry is developing comprehensive semantic conventions for AI agentic systems including tasks, actions, agents, teams, artifacts, and memory. These conventions will standardize telemetry for multi-step AI workflows. | **Impact**: When finalized, we can add task-level and action-level spans for complex workflows like `recommend` (which has multiple stages: intent analysis → capability scan → AI recommendation → manifest generation). | **Timeline**: Monitor for stability; likely Phase 3 implementation.

## Scope Management

### In Scope (Current Version)
- [ ] HTTP request and tool execution tracing
- [ ] AI provider call instrumentation (Claude, OpenAI)
- [ ] Kubernetes operation tracing
- [ ] Session lifecycle and context propagation
- [ ] Console, OTLP, Jaeger, Zipkin exporters
- [ ] Configurable sampling strategies
- [ ] Integration with existing error handling and logging
- [ ] Developer utilities for adding instrumentation

### Out of Scope (Future Versions)
- [~] Custom trace visualization UI
- [~] Automatic anomaly detection in traces
- [~] Cost analysis and optimization recommendations
- [~] Trace-based alerting and notifications
- [~] Historical trace analysis and trend identification
- [~] Multi-tenant trace isolation

### Deferred Items
- [~] Custom visualization - **Reason**: Use existing OTel-compatible tools (Jaeger, Grafana) **Target**: Not planned
- [~] Anomaly detection - **Reason**: Focus on instrumentation first, analysis tools exist **Target**: Future enhancement
- [~] Cost optimization - **Reason**: Requires trace correlation with billing data **Target**: v2.0
- [~] Alerting - **Reason**: Use existing observability platform alerting **Target**: Not planned (external tool responsibility)

## Testing & Validation

### Test Coverage Requirements
- [ ] Unit tests for tracer initialization and configuration (>90% coverage)
- [ ] Unit tests for span factory and instrumentation utilities (>90% coverage)
- [ ] Integration tests with mock OpenTelemetry collector
- [ ] Performance tests validating <2ms overhead target
- [ ] Load tests with high-volume trace generation
- [ ] Trace data validation tests ensuring correct span relationships

### User Acceptance Testing
- [ ] Verify traces appear in console exporter during local development
- [ ] Test OTLP export to Jaeger/Zipkin backends
- [ ] Confirm AI provider spans include token counts and model information
- [ ] Validate Kubernetes operation spans include resource types and namespaces
- [ ] Verify error spans correctly capture exception details
- [ ] Test multi-step workflow trace correlation (buildPlatform, remediate)

## Documentation & Communication

### Documentation Completion Status
- [ ] **`docs/observability-guide.md`**: Complete - User guide with tracing concepts, setup, configuration, usage
- [ ] **`docs/development-guide.md`**: Complete - Developer guide for adding instrumentation to tools
- [ ] **`docs/deployment-guide.md`**: Updated - Added tracing configuration for production deployments
- [ ] **`README.md`**: Updated - Added observability to core capabilities
- [ ] **Cross-file consistency**: Complete - OpenTelemetry terminology and patterns aligned

### Communication & Training
- [ ] Team announcement of observability capabilities
- [ ] Create demo showing trace collection and analysis workflow
- [ ] Prepare documentation for interpreting traces and debugging with distributed tracing
- [ ] Establish guidelines for adding instrumentation to new tools and features

## Launch Checklist

### Pre-Launch
- [ ] All Phase 1 implementation tasks completed
- [ ] Performance overhead validated (<2ms per request)
- [ ] Console exporter working for local development
- [ ] Documentation and configuration examples completed
- [ ] Developer guide tested with new tool instrumentation

### Launch
- [ ] Deploy tracing-enabled MCP server to staging environment
- [ ] Monitor performance metrics and overhead
- [ ] Validate trace data quality and completeness
- [ ] Collect team feedback on developer experience

### Post-Launch
- [ ] Analyze trace data to identify performance bottlenecks
- [ ] Monitor overhead and optimize if needed
- [ ] Iterate on instrumentation based on usage insights
- [ ] Plan Phase 2 enhancements (AI/K8s deep instrumentation)

## Work Log

### 2025-10-03: Initial PRD Creation
**Duration**: ~45 minutes
**Primary Focus**: Research OpenTelemetry integration and create comprehensive PRD

**Completed Work**:
- Researched OpenTelemetry MCP integration patterns and community discussions
- Analyzed existing MCP server architecture and logging infrastructure
- Created GitHub issue #137 for OpenTelemetry tracing feature
- Developed comprehensive PRD following documentation-first approach
- Structured implementation as 3 major phases with clear milestones

**Key Decisions**:
- **Standard server-side implementation**: Avoiding controversial MCP protocol trace forwarding
- **Extend existing patterns**: Building on current Logger and ErrorHandler infrastructure
- **Zero-config defaults**: Console exporter for immediate local development value
- **Vendor-neutral**: OpenTelemetry standard ensures backend portability

**Next Steps**: Ready for implementation of Phase 1 - Core Tracing Foundation

### 2025-10-29: Architecture Refinement - GenAI Conventions & Generic Instrumentation
**Duration**: ~90 minutes
**Primary Focus**: Research OpenTelemetry GenAI semantic conventions and design generic instrumentation strategy

**Completed Work**:
- Researched official OpenTelemetry GenAI semantic conventions (gen_ai.* attributes)
- Analyzed OpenTelemetry AI agent observability standards and best practices
- Investigated span hierarchy patterns for multi-layer architectures (SERVER → INTERNAL → CLIENT)
- Designed generic instrumentation strategy using wrappers instead of manual spans
- Identified four strategic instrumentation points for minimal code changes

**Key Decisions**:
- **Use official GenAI conventions**: Replace custom attributes with `gen_ai.*` standards
- **Hierarchical span structure**: Combine SERVER, INTERNAL, CLIENT spans naturally
- **Generic wrappers**: Instrument at architectural boundaries (HTTP, tools, AI, K8s)
- **Four-layer architecture**: Auto-instrumentation + three generic wrappers

**PRD Updates**:
- Updated Decision Log with four major architectural decisions (2025-10-29)
- Replaced custom semantic conventions with official GenAI attributes
- Added comprehensive span hierarchy documentation with examples
- Updated example trace output to use correct attribute names
- Added generic instrumentation strategy to Solution Overview

**Impact on Implementation**:
- Phase 1 implementation simplified: ~100-150 lines of wrapper code
- New tools/providers automatically traced when added
- All attribute names changed to official conventions
- Documentation must reference OpenTelemetry GenAI specifications

**Resources Added**:
- [OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/)
- [GenAI Spans Specification](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/)
- [AI Agent Observability Blog](https://opentelemetry.io/blog/2025/ai-agent-observability/)
- [Agentic Systems Proposal (Issue #2664)](https://github.com/open-telemetry/semantic-conventions/issues/2664)

**Next Steps**: Implement Phase 1 with generic wrappers and official GenAI conventions

### 2025-10-29: Phase 1 Core Tracing Foundation Implementation
**Duration**: ~3 hours
**Primary Focus**: Core OpenTelemetry infrastructure and auto-instrumentation

**Completed PRD Items**:
- [x] Installed OpenTelemetry SDK dependencies (sdk-node, api, auto-instrumentations-node)
- [x] Created core tracing infrastructure (types.ts, config.ts, tracer.ts, index.ts)
- [x] Implemented lazy initialization with console exporter
- [x] Integrated tracer with MCP server startup and graceful shutdown
- [x] Configured environment variable support (OTEL_TRACING_ENABLED, OTEL_SERVICE_NAME, OTEL_DEBUG)
- [x] Tested auto-instrumentation with manual requests (Qdrant, OpenAI, Anthropic CLIENT spans working)
- [x] Changed default session directory to `./tmp/sessions`

**Key Decisions**:
- **Tracing disabled by default**: Set OTEL_TRACING_ENABLED=true to opt-in (avoids console noise)
- **Resource import fix**: Used resourceFromAttributes() instead of new Resource() for TypeScript compatibility

**What's Working**:
- ✅ Auto-instrumentation captures outbound HTTP calls (undici instrumentation)
- ✅ Qdrant vector DB queries traced (GET /collections, POST /points/scroll)
- ✅ OpenAI embeddings API calls traced (1049ms duration observed)
- ✅ Anthropic API calls traced (showing 401 errors correctly)
- ✅ Console exporter outputs JSON spans to stderr
- ✅ Service identification (dot-ai-mcp v0.125.0) with full resource attributes

**Known Limitations**:
- ❌ HTTP SERVER spans not appearing (REST API requests not traced at entry point)
- ❌ MCP tool execution spans not implemented (Layer 2 INTERNAL spans pending)
- ❌ No span hierarchy (all spans are roots with parentSpanContext: undefined)
- ❌ Documentation not written yet

**Files Created**:
- `src/core/tracing/types.ts` - TypeScript type definitions and interfaces
- `src/core/tracing/config.ts` - Environment-based configuration with loadTracingConfig()
- `src/core/tracing/tracer.ts` - Core OpenTelemetry SDK integration with NodeSDK
- `src/core/tracing/index.ts` - Public API exports

**Files Modified**:
- `src/mcp/server.ts` - Added getTracer() initialization and shutdownTracer() cleanup
- `package.json` - Added OpenTelemetry dependencies

**Next Session Priorities**:
- Implement Layer 2: MCP tool execution wrappers (INTERNAL spans with gen_ai.tool.* attributes)
- Add HTTP SERVER span creation for REST API endpoints
- Write `docs/observability-guide.md` comprehensive user guide
- Update `README.md` to mention observability capabilities

### 2025-10-29: HTTP SERVER Spans, OTLP Exporter & Context Propagation
**Duration**: ~3 hours
**Primary Focus**: Complete HTTP tracing infrastructure with proper span hierarchy

**Completed PRD Items**:
- [x] HTTP SERVER span creation - Evidence: `src/core/tracing/http-tracing.ts` (170 lines) with W3C Trace Context extraction, OpenTelemetry HTTP semantic conventions
- [x] HTTP transport integration - Evidence: `src/interfaces/mcp.ts` updated with `createHttpServerSpan()` at entry point
- [x] Context propagation - Evidence: Wrapped request handler in `context.with(trace.setSpan(context.active(), span), async () => {...})`
- [x] OTLP exporter - Evidence: `src/core/tracing/tracer.ts` with `OTLPTraceExporter` for Jaeger integration
- [x] Jaeger testing - Evidence: Traces showing 1 trace with 20 spans, depth 2 (proper parent-child hierarchy)

**Key Implementation Details**:
- **Manual HTTP SERVER span creation**: Resolved auto-instrumentation timing issues by creating explicit SERVER spans with `createHttpServerSpan()` function
- **W3C Trace Context support**: Extract parent trace context from HTTP headers using `propagation.extract()` for distributed tracing compatibility
- **OpenTelemetry semantic conventions**: Full `http.*` attributes (request.method, url.path, response.status_code, client.address, user_agent.original)
- **Context propagation fix**: Wrapped entire HTTP request handler in `context.with(trace.setSpan(context.active(), span), async () => {...})` so CLIENT spans inherit SERVER as parent
- **OTLP exporter**: Implemented with default endpoint `http://localhost:4318/v1/traces`, configurable via `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable
- **Span hierarchy**: Proper SpanKind usage - SERVER (1) for HTTP requests, CLIENT (2) for outbound calls

**Test Results**:
- ✅ SERVER spans appearing in Jaeger with correct span kind (1) and operation name format: `{METHOD} {path}`
- ✅ CLIENT spans (Qdrant, OpenAI, Anthropic) automatically traced by auto-instrumentation
- ✅ **Context propagation working**: Single trace with 20 spans, depth 2 (CLIENT spans are children of SERVER span)
- ✅ Visual waterfall in Jaeger showing timing relationships and request flow
- ✅ Error tracking working (401 from Anthropic API correctly captured with error attributes)
- ✅ Request duration tracking (554-609ms for version tool with all dependency calls)

**Files Created**:
- `src/core/tracing/http-tracing.ts` - HTTP SERVER span creation module with W3C Trace Context extraction

**Files Modified**:
- `src/core/tracing/tracer.ts` - Added OTLP exporter implementation with `OTLPTraceExporter`
- `src/core/tracing/index.ts` - Exported `createHttpServerSpan` and `withHttpServerTracing` functions
- `src/interfaces/mcp.ts` - Integrated HTTP span creation at request entry point, wrapped handler in active context

**Known Limitations**:
- STDIO transport not instrumented (only HTTP has SERVER spans)
- No INTERNAL spans for MCP tool execution yet (Layer 2 pending)
- Only OTLP exporter implemented (Jaeger/Zipkin native exporters pending)
- Documentation not written yet

**Next Session Priorities**:
- Implement Layer 2: INTERNAL spans for MCP tool execution (5 tools: recommend, version, manageOrgData, remediate, projectSetup)
- Add STDIO transport SERVER spans
- Begin Phase 2: AI provider and Kubernetes deep instrumentation
- Write `docs/observability-guide.md` comprehensive user guide

### 2025-10-29: STDIO Tool Tracing & Version Tool Enhancement
**Duration**: ~3 hours
**Primary Focus**: Implement tool execution tracing for STDIO transport (MCP)

**Completed PRD Items**:
- [x] Tool execution span wrapper for all 5 MCP tools - Evidence: `src/core/tracing/tool-tracing.ts` created with `withToolTracing()` function
- [x] STDIO transport tracing - Evidence: Tool spans working for MCP calls through Claude Code
- [x] OpenTelemetry status in version tool - Evidence: `src/tools/version.ts` with `getTracingStatus()` function
- [x] Tool description optimization - Evidence: Reduced from 30+ words to 7 words to save LLM context tokens

**Key Implementation Details**:
- **Generic tool tracing wrapper**: Created `withToolTracing<T>(toolName, args, handler)` that wraps any tool handler with INTERNAL spans
- **GenAI semantic conventions**: Uses `gen_ai.tool.name`, `gen_ai.tool.input`, `gen_ai.tool.duration_ms`, `gen_ai.tool.success` attributes
- **Universal integration**: Modified `registerTool()` in `src/interfaces/mcp.ts` to automatically wrap all tool handlers - zero changes needed to individual tools
- **Context propagation**: Wraps handler execution in `context.with(trace.setSpan(context.active(), span), async () => {...})` so child operations inherit tool span as parent
- **Version tool enhancement**: Added tracing status reporting (enabled, exporterType, endpoint, serviceName, initialized)
- **Token optimization**: Simplified VERSION_TOOL_DESCRIPTION from detailed feature list to "Get comprehensive system health and diagnostics"

**Span Hierarchy**:
- **STDIO (MCP) transport**: `execute_tool version` (INTERNAL) → child spans (CLIENT)
- **HTTP (REST) transport**: `POST /api/v1/tools/version` (SERVER) → `execute_tool version` (INTERNAL) → child spans (CLIENT)

**Test Results**:
- ✅ STDIO tracing working: MCP calls through Claude Code appearing in Jaeger
- ✅ Tool spans showing correct attributes: tool name, input JSON, duration, success status
- ✅ Proper span hierarchy: 1 trace with 20 spans, depth 2
- ✅ Version tool reports tracing status: enabled=true, exporterType=otlp, endpoint, serviceName=dot-ai-mcp
- ✅ Fixed environment variable issue: Clarified `OTEL_EXPORTER_OTLP_ENDPOINT` requirement vs `OTEL_OTLP_ENDPOINT`

**Files Created**:
- `src/core/tracing/tool-tracing.ts` - Generic tool tracing wrapper with GenAI semantic conventions

**Files Modified**:
- `src/core/tracing/index.ts` - Exported `withToolTracing` function
- `src/interfaces/mcp.ts` - Integrated tool tracing wrapper in `registerTool()` method
- `src/tools/version.ts` - Added `getTracingStatus()` function, simplified tool description

**Architecture Decision**:
- **Integration tests at the end**: Decided to write comprehensive integration tests after all implementation phases complete (Phase 4), not incrementally per phase

**Known Discoveries**:
- Phase 2 needed for better span names: Current child spans show generic "GET"/"POST" - need AI provider and K8s instrumentation to show descriptive names like "chat anthropic claude-3-5-sonnet" and "k8s.getCoreV1Api"

**Next Session Priorities**:
- Begin Phase 2: Deep Instrumentation (AI provider call spans, Kubernetes operation spans)
- Make child spans more distinguishable in Jaeger with descriptive operation names
- Consider Phase 4 comprehensive integration testing strategy

### 2025-10-29: Phase 2 AI Provider Chat Tracing Implementation
**Duration**: ~4 hours
**Primary Focus**: Implement AI provider tracing with official GenAI semantic conventions

**Completed PRD Items**:
- [x] **AI Provider Chat Tracing** - Complete generic wrapper instrumentation
  - Created `src/core/tracing/ai-tracing.ts` with unified `withAITracing()` wrapper
  - Integrated with `AnthropicProvider.sendMessage()`
  - Integrated with `VercelProvider.sendMessage()`
  - Implemented official OpenTelemetry GenAI semantic conventions
  - Token usage tracking (input, output, cache metrics)
  - Verified with Jaeger showing `chat claude-sonnet-4-5-20250929` spans

**Key Implementation Details**:
- **Unified tracing wrapper**: Single `withAITracing()` function supports 3 operation types (`chat`, `tool_loop`, `embeddings`)
- **GenAI semantic conventions**: Using official `gen_ai.*` attributes per OpenTelemetry spec
- **Dynamic provider support**: Works with all Vercel AI SDK providers (OpenAI, Google, Anthropic, xAI, Mistral, DeepSeek, OpenRouter, custom)
- **Context propagation**: AI CLIENT spans properly nested under tool INTERNAL spans
- **HTTP auto-instrumentation**: Nested POST spans visible but kept for now as discovery tool

**Bonus Refactoring Work**:
- **Unified embedding providers**: Consolidated `OpenAIEmbeddingProvider`, `GoogleEmbeddingProvider`, `MistralEmbeddingProvider` into single `VercelEmbeddingProvider`
  - Reduced code by 159 lines (32% reduction, 494→335 lines)
  - All three providers now use Vercel AI SDK `embed()` function
  - Dynamic provider selection with `getProviderType()` method
  - Maintains backward compatibility through factory pattern

**Test Results**:
- ✅ Jaeger traces showing proper span hierarchy: SERVER → INTERNAL → CLIENT
- ✅ GenAI attributes complete: operation, provider, model, tokens, duration
- ✅ Build successful with zero TypeScript errors
- ✅ Context propagation working correctly

**Files Created**:
- `src/core/tracing/ai-tracing.ts` - Unified AI tracing wrapper (~150 lines)

**Files Modified**:
- `src/core/providers/anthropic-provider.ts` - Wrapped `sendMessage()` with AI tracing
- `src/core/providers/vercel-provider.ts` - Wrapped `sendMessage()` with AI tracing
- `src/core/embedding-service.ts` - Unified three provider classes into one (major refactor)
- `src/core/tracing/index.ts` - Exported `withAITracing` function
- `src/core/index.ts` - Updated to export `VercelEmbeddingProvider`

**Architecture Decisions**:
- **Keep HTTP auto-instrumentation for now**: Nested POST spans provide discovery value showing what still needs manual instrumentation
- **Embedding provider unification**: Cleaner architecture with single class handling all providers via Vercel AI SDK
- **Generic wrapper pattern proven**: Same pattern will be used for `toolLoop()` and embeddings tracing

**Next Session Priorities**:
- Add tracing to `toolLoop()` operations (AnthropicProvider and VercelProvider)
- Add tracing to embeddings operations (`VercelEmbeddingProvider`)
- Begin Kubernetes client instrumentation
- Consider disabling HTTP auto-instrumentation once all operations are manually traced

### 2025-10-29: ToolLoop Iteration Tracing & Tracing Simplification
**Duration**: ~2 hours
**Primary Focus**: Add per-iteration visibility to toolLoop operations and simplify tracing code

**Completed PRD Items**:
- [x] **ToolLoop Iteration Spans** - Added `tool_loop_iteration` spans to `AnthropicProvider.toolLoop()`
  - Evidence: Each iteration wrapped in INTERNAL span, tested with remediate tool (4 iterations traced)
  - Proper span lifecycle with error handling and early return support
  - Jaeger shows clear iteration progression within `tool_loop` span
- [x] **Tracing Code Simplification** - Removed `isEnabled()` checks from all tracing wrappers
  - Evidence: Updated `ai-tracing.ts`, `http-tracing.ts` to trust OpenTelemetry no-op tracer
  - Simpler code, zero overhead when tracing disabled
  - Kept `isEnabled()` only for status reporting and initialization

**Files Modified**:
- `src/core/providers/anthropic-provider.ts` - Added iteration span wrapping with proper early-return handling
- `src/core/tracing/ai-tracing.ts` - Removed `isEnabled()` check, trust no-op tracer
- `src/core/tracing/http-tracing.ts` - Removed `isEnabled()` check, removed unused `getTracer` import

**Architecture Decisions**:
- **Iteration spans only for Anthropic provider**: VercelProvider uses SDK's internal loop, no per-iteration hooks available
- **Trust OpenTelemetry no-op tracer**: Eliminates redundant checks, cleaner code
- **Clear span naming**: Use `tool_loop_iteration` to show clear parent-child relationship with `tool_loop` span

**Test Results**:
- ✅ Tested with remediate tool - 4 iterations traced successfully
- ✅ Proper span hierarchy: `execute_tool remediate` → `tool_loop claude-sonnet-4-5-20250929` → `tool_loop_iteration` (×4)
- ✅ Token metrics captured correctly across all iterations
- ✅ Build passes with zero errors

**Next Session Priorities**:
- Add embeddings tracing (`VercelEmbeddingProvider`)
- Begin Kubernetes client instrumentation
- Plan HTTP auto-instrumentation removal (Phase 3)

### 2025-10-30: Phase 2 Completion - Embeddings & Kubernetes Tracing + Capability Scan Refactoring
**Duration**: ~6 hours
**Primary Focus**: Complete Phase 2 AI provider and Kubernetes instrumentation, remove manual mode from capability scanning

**Completed PRD Items**:
- [x] **AI Provider Embeddings Tracing** - Complete generic wrapper instrumentation
  - Wrapped `VercelEmbeddingProvider.generateEmbedding()` and `generateEmbeddings()` with `withAITracing(operation: 'embeddings')`
  - Official GenAI semantic conventions: `gen_ai.operation.name: 'embeddings'`, `gen_ai.provider.name`, `gen_ai.request.model`
  - Embedding metrics tracking: `gen_ai.embeddings.count`, `gen_ai.embeddings.dimensions`
  - Tested with capability scan - spans showing `embeddings text-embedding-3-small` with proper context propagation
- [x] **Kubernetes Client Library Tracing** - Complete generic proxy wrapper instrumentation
  - Created `src/core/tracing/k8s-tracing.ts` (~150 lines) with `createTracedK8sClient()` proxy wrapper
  - JavaScript Proxy pattern for transparent method interception - zero code changes to existing operations
  - Integrated in `src/core/discovery.ts` (CoreV1Api, VersionApi) and `src/tools/version.ts` (AppsV1Api, AdmissionregistrationV1Api)
  - Creates CLIENT spans with `k8s.api`, `k8s.method` attributes
- [x] **Kubectl CLI Tracing** - Complete wrapper instrumentation for CLI commands
  - Created `withKubectlTracing()` wrapper function in `src/core/tracing/k8s-tracing.ts`
  - Integrated in `src/core/kubernetes-utils.ts` - wrapped `executeKubectl()` function
  - Creates CLIENT spans with `k8s.client: 'kubectl'`, `k8s.operation`, `k8s.resource` attributes
  - Tested with capability scanning kubectl commands

**Additional Work Done (Out of PRD Scope)**:
- **Capability Scan Workflow Simplification** - Removed manual processing mode for cleaner UX
  - Removed `processingMode` field from session interface (removed `'processing-mode'` step)
  - Deleted `handleProcessingMode` function entirely (~50 lines removed)
  - Updated workflow routing in `src/tools/organizational-data.ts` to skip processing-mode step
  - Modified `handleResourceSelection` and `handleResourceSpecification` to transition directly to scanning
  - Fixed duplicate CRD fetching bug discovered during testing - moved CRD fetch into manual mode block only
  - Updated integration tests in `tests/integration/tools/manage-org-data-capabilities.test.ts`
    - Removed processing-mode expectations from all test workflows
    - Updated resource lists to use actual cluster resources (Deployment.apps, Service, Pod, ConfigMap)
    - All 11 integration tests passing (406.59s duration)
  - Verified end-to-end with manual MCP testing - workflow now: resource-selection → [resource-specification] → scanning → complete
  - Database verification: 64 capabilities stored in qdrant-test container (capabilities-policies collection)

**Key Implementation Details**:
- **Dual K8s tracing strategy**: Client library tracing (Proxy wrapper) + kubectl CLI tracing (function wrapper)
- **Generic instrumentation pattern**: Both K8s wrappers follow same pattern as AI tracing - instrument at boundaries
- **Zero overhead when disabled**: Trust OpenTelemetry no-op tracer, no manual isEnabled() checks
- **Context propagation verified**: K8s CLIENT spans properly nested under tool INTERNAL spans

**Files Created**:
- `src/core/tracing/k8s-tracing.ts` - Dual K8s tracing module (~150 lines)

**Files Modified**:
- `src/core/embedding-service.ts` - Added embeddings tracing wrappers
- `src/core/discovery.ts` - Wrapped K8s API clients with traced proxies
- `src/core/kubernetes-utils.ts` - Wrapped kubectl execution with tracing
- `src/tools/version.ts` - Wrapped K8s API clients with traced proxies
- `src/core/tracing/index.ts` - Exported K8s tracing functions
- `src/core/capability-scan-workflow.ts` - Removed manual mode logic, fixed duplicate CRD bug
- `src/tools/organizational-data.ts` - Updated routing to skip processing-mode
- `tests/integration/tools/manage-org-data-capabilities.test.ts` - Updated all tests for simplified workflow

**Phase 2 Progress**:
- **Before**: 33% complete (3/9 items - AI providers only)
- **After**: 67% complete (6/9 items - AI providers + K8s client/kubectl complete)
- **Remaining**: Deployment operations instrumentation, session lifecycle, multi-step workflow propagation

**Architecture Decisions**:
- **K8s Proxy Pattern**: JavaScript Proxy wrapper provides transparent instrumentation without modifying business logic
- **Kubectl wrapper approach**: Function wrapper intercepts CLI execution, parses args for operation/resource metadata
- **Simplified capability scan**: Removed manual mode based on user feedback - auto mode covers all use cases

**Test Results**:
- ✅ Embeddings tracing working - spans visible in Jaeger during capability scan
- ✅ K8s client tracing working - CoreV1Api methods traced (listNamespace, etc.)
- ✅ Kubectl tracing working - CLI commands traced (kubectl get crd, etc.)
- ✅ Context propagation verified - K8s spans nested under tool spans
- ✅ All 11 capability scan integration tests passing
- ✅ Database storage verified - 64 capabilities in test Qdrant
- ✅ Build successful with zero TypeScript errors

**Bugs Fixed**:
- **Duplicate CRD fetching**: Discovered during Jaeger trace analysis - manual mode was prefetching CRD that auto mode didn't use, causing 2x kubectl get crd calls. Fixed by moving CRD fetch inside manual mode block. (Note: Manual mode subsequently removed entirely)

**Known Discoveries**:
- Manual mode incomplete: Didn't store capabilities to database, only showed preview
- User decision: Remove manual mode entirely - auto mode with resource selection covers all use cases
- Duplicate operations visible in traces helped identify inefficient code paths

**Next Session Priorities**:
- Complete remaining Phase 2 items: deployment operations, session lifecycle, multi-step workflows
- Begin Phase 3: Advanced features (sampling strategies, native exporters, metrics)
- Begin Phase 4: Documentation (`docs/observability-guide.md`, `docs/development-guide.md`)
- Consider Phase 4 comprehensive integration testing for tracing features

### 2025-10-30: Qdrant Vector Database Tracing + Capability Scanning Bug Fix
**Duration**: ~4 hours
**Primary Focus**: Complete Phase 2 vector database instrumentation and fix resource schema fetching

**Completed PRD Items**:
- [x] **Qdrant Vector Database Tracing** - Full instrumentation with database semantic conventions
  - Created `src/core/tracing/qdrant-tracing.ts` with generic `withQdrantTracing()` wrapper
  - Integrated with all 10 VectorDBService operations (upsert, search, searchByKeywords, getDocument, deleteDocument, deleteAll, getAllDocuments, getCollectionInfo, healthCheck, initializeCollection)
  - Official database semantic conventions: `db.system: 'qdrant'`, `db.operation.name`, `db.collection.name`, `db.vector.*`
  - Result metadata tracking: `db.query.result_count`, `db.vector.top_score` for search operations
  - Tested comprehensively with capability scan workflow

**Additional Work Done**:
- **Fixed capability scanning bug** for resources with API groups (Deployment.apps, StatefulSet.apps)
  - Updated `src/core/capability-scan-workflow.ts` resource schema fetching logic
  - Changed from "extract Kind first, then fallback to CRD" to "try full name first, then fallback to Kind"
  - Pattern: Try `kubectl explain <full-name>` → if fails, try `kubectl explain <Kind>`
  - This works for both CRDs (clusters.postgresql.cnpg.io) and built-in resources with groups (Deployment.apps)
  - Verified working with Deployment.apps, Service, ConfigMap, and apps.devopstoolkit.live (CRD)

**Key Implementation Details**:
- **Generic wrapper pattern**: Single `withQdrantTracing()` function handles all 13 operation types
- **Automatic result metadata**: Captures result counts, top scores for search operations
- **Zero overhead when disabled**: Trust OpenTelemetry no-op tracer (no manual isEnabled() checks)
- **Elegant bug fix**: Try-with-fallback pattern avoids hardcoded API group lists

**Files Created**:
- `src/core/tracing/qdrant-tracing.ts` - Qdrant tracing module (~150 lines)

**Files Modified**:
- `src/core/vector-db-service.ts` - Wrapped all 10 operations with Qdrant tracing
- `src/core/tracing/index.ts` - Exported `withQdrantTracing`
- `src/core/capability-scan-workflow.ts` - Fixed resource schema fetching (try full name first, fallback to Kind)

**Test Results**:
- ✅ All Qdrant operations traced: delete_all, upsert, search, list
- ✅ Built-in resources scan successfully: Deployment.apps, Service, ConfigMap
- ✅ CRD scanning works: apps.devopstoolkit.live
- ✅ Database semantic conventions correctly applied (`db.system`, `db.operation.name`, `db.collection.name`, `db.vector.*`)
- ✅ Context propagation verified - Qdrant spans nested under tool spans
- ✅ Build successful with zero TypeScript errors

**Phase 2 Progress**:
- **Before**: 67% complete (6/9 items - AI providers + K8s only)
- **After**: 78% complete (7/9 items - AI providers + K8s + Qdrant complete)
- **Remaining**: Deployment operations instrumentation, session lifecycle, multi-step workflow propagation, custom span attributes

**Architecture Decisions**:
- **Try-with-fallback pattern**: More elegant than hardcoding built-in API groups
  - Works for all CRDs (full name succeeds)
  - Works for all built-in resources with groups (full name fails, Kind succeeds)
  - Works for all core resources (full name succeeds)
- **Generic Qdrant wrapper**: Same pattern as AI and K8s tracing - instrument at boundaries, not scattered throughout code

**Next Session Priorities**:
- Begin Phase 3 or Phase 4: Documentation is critical for production readiness
- Complete remaining Phase 2 items: deployment operations, session lifecycle, multi-step workflows
- Integration tests for tracing features

### 2025-10-30: Observability Documentation - User Guide Creation
**Duration**: ~2 hours
**Primary Focus**: Create comprehensive observability documentation for users

**Completed PRD Items**:
- [x] Created `docs/observability-guide.md` (176 lines) - Complete user guide with:
  - Environment variables table with all tracing configuration options
  - Quick Start section with verification steps
  - "What Gets Traced" covering all instrumentation categories (MCP tools, AI providers, Kubernetes, vector DB)
  - Backend Integration with Jaeger Docker setup and OTLP configuration examples
  - "Viewing Traces" section explaining Jaeger UI navigation and trace hierarchy
- [x] Updated `docs/mcp-setup.md` - Added tracing references in Configuration Components table and Next Steps section
- [x] Removed auto-instrumentation - Completed transition to manual-only tracing (commit 54e30b9)

**Documentation Approach**:
- **Project-specific focus**: No generic OpenTelemetry explanations, only links to official docs for concepts
- **Validated examples**: All commands and configurations tested before documentation (Jaeger Docker, OTLP endpoint)
- **User-centric outputs**: Natural language agent responses instead of raw JSON for verification steps
- **Comprehensive coverage**: Environment variables, backend integration, trace viewing, span hierarchy

**Phase 1 Status**:
- **Implementation**: 100% complete ✅
- **Documentation**: 75% complete (3/4 items - missing deployment-guide.md update)

**Next Session Priorities**:
- Add tracing configuration section to `docs/deployment-guide.md` to complete Phase 1
- Consider creating `docs/development-guide.md` for developer instrumentation patterns (Phase 2 doc requirement)
- Complete remaining Phase 2 implementation items (deployment operations, session lifecycle, multi-step workflows)

### 2025-10-30: Helm Chart Enhancement & Deployment Documentation
**Duration**: ~2 hours
**Primary Focus**: Enable tracing configuration for Kubernetes/Docker deployments

**Completed PRD Items**:
- [x] Added `extraEnv` support to Helm chart:
  - `charts/values.yaml`: Added extraEnv field with commented tracing examples (OTEL_TRACING_ENABLED, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME)
  - `charts/templates/deployment.yaml`: Render extraEnv entries via toYaml template with proper indentation
  - `charts/templates/mcpserver.yaml`: Render extraEnv entries for ToolHive deployment method
  - Added customLlm/customEmbeddings secret key definitions to fix template evaluation error
  - Tested with `helm template` - extraEnv renders correctly in both standard and ToolHive modes
- [x] Updated deployment documentation:
  - `docs/setup/docker-setup.md`: Added optional tracing environment variables section with link to observability guide
  - `docs/setup/kubernetes-setup.md`: Added tracing note in installation notes with link to observability guide
  - `docs/setup/kubernetes-toolhive-setup.md`: Added tracing note in installation notes with link to observability guide

**Implementation Approach**:
- **Generic extraEnv pattern**: Follows Helm best practices by allowing any environment variables, not just tracing-specific fields
- **Minimal documentation**: Simple note with link to observability guide rather than duplicating configuration details
- **Consistent across deployment methods**: Docker, standard Kubernetes, and ToolHive all support tracing configuration
- **User-friendly examples**: Commented examples in values.yaml show exact syntax for tracing configuration

**Phase 1 Status**:
- **Complete**: 100% (all implementation + all documentation) ✅
- **PRD divergence**: Referenced "deployment-guide.md" doesn't exist - actual files are `docs/setup/docker-setup.md`, `docs/setup/kubernetes-setup.md`, `docs/setup/kubernetes-toolhive-setup.md`

**Next Session Priorities**:
- Complete Phase 2 remaining items: deployment operations instrumentation, session lifecycle tracing, multi-step workflow trace propagation
- Consider Phase 3 advanced features (sampling strategies, metrics, native exporters)
- Consider Phase 4 testing and integration tests for tracing functionality

### 2025-10-30: PRD Completion - Phase 2 Complete, Phase 3/4 Deferred
**Duration**: ~15 minutes
**Primary Focus**: Finalize PRD status and document completion

**Completed Work**:
- Verified deployment operations already instrumented via executeKubectl wrapper
- Confirmed Phase 2 is 100% complete (all instrumentation working)
- Updated PRD status from "Draft" to "Complete"
- Marked Phase 3 and Phase 4 as "Deferred" (not required for production use)
- Created PRD #197 for multi-step workflow distributed tracing (separate feature)

**Key Decisions**:
- **Deployment tracing complete**: All kubectl commands (including apply/wait) already traced via withKubectlTracing wrapper
- **Phase 2 at 100%**: All AI providers, K8s operations, vector DB, and deployments fully instrumented
- **Phase 3/4 deferred**: OTLP exporter and current feature set sufficient for production, no need for additional exporters, sampling strategies, metrics, or formal testing
- **Session tracing separate**: Multi-step workflow trace propagation moved to PRD #197 as independent feature

**Final Status**:
- **Phase 1**: ✅ 100% Complete - Core tracing foundation
- **Phase 2**: ✅ 100% Complete - Deep instrumentation (AI, K8s, Qdrant, deployments)
- **Phase 3**: 🔄 Deferred - Advanced features not required
- **Phase 4**: 🔄 Deferred - Testing and additional documentation not required

**Feature Capabilities Delivered**:
- ✅ HTTP/MCP request tracing with proper span hierarchy
- ✅ All 5 MCP tools traced (recommend, version, manageOrgData, remediate, projectSetup)
- ✅ AI provider operations traced (chat, toolLoop, embeddings) with official GenAI conventions
- ✅ Kubernetes client library and kubectl CLI fully traced
- ✅ Qdrant vector database operations traced
- ✅ Deployment operations traced (kubectl apply, wait)
- ✅ OTLP exporter for production observability backends
- ✅ Environment-based configuration (OTEL_TRACING_ENABLED, OTEL_EXPORTER_OTLP_ENDPOINT)
- ✅ Comprehensive user documentation (docs/observability-guide.md)
- ✅ Production deployment configuration (Helm chart, Docker, Kubernetes)

**PRD Complete**: OpenTelemetry tracing is production-ready and fully operational.

---

## Appendix

### Supporting Materials
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - Official OTel documentation
- [OpenTelemetry JavaScript SDK](https://opentelemetry.io/docs/languages/js/) - Node.js implementation guide
- [MCP OpenTelemetry Discussion #269](https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/269) - Community discussion on tracing
- [Existing Error Handling System](./src/core/error-handling.ts) - Current logging and error infrastructure

### Research Findings
- OpenTelemetry is becoming standard for AI agent observability (2025 trend)
- Standard server-side implementation preferred over MCP protocol forwarding
- Minimal overhead (<2ms) achievable with proper async export configuration
- Strong ecosystem support with multiple backend options (Jaeger, Grafana Tempo, vendors)
- Natural integration with existing Logger interface patterns

### Example Trace Output with Official GenAI Conventions

#### Example 1: Complete Request Trace Hierarchy

```json
// Root span: HTTP SERVER
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "1111111111111111",
  "parentSpanId": null,
  "name": "POST /mcp/tools",
  "kind": "SERVER",
  "timestamp": "2025-10-29T10:15:30.123Z",
  "duration": 10500,
  "attributes": {
    "http.request.method": "POST",
    "http.route": "/mcp/tools",
    "url.path": "/mcp/tools",
    "url.scheme": "https",
    "server.address": "mcp.example.com",
    "server.port": 443,
    "http.response.status_code": 200
  },
  "status": { "code": "OK" }
}

// Child span: MCP Tool Execution (INTERNAL)
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "2222222222222222",
  "parentSpanId": "1111111111111111",
  "name": "execute_tool recommend",
  "kind": "INTERNAL",
  "timestamp": "2025-10-29T10:15:30.150Z",
  "duration": 10400,
  "attributes": {
    "gen_ai.operation.name": "execute_tool",
    "gen_ai.tool.name": "recommend",
    "gen_ai.tool.type": "function",
    "gen_ai.tool.description": "Get AI recommendations for Kubernetes deployments",
    "gen_ai.agent.id": "dot-ai-mcp",
    "gen_ai.agent.name": "DevOps AI Assistant",
    "mcp.session.id": "session_abc123",
    "mcp.tool.stage": "recommend"
  },
  "status": { "code": "OK" }
}

// Grandchild span: AI Provider Call (CLIENT)
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "3333333333333333",
  "parentSpanId": "2222222222222222",
  "name": "chat claude-3-5-sonnet",
  "kind": "CLIENT",
  "timestamp": "2025-10-29T10:15:31.200Z",
  "duration": 2000,
  "attributes": {
    "gen_ai.operation.name": "chat",
    "gen_ai.provider.name": "anthropic",
    "gen_ai.request.model": "claude-3-5-sonnet",
    "gen_ai.response.model": "claude-3-5-sonnet-20241022",
    "gen_ai.usage.input_tokens": 1500,
    "gen_ai.usage.output_tokens": 800,
    "gen_ai.request.temperature": 0.7,
    "gen_ai.request.max_tokens": 4096,
    "gen_ai.response.finish_reasons": ["stop"],
    "gen_ai.conversation.id": "session_abc123"
  },
  "status": { "code": "OK" }
}

// Sibling span: Kubernetes API Call (CLIENT)
{
  "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "spanId": "4444444444444444",
  "parentSpanId": "2222222222222222",
  "name": "GET /apis",
  "kind": "CLIENT",
  "timestamp": "2025-10-29T10:15:33.300Z",
  "duration": 150,
  "attributes": {
    "http.request.method": "GET",
    "server.address": "kubernetes.default.svc",
    "server.port": 443,
    "k8s.operation": "list_api_groups",
    "k8s.client": "kubernetes-client",
    "http.response.status_code": 200
  },
  "status": { "code": "OK" }
}
```

#### Example 2: Error Trace with Exception

```json
{
  "traceId": "error1234567890abcdef",
  "spanId": "err111111111111",
  "parentSpanId": "parent222222222",
  "name": "chat gpt-4o",
  "kind": "CLIENT",
  "timestamp": "2025-10-29T10:20:15.000Z",
  "duration": 500,
  "attributes": {
    "gen_ai.operation.name": "chat",
    "gen_ai.provider.name": "openai",
    "gen_ai.request.model": "gpt-4o",
    "error.type": "timeout"
  },
  "events": [
    {
      "name": "exception",
      "timestamp": "2025-10-29T10:20:15.500Z",
      "attributes": {
        "exception.type": "TimeoutError",
        "exception.message": "Request timed out after 30s",
        "exception.stacktrace": "..."
      }
    }
  ],
  "status": {
    "code": "ERROR",
    "message": "Request timed out after 30s"
  }
}
```

### Implementation References
- `@opentelemetry/sdk-node` - Core SDK for Node.js
- `@opentelemetry/api` - OpenTelemetry API
- `@opentelemetry/auto-instrumentations-node` - Automatic HTTP/Express instrumentation
- `@opentelemetry/exporter-trace-otlp-http` - OTLP exporter for production
- `@opentelemetry/exporter-jaeger` - Jaeger exporter
- `@opentelemetry/exporter-zipkin` - Zipkin exporter

### Official OpenTelemetry Semantic Conventions for MCP Operations

**Note**: Using official OpenTelemetry GenAI semantic conventions (development status) as defined in [OpenTelemetry GenAI Specifications](https://opentelemetry.io/docs/specs/semconv/gen-ai/).

#### General Service Attributes
- `service.name`: "dot-ai-mcp"
- `OTEL_SEMCONV_STABILITY_OPT_IN`: "gen_ai_latest_experimental" (environment variable)

#### GenAI Tool Execution Spans (INTERNAL)
Span name: `execute_tool {gen_ai.tool.name}`

**Required attributes:**
- `gen_ai.operation.name`: "execute_tool"

**Recommended attributes:**
- `gen_ai.tool.name`: Tool being executed (e.g., "recommend", "remediate", "manageOrgData")
- `gen_ai.tool.description`: Tool functionality description
- `gen_ai.tool.type`: "function" (MCP tools are function-type tools)
- `gen_ai.tool.call.id`: Optional unique tool call identifier
- `gen_ai.agent.id`: "dot-ai-mcp" (identifies the MCP server agent)
- `gen_ai.agent.name`: "DevOps AI Assistant"
- `gen_ai.agent.description`: "Kubernetes deployment recommendation agent"

**Custom MCP attributes (namespaced):**
- `mcp.session.id`: Session identifier for stateful interactions
- `mcp.transport`: "stdio" | "http"
- `mcp.tool.stage`: Workflow stage (e.g., "recommend", "chooseSolution")

#### GenAI LLM Inference Spans (CLIENT)
Span name: `{gen_ai.operation.name} {gen_ai.request.model}`
Example: `chat claude-3-5-sonnet`

**Required attributes:**
- `gen_ai.operation.name`: "chat" | "text_completion" | "generate_content"
- `gen_ai.provider.name`: "anthropic" | "openai" | "google" | "xai" | etc.

**Conditionally required attributes:**
- `gen_ai.request.model`: Model identifier (e.g., "claude-3-5-sonnet", "gpt-4o")
- `gen_ai.conversation.id`: Conversation/session identifier (when session-based)

**Recommended attributes:**
- `gen_ai.response.model`: Actual model that generated response (may differ from request)
- `gen_ai.usage.input_tokens`: Prompt token count
- `gen_ai.usage.output_tokens`: Completion token count
- `gen_ai.request.temperature`: Temperature parameter
- `gen_ai.request.max_tokens`: Maximum output tokens
- `gen_ai.request.top_p`: Top-p sampling parameter
- `gen_ai.request.frequency_penalty`: Frequency penalty setting
- `gen_ai.request.stop_sequences`: Stop sequences array
- `gen_ai.response.finish_reasons`: Array of finish reasons (e.g., ["stop"], ["length"])
- `gen_ai.response.id`: Unique completion identifier

**Opt-in attributes (contain sensitive data):**
- `gen_ai.input.messages`: Full chat history (JSON array)
- `gen_ai.output.messages`: Model response messages (JSON array)
- `gen_ai.system_instructions`: System prompt/instructions

#### GenAI Embeddings Spans (CLIENT)
Span name: `embeddings {gen_ai.request.model}`

**Required attributes:**
- `gen_ai.operation.name`: "embeddings"

**Recommended attributes:**
- `gen_ai.request.model`: Embedding model name (e.g., "text-embedding-3-small")
- `gen_ai.usage.input_tokens`: Input token count
- `gen_ai.request.encoding_formats`: Requested encoding formats (e.g., ["base64"])

#### Kubernetes API Spans (CLIENT)
Span name: `{http.request.method} {http.route}` or `k8s.{operation}`

**HTTP attributes:**
- `http.request.method`: "GET" | "POST" | "PUT" | "DELETE"
- `server.address`: Kubernetes API server address
- `server.port`: API server port (typically 443)
- `http.response.status_code`: HTTP status code

**Custom K8s attributes:**
- `k8s.operation`: Kubernetes operation type (e.g., "list_api_groups", "get_pod", "create_deployment")
- `k8s.resource.kind`: Resource kind (e.g., "Pod", "Deployment", "Service")
- `k8s.namespace`: Namespace (when applicable)
- `k8s.client`: "kubernetes-client"

#### HTTP Server Spans (SERVER)
Span name: `{http.request.method} {http.route}`

**Required attributes:**
- `http.request.method`: HTTP method (e.g., "POST", "GET")
- `url.path`: URL path (e.g., "/mcp/tools")
- `url.scheme`: "http" | "https"

**Recommended attributes:**
- `http.route`: Route template (e.g., "/mcp/tools")
- `server.address`: Server address
- `server.port`: Server port
- `client.address`: Client IP address
- `user_agent.original`: User agent string
- `http.response.status_code`: HTTP response status code

#### Span Kind Guidelines
- **SERVER**: HTTP/SSE entry points (incoming requests)
- **INTERNAL**: MCP tool execution, business logic, workflows
- **CLIENT**: AI provider calls, K8s API calls, vector DB queries

#### Future: Agentic System Conventions (Proposed)
When [Issue #2664](https://github.com/open-telemetry/semantic-conventions/issues/2664) is finalized:
- `gen_ai.task.*`: Task-level spans for multi-step workflows
- `gen_ai.action.*`: Action-level spans for execution steps
- `gen_ai.artifact.*`: Input/output artifacts
- `gen_ai.memory.*`: Persistent context storage