DevOps AI Toolkit

dot-ai
prds

137-opentelemetry-tracing.md•21 KiB

# PRD: OpenTelemetry Tracing for MCP Server **Created**: 2025-10-03 **Status**: Draft **Owner**: Viktor Farcic **GitHub Issue**: [#137](https://github.com/vfarcic/dot-ai/issues/137) **Last Updated**: 2025-10-03 ## Executive Summary Add OpenTelemetry distributed tracing to the MCP server to provide vendor-neutral observability for debugging complex multi-step workflows, measuring AI provider performance, and understanding Kubernetes operation latency. This enables production-ready monitoring without infrastructure lock-in. ## Problem Statement The DevOps AI Toolkit MCP server handles complex operations including: - Multi-step workflows (buildPlatform: list → submitAnswers → execute) - AI provider calls (Claude, OpenAI) with variable latency - Kubernetes cluster operations (discovery, deployment, remediation) - Session-based stateful interactions across tool calls - HTTP/SSE and STDIO transport protocols **Current Gap**: No distributed tracing capability to understand: - Where time is spent in multi-tool workflows - Which AI provider calls are slow or failing - How Kubernetes API latency impacts user experience - How errors correlate across complex request chains **Impact**: Difficult to debug performance issues, optimize AI costs, and troubleshoot production incidents. ## Documentation Changes ### Files Created/Updated - **`docs/observability-guide.md`** - New File - Complete guide for OpenTelemetry tracing, configuration, and usage - **`docs/deployment-guide.md`** - Updated - Add tracing configuration for production deployments - **`docs/development-guide.md`** - New File - Developer guide for adding instrumentation to new tools - **`README.md`** - Project Overview - Add observability to core capabilities - **`src/core/tracing/`** - Technical Implementation - OpenTelemetry instrumentation modules ### Content Location Map - **Feature Overview**: See `docs/observability-guide.md` (Section: "What is Distributed Tracing") - **Configuration**: See `docs/observability-guide.md` (Section: "Setup and Configuration") - **Tool Instrumentation**: See `docs/development-guide.md` (Section: "Adding Tracing to Tools") - **Production Deployment**: See `docs/deployment-guide.md` (Section: "Observability Configuration") - **Trace Analysis**: See `docs/observability-guide.md` (Section: "Understanding Traces") - **Integration Examples**: See `docs/observability-guide.md` (Section: "Backend Integration") ### User Journey Validation - [ ] **Primary workflow** documented end-to-end: Enable tracing → Deploy MCP → View traces → Debug issues - [ ] **Developer workflow** complete: Add instrumentation → Test locally → Verify traces → Deploy - [ ] **Operations workflow** complete: Configure collector → Deploy server → Monitor traces → Troubleshoot - [ ] **Cross-references** between development docs and observability docs work correctly ## Solution Overview **Standard Server-Side OpenTelemetry Implementation** Implement OpenTelemetry instrumentation following industry best practices: 1. **Auto-instrumentation**: HTTP, Express middleware tracing 2. **Manual instrumentation**: Tool execution, AI calls, Kubernetes operations 3. **Direct export**: Server exports traces to OTel collector (not through MCP protocol) 4. **Trace context propagation**: Correlate multi-step workflows and sessions 5. **Integration**: Extend existing Logger interface for trace context **NOT implementing**: The controversial "send traces through MCP" approach from modelcontextprotocol/discussions/269. ## Implementation Requirements ### Core Functionality - [ ] **HTTP/MCP request tracing**: Automatic span creation for all incoming requests - Documented in `docs/observability-guide.md` (Section: "Request Tracing") - [ ] **Tool execution spans**: Each of 6 MCP tools traced (recommend, version, testDocs, manageOrgData, remediate, buildPlatform) - Documented in `docs/development-guide.md` (Section: "Tool Spans") - [ ] **Error tracking**: Integration with existing error-handling system - Documented in `docs/observability-guide.md` (Section: "Error Correlation") - [ ] **Trace context propagation**: Session-based workflow correlation - Documented in `docs/development-guide.md` (Section: "Context Propagation") ### Deep Instrumentation - [ ] **AI provider tracing**: Claude/OpenAI API call spans with latency/tokens - Documented in `docs/observability-guide.md` (Section: "AI Provider Metrics") - [ ] **Kubernetes operations**: Cluster API calls, discovery, deployments - Documented in `docs/observability-guide.md` (Section: "Kubernetes Operations") - [ ] **Multi-step workflows**: Trace buildPlatform intent mapping → script discovery → execution - Documented in `docs/development-guide.md` (Section: "Complex Workflows") - [ ] **Session lifecycle**: Track session creation, continuity, and completion - Documented in `docs/observability-guide.md` (Section: "Session Tracking") ### Configuration & Deployment - [ ] **Environment-based config**: OTEL_EXPORTER_OTLP_ENDPOINT, service name, sampling - Documented in `docs/deployment-guide.md` (Section: "Environment Variables") - [ ] **Multiple exporters**: Console (dev), OTLP (production), Jaeger, Zipkin - Documented in `docs/observability-guide.md` (Section: "Exporter Configuration") - [ ] **Sampling strategies**: Always-on (dev), probability-based (production) - Documented in `docs/deployment-guide.md` (Section: "Sampling Configuration") - [ ] **Zero-config default**: Works out-of-box with console exporter for local development - Documented in `docs/development-guide.md` (Section: "Getting Started") ### Documentation Quality Requirements - [ ] **All examples work**: Configuration examples validated in integration tests - [ ] **Complete user journeys**: End-to-end workflows from setup to trace analysis documented - [ ] **Consistent terminology**: OpenTelemetry terms used correctly across all documentation - [ ] **Working cross-references**: All links between observability docs and core docs resolve correctly ### Success Criteria - [ ] **Minimal overhead**: <2ms latency added per request with tracing enabled - [ ] **Complete visibility**: All tool executions, AI calls, and K8s operations traced - [ ] **Developer experience**: Simple API for adding spans to new tools - [ ] **Production ready**: Configurable sampling, multiple backends, robust error handling - [ ] **Zero infrastructure requirement**: Works with any OTel-compatible backend ## Implementation Progress ### Phase 1: Core Tracing Foundation [Status: ⏳ PENDING] **Target**: Basic distributed tracing working for HTTP requests and tool execution **Documentation Changes:** - [ ] **`docs/observability-guide.md`**: Create comprehensive user guide with tracing concepts, setup, and usage - [ ] **`docs/deployment-guide.md`**: Add tracing configuration section for production deployments - [ ] **`README.md`**: Update capabilities section to mention observability and distributed tracing **Implementation Tasks:** - [ ] Add OpenTelemetry dependencies (`@opentelemetry/sdk-node`, `@opentelemetry/api`, `@opentelemetry/auto-instrumentations-node`) - [ ] Create `src/core/tracing/tracer.ts` with initialization and configuration logic - [ ] Implement HTTP middleware tracing for both STDIO and HTTP/SSE transports - [ ] Add tool execution span wrapper for all 6 MCP tools - [ ] Integrate with existing Logger interface for trace context injection - [ ] Configure console exporter for local development - [ ] Add environment variable configuration (OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT) ### Phase 2: Deep Instrumentation [Status: ⏳ PENDING] **Target**: AI provider calls and Kubernetes operations fully traced **Documentation Changes:** - [ ] **`docs/development-guide.md`**: Create developer guide for adding instrumentation to new tools and operations - [ ] **`docs/observability-guide.md`**: Add "AI Provider Metrics" and "Kubernetes Operations" sections - [ ] **`docs/observability-guide.md`**: Document trace analysis workflows for common debugging scenarios **Implementation Tasks:** - [ ] Instrument Claude/OpenAI API calls in `src/core/claude.ts` with latency, token count, model attributes - [ ] Add Kubernetes client instrumentation in `src/core/cluster-utils.ts` for API calls - [ ] Trace cluster discovery operations in `src/core/discovery.ts` with resource counts - [ ] Instrument deployment operations in `src/tools/deploy-manifests.ts` - [ ] Add session lifecycle tracing with session ID propagation - [ ] Implement trace context propagation across multi-step workflows (buildPlatform, remediate) - [ ] Add custom span attributes for tool parameters and results ### Phase 3: Advanced Features & Production Readiness [Status: ⏳ PENDING] **Target**: Production-grade observability with metrics, sampling, and multiple backends **Documentation Changes:** - [ ] **`docs/observability-guide.md`**: Add "Advanced Configuration", "Metrics", and "Production Best Practices" sections - [ ] **`docs/deployment-guide.md`**: Document production sampling strategies and backend integration - [ ] **Cross-file validation**: Ensure observability integrates seamlessly with deployment and development workflows **Implementation Tasks:** - [ ] Add OTLP, Jaeger, and Zipkin exporter support with auto-detection - [ ] Implement configurable sampling strategies (always-on, probability-based, rate-limiting) - [ ] Add OpenTelemetry Metrics API for request counts, durations, error rates - [ ] Create custom metrics for AI token usage, K8s API call counts, tool execution frequency - [ ] Implement trace baggage for user context propagation - [ ] Add integration tests for tracing with mock OTel collector - [ ] Performance benchmarking to validate <2ms overhead target ## Technical Implementation Checklist ### Architecture & Design - [ ] Design tracer initialization with lazy loading to minimize startup overhead (src/core/tracing/tracer.ts) - [ ] Create span factory with consistent attribute naming conventions (src/core/tracing/span-factory.ts) - [ ] Design Logger integration for automatic trace context injection (src/core/tracing/logger-integration.ts) - [ ] Plan exporter selection strategy based on environment variables (src/core/tracing/exporters.ts) - [ ] Design sampling configuration with environment-based overrides (src/core/tracing/sampling.ts) - [ ] Document tracing architecture and span hierarchy ### Development Tasks - [ ] Implement `TracingService` class with start/stop lifecycle management - [ ] Create `withSpan` utility for wrapping async operations with tracing - [ ] Add `instrumentTool` decorator for automatic tool span creation - [ ] Implement trace context extraction from MCP session IDs - [ ] Build error tracking integration with existing `ErrorHandler` class - [ ] Create span attribute helpers for consistent metadata ### Documentation Validation - [ ] **Automated testing**: Configuration examples execute successfully in integration tests - [ ] **Cross-file consistency**: Tracing terminology aligned across all documentation - [ ] **User journey testing**: Complete setup-to-analysis workflows can be followed end-to-end - [ ] **Link validation**: All references between observability docs and core documentation resolve correctly ### Quality Assurance - [ ] Unit tests for tracer initialization and span creation (>90% coverage) - [ ] Unit tests for exporter configuration and selection logic (>90% coverage) - [ ] Integration tests with mock OpenTelemetry collector - [ ] Performance tests validating <2ms overhead per request - [ ] Load testing with tracing enabled on large-scale operations - [ ] Trace data validation ensuring correct span relationships and attributes ## Dependencies & Blockers ### External Dependencies - [ ] OpenTelemetry SDK and API packages (npm packages) - ✅ Available - [ ] OpenTelemetry collector or compatible backend (optional for dev) - ✅ Console exporter works out-of-box - [ ] Backend for production (Jaeger, Zipkin, Grafana Tempo, vendor services) - User choice ### Internal Dependencies - [ ] Existing Logger interface for trace context integration - ✅ Available - [ ] Error handling system for error span tracking - ✅ Available (src/core/error-handling.ts) - [ ] MCP server with 6 tools for instrumentation - ✅ Available - [ ] HTTP/SSE transport for request tracing - ✅ Available ### Current Blockers - [ ] None currently identified - all dependencies are satisfied ## Risk Management ### Identified Risks - [ ] **Risk**: Performance overhead impacting request latency | **Mitigation**: Benchmark early, implement sampling, use async exports | **Owner**: Developer - [ ] **Risk**: Additional complexity in error handling and logging | **Mitigation**: Extend existing patterns, comprehensive testing | **Owner**: Developer - [ ] **Risk**: Configuration complexity for users | **Mitigation**: Zero-config defaults, clear documentation, environment variable standards | **Owner**: Developer - [ ] **Risk**: Vendor lock-in with specific backends | **Mitigation**: OpenTelemetry standard ensures portability, support multiple exporters | **Owner**: Developer ### Mitigation Actions - [ ] Performance benchmarking in Phase 1 to validate overhead targets - [ ] Developer guide with clear examples for adding instrumentation - [ ] Default to console exporter for zero-config local development - [ ] Support standard OTEL environment variables for backend-agnostic configuration ## Decision Log ### Open Questions - [ ] Should we enable tracing by default or opt-in via environment variable? - [ ] What default sampling rate for production (1%, 10%, 100%)? - [ ] Should we include trace IDs in all log messages automatically? - [ ] Which OpenTelemetry semantic conventions apply to MCP operations? ### Resolved Decisions - [x] Standard server-side OTel implementation - **Decided**: 2025-10-03 **Rationale**: Industry best practice, avoids MCP protocol controversy, mature ecosystem - [x] Direct trace export to collector - **Decided**: 2025-10-03 **Rationale**: Standard approach, avoids security concerns, better separation of concerns - [x] Extend existing Logger interface - **Decided**: 2025-10-03 **Rationale**: Minimal disruption, automatic trace context in logs, familiar developer experience - [x] Zero-config with console exporter default - **Decided**: 2025-10-03 **Rationale**: Excellent developer experience, no barrier to adoption, see traces immediately ## Scope Management ### In Scope (Current Version) - [ ] HTTP request and tool execution tracing - [ ] AI provider call instrumentation (Claude, OpenAI) - [ ] Kubernetes operation tracing - [ ] Session lifecycle and context propagation - [ ] Console, OTLP, Jaeger, Zipkin exporters - [ ] Configurable sampling strategies - [ ] Integration with existing error handling and logging - [ ] Developer utilities for adding instrumentation ### Out of Scope (Future Versions) - [~] Custom trace visualization UI - [~] Automatic anomaly detection in traces - [~] Cost analysis and optimization recommendations - [~] Trace-based alerting and notifications - [~] Historical trace analysis and trend identification - [~] Multi-tenant trace isolation ### Deferred Items - [~] Custom visualization - **Reason**: Use existing OTel-compatible tools (Jaeger, Grafana) **Target**: Not planned - [~] Anomaly detection - **Reason**: Focus on instrumentation first, analysis tools exist **Target**: Future enhancement - [~] Cost optimization - **Reason**: Requires trace correlation with billing data **Target**: v2.0 - [~] Alerting - **Reason**: Use existing observability platform alerting **Target**: Not planned (external tool responsibility) ## Testing & Validation ### Test Coverage Requirements - [ ] Unit tests for tracer initialization and configuration (>90% coverage) - [ ] Unit tests for span factory and instrumentation utilities (>90% coverage) - [ ] Integration tests with mock OpenTelemetry collector - [ ] Performance tests validating <2ms overhead target - [ ] Load tests with high-volume trace generation - [ ] Trace data validation tests ensuring correct span relationships ### User Acceptance Testing - [ ] Verify traces appear in console exporter during local development - [ ] Test OTLP export to Jaeger/Zipkin backends - [ ] Confirm AI provider spans include token counts and model information - [ ] Validate Kubernetes operation spans include resource types and namespaces - [ ] Verify error spans correctly capture exception details - [ ] Test multi-step workflow trace correlation (buildPlatform, remediate) ## Documentation & Communication ### Documentation Completion Status - [ ] **`docs/observability-guide.md`**: Complete - User guide with tracing concepts, setup, configuration, usage - [ ] **`docs/development-guide.md`**: Complete - Developer guide for adding instrumentation to tools - [ ] **`docs/deployment-guide.md`**: Updated - Added tracing configuration for production deployments - [ ] **`README.md`**: Updated - Added observability to core capabilities - [ ] **Cross-file consistency**: Complete - OpenTelemetry terminology and patterns aligned ### Communication & Training - [ ] Team announcement of observability capabilities - [ ] Create demo showing trace collection and analysis workflow - [ ] Prepare documentation for interpreting traces and debugging with distributed tracing - [ ] Establish guidelines for adding instrumentation to new tools and features ## Launch Checklist ### Pre-Launch - [ ] All Phase 1 implementation tasks completed - [ ] Performance overhead validated (<2ms per request) - [ ] Console exporter working for local development - [ ] Documentation and configuration examples completed - [ ] Developer guide tested with new tool instrumentation ### Launch - [ ] Deploy tracing-enabled MCP server to staging environment - [ ] Monitor performance metrics and overhead - [ ] Validate trace data quality and completeness - [ ] Collect team feedback on developer experience ### Post-Launch - [ ] Analyze trace data to identify performance bottlenecks - [ ] Monitor overhead and optimize if needed - [ ] Iterate on instrumentation based on usage insights - [ ] Plan Phase 2 enhancements (AI/K8s deep instrumentation) ## Work Log ### 2025-10-03: Initial PRD Creation **Duration**: ~45 minutes **Primary Focus**: Research OpenTelemetry integration and create comprehensive PRD **Completed Work**: - Researched OpenTelemetry MCP integration patterns and community discussions - Analyzed existing MCP server architecture and logging infrastructure - Created GitHub issue #137 for OpenTelemetry tracing feature - Developed comprehensive PRD following documentation-first approach - Structured implementation as 3 major phases with clear milestones **Key Decisions**: - **Standard server-side implementation**: Avoiding controversial MCP protocol trace forwarding - **Extend existing patterns**: Building on current Logger and ErrorHandler infrastructure - **Zero-config defaults**: Console exporter for immediate local development value - **Vendor-neutral**: OpenTelemetry standard ensures backend portability **Next Steps**: Ready for implementation of Phase 1 - Core Tracing Foundation --- ## Appendix ### Supporting Materials - [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - Official OTel documentation - [OpenTelemetry JavaScript SDK](https://opentelemetry.io/docs/languages/js/) - Node.js implementation guide - [MCP OpenTelemetry Discussion #269](https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/269) - Community discussion on tracing - [Existing Error Handling System](./src/core/error-handling.ts) - Current logging and error infrastructure ### Research Findings - OpenTelemetry is becoming standard for AI agent observability (2025 trend) - Standard server-side implementation preferred over MCP protocol forwarding - Minimal overhead (<2ms) achievable with proper async export configuration - Strong ecosystem support with multiple backend options (Jaeger, Grafana Tempo, vendors) - Natural integration with existing Logger interface patterns ### Example Trace Output ```bash # Console exporter output during local development { "traceId": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6", "spanId": "b1c2d3e4f5g6h7i8", "name": "recommend", "kind": "INTERNAL", "timestamp": "2025-10-03T10:15:30.123Z", "duration": 2456, "attributes": { "tool.name": "recommend", "tool.stage": "recommend", "intent": "deploy PostgreSQL database" }, "status": { "code": "OK" } } ``` ### Implementation References - `@opentelemetry/sdk-node` - Core SDK for Node.js - `@opentelemetry/api` - OpenTelemetry API - `@opentelemetry/auto-instrumentations-node` - Automatic HTTP/Express instrumentation - `@opentelemetry/exporter-trace-otlp-http` - OTLP exporter for production - `@opentelemetry/exporter-jaeger` - Jaeger exporter - `@opentelemetry/exporter-zipkin` - Zipkin exporter ### Semantic Conventions for MCP Operations - `service.name`: "dot-ai-mcp" - `mcp.tool.name`: Tool being executed - `mcp.session.id`: Session identifier for stateful interactions - `mcp.transport`: "stdio" | "http" - `ai.provider`: "anthropic" | "openai" - `ai.model`: Model identifier (e.g., "claude-3-5-sonnet") - `ai.tokens.prompt`: Prompt token count - `ai.tokens.completion`: Completion token count - `k8s.operation`: Kubernetes operation type - `k8s.resource.kind`: Resource kind being operated on

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vfarcic/dot-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

137-opentelemetry-tracing.md•21 KiB