DevOps AI Toolkit

dot-ai
docs

why-devops-ai-toolkit.md•24.2 KiB

# Why DevOps AI Toolkit? **Understanding the unique value of specialized DevOps intelligence over general-purpose AI assistants.** --- ## The Question With powerful AI assistants like Claude Code available, why use a specialized DevOps toolkit? Can't you just use Claude Code with kubectl and API calls? **Short answer**: You can - for simple tasks. But for production-grade DevOps operations, you need **organizational context**, **autonomous operations**, and **specialized intelligence** that general-purpose AI cannot provide. --- ## Architecture Comparison ### General-Purpose AI + Manual API Calls ```mermaid flowchart TB subgraph User["User Terminal"] CC[Claude Code] end subgraph Manual["Manual Operations"] Bash[Bash/kubectl] API[API Calls] end subgraph Cluster["Kubernetes Cluster"] K8s[Kubernetes API] end CC --> Bash CC --> API Bash --> K8s API --> K8s style CC fill:#e1bee7,stroke:#6a1b9a,color:#000 style Manual fill:#fff3e0,stroke:#e65100,color:#000 style K8s fill:#326ce5,stroke:#1565c0,color:#fff ``` **Characteristics:** - Generic AI with no DevOps-specific training - Manual kubectl commands and API calls - No persistent state between sessions - No organizational context - Human must be present for all operations ### DevOps AI Toolkit Ecosystem ```mermaid flowchart TB subgraph User["User Terminal"] CC[Claude Code + MCP] end subgraph MCP["dot-ai MCP Server"] Tools[9 Specialized Tools] Prompts[46 DevOps Prompts] Sessions[Session Management] Vector[(Qdrant Vector DB)] end subgraph Controller["dot-ai-controller"] Remediation[Event-Driven Remediation] Solutions[Solution Tracking] Sync[Resource Sync] Capabilities[Capability Discovery] end subgraph UI["dot-ai-ui"] Viz[Interactive Visualizations] Dashboard[Resource Dashboard] end subgraph Cluster["Kubernetes Cluster"] K8s[Kubernetes API] Events[Cluster Events] CRDs[Custom Resources] end CC <--> Tools Tools <--> Vector Tools <--> K8s Events --> Remediation CRDs --> Capabilities Remediation --> Tools Capabilities --> Tools Sync --> Vector Tools --> Viz K8s --> Dashboard style CC fill:#e1bee7,stroke:#6a1b9a,color:#000 style MCP fill:#c8e6c9,stroke:#2e7d32,color:#000 style Controller fill:#bbdefb,stroke:#1565c0,color:#000 style UI fill:#e1bee7,stroke:#6a1b9a,color:#000 style K8s fill:#326ce5,stroke:#1565c0,color:#fff ``` **Characteristics:** - Specialized DevOps intelligence - Persistent organizational knowledge - Autonomous operations (controller) - Rich visualizations - Multi-step workflow support --- ## Key Differentiators ### 1. Organizational Context & Knowledge Management ```mermaid flowchart LR subgraph Generic["General-Purpose AI"] G1[Each session starts fresh] G2[No org patterns] G3[No policy awareness] G4[Must re-explain context] end subgraph Toolkit["DevOps AI Toolkit"] subgraph Knowledge["Persistent Knowledge Base"] Patterns[(Deployment Patterns)] Policies[(Governance Policies)] Caps[(Cluster Capabilities)] Resources[(Resource Index)] end T1[Context automatically applied] T2[Semantic search] T3[Team knowledge compounds] end Knowledge --> T1 Knowledge --> T2 Knowledge --> T3 style Generic fill:#ffcdd2,stroke:#c62828,color:#000 style Toolkit fill:#c8e6c9,stroke:#2e7d32,color:#000 style Knowledge fill:#fff9c4,stroke:#f9a825,color:#000 ``` | Capability | General-Purpose AI | DevOps AI Toolkit | |------------|-------------------|-------------------| | Deployment patterns | None - starts fresh | Vector DB stores org patterns | | Policy enforcement | Manual checks | Automatic policy matching | | Resource capabilities | Must discover each time | Indexed with semantic search | | Historical context | Conversation only | Persistent across sessions | | Team knowledge | Not captured | Stores rationale & best practices | **Example**: When you ask to "deploy a database", the toolkit automatically: 1. Searches your organization's database deployment patterns 2. Applies relevant governance policies 3. Matches against discovered cluster capabilities 4. Recommends solutions that fit your organization's standards ### 2. Autonomous Operations ```mermaid flowchart TB subgraph Event["Kubernetes Event"] Warning[Warning: FailedScheduling] end subgraph Controller["dot-ai-controller"] Watch[Event Watcher] Filter[Event Filter] Rate[Rate Limiter] end subgraph MCP["MCP Server"] Remediate[Remediate Tool] AI[AI Analysis] end subgraph Actions["Remediation"] Analyze[Root Cause Analysis] Fix[Apply Fix] Notify[Slack/Google Chat] end Warning --> Watch Watch --> Filter Filter --> Rate Rate --> Remediate Remediate --> AI AI --> Analyze Analyze --> Fix Analyze --> Notify style Event fill:#ffcdd2,stroke:#c62828,color:#000 style Controller fill:#bbdefb,stroke:#1565c0,color:#000 style MCP fill:#c8e6c9,stroke:#2e7d32,color:#000 style Actions fill:#fff9c4,stroke:#f9a825,color:#000 ``` **This is impossible with general-purpose AI.** Claude Code only operates when you're actively using it. The dot-ai-controller provides 24/7 autonomous capabilities: | CRD | Function | |-----|----------| | **RemediationPolicy** | Watches events, triggers AI analysis, auto-fixes issues | | **Solution** | Tracks deployed resources, manages lifecycle | | **ResourceSyncConfig** | Keeps vector DB synchronized with cluster state | | **CapabilityScanConfig** | Auto-discovers new CRDs and operators | ### 3. Multi-Step Workflow Support ```mermaid sequenceDiagram participant User participant MCP as DevOps AI Toolkit participant AI as AI Engine participant K8s as Kubernetes User->>MCP: "Deploy PostgreSQL with HA" MCP->>AI: Analyze intent + org patterns AI->>MCP: 3 recommended solutions MCP->>User: Present options with trade-offs User->>MCP: "Choose option 2" MCP->>AI: Generate configuration questions AI->>MCP: Required parameters MCP->>User: Ask about storage, replicas, etc. User->>MCP: Provide answers MCP->>AI: Apply org policies + generate manifests AI->>MCP: Complete YAML with dry-run validation MCP->>User: Show manifests + validation results User->>MCP: "Deploy it" MCP->>K8s: Apply manifests K8s->>MCP: Deployment status MCP->>User: Success + documentation URL ``` **General-purpose AI workflow:** ``` User: "Deploy postgres with HA" AI: *suggests kubectl commands* User: *runs commands, gets errors* AI: *debugs errors* User: *runs more commands* ... (manual orchestration continues) ``` **DevOps AI Toolkit workflow:** ``` recommend → chooseSolution → answerQuestion → generateManifests → deployManifests ``` Each step maintains session state, applies organizational context, and validates before proceeding. ### 4. Security Through Controlled Tool Access ```mermaid flowchart TB subgraph Analysis["Analysis Phase"] direction TB A1[kubectl get] A2[kubectl describe] A3[kubectl logs] A4[kubectl top] end subgraph Remediation["Remediation Phase"] direction TB R1[kubectl apply] R2[kubectl delete] R3[kubectl scale] R4[kubectl rollout] end subgraph GenericAI["General-Purpose AI"] All[Full bash access All commands available No restrictions] end User([User Request]) --> Analysis Analysis -->|User approves| Remediation style Analysis fill:#c8e6c9,stroke:#2e7d32,color:#000 style Remediation fill:#fff9c4,stroke:#f9a825,color:#000 style GenericAI fill:#ffcdd2,stroke:#c62828,color:#000 ``` **This is a critical security differentiator.** General-purpose AI assistants have unrestricted access to all bash commands. The DevOps AI Toolkit implements **phase-based tool restrictions**: | Workflow Phase | Available Tools | Why | |----------------|-----------------|-----| | **Analysis** | Read-only: `kubectl get`, `describe`, `logs`, `top` | Safe exploration without risk | | **User Decision** | None - waiting for approval | Human-in-the-loop checkpoint | | **Remediation** | Write: `kubectl apply`, `delete`, `scale`, `rollout` | Only after explicit approval | **How it works:** 1. **During analysis**, AI can only use read-only kubectl tools - it cannot modify cluster state even if it wanted to 2. **User reviews** the analysis and proposed remediation 3. **Only after approval** are write tools attached to the AI context 4. **Each workflow step** has a specific, limited tool set ```mermaid sequenceDiagram participant User participant MCP as DevOps AI Toolkit participant AI as AI Engine participant K8s as Kubernetes User->>MCP: "Fix the failing pod" Note over MCP: Attach read-only tools only MCP->>AI: Analyze with kubectl get/describe/logs AI->>K8s: kubectl get pods (read) AI->>K8s: kubectl describe pod (read) AI->>K8s: kubectl logs pod (read) AI->>MCP: Root cause + proposed fix MCP->>User: "Found issue. Apply fix?" Note over User: Human decision point User->>MCP: "Yes, apply the fix" Note over MCP: Now attach write tools MCP->>AI: Execute with kubectl apply/scale AI->>K8s: kubectl apply -f fix.yaml (write) AI->>MCP: Remediation complete ``` **Benefits:** - **Blast radius limitation** - AI mistakes during analysis cannot modify cluster state - **Audit trail** - Clear separation between what AI observed vs what it changed - **Compliance** - Meets security requirements for human approval before changes - **Confidence** - Users can let AI investigate freely knowing it cannot break anything **Comparison:** | Aspect | General-Purpose AI | DevOps AI Toolkit | |--------|-------------------|-------------------| | Tool access | All bash commands always | Phase-restricted tool sets | | Analysis safety | Could accidentally modify | Read-only tools only | | Change approval | Implicit (runs what you ask) | Explicit human checkpoint | | Blast radius | Unlimited | Limited by workflow phase | ### 5. Reliability Through Deterministic Operations ```mermaid flowchart LR subgraph AgentBased["Agent-Based (Unpredictable)"] direction TB LLM1[LLM decides what to fetch] LLM2[LLM decides how to process] LLM3[LLM decides what to return] end subgraph Hybrid["DevOps AI Toolkit (Hybrid)"] direction TB Code[Code executes operations] Inject[Data injected to context] AI[AI reasons with complete info] end style AgentBased fill:#ffcdd2,stroke:#c62828,color:#000 style Hybrid fill:#c8e6c9,stroke:#2e7d32,color:#000 ``` **The toolkit uses a hybrid architecture** that combines deterministic code execution with AI reasoning - not pure agent-based operations where AI decides everything. #### Code-Based Operations vs Agent Operations | Approach | General-Purpose AI | DevOps AI Toolkit | |----------|-------------------|-------------------| | Data collection | AI decides what to fetch | Code fetches required data | | Processing | AI interprets raw output | Code parses and structures | | Consistency | Varies by conversation | Deterministic execution | | Reliability | Depends on AI's choices | Guaranteed operations | **Example - Capability Discovery:** ```mermaid flowchart TB subgraph Agent["Pure Agent Approach"] A1[AI: Should I check CRDs?] A2[AI: Which kubectl command?] A3[AI: How to parse output?] A4[AI: What's important?] A1 --> A2 --> A3 --> A4 end subgraph Toolkit["DevOps AI Toolkit"] T1[Code: kubectl get crds] T2[Code: Parse to structured data] T3[Code: Extract schemas] T4[AI: Reason about capabilities] T1 --> T2 --> T3 --> T4 end style Agent fill:#ffcdd2,stroke:#c62828,color:#000 style Toolkit fill:#c8e6c9,stroke:#2e7d32,color:#000 ``` - **Pure agent**: AI might forget to check CRDs, use wrong commands, or miss important fields - **Toolkit**: Code reliably collects all CRDs, parses them correctly, then AI reasons about the structured result #### Context Injection vs Tool-Based Retrieval ```mermaid flowchart TB subgraph ToolBased["Tool-Based Retrieval"] Q1[AI receives user query] Q2{AI decides: fetch patterns?} Q3{AI decides: fetch policies?} Q4{AI decides: fetch capabilities?} Q5[AI might miss critical context] Q1 --> Q2 Q2 -->|maybe| Q3 Q3 -->|maybe| Q4 Q4 --> Q5 end subgraph Injected["Context Injection"] I1[User query arrives] I2[Code: Fetch relevant patterns] I3[Code: Fetch relevant policies] I4[Code: Fetch capabilities] I5[AI receives complete context] I1 --> I2 I2 --> I3 I3 --> I4 I4 --> I5 end style ToolBased fill:#ffcdd2,stroke:#c62828,color:#000 style Injected fill:#c8e6c9,stroke:#2e7d32,color:#000 ``` | Aspect | Tool-Based Retrieval | Context Injection | |--------|---------------------|-------------------| | Data availability | AI might not call the tool | Always present in context | | Consistency | Varies by AI's judgment | Guaranteed inclusion | | Org patterns | AI might forget to check | Always included for recommendations | | Policies | AI might skip policy lookup | Always enforced | | Capabilities | AI might miss some | Complete set provided | **Why this matters:** When a user asks "deploy a database", the toolkit: 1. **Code** fetches matching patterns from vector DB (not left to AI's discretion) 2. **Code** fetches applicable policies (guaranteed, not optional) 3. **Code** fetches cluster capabilities (complete, not partial) 4. **AI** receives all context and reasons about the best solution A pure agent approach might: - Forget to check organizational patterns - Skip policy validation - Miss available operators - Give inconsistent recommendations **The result**: Predictable, policy-compliant recommendations every time - not just when the AI "remembers" to check. ### 6. Specialized DevOps Intelligence | Capability | General-Purpose AI | DevOps AI Toolkit | |------------|-------------------|-------------------| | Kubernetes expertise | Generic knowledge | 46 specialized prompts | | Deployment recommendations | Manual research | AI recommends based on capabilities | | Operator awareness | Must discover manually | Auto-detects Crossplane, CAPI, Kyverno, KEDA | | Helm chart selection | Manual ArtifactHub search | AI-powered chart selection | | Remediation guidance | Generic troubleshooting | Structured analysis with confidence scores | **The 9 specialized MCP tools:** | Tool | Purpose | |------|---------| | `recommend` | AI-powered deployment recommendations | | `query` | Natural language cluster exploration | | `remediate` | Root cause analysis and remediation | | `operate` | Day 2 operations (scale, update, rollback) | | `manageOrgData` | Pattern, policy, and capability management | | `projectSetup` | Repository governance automation | | `chooseSolution` | Solution selection with configuration | | `answerQuestion` | Multi-step Q&A workflow | | `version` | System health and diagnostics | ### 7. Full Operational Dashboard (Not Just Visualizations) The toolkit is evolving from returning visualization URLs to providing a **complete Kubernetes operational dashboard** with AI deeply integrated. ```mermaid flowchart TB subgraph Terminal["General-Purpose AI"] Text[Plain text output] Manual[Manual kubectl commands] end subgraph Evolution["DevOps AI Toolkit Evolution"] subgraph Phase1["Current: Visualization URLs"] V1[MCP returns visualization URL] V2[User opens in browser] V3[Mermaid, Cards, Tables, Code] end subgraph Phase2["Upcoming: Full Dashboard"] D1[Kubernetes Resource Browser] D2[AI-Powered Actions] D3[Real-time Status] D4[Integrated Troubleshooting] end Phase1 --> Phase2 end style Terminal fill:#ffcdd2,stroke:#c62828,color:#000 style Phase1 fill:#fff9c4,stroke:#f9a825,color:#000 style Phase2 fill:#c8e6c9,stroke:#2e7d32,color:#000 ``` #### Current: AI-Generated Visualizations MCP tools return visualization URLs for complex output: - **Mermaid diagrams** - topology, workflows, dependencies - **Card grids** - solution comparison with status indicators - **Syntax-highlighted code** - YAML manifests with copy - **Data tables** - resources with AI-driven status coloring - **Bar charts** - resource metrics visualization #### Upcoming: Full Kubernetes Dashboard The dashboard transforms from visualization-only to a **complete operational interface**: ```mermaid flowchart LR subgraph Dashboard["dot-ai-ui Dashboard"] Sidebar[Resource Sidebar All K8s kinds] List[Resource Lists Dynamic columns] Detail[Resource Detail Tabs: Overview, YAML, Events, Logs] Actions[AI Action Bar Query, Remediate, Operate] end subgraph MCP["MCP as Backend"] API[REST API Endpoints] Tools[AI Tools] Vector[(Qdrant)] end subgraph K8s["Kubernetes"] Resources[Live Resources] Events[Events] Logs[Pod Logs] end Sidebar --> API List --> API Detail --> API Actions --> Tools API --> Vector API --> Resources Tools --> K8s style Dashboard fill:#e1bee7,stroke:#6a1b9a,color:#000 style MCP fill:#c8e6c9,stroke:#2e7d32,color:#000 style K8s fill:#326ce5,stroke:#1565c0,color:#fff ``` **Dashboard Features:** | Feature | Description | |---------|-------------| | **Resource Browser** | Sidebar showing all resource kinds (Pods, Deployments, CRDs) with counts | | **Dynamic Tables** | Columns auto-generated from Kubernetes printer columns | | **Resource Detail** | Tabbed view: Overview, Metadata, Spec, Status, YAML, Events, Logs | | **Namespace Filtering** | Quick namespace selector for scoping views | | **Multi-Select** | Select multiple resources for batch AI analysis | | **AI Action Bar** | Context-aware buttons: Query, Remediate, Operate, Recommend | | **Status Coloring** | AI-driven problem indication (red/yellow/green) | | **Pod Logs** | Container logs with multi-container support | | **Events Timeline** | Kubernetes events for any resource | **MCP as Backend:** The MCP server provides REST APIs that power the dashboard: ``` GET /api/v1/resources/kinds → Sidebar navigation GET /api/v1/resources → Resource tables with live status GET /api/v1/resource → Single resource detail (full spec/status) GET /api/v1/events → Kubernetes events for a resource GET /api/v1/logs → Pod container logs GET /api/v1/namespaces → Namespace dropdown POST /api/v1/tools/query → AI-powered cluster analysis POST /api/v1/tools/remediate → AI-powered troubleshooting ``` **AI Integration in Dashboard:** ```mermaid sequenceDiagram participant User participant Dashboard participant MCP participant AI participant K8s User->>Dashboard: Click "Analyze" on Deployment Dashboard->>MCP: POST /tools/query with context MCP->>K8s: Gather resource state MCP->>AI: Analyze with read-only tools AI->>MCP: Structured analysis MCP->>Dashboard: Visualization data Dashboard->>User: Inline results with status colors User->>Dashboard: Click "Remediate" Note over Dashboard: Phase-restricted tools activate Dashboard->>MCP: POST /tools/remediate MCP->>AI: Analyze with write tools available AI->>MCP: Remediation plan MCP->>Dashboard: Actions with approval gates ``` **Key Differentiator:** The dashboard isn't just a visualization layer - it's an **AI-native operations interface** where: - Resource context flows automatically to AI tools - AI results render inline with status-based styling - Tool restrictions (read-only vs write) are enforced - Human approval gates are built into the workflow ### 8. Semantic Search & Natural Language Queries ```mermaid flowchart TB subgraph Query["Natural Language Query"] Q1["What resources are consuming the most memory?"] end subgraph Processing["Query Processing"] Parse[Parse Intent] Tools[Select kubectl Tools] Execute[Execute Commands] Correlate[Correlate Results] end subgraph Results["Intelligent Results"] Answer[Structured Answer] Viz[Visualization] Actions[Suggested Actions] end Q1 --> Parse Parse --> Tools Tools --> Execute Execute --> Correlate Correlate --> Answer Correlate --> Viz Correlate --> Actions style Query fill:#bbdefb,stroke:#1565c0,color:#000 style Processing fill:#c8e6c9,stroke:#2e7d32,color:#000 style Results fill:#e1bee7,stroke:#6a1b9a,color:#000 ``` Instead of: ```bash kubectl top pods --all-namespaces | sort -k4 -rn | head -10 kubectl get hpa --all-namespaces kubectl describe node | grep -A5 "Allocated resources" ``` Just ask: ``` "What resources in production are consuming the most memory?" ``` The AI uses multiple kubectl tools, correlates the data, and provides a comprehensive answer with visualization. --- ## When to Use Each Approach ### Use General-Purpose AI When: - Simple, one-off kubectl operations - Ad-hoc troubleshooting that doesn't need automation - Quick prototyping before formalizing patterns - Environments without MCP support ### Use DevOps AI Toolkit When: - You want to codify deployment patterns - Teams need consistent policy enforcement - Autonomous remediation is desired (24/7 operations) - Rich visualizations improve understanding - Semantic search over resources is valuable - Multi-step deployment workflows are common - Knowledge sharing across team members matters - Operator-heavy environments (Crossplane, CAPI, etc.) --- ## Quantified Comparison | Metric | General-Purpose AI | DevOps AI Toolkit | |--------|-------------------|-------------------| | Specialized MCP tools | 0 | 9 | | DevOps prompts | 0 | 46 | | Kubernetes CRDs | 0 | 4 | | Visualization types | 0 (text only) | 6 (Mermaid, Cards, Tables, Code, Charts, Dashboard) | | Vector collections | 0 | 4 | | Autonomous operations | None | Event-driven | | Session persistence | Conversation only | Full workflow state | | Tool access control | Unrestricted | Phase-restricted | | Human approval gates | None | Built-in checkpoints | | Data collection | Agent-decided | Code-guaranteed | | Context availability | Tool-dependent | Injected automatically | | Operation consistency | Variable | Deterministic | | Web dashboard | None | Full K8s resource browser with AI actions | | REST API endpoints | 0 | 8+ (resources, events, logs, tools) | --- ## Summary **General-purpose AI** is capable for simple operations and ad-hoc tasks. **DevOps AI Toolkit** transforms Kubernetes operations into an intelligent, autonomous, and organization-aware system: 1. **Reduces cognitive load** - AI handles complexity, presents options clearly 2. **Enforces consistency** - Patterns and policies applied automatically 3. **Operates autonomously** - Responds to events without human presence 4. **Captures knowledge** - Organizational expertise persists and compounds 5. **Accelerates onboarding** - New team members benefit from codified patterns 6. **Provides operational visibility** - Full dashboard with AI-native actions 7. **Guarantees safety** - Phase-restricted tools and human approval gates The toolkit is not a replacement for AI assistants - it's a specialized enhancement layer that makes AI dramatically more effective for DevOps and Kubernetes operations. With the upcoming full dashboard, it becomes a **complete operational interface** where AI assistance is seamlessly integrated into everyday cluster management. --- ## Next Steps - [Quick Start Guide](quick-start.md) - Get started in minutes - [Tools Overview](guides/mcp-tools-overview.md) - Explore all available tools - [Pattern Management](guides/pattern-management-guide.md) - Codify your deployment patterns - [Capability Management](guides/mcp-capability-management-guide.md) - Discover cluster capabilities

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vfarcic/dot-ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

why-devops-ai-toolkit.md•24.2 KiB