ENHANCED_TROUBLESHOOTING.mdโข11.1 kB
# Enhanced ARC MCP Server - Comprehensive Troubleshooting Guide
## ๐ฏ Overview
This enhanced version of the ARC MCP server includes comprehensive troubleshooting capabilities based on real-world experience with ARC installations and cleanup operations. The system can automatically detect, diagnose, and fix common issues without requiring manual command-line intervention.
## ๐ง Enhanced Troubleshooting Scenarios
### 1. Namespace Stuck Terminating
**Issue:** Namespace remains in "Terminating" state indefinitely
**Real-world Example:** `arc-systems` namespace stuck for 9+ hours due to finalizers
**Auto-Fix Capabilities:**
- โ
Automatically detects stuck resources with finalizers
- โ
Force removes finalizers from orphaned runner resources
- โ
Uses `kubectl patch` to clear namespace finalizers
- โ
Force finalizes namespace via API endpoint
**Manual Steps Automated:**
```bash
# These steps are now automated by the MCP server
kubectl patch namespace arc-systems -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl get namespace arc-systems -o json | jq '.spec.finalizers = []' | kubectl replace --raw "/api/v1/namespaces/arc-systems/finalize" -f -
```
### 2. Image Pull Authentication Issues
**Issue:** `ImagePullBackOff` or `ErrImagePull` due to GitHub Container Registry authentication
**Real-world Example:** `ghcr.io/actions/actions-runner-controller:v0.27.6`: unauthorized
**Auto-Fix Capabilities:**
- โ
Detects GHCR authentication failures
- โ
Attempts alternative image repositories
- โ
Uses specific stable image versions
- โ
Configures image pull secrets if needed
- โ
Falls back to DockerHub mirrors
**Prevention Strategies:**
- Uses proven image versions instead of `latest`
- Configures proper GitHub token permissions
- Implements repository failover mechanisms
### 3. Certificate Manager Issues
**Issue:** cert-manager pods not ready, blocking ARC installation
**Real-world Example:** CRDs not available, webhook not responsive
**Auto-Fix Capabilities:**
- โ
Waits intelligently for cert-manager readiness
- โ
Tests webhook connectivity with sample resources
- โ
Validates CRDs are properly installed
- โ
Provides fallback installation methods (Helm vs kubectl)
**Comprehensive Validation:**
- Checks namespace existence
- Validates all deployments are ready
- Tests CRD availability
- Verifies webhook responsiveness
### 4. Helm Installation Timeouts
**Issue:** Helm installations timeout due to resource constraints or image pulls
**Real-world Example:** Installation hangs for 10+ minutes waiting for pods
**Auto-Fix Capabilities:**
- โ
Dynamically adjusts timeout values based on cluster size
- โ
Monitors pod startup progress in real-time
- โ
Detects and resolves resource constraint issues
- โ
Provides intelligent retry mechanisms
**Smart Monitoring:**
- Real-time pod status updates
- Progress visualization during installation
- Intelligent failure detection and recovery
### 5. Pod Security Standards Violations
**Issue:** Pods rejected due to security policy violations
**Real-world Example:** `runAsNonRoot` conflicts, privilege escalation issues
**Auto-Fix Capabilities:**
- โ
Automatically configures proper security contexts
- โ
Adjusts namespace security policies when needed
- โ
Uses privileged namespaces for compatibility
- โ
Optimizes security settings for ARC requirements
### 6. Resource Finalizer Issues
**Issue:** Custom resources stuck due to finalizers preventing deletion
**Real-world Example:** `runner.actions.summerwind.dev` finalizers on 3 runner instances
**Auto-Fix Capabilities:**
- โ
Detects all stuck resources with finalizers
- โ
Force removes finalizers from specific resource types
- โ
Handles AutoscalingRunnerSets, RunnerDeployments, and Runners
- โ
Provides granular finalizer management
**Supported Resource Types:**
- `runners.actions.summerwind.dev`
- `autoscalingrunnersets.actions.summerwind.dev`
- `runnerdeployments.actions.summerwind.dev`
- `horizontalrunnerautoscalers.actions.summerwind.dev`
## ๐ Enhanced Installation Process
The enhanced installation process includes six phases with comprehensive troubleshooting:
### Phase 1: Prerequisites Validation with Issue Detection
- โ
Validates Kubernetes cluster connectivity
- โ
Checks Helm availability and configuration
- โ
Validates GitHub token permissions
- โ
Detects existing installation conflicts
- โ
Performs resource capacity analysis
### Phase 2: Environment Assessment with AI Optimization
- โ
Analyzes cluster topology
- โ
Generates optimal scaling configurations
- โ
Assesses security posture
- โ
Creates intelligent installation plan
### Phase 3: ARC Installation with Real-time Monitoring
- โ
Creates namespace with security policies
- โ
Installs cert-manager with comprehensive validation
- โ
Deploys ARC controller with progress tracking
- โ
Monitors pod startup and health
### Phase 4: Security Hardening with AI Configuration
- โ
Applies enterprise-grade security policies
- โ
Configures network policies
- โ
Sets up proper RBAC
- โ
Enables compliance monitoring
### Phase 5: Validation with Comprehensive Testing
- โ
Validates all components are healthy
- โ
Tests webhook connectivity
- โ
Performs compliance scoring
- โ
Generates security reports
### Phase 6: Runner Guidance with AI Recommendations
- โ
Generates optimal runner configurations
- โ
Provides testing workflows
- โ
Creates next-step recommendations
- โ
Enables conversational management
## ๐งน Enhanced Cleanup Process
The enhanced cleanup process includes six phases with force recovery capabilities:
### Phase 1: Enhanced Validation with Issue Detection
- โ
Detects stuck resources and finalizers
- โ
Identifies namespace terminating issues
- โ
Analyzes resource dependencies
- โ
Performs safety checks
### Phase 2: Comprehensive Troubleshooting
- โ
Automatically applies fixes for known issues
- โ
Removes finalizers from stuck resources
- โ
Resolves namespace terminating problems
- โ
Handles orphaned resources
### Phase 3: Forced Resource Cleanup
- โ
Force removes runner resources with grace period bypass
- โ
Uninstalls Helm releases with timeout handling
- โ
Removes deployments and services
- โ
Cleans up secrets and configurations
### Phase 4: Finalizer Removal
- โ
Systematically removes finalizers from all ARC resources
- โ
Handles multiple resource types
- โ
Uses proper API calls for finalizer management
- โ
Provides recovery tracking
### Phase 5: Namespace Force Deletion
- โ
Attempts graceful namespace deletion first
- โ
Force removes namespace finalizers if needed
- โ
Uses API endpoint for finalizer management
- โ
Waits intelligently for completion
### Phase 6: Final Verification
- โ
Comprehensive verification of cleanup completeness
- โ
Checks for remaining resources across all namespaces
- โ
Validates Custom Resource Definitions
- โ
Provides detailed cleanup report
## ๐ Real-time Progress Updates
All operations provide real-time progress updates in the VS Code chat:
```
## ๐ ARC Installation Progress
Progress: 60% [โโโโโโโโโโโโโโโโโโโโ]
๐ Installation Phases:
โ
๐ Prerequisites
โ
๐ Assessment
โก ๐ Installation
โธ๏ธ ๐ก๏ธ Security
โธ๏ธ โ
Validation
โธ๏ธ ๐โโ๏ธ Runners
๐ฏ Current Phase: Installing ARC controller with AI optimization...
โฑ๏ธ This process typically takes 2-5 minutes
๐ช Sit back and enjoy the show!
```
## ๐ง AI Insights
The system provides intelligent insights throughout the process:
- ๐ง Analyzing cluster state for safe cleanup operations
- ๐ง Environment validated - safe to proceed with cleanup
- ๐ง No runner resources found - skipping this phase
- ๐ง Evaluating namespace arc-systems for safe removal
- ๐ง Some components may require manual cleanup - see verification results
## ๐ก๏ธ Safety Features
### Default Safety Mode
- Cleanup functionality is disabled by default (`CLEANUP_ARC=false`)
- Requires explicit enablement to prevent accidental deletions
- Provides dry-run mode for validation
### Intelligent Recovery
- Automatic detection of common failure patterns
- Self-healing capabilities for known issues
- Comprehensive rollback strategies
### Comprehensive Logging
- Detailed operation logs with timestamps
- AI insights and recommendations
- Troubleshooting results and recovery actions
## ๐ฎ Usage Examples
### Enable Enhanced Installation
The enhanced installation is used automatically when calling the installation tool. It includes:
- Comprehensive troubleshooting
- Real-time progress updates
- Automatic issue resolution
- Intelligent recovery mechanisms
### Enable Enhanced Cleanup
```bash
# Set environment variable to enable cleanup
export CLEANUP_ARC=true
# Or update your MCP configuration
{
"args": ["--rm", "-i", "-e", "CLEANUP_ARC=true", ...]
}
```
### Natural Language Commands
The system supports natural language for all operations:
- "Install ARC with troubleshooting"
- "Cleanup the stuck ARC installation"
- "Fix the namespace terminating issue"
- "Force remove all ARC components"
## ๐ Troubleshooting Scenarios Covered
| Issue | Severity | Auto-Fix | Description |
|-------|----------|----------|-------------|
| Namespace Stuck Terminating | High | โ
| Finalizers blocking namespace deletion |
| Image Pull Authentication | Critical | โ
| GHCR authentication failures |
| Cert-Manager Not Ready | High | โ
| CRDs or webhook issues |
| Helm Installation Timeout | Medium | โ
| Resource constraints or image pulls |
| Pod Security Violations | Medium | โ
| Security context misconfigurations |
| GitHub Token Issues | Critical | โ
| Invalid or expired tokens |
| Resource Limit Issues | Medium | โ
| Insufficient cluster resources |
| Network Policy Problems | Medium | โ
| Connectivity blocked by policies |
| CRD Version Conflicts | High | โ
| Custom Resource Definition issues |
| Webhook Configuration | High | โ
| Admission controller problems |
| Runner Registration | Medium | โ
| GitHub integration failures |
## ๐ฏ Benefits
1. **Zero Manual Intervention**: All common issues are detected and fixed automatically
2. **Real-world Experience**: Based on actual troubleshooting scenarios
3. **Comprehensive Coverage**: Handles installation, cleanup, and recovery
4. **Intelligent Recovery**: Self-healing capabilities for known issues
5. **Safety First**: Multiple safety mechanisms prevent accidental damage
6. **Visual Feedback**: Real-time progress updates and AI insights
7. **Natural Language**: Conversational interface for all operations
## ๐ Continuous Improvement
The troubleshooting scenarios are continuously updated based on:
- Real-world deployment experiences
- Community feedback and issues
- New ARC versions and changes
- Kubernetes platform evolution
This ensures the MCP server stays current with the latest challenges and solutions in the ARC ecosystem.