ARCHITECTURE.md•14.1 kB
# SCS-MCP Architecture
## System Overview
SCS-MCP (Smart Code Search - Model Context Protocol) is a sophisticated code intelligence system that provides semantic search, analysis, and voice interaction capabilities to Claude Desktop and other MCP-compatible clients.
```
┌─────────────────────────────────────────────────────────────┐
│ Claude Desktop │
│ (or MCP Client) │
└────────────────────┬───────────────────────────────────────┘
│ MCP Protocol
┌────────────────────▼───────────────────────────────────────┐
│ MCP Server Layer │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Request Router & Handler │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────┬───────────────────────────────────────┘
│
┌────────────────────▼───────────────────────────────────────┐
│ Core Services │
│ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ Search │ │ Analysis │ │ Orchestration │ │
│ │ Engine │ │ Tools │ │ Framework │ │
│ └────────────┘ └────────────┘ └──────────────────┘ │
└────────────────────┬───────────────────────────────────────┘
│
┌────────────────────▼───────────────────────────────────────┐
│ Data Layer │
│ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ SQLite │ │ Embeddings │ │ Git History │ │
│ │ Database │ │ Cache │ │ Analyzer │ │
│ └────────────┘ └────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
## Core Components
### 1. MCP Server (`src/server.py`)
The main entry point that implements the Model Context Protocol specification.
**Responsibilities**:
- Protocol implementation (JSON-RPC)
- Request routing and validation
- Response formatting
- Error handling
- Rate limiting
**Key Features**:
- Async request handling
- Thread-safe operations
- Connection pooling
- Request caching
### 2. Search Engine (`src/search_engine.py`)
Hybrid search system combining semantic and keyword matching.
**Architecture**:
```python
SearchEngine
├── EmbeddingGenerator (Sentence Transformers)
│ └── all-MiniLM-L6-v2 model
├── IndexManager
│ ├── SQLite storage
│ └── FAISS vector index (optional)
├── QueryProcessor
│ ├── Query parsing
│ ├── Synonym expansion
│ └── Filter application
└── ResultRanker
├── Semantic scoring
├── Keyword matching
└── Hybrid ranking
```
**Search Pipeline**:
1. Query preprocessing (tokenization, normalization)
2. Embedding generation
3. Vector similarity search
4. Keyword matching
5. Result fusion and ranking
6. Post-processing and formatting
### 3. Code Indexer (`src/enhanced_indexer.py`)
Multi-language code parser and indexer using Tree-sitter.
**Supported Languages**:
- Python (full AST analysis)
- JavaScript/TypeScript
- Java
- C/C++
- Go
- Rust
- Ruby
**Index Structure**:
```sql
symbols
├── id (PRIMARY KEY)
├── name
├── type (function/class/variable)
├── file_path
├── line_number
├── column_number
├── signature
├── docstring
├── code_snippet
├── embedding (BLOB)
└── metadata (JSON)
dependencies
├── source_symbol_id
├── target_symbol_id
├── dependency_type
└── context
git_history
├── commit_hash
├── file_path
├── change_type
├── diff_content
└── metadata
```
### 4. Analysis Tools (`src/tools/`)
Modular analysis components for code intelligence.
**Tool Categories**:
#### Code Quality Tools
- `instant_review.py`: Real-time code review
- `complexity_analyzer.py`: Cyclomatic complexity
- `test_gap_analyzer.py`: Test coverage analysis
- `security_analyzer.py`: Basic vulnerability scanning
#### Git Analysis Tools
- `git_analyzer.py`: Repository history analysis
- `git_search.py`: Commit message search
- `change_tracker.py`: File change tracking
#### Dependency Tools
- `dependency_analyzer.py`: Import analysis
- `circular_detector.py`: Circular dependency detection
- `usage_tracker.py`: Symbol usage tracking
#### Model Information Tools
- `model_info_tools.py`: AI model capabilities
- `cost_estimator.py`: Token usage estimation
- `model_selector.py`: Task-based model selection
### 5. Orchestration Framework (`src/orchestrators/`)
High-level coordination for complex operations.
**Orchestrator Pattern**:
```python
class Orchestrator:
def __init__(self):
self.tools = []
self.pipeline = []
async def execute(self, context):
results = {}
for stage in self.pipeline:
results[stage] = await self.run_stage(stage, context, results)
return self.aggregate_results(results)
```
**Available Orchestrators**:
- `DebtOrchestrator`: Technical debt analysis
- `RefactorOrchestrator`: Refactoring coordination
- `MigrationOrchestrator`: Code migration planning
- `QualityOrchestrator`: Comprehensive quality assessment
- `PerformanceOrchestrator`: Performance optimization
### 6. Voice Assistant (`voice-assistant/`)
Web-based voice interface with media capture capabilities.
**Architecture**:
```
Voice Assistant
├── Server (Node.js/Express)
│ ├── WebSocket handler
│ ├── MCP client
│ └── Media processor
├── Web UI (HTML/JS)
│ ├── Voice recognition (Web Speech API)
│ ├── Media gallery
│ └── Real-time updates
├── VS Code Extension
│ ├── Editor context
│ ├── Command palette
│ └── Status bar
└── Storage
├── SQLite (metadata)
└── File system (media)
```
## Data Flow
### Search Request Flow
```
1. Client Request
└── MCP Server receives query
└── Query Processor
├── Parse and validate
├── Generate embedding
└── Build search parameters
└── Search Engine
├── Vector search
├── Keyword search
└── Merge results
└── Post-processor
├── Format response
├── Add context
└── Return to client
```
### Indexing Flow
```
1. File Discovery
└── File Scanner
└── Language Detector
└── Parser (Tree-sitter)
├── Extract symbols
├── Extract dependencies
└── Extract documentation
└── Embedding Generator
└── Database Writer
├── Store symbols
├── Store embeddings
└── Update indices
```
## Database Schema
### Core Tables
```sql
-- Symbol storage
CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
type TEXT NOT NULL,
file_path TEXT NOT NULL,
line_number INTEGER,
signature TEXT,
docstring TEXT,
code TEXT,
embedding BLOB,
metadata JSON,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Search index
CREATE TABLE search_index (
id INTEGER PRIMARY KEY,
symbol_id INTEGER,
content TEXT,
embedding BLOB,
tfidf_vector BLOB,
FOREIGN KEY (symbol_id) REFERENCES symbols(id)
);
-- Dependencies
CREATE TABLE dependencies (
id INTEGER PRIMARY KEY,
source_id INTEGER,
target_id INTEGER,
type TEXT,
context TEXT,
FOREIGN KEY (source_id) REFERENCES symbols(id),
FOREIGN KEY (target_id) REFERENCES symbols(id)
);
-- Git history
CREATE TABLE git_commits (
hash TEXT PRIMARY KEY,
message TEXT,
author TEXT,
timestamp TIMESTAMP,
files_changed TEXT,
stats JSON
);
-- Cache
CREATE TABLE cache (
key TEXT PRIMARY KEY,
value BLOB,
expires_at TIMESTAMP
);
```
## Performance Optimizations
### 1. Caching Strategy
**Multi-level caching**:
- L1: In-memory LRU cache (most recent queries)
- L2: SQLite cache table (persistent cache)
- L3: File system cache (large results)
**Cache invalidation**:
- Time-based expiry (TTL: 5 minutes default)
- Event-based (file changes)
- Manual refresh
### 2. Indexing Optimizations
- **Incremental indexing**: Only changed files
- **Parallel processing**: Multi-threaded parsing
- **Batch operations**: Bulk database inserts
- **Lazy loading**: On-demand embedding generation
### 3. Search Optimizations
- **Query optimization**: SQL query planning
- **Vector quantization**: Reduced embedding size
- **Early termination**: Stop at result threshold
- **Result streaming**: Progressive response
## Security Considerations
### 1. Input Validation
- Parameter sanitization
- SQL injection prevention
- Path traversal protection
- Size limits enforcement
### 2. Access Control
- File system boundaries
- Git repository isolation
- Read-only operations
- No code execution
### 3. Data Protection
- No credential storage
- Temporary file cleanup
- Secure communication (MCP protocol)
- Error message sanitization
## Scalability
### Horizontal Scaling
```
Load Balancer
├── MCP Server Instance 1
├── MCP Server Instance 2
└── MCP Server Instance N
└── Shared Database (PostgreSQL)
└── Distributed Cache (Redis)
```
### Vertical Scaling
- **Memory**: Increase embedding cache size
- **CPU**: More worker threads
- **Storage**: Larger index capacity
- **GPU**: Hardware acceleration for embeddings
## Monitoring & Observability
### Metrics
- Request latency (p50, p95, p99)
- Search accuracy (precision/recall)
- Index size and growth
- Cache hit rates
- Error rates
### Logging
```python
# Structured logging
logger.info("search_request", {
"query": query,
"results": len(results),
"latency_ms": latency,
"cache_hit": cache_hit
})
```
### Health Checks
```json
GET /health
{
"status": "healthy",
"version": "1.0.0",
"uptime": 3600,
"index_size": 150000,
"cache_hit_rate": 0.85
}
```
## Future Architecture Considerations
### Planned Enhancements
1. **Distributed indexing**: Apache Spark integration
2. **Real-time updates**: File watcher integration
3. **Advanced ML models**: CodeBERT, GraphCodeBERT
4. **Cloud deployment**: AWS Lambda, Google Cloud Run
5. **Multi-tenant support**: Workspace isolation
### API Evolution
```yaml
# Proposed v2 API structure
/api/v2/
/search
/semantic
/keyword
/hybrid
/analyze
/quality
/security
/performance
/refactor
/suggest
/preview
/apply
```
## Development Guidelines
### Code Organization
```
src/
├── core/ # Core functionality
├── tools/ # Analysis tools
├── orchestrators/ # High-level coordinators
├── utils/ # Shared utilities
├── models/ # Data models
└── tests/ # Test suites
```
### Design Principles
1. **Modularity**: Loosely coupled components
2. **Extensibility**: Plugin architecture
3. **Testability**: Dependency injection
4. **Performance**: Async-first design
5. **Reliability**: Graceful degradation
### Testing Strategy
- Unit tests: 80% coverage minimum
- Integration tests: API endpoints
- Performance tests: Load testing
- End-to-end tests: User workflows
## Deployment Architecture
### Docker Deployment
```dockerfile
# Multi-stage build
FROM python:3.11-slim AS builder
# Build stage
FROM python:3.11-slim
# Runtime stage
```
### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scs-mcp
spec:
replicas: 3
selector:
matchLabels:
app: scs-mcp
template:
spec:
containers:
- name: scs-mcp
image: scs-mcp:latest
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
```
## Contributing
See [CONTRIBUTING.md](../CONTRIBUTING.md) for development setup and guidelines.