IMPLEMENTATION_GUIDE.md•20.9 kB
# Crawl4AI MCP Server Implementation Guide
This technical guide provides detailed implementation considerations and best practices for developing the enhanced Crawl4AI MCP Server architecture. It serves as a companion to the `MIGRATION_PLAN.md` and `ENHANCED_ARCHITECTURE.md` documents.
## Technical Implementation Considerations
### 1. Adapter Implementation Details
The `Crawl4AIAdapter` class needs significant refactoring to properly work with the Crawl4AI Docker API:
#### Parameter Transformation
Crawl4AI Docker API expects different parameter structures than the current Firecrawl implementation:
```typescript
// Current Firecrawl format (flat structure):
{
"url": "https://example.com",
"formats": ["markdown", "html"],
"onlyMainContent": true
}
// Required Crawl4AI format (nested structure):
{
"urls": ["https://example.com"],
"browser_config": {
"type": "BrowserConfig",
"params": {"headless": true}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"formats": ["markdown", "html"],
"onlyMainContent": true
}
}
}
```
Implementation example:
```typescript
private transformScrapeParams(params: any): any {
const { url, ...options } = params;
return {
urls: [url],
browser_config: {
type: "BrowserConfig",
params: {
headless: options.mobile ? false : true,
// Transform other browser-specific options
}
},
crawler_config: {
type: "CrawlerRunConfig",
params: {
formats: options.formats || ["markdown"],
onlyMainContent: options.onlyMainContent || false,
// Transform other crawler-specific options
}
}
};
}
```
### 2. Authentication Mechanisms
Crawl4AI Docker server supports multiple authentication methods:
#### Bearer Token Authentication
```typescript
public setApiKey(apiKey: string): void {
if (!apiKey) return;
this.apiClient.defaults.headers.common['Authorization'] = `Bearer ${apiKey}`;
}
```
#### JWT Authentication
For JWT authentication, implement:
```typescript
public async authenticate(email: string): Promise<string> {
try {
const response = await this.executeRequest(
'POST',
'/auth',
{ email }
);
if (response.token) {
this.apiClient.defaults.headers.common['Authorization'] = `Bearer ${response.token}`;
return response.token;
}
throw new Error('Authentication failed: No token received');
} catch (error) {
console.error('Authentication error:', error);
throw error;
}
}
```
### 3. Error Handling and Retries
Implement more sophisticated error handling specific to Crawl4AI:
```typescript
// Update error classification based on Crawl4AI error patterns
switch (status) {
case 401:
case 403:
errorType = ErrorType.AUTHENTICATION;
errorMessage = 'Authentication error: Invalid or expired API key';
break;
case 429:
errorType = ErrorType.RATE_LIMIT;
errorMessage = 'Rate limit exceeded: Too many requests';
retryable = true;
break;
case 503:
case 504:
errorType = ErrorType.SERVER;
errorMessage = 'Crawl4AI server temporarily unavailable';
retryable = true;
break;
// Additional Crawl4AI-specific error codes
case 422:
errorType = ErrorType.VALIDATION;
errorMessage = `Validation error: ${errorData.message || 'Invalid parameters'}`;
break;
// etc.
}
```
### 4. WebSocket Support
Crawl4AI Docker server supports WebSocket connections for streaming results:
```typescript
public async streamCrawl(urls: string[], options: any = {}): Promise<WebSocket> {
const params = this.transformCrawlParams({
urls,
...options,
stream: true
});
// Create WebSocket connection
const wsUrl = `${this.baseUrl.replace('http', 'ws')}/crawl/stream`;
const ws = new WebSocket(wsUrl);
ws.onopen = () => {
ws.send(JSON.stringify(params));
};
return ws;
}
```
## Infrastructure as Code Implementations
### AWS CloudFormation Template
```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Crawl4AI Docker Server on AWS'
Parameters:
InstanceType:
Type: String
Default: t3.medium
AllowedValues:
- t3.medium
- t3.large
- t3.xlarge
Resources:
Crawl4AISecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for Crawl4AI Docker server
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 11235
ToPort: 11235
CidrIp: 0.0.0.0/0 # Should be restricted in production
Crawl4AIInstance:
Type: AWS::EC2::Instance
Properties:
InstanceType: !Ref InstanceType
ImageId: ami-0261755bbcb8c4a84 # Amazon Linux 2 (update as needed)
SecurityGroupIds:
- !Ref Crawl4AISecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash
yum update -y
amazon-linux-extras install docker -y
service docker start
systemctl enable docker
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
Outputs:
ServerURL:
Description: URL for the Crawl4AI server
Value: !Sub http://${Crawl4AIInstance.PublicDnsName}:11235
```
### Terraform Module for Multi-Cloud
```hcl
variable "cloud_provider" {
description = "Cloud provider to use (aws, gcp, azure)"
type = string
default = "aws"
}
module "crawl4ai_server" {
source = "./modules/${var.cloud_provider}"
# Common variables
instance_size = var.instance_size
docker_version = var.docker_version
# Provider-specific variables
region = var.region
vpc_id = var.vpc_id
subnet_id = var.subnet_id
}
```
## User Management System
### User Model
```typescript
interface User {
id: string;
email: string;
name: string;
apiKey: string;
createdAt: Date;
lastLoginAt: Date;
preferences: {
defaultProvider: string;
instanceSize: string;
autoShutdownMinutes: number;
};
deployments: Deployment[];
}
interface Deployment {
id: string;
provider: string;
region: string;
instanceType: string;
status: 'creating' | 'running' | 'stopped' | 'error';
createdAt: Date;
lastActiveAt: Date;
endpoint: string;
cost: {
hourlyRate: number;
monthlyCost: number;
currency: string;
};
}
```
### User Registration Flow
```typescript
async function registerUser(email: string, name: string, password: string): Promise<User> {
// Validate email format
if (!isValidEmail(email)) {
throw new Error('Invalid email format');
}
// Check if user already exists
const existingUser = await UserStorage.findByEmail(email);
if (existingUser) {
throw new Error('User already exists');
}
// Generate API key
const apiKey = generateSecureApiKey();
// Create user record
const user = {
id: uuidv4(),
email,
name,
apiKey,
createdAt: new Date(),
lastLoginAt: new Date(),
preferences: {
defaultProvider: 'aws',
instanceSize: 'medium',
autoShutdownMinutes: 60
},
deployments: []
};
// Store password securely (using bcrypt or similar)
await UserStorage.storePassword(user.id, await hashPassword(password));
// Store user record
await UserStorage.save(user);
return user;
}
```
## Dashboard Implementation
### Frontend Components
The user dashboard will need several key components:
1. **Authentication Module**:
- Login/registration forms
- Password reset functionality
- Session management
2. **Provider Configuration**:
- Cloud provider selection
- Secure credential collection
- Deployment options
3. **Management Interface**:
- Instance status monitoring
- Resource scaling controls
- Usage statistics and billing information
### Backend API Endpoints
```typescript
// User management
app.post('/api/users/register', handleUserRegistration);
app.post('/api/users/login', handleUserLogin);
app.get('/api/users/me', requireAuth, handleGetCurrentUser);
// Cloud provider management
app.post('/api/providers/configure', requireAuth, handleProviderConfiguration);
app.get('/api/providers/status', requireAuth, handleGetProviderStatus);
// Deployment management
app.post('/api/deployments', requireAuth, handleCreateDeployment);
app.get('/api/deployments', requireAuth, handleListDeployments);
app.get('/api/deployments/:id', requireAuth, handleGetDeployment);
app.post('/api/deployments/:id/start', requireAuth, handleStartDeployment);
app.post('/api/deployments/:id/stop', requireAuth, handleStopDeployment);
app.delete('/api/deployments/:id', requireAuth, handleDeleteDeployment);
// Monitoring and utilization
app.get('/api/deployments/:id/metrics', requireAuth, handleGetDeploymentMetrics);
app.get('/api/deployments/:id/logs', requireAuth, handleGetDeploymentLogs);
app.get('/api/deployments/:id/billing', requireAuth, handleGetDeploymentBilling);
```
## Secure Credential Management
### Storing Cloud Provider Credentials
Never store cloud provider credentials in plain text. Use proper encryption and secure storage:
```typescript
async function storeProviderCredentials(
userId: string,
provider: string,
credentials: any
): Promise<void> {
// Encrypt credentials before storage
const encryptedCredentials = await encryptData(
JSON.stringify(credentials),
process.env.ENCRYPTION_KEY!
);
// Store in secure storage (KV, database, etc.)
await CredentialStorage.save(userId, provider, encryptedCredentials);
// Optional: Store a key reference in the user record
await UserStorage.updateProviderStatus(userId, provider, 'configured');
}
```
### Using Credentials for Deployments
```typescript
async function getProviderCredentials(userId: string, provider: string): Promise<any> {
// Retrieve encrypted credentials
const encryptedCredentials = await CredentialStorage.get(userId, provider);
if (!encryptedCredentials) {
throw new Error(`No ${provider} credentials found for user`);
}
// Decrypt credentials for use
const credentials = JSON.parse(
await decryptData(encryptedCredentials, process.env.ENCRYPTION_KEY!)
);
return credentials;
}
```
## Monitoring and Observability
### Health Checks
Implement regular health checks for Crawl4AI deployments:
```typescript
async function checkDeploymentHealth(deploymentId: string): Promise<HealthStatus> {
try {
const deployment = await DeploymentStorage.get(deploymentId);
if (!deployment) {
throw new Error('Deployment not found');
}
// Check if instance is responsive
const response = await fetch(`${deployment.endpoint}/health`, {
timeout: 5000
});
if (!response.ok) {
return {
status: 'unhealthy',
message: `Server returned status ${response.status}`,
lastChecked: new Date()
};
}
const healthData = await response.json();
return {
status: 'healthy',
message: 'Service is running normally',
metrics: {
memoryUsage: healthData.memory_usage,
cpuUsage: healthData.cpu_usage,
uptime: healthData.uptime
},
lastChecked: new Date()
};
} catch (error) {
return {
status: 'error',
message: error.message,
lastChecked: new Date()
};
}
}
```
### Alerting and Notifications
```typescript
async function monitorDeployments(): Promise<void> {
// Get all active deployments
const activeDeployments = await DeploymentStorage.findActive();
for (const deployment of activeDeployments) {
const health = await checkDeploymentHealth(deployment.id);
// Update stored health status
await DeploymentStorage.updateHealth(deployment.id, health);
// Send alerts for unhealthy deployments
if (health.status !== 'healthy') {
await sendAlert(deployment.userId, {
type: 'deployment_unhealthy',
deploymentId: deployment.id,
message: health.message,
timestamp: new Date(),
severity: health.status === 'error' ? 'high' : 'medium'
});
}
}
}
```
## Cost Optimization Strategies
### Auto-Shutdown for Idle Instances
```typescript
async function checkAndShutdownIdleDeployments(): Promise<void> {
const runningDeployments = await DeploymentStorage.findByStatus('running');
for (const deployment of runningDeployments) {
const user = await UserStorage.get(deployment.userId);
const idleMinutes = calculateIdleTime(deployment.lastActiveAt);
if (idleMinutes >= user.preferences.autoShutdownMinutes) {
console.log(`Shutting down idle deployment ${deployment.id} after ${idleMinutes} minutes`);
try {
// Get provider-specific client
const provider = getProviderClient(deployment.provider);
// Stop the deployment
await provider.stopDeployment(deployment.id);
// Update deployment status
await DeploymentStorage.updateStatus(deployment.id, 'stopped');
// Notify user
await sendNotification(deployment.userId, {
type: 'deployment_auto_shutdown',
deploymentId: deployment.id,
message: `Your ${deployment.provider} deployment was automatically stopped after ${idleMinutes} minutes of inactivity`,
timestamp: new Date()
});
} catch (error) {
console.error(`Failed to shutdown deployment ${deployment.id}:`, error);
}
}
}
}
```
### Right-Sizing Recommendations
```typescript
function generateSizingRecommendation(deployment: Deployment, usageMetrics: UsageMetrics): SizingRecommendation {
// Analyze CPU utilization
const avgCpuUtilization = calculateAverageCpuUtilization(usageMetrics);
const peakCpuUtilization = calculatePeakCpuUtilization(usageMetrics);
// Analyze memory utilization
const avgMemoryUtilization = calculateAverageMemoryUtilization(usageMetrics);
const peakMemoryUtilization = calculatePeakMemoryUtilization(usageMetrics);
// Determine if over-provisioned
const isOverProvisioned = avgCpuUtilization < 20 && peakCpuUtilization < 50 &&
avgMemoryUtilization < 40 && peakMemoryUtilization < 70;
// Determine if under-provisioned
const isUnderProvisioned = avgCpuUtilization > 70 || peakCpuUtilization > 90 ||
avgMemoryUtilization > 80 || peakMemoryUtilization > 95;
// Make recommendation
if (isOverProvisioned) {
return {
currentSize: deployment.instanceType,
recommendedSize: getNextSmallerSize(deployment.provider, deployment.instanceType),
reason: 'Instance is significantly over-provisioned. Downsizing could reduce costs by approximately 30-50%.',
potentialSavings: calculatePotentialSavings(deployment, -1)
};
} else if (isUnderProvisioned) {
return {
currentSize: deployment.instanceType,
recommendedSize: getNextLargerSize(deployment.provider, deployment.instanceType),
reason: 'Instance is under-provisioned, which could lead to performance issues. Upgrading would improve reliability.',
potentialSavings: calculatePotentialSavings(deployment, 1)
};
}
return {
currentSize: deployment.instanceType,
recommendedSize: deployment.instanceType,
reason: 'Current instance size appears to be appropriate for your usage patterns.',
potentialSavings: 0
};
}
```
## Deployment Strategies
### Continuous Integration/Continuous Deployment
Set up a CI/CD pipeline for the MCP Server:
```yaml
# GitHub Actions workflow example
name: Deploy MCP Server
on:
push:
branches: [main]
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '16'
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test
deploy:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '16'
- name: Install dependencies
run: npm ci
- name: Build
run: npm run build
- name: Deploy to Cloudflare
uses: cloudflare/wrangler-action@v2
with:
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
```
### Blue-Green Deployments
For zero-downtime updates to the Crawl4AI Docker server:
```bash
#!/bin/bash
# Blue-Green deployment script for Crawl4AI Docker server
# Pull the new image
docker pull unclecode/crawl4ai:latest
# Start the new container (green)
docker run -d --name crawl4ai-green -p 11236:11235 --shm-size=1g unclecode/crawl4ai:latest
# Wait for the new container to initialize
echo "Waiting for green deployment to initialize..."
sleep 30
# Check if the new container is healthy
HEALTH_CHECK=$(curl -s http://localhost:11236/health)
if [ $? -ne 0 ] || [[ "$HEALTH_CHECK" != *"status":"healthy"* ]]; then
echo "Health check failed for green deployment, rolling back..."
docker stop crawl4ai-green
docker rm crawl4ai-green
exit 1
fi
# Update the load balancer/proxy to route to the new container
# This depends on your specific infrastructure
# Stop the old container (blue)
docker stop crawl4ai
docker rm crawl4ai
# Rename the green container to be the new blue
docker rename crawl4ai-green crawl4ai
# Update the port mapping if needed
# This might require stopping and restarting with the correct port
echo "Blue-Green deployment completed successfully"
```
## Security Best Practices
### JWT Token Configuration
```typescript
// JWT configuration for secure tokens
const jwtConfig = {
algorithm: 'RS256',
expiresIn: '1h',
audience: 'crawl4ai-mcp-server',
issuer: 'crawl4ai-auth'
};
// Generate JWT token
function generateToken(userId: string, permissions: string[]): string {
return jwt.sign(
{
sub: userId,
permissions,
iat: Math.floor(Date.now() / 1000)
},
PRIVATE_KEY,
jwtConfig
);
}
// Verify JWT token
function verifyToken(token: string): JWTPayload {
try {
return jwt.verify(token, PUBLIC_KEY, jwtConfig);
} catch (error) {
throw new AuthError('Invalid or expired token');
}
}
```
### API Rate Limiting
```typescript
// Rate limiting middleware
function rateLimitMiddleware(options: RateLimitOptions) {
const limiter = new RateLimiter({
windowMs: options.windowMs || 15 * 60 * 1000, // 15 minutes by default
max: options.max || 100, // 100 requests per windowMs by default
standardHeaders: true,
legacyHeaders: false,
keyGenerator: (req) => {
// Use user ID if authenticated, IP otherwise
return req.user?.id || req.ip;
}
});
return limiter;
}
// Apply rate limiting to sensitive routes
app.use('/api/deployments', rateLimitMiddleware({ windowMs: 5 * 60 * 1000, max: 20 }));
app.use('/api/providers/configure', rateLimitMiddleware({ windowMs: 60 * 60 * 1000, max: 10 }));
```
### Firewall Configuration
Example AWS security group configuration to secure the Crawl4AI Docker server:
```typescript
const securityGroup = new aws.ec2.SecurityGroup('crawl4ai-sg', {
description: 'Security group for Crawl4AI Docker server',
vpcId: vpc.id,
ingress: [
{
description: 'Crawl4AI API access',
fromPort: 11235,
toPort: 11235,
protocol: 'tcp',
// Only allow access from the Cloudflare Worker IPs
cidrBlocks: CLOUDFLARE_IP_RANGES,
},
{
description: 'SSH access',
fromPort: 22,
toPort: 22,
protocol: 'tcp',
// Only allow SSH from admin IPs
cidrBlocks: ADMIN_IP_ADDRESSES,
}
],
egress: [
{
fromPort: 0,
toPort: 0,
protocol: '-1',
cidrBlocks: ['0.0.0.0/0'],
}
],
tags: {
Name: 'crawl4ai-security-group',
},
});
```
## Conclusion
This implementation guide provides detailed technical considerations and code examples for building the enhanced Crawl4AI MCP Server architecture. By following these patterns and best practices, developers can create a robust, secure, and user-friendly solution that properly integrates with Crawl4AI while supporting multiple users and cloud providers.
The focus throughout the implementation should be on:
1. Security first - protecting user data and credentials
2. Performance optimization - ensuring efficient API calls and resource usage
3. User experience - making deployment and management straightforward
4. Cost effectiveness - helping users minimize infrastructure costs
5. Reliability - implementing proper monitoring and recovery procedures
By addressing these aspects from the start, the resulting system will provide a strong foundation that can be extended with additional features over time.