Skip to main content
Glama
IMPLEMENTATION_GUIDE.md20.9 kB
# Crawl4AI MCP Server Implementation Guide This technical guide provides detailed implementation considerations and best practices for developing the enhanced Crawl4AI MCP Server architecture. It serves as a companion to the `MIGRATION_PLAN.md` and `ENHANCED_ARCHITECTURE.md` documents. ## Technical Implementation Considerations ### 1. Adapter Implementation Details The `Crawl4AIAdapter` class needs significant refactoring to properly work with the Crawl4AI Docker API: #### Parameter Transformation Crawl4AI Docker API expects different parameter structures than the current Firecrawl implementation: ```typescript // Current Firecrawl format (flat structure): { "url": "https://example.com", "formats": ["markdown", "html"], "onlyMainContent": true } // Required Crawl4AI format (nested structure): { "urls": ["https://example.com"], "browser_config": { "type": "BrowserConfig", "params": {"headless": true} }, "crawler_config": { "type": "CrawlerRunConfig", "params": { "formats": ["markdown", "html"], "onlyMainContent": true } } } ``` Implementation example: ```typescript private transformScrapeParams(params: any): any { const { url, ...options } = params; return { urls: [url], browser_config: { type: "BrowserConfig", params: { headless: options.mobile ? false : true, // Transform other browser-specific options } }, crawler_config: { type: "CrawlerRunConfig", params: { formats: options.formats || ["markdown"], onlyMainContent: options.onlyMainContent || false, // Transform other crawler-specific options } } }; } ``` ### 2. Authentication Mechanisms Crawl4AI Docker server supports multiple authentication methods: #### Bearer Token Authentication ```typescript public setApiKey(apiKey: string): void { if (!apiKey) return; this.apiClient.defaults.headers.common['Authorization'] = `Bearer ${apiKey}`; } ``` #### JWT Authentication For JWT authentication, implement: ```typescript public async authenticate(email: string): Promise<string> { try { const response = await this.executeRequest( 'POST', '/auth', { email } ); if (response.token) { this.apiClient.defaults.headers.common['Authorization'] = `Bearer ${response.token}`; return response.token; } throw new Error('Authentication failed: No token received'); } catch (error) { console.error('Authentication error:', error); throw error; } } ``` ### 3. Error Handling and Retries Implement more sophisticated error handling specific to Crawl4AI: ```typescript // Update error classification based on Crawl4AI error patterns switch (status) { case 401: case 403: errorType = ErrorType.AUTHENTICATION; errorMessage = 'Authentication error: Invalid or expired API key'; break; case 429: errorType = ErrorType.RATE_LIMIT; errorMessage = 'Rate limit exceeded: Too many requests'; retryable = true; break; case 503: case 504: errorType = ErrorType.SERVER; errorMessage = 'Crawl4AI server temporarily unavailable'; retryable = true; break; // Additional Crawl4AI-specific error codes case 422: errorType = ErrorType.VALIDATION; errorMessage = `Validation error: ${errorData.message || 'Invalid parameters'}`; break; // etc. } ``` ### 4. WebSocket Support Crawl4AI Docker server supports WebSocket connections for streaming results: ```typescript public async streamCrawl(urls: string[], options: any = {}): Promise<WebSocket> { const params = this.transformCrawlParams({ urls, ...options, stream: true }); // Create WebSocket connection const wsUrl = `${this.baseUrl.replace('http', 'ws')}/crawl/stream`; const ws = new WebSocket(wsUrl); ws.onopen = () => { ws.send(JSON.stringify(params)); }; return ws; } ``` ## Infrastructure as Code Implementations ### AWS CloudFormation Template ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'Crawl4AI Docker Server on AWS' Parameters: InstanceType: Type: String Default: t3.medium AllowedValues: - t3.medium - t3.large - t3.xlarge Resources: Crawl4AISecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Security group for Crawl4AI Docker server SecurityGroupIngress: - IpProtocol: tcp FromPort: 11235 ToPort: 11235 CidrIp: 0.0.0.0/0 # Should be restricted in production Crawl4AIInstance: Type: AWS::EC2::Instance Properties: InstanceType: !Ref InstanceType ImageId: ami-0261755bbcb8c4a84 # Amazon Linux 2 (update as needed) SecurityGroupIds: - !Ref Crawl4AISecurityGroup UserData: Fn::Base64: !Sub | #!/bin/bash yum update -y amazon-linux-extras install docker -y service docker start systemctl enable docker docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest Outputs: ServerURL: Description: URL for the Crawl4AI server Value: !Sub http://${Crawl4AIInstance.PublicDnsName}:11235 ``` ### Terraform Module for Multi-Cloud ```hcl variable "cloud_provider" { description = "Cloud provider to use (aws, gcp, azure)" type = string default = "aws" } module "crawl4ai_server" { source = "./modules/${var.cloud_provider}" # Common variables instance_size = var.instance_size docker_version = var.docker_version # Provider-specific variables region = var.region vpc_id = var.vpc_id subnet_id = var.subnet_id } ``` ## User Management System ### User Model ```typescript interface User { id: string; email: string; name: string; apiKey: string; createdAt: Date; lastLoginAt: Date; preferences: { defaultProvider: string; instanceSize: string; autoShutdownMinutes: number; }; deployments: Deployment[]; } interface Deployment { id: string; provider: string; region: string; instanceType: string; status: 'creating' | 'running' | 'stopped' | 'error'; createdAt: Date; lastActiveAt: Date; endpoint: string; cost: { hourlyRate: number; monthlyCost: number; currency: string; }; } ``` ### User Registration Flow ```typescript async function registerUser(email: string, name: string, password: string): Promise<User> { // Validate email format if (!isValidEmail(email)) { throw new Error('Invalid email format'); } // Check if user already exists const existingUser = await UserStorage.findByEmail(email); if (existingUser) { throw new Error('User already exists'); } // Generate API key const apiKey = generateSecureApiKey(); // Create user record const user = { id: uuidv4(), email, name, apiKey, createdAt: new Date(), lastLoginAt: new Date(), preferences: { defaultProvider: 'aws', instanceSize: 'medium', autoShutdownMinutes: 60 }, deployments: [] }; // Store password securely (using bcrypt or similar) await UserStorage.storePassword(user.id, await hashPassword(password)); // Store user record await UserStorage.save(user); return user; } ``` ## Dashboard Implementation ### Frontend Components The user dashboard will need several key components: 1. **Authentication Module**: - Login/registration forms - Password reset functionality - Session management 2. **Provider Configuration**: - Cloud provider selection - Secure credential collection - Deployment options 3. **Management Interface**: - Instance status monitoring - Resource scaling controls - Usage statistics and billing information ### Backend API Endpoints ```typescript // User management app.post('/api/users/register', handleUserRegistration); app.post('/api/users/login', handleUserLogin); app.get('/api/users/me', requireAuth, handleGetCurrentUser); // Cloud provider management app.post('/api/providers/configure', requireAuth, handleProviderConfiguration); app.get('/api/providers/status', requireAuth, handleGetProviderStatus); // Deployment management app.post('/api/deployments', requireAuth, handleCreateDeployment); app.get('/api/deployments', requireAuth, handleListDeployments); app.get('/api/deployments/:id', requireAuth, handleGetDeployment); app.post('/api/deployments/:id/start', requireAuth, handleStartDeployment); app.post('/api/deployments/:id/stop', requireAuth, handleStopDeployment); app.delete('/api/deployments/:id', requireAuth, handleDeleteDeployment); // Monitoring and utilization app.get('/api/deployments/:id/metrics', requireAuth, handleGetDeploymentMetrics); app.get('/api/deployments/:id/logs', requireAuth, handleGetDeploymentLogs); app.get('/api/deployments/:id/billing', requireAuth, handleGetDeploymentBilling); ``` ## Secure Credential Management ### Storing Cloud Provider Credentials Never store cloud provider credentials in plain text. Use proper encryption and secure storage: ```typescript async function storeProviderCredentials( userId: string, provider: string, credentials: any ): Promise<void> { // Encrypt credentials before storage const encryptedCredentials = await encryptData( JSON.stringify(credentials), process.env.ENCRYPTION_KEY! ); // Store in secure storage (KV, database, etc.) await CredentialStorage.save(userId, provider, encryptedCredentials); // Optional: Store a key reference in the user record await UserStorage.updateProviderStatus(userId, provider, 'configured'); } ``` ### Using Credentials for Deployments ```typescript async function getProviderCredentials(userId: string, provider: string): Promise<any> { // Retrieve encrypted credentials const encryptedCredentials = await CredentialStorage.get(userId, provider); if (!encryptedCredentials) { throw new Error(`No ${provider} credentials found for user`); } // Decrypt credentials for use const credentials = JSON.parse( await decryptData(encryptedCredentials, process.env.ENCRYPTION_KEY!) ); return credentials; } ``` ## Monitoring and Observability ### Health Checks Implement regular health checks for Crawl4AI deployments: ```typescript async function checkDeploymentHealth(deploymentId: string): Promise<HealthStatus> { try { const deployment = await DeploymentStorage.get(deploymentId); if (!deployment) { throw new Error('Deployment not found'); } // Check if instance is responsive const response = await fetch(`${deployment.endpoint}/health`, { timeout: 5000 }); if (!response.ok) { return { status: 'unhealthy', message: `Server returned status ${response.status}`, lastChecked: new Date() }; } const healthData = await response.json(); return { status: 'healthy', message: 'Service is running normally', metrics: { memoryUsage: healthData.memory_usage, cpuUsage: healthData.cpu_usage, uptime: healthData.uptime }, lastChecked: new Date() }; } catch (error) { return { status: 'error', message: error.message, lastChecked: new Date() }; } } ``` ### Alerting and Notifications ```typescript async function monitorDeployments(): Promise<void> { // Get all active deployments const activeDeployments = await DeploymentStorage.findActive(); for (const deployment of activeDeployments) { const health = await checkDeploymentHealth(deployment.id); // Update stored health status await DeploymentStorage.updateHealth(deployment.id, health); // Send alerts for unhealthy deployments if (health.status !== 'healthy') { await sendAlert(deployment.userId, { type: 'deployment_unhealthy', deploymentId: deployment.id, message: health.message, timestamp: new Date(), severity: health.status === 'error' ? 'high' : 'medium' }); } } } ``` ## Cost Optimization Strategies ### Auto-Shutdown for Idle Instances ```typescript async function checkAndShutdownIdleDeployments(): Promise<void> { const runningDeployments = await DeploymentStorage.findByStatus('running'); for (const deployment of runningDeployments) { const user = await UserStorage.get(deployment.userId); const idleMinutes = calculateIdleTime(deployment.lastActiveAt); if (idleMinutes >= user.preferences.autoShutdownMinutes) { console.log(`Shutting down idle deployment ${deployment.id} after ${idleMinutes} minutes`); try { // Get provider-specific client const provider = getProviderClient(deployment.provider); // Stop the deployment await provider.stopDeployment(deployment.id); // Update deployment status await DeploymentStorage.updateStatus(deployment.id, 'stopped'); // Notify user await sendNotification(deployment.userId, { type: 'deployment_auto_shutdown', deploymentId: deployment.id, message: `Your ${deployment.provider} deployment was automatically stopped after ${idleMinutes} minutes of inactivity`, timestamp: new Date() }); } catch (error) { console.error(`Failed to shutdown deployment ${deployment.id}:`, error); } } } } ``` ### Right-Sizing Recommendations ```typescript function generateSizingRecommendation(deployment: Deployment, usageMetrics: UsageMetrics): SizingRecommendation { // Analyze CPU utilization const avgCpuUtilization = calculateAverageCpuUtilization(usageMetrics); const peakCpuUtilization = calculatePeakCpuUtilization(usageMetrics); // Analyze memory utilization const avgMemoryUtilization = calculateAverageMemoryUtilization(usageMetrics); const peakMemoryUtilization = calculatePeakMemoryUtilization(usageMetrics); // Determine if over-provisioned const isOverProvisioned = avgCpuUtilization < 20 && peakCpuUtilization < 50 && avgMemoryUtilization < 40 && peakMemoryUtilization < 70; // Determine if under-provisioned const isUnderProvisioned = avgCpuUtilization > 70 || peakCpuUtilization > 90 || avgMemoryUtilization > 80 || peakMemoryUtilization > 95; // Make recommendation if (isOverProvisioned) { return { currentSize: deployment.instanceType, recommendedSize: getNextSmallerSize(deployment.provider, deployment.instanceType), reason: 'Instance is significantly over-provisioned. Downsizing could reduce costs by approximately 30-50%.', potentialSavings: calculatePotentialSavings(deployment, -1) }; } else if (isUnderProvisioned) { return { currentSize: deployment.instanceType, recommendedSize: getNextLargerSize(deployment.provider, deployment.instanceType), reason: 'Instance is under-provisioned, which could lead to performance issues. Upgrading would improve reliability.', potentialSavings: calculatePotentialSavings(deployment, 1) }; } return { currentSize: deployment.instanceType, recommendedSize: deployment.instanceType, reason: 'Current instance size appears to be appropriate for your usage patterns.', potentialSavings: 0 }; } ``` ## Deployment Strategies ### Continuous Integration/Continuous Deployment Set up a CI/CD pipeline for the MCP Server: ```yaml # GitHub Actions workflow example name: Deploy MCP Server on: push: branches: [main] workflow_dispatch: jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Setup Node.js uses: actions/setup-node@v3 with: node-version: '16' - name: Install dependencies run: npm ci - name: Run tests run: npm test deploy: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Setup Node.js uses: actions/setup-node@v3 with: node-version: '16' - name: Install dependencies run: npm ci - name: Build run: npm run build - name: Deploy to Cloudflare uses: cloudflare/wrangler-action@v2 with: apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }} ``` ### Blue-Green Deployments For zero-downtime updates to the Crawl4AI Docker server: ```bash #!/bin/bash # Blue-Green deployment script for Crawl4AI Docker server # Pull the new image docker pull unclecode/crawl4ai:latest # Start the new container (green) docker run -d --name crawl4ai-green -p 11236:11235 --shm-size=1g unclecode/crawl4ai:latest # Wait for the new container to initialize echo "Waiting for green deployment to initialize..." sleep 30 # Check if the new container is healthy HEALTH_CHECK=$(curl -s http://localhost:11236/health) if [ $? -ne 0 ] || [[ "$HEALTH_CHECK" != *"status":"healthy"* ]]; then echo "Health check failed for green deployment, rolling back..." docker stop crawl4ai-green docker rm crawl4ai-green exit 1 fi # Update the load balancer/proxy to route to the new container # This depends on your specific infrastructure # Stop the old container (blue) docker stop crawl4ai docker rm crawl4ai # Rename the green container to be the new blue docker rename crawl4ai-green crawl4ai # Update the port mapping if needed # This might require stopping and restarting with the correct port echo "Blue-Green deployment completed successfully" ``` ## Security Best Practices ### JWT Token Configuration ```typescript // JWT configuration for secure tokens const jwtConfig = { algorithm: 'RS256', expiresIn: '1h', audience: 'crawl4ai-mcp-server', issuer: 'crawl4ai-auth' }; // Generate JWT token function generateToken(userId: string, permissions: string[]): string { return jwt.sign( { sub: userId, permissions, iat: Math.floor(Date.now() / 1000) }, PRIVATE_KEY, jwtConfig ); } // Verify JWT token function verifyToken(token: string): JWTPayload { try { return jwt.verify(token, PUBLIC_KEY, jwtConfig); } catch (error) { throw new AuthError('Invalid or expired token'); } } ``` ### API Rate Limiting ```typescript // Rate limiting middleware function rateLimitMiddleware(options: RateLimitOptions) { const limiter = new RateLimiter({ windowMs: options.windowMs || 15 * 60 * 1000, // 15 minutes by default max: options.max || 100, // 100 requests per windowMs by default standardHeaders: true, legacyHeaders: false, keyGenerator: (req) => { // Use user ID if authenticated, IP otherwise return req.user?.id || req.ip; } }); return limiter; } // Apply rate limiting to sensitive routes app.use('/api/deployments', rateLimitMiddleware({ windowMs: 5 * 60 * 1000, max: 20 })); app.use('/api/providers/configure', rateLimitMiddleware({ windowMs: 60 * 60 * 1000, max: 10 })); ``` ### Firewall Configuration Example AWS security group configuration to secure the Crawl4AI Docker server: ```typescript const securityGroup = new aws.ec2.SecurityGroup('crawl4ai-sg', { description: 'Security group for Crawl4AI Docker server', vpcId: vpc.id, ingress: [ { description: 'Crawl4AI API access', fromPort: 11235, toPort: 11235, protocol: 'tcp', // Only allow access from the Cloudflare Worker IPs cidrBlocks: CLOUDFLARE_IP_RANGES, }, { description: 'SSH access', fromPort: 22, toPort: 22, protocol: 'tcp', // Only allow SSH from admin IPs cidrBlocks: ADMIN_IP_ADDRESSES, } ], egress: [ { fromPort: 0, toPort: 0, protocol: '-1', cidrBlocks: ['0.0.0.0/0'], } ], tags: { Name: 'crawl4ai-security-group', }, }); ``` ## Conclusion This implementation guide provides detailed technical considerations and code examples for building the enhanced Crawl4AI MCP Server architecture. By following these patterns and best practices, developers can create a robust, secure, and user-friendly solution that properly integrates with Crawl4AI while supporting multiple users and cloud providers. The focus throughout the implementation should be on: 1. Security first - protecting user data and credentials 2. Performance optimization - ensuring efficient API calls and resource usage 3. User experience - making deployment and management straightforward 4. Cost effectiveness - helping users minimize infrastructure costs 5. Reliability - implementing proper monitoring and recovery procedures By addressing these aspects from the start, the resulting system will provide a strong foundation that can be extended with additional features over time.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/BjornMelin/crawl4ai-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server