Crawl4AI MCP Server

CONDUCTOR.md•19.3 KiB

# CONDUCTOR.md  > _Read me first. Every other doc is linked below._ ## Critical Context (Read First) - **Tech Stack**: Python 3.7+, FastMCP, crawl4ai, Playwright, asyncio, Microsoft MarkItDown - **Main File**: `crawl4ai_mcp/server.py` (3,392 lines) - Complete MCP server implementation - **Core Mechanic**: Advanced web crawling MCP server with 19 tools, JavaScript support, file processing - **Key Integration**: crawl4ai engine, Playwright browser automation, YouTube transcripts, Google search - **Platform Support**: STDIO (default), Pure HTTP (recommended), Legacy HTTP/SSE, Claude Desktop - **DO NOT**: Modify output suppression system - breaks MCP JSON protocol communication ## Table of Contents 1. [Architecture](ARCHITECTURE.md) - Tech stack, folder structure, infrastructure 2. [Design Tokens](DESIGN.md) - Colors, typography, visual system 3. [UI/UX Patterns](UIUX.md) - Components, interactions, accessibility 4. [Runtime Config](CONFIG.md) - Environment variables, feature flags 5. [Data Model](DATA_MODEL.md) - Database schema, entities, relationships 6. [API Contracts](API.md) - Endpoints, request/response formats, auth 7. [Build & Release](BUILD.md) - Build process, deployment, CI/CD 8. [Testing Guide](TEST.md) - Test strategies, E2E scenarios, coverage 9. [Operational Playbooks](PLAYBOOKS/DEPLOY.md) - Deployment, rollback, monitoring 10. [Contributing](CONTRIBUTING.md) - Code style, PR process, conventions 11. [Error Ledger](ERRORS.md) - Critical P0/P1 error tracking 12. [Task Management](TASKS.md) - Active tasks, phase tracking, context preservation ## Quick Reference **MCP Server Instance**: `crawl4ai_mcp/server.py:229` - FastMCP("Crawl4AI MCP Server") **Tool Selection Guide**: `crawl4ai_mcp/__init__.py:11-155` - Comprehensive AI tool mapping **Core Tools Implementation**: `crawl4ai_mcp/server.py:250-3300` - All 19 MCP tools **File Processor**: `crawl4ai_mcp/file_processor.py:1-319` - MarkItDown integration **YouTube Processor**: `crawl4ai_mcp/youtube_processor.py:1-607` - Transcript extraction **Google Search Engine**: `crawl4ai_mcp/google_search_processor.py:1-563` - 31 search genres **Configuration Manager**: `crawl4ai_mcp/config.py:1-397` - Server behavior config **Output Suppression**: `crawl4ai_mcp/suppress_output.py:1-49` - MCP protocol protection **HTTP Transport Example**: `examples/pure_streamable_http_server.py:1-400` - Pure HTTP **Cache System**: `crawl4ai_mcp/server.py:150-200` - 15-minute self-cleaning cache **Main Entry Point**: `crawl4ai_mcp/server.py:3349-3393` - Server startup logic **Extraction Strategies**: `crawl4ai_mcp/strategies.py:1-273` - Content extraction patterns **Setup Automation**: `setup.sh` & `setup_windows.bat` - Environment setup scripts ## Current State - [x] Production-ready MCP server with 19 comprehensive tools - [x] Full JavaScript/SPA support via Playwright integration - [x] Advanced file processing (PDF, Office docs, ZIP archives) - [x] Stable YouTube transcript extraction (no auth required) - [x] Google search with 31 specialized genres - [x] Multiple transport protocols (STDIO, Pure HTTP, Legacy SSE) - [x] Intelligent caching system with 15-minute auto-cleanup - [x] Claude Desktop integration with multiple config templates - [x] Comprehensive security audit completed - [x] Output suppression system protecting MCP JSON protocol ## Development Workflow 1. **Environment Setup**: Execute `./setup.sh` (Linux/macOS) or `setup_windows.bat` 2. **Server Launch**: `python -m crawl4ai_mcp.server` (STDIO) or HTTP via scripts 3. **Tool Testing**: Use Claude Desktop or MCP client to validate tool functionality 4. **Output Monitoring**: Ensure output suppression maintains MCP JSON integrity 5. **Configuration Updates**: Modify `crawl4ai_mcp/config.py` for behavior adjustments 6. **Security Validation**: Run security checks before commits (no exposed secrets) ## Task Templates ### 1. Implement New MCP Tool 1. Add tool function to `crawl4ai_mcp/server.py:250+` with `@mcp.tool()` decorator 2. Update tool selection guide in `__init__.py:11-155` with use case mappings 3. Test tool functionality through Claude Desktop integration 4. Validate output suppression doesn't interfere with responses 5. Update documentation and examples ### 2. Debug JavaScript/SPA Crawling Issues 1. Verify Playwright settings in `crawl_url` tool parameters 2. Check `wait_for_js: true`, `simulate_user: true` configuration 3. Test problematic URL with 30-60 second timeouts 4. Validate content extraction quality and completeness 5. Document findings in troubleshooting section ### 3. Add File Processing Format Support 1. Check Microsoft MarkItDown compatibility in `file_processor.py:50-100` 2. Add new format to supported formats enumeration 3. Test with sample files using `process_file` tool 4. Ensure 100MB size limit handling works correctly 5. Update documentation with format specifications ### 4. Transport Protocol Configuration 1. Review server startup logic in `server.py:3349-3393` 2. Configure transport settings in `config.py` as needed 3. Test with appropriate example server in `examples/` 4. Verify output suppression system remains intact 5. Validate full MCP JSON protocol compliance ### 5. Performance Optimization 1. Profile tool execution times using cache system metrics 2. Identify bottlenecks in `crawl4ai_mcp/server.py` tool implementations 3. Optimize concurrent processing in batch operations 4. Test improvements with realistic workloads 5. Document performance characteristics and limitations ## Anti-Patterns (Avoid These) ❌ **Don't modify suppress_output.py** - Breaks MCP JSON protocol communication completely ❌ **Don't disable crawl4ai output suppression** - Verbose logs corrupt MCP responses ❌ **Don't exceed 100MB file processing limit** - System will reject with clear error messages ❌ **Don't ignore JavaScript execution settings** - SPA sites require `wait_for_js: true` ❌ **Don't set excessive deep crawl depth** - Stability limited to 5 pages maximum ❌ **Don't hardcode credentials anywhere** - Always use environment variables exclusively ## Version History - **v1.0.0** - Initial release - **v1.1.0** - Feature added (see JOURNAL.md YYYY-MM-DD) [Link major versions to journal entries] ## Continuous Engineering Journal  Claude, keep an ever-growing changelog in [`JOURNAL.md`](JOURNAL.md). ### What to Journal - **Major changes**: New features, significant refactors, API changes - **Bug fixes**: What broke, why, and how it was fixed - **Frustration points**: Problems that took multiple attempts to solve - **Design decisions**: Why we chose one approach over another - **Performance improvements**: Before/after metrics - **User feedback**: Notable issues or requests - **Learning moments**: New techniques or patterns discovered ### Journal Format \``` ## YYYY-MM-DD HH:MM ### [Short Title] - **What**: Brief description of the change - **Why**: Reason for the change - **How**: Technical approach taken - **Issues**: Any problems encountered - **Result**: Outcome and any metrics ### [Short Title] |ERROR:ERR-YYYY-MM-DD-001| - **What**: Critical P0/P1 error description - **Why**: Root cause analysis - **How**: Fix implementation - **Issues**: Debugging challenges - **Result**: Resolution and prevention measures ### [Task Title] |TASK:TASK-YYYY-MM-DD-001| - **What**: Task implementation summary - **Why**: Part of [Phase Name] phase - **How**: Technical approach and key decisions - **Issues**: Blockers encountered and resolved - **Result**: Task completed, findings documented in ARCHITECTURE.md \``` ### Compaction Rule When `JOURNAL.md` exceeds **500 lines**: 1. Claude summarizes the oldest half into `JOURNAL_ARCHIVE/<year>-<month>.md` 2. Remaining entries stay in `JOURNAL.md` so the file never grows unbounded > ⚠️ Claude must NEVER delete raw history—only move & summarize. ### 2. ARCHITECTURE.md **Purpose**: System design, tech stack decisions, and code structure with line numbers. **Required Elements**: - Technology stack listing - Directory structure diagram - Key architectural decisions with rationale - Component architecture with exact line numbers - System flow diagram (ASCII art) - Common patterns section - Keywords for search optimization **Line Number Format**: \``` #### ComponentName Structure  \```typescript // Major classes with exact line numbers class MainClass { /* lines 100-500 */ } //  class Helper { /* lines 501-600 */ } //  \``` \``` ### 3. DESIGN.md **Purpose**: Visual design system, styling, and theming documentation. **Required Sections**: - Typography system - Color palette (with hex values) - Visual effects specifications - Character/entity design - UI/UX component styling - Animation system - Mobile design considerations - Accessibility guidelines - Keywords section ### 4. DATA_MODEL.md **Purpose**: Database schema, application models, and data structures. **Required Elements**: - Database schema (SQL) - Application data models (TypeScript/language interfaces) - Validation rules - Common queries - Data migration history - Keywords for entities ### 5. API.md **Purpose**: Complete API documentation with examples. **Structure for Each Endpoint**: \``` ### Endpoint Name \```http METHOD /api/endpoint \``` #### Request \```json { "field": "type" } \``` #### Response \```json { "field": "value" } \``` #### Details - **Rate limit**: X requests per Y seconds - **Auth**: Required/Optional - **Notes**: Special considerations \``` ### 6. CONFIG.md **Purpose**: Runtime configuration, environment variables, and settings. **Required Sections**: - Environment variables (required and optional) - Application configuration constants - Feature flags - Performance tuning settings - Security configuration - Common patterns for configuration changes ### 7. BUILD.md **Purpose**: Build process, deployment, and CI/CD documentation. **Include**: - Prerequisites - Build commands - CI/CD pipeline configuration - Deployment steps - Rollback procedures - Troubleshooting guide ### 8. TEST.md **Purpose**: Testing strategies, patterns, and examples. **Sections**: - Test stack and tools - Running tests commands - Test structure - Coverage goals - Common test patterns - Debugging tests ### 9. UIUX.md **Purpose**: Interaction patterns, user flows, and behavior specifications. **Cover**: - Input methods - State transitions - Component behaviors - User flows - Accessibility patterns - Performance considerations ### 10. CONTRIBUTING.md **Purpose**: Guidelines for contributors. **Include**: - Code of conduct - Development setup - Code style guide - Commit message format - PR process - Common patterns ### 11. PLAYBOOKS/DEPLOY.md **Purpose**: Step-by-step operational procedures. **Format**: - Pre-deployment checklist - Deployment steps (multiple options) - Post-deployment verification - Rollback procedures - Troubleshooting ### 12. ERRORS.md (Critical Error Ledger) **Purpose**: Track and resolve P0/P1 critical errors with full traceability. **Required Structure**: \``` # Critical Error Ledger  ## Schema | ID | First seen | Status | Severity | Affected area | Link to fix | |----|------------|--------|----------|---------------|-------------| ## Active Errors [New errors added here, newest first] ## Resolved Errors [Moved here when fixed, with links to fixes] \``` **Error ID Format**: `ERR-YYYY-MM-DD-001` (increment for multiple per day) **Severity Definitions**: - **P0**: Complete outage, data loss, security breach - **P1**: Major functionality broken, significant performance degradation - **P2**: Minor functionality (not tracked in ERRORS.md) - **P3**: Cosmetic issues (not tracked in ERRORS.md) **Claude's Error Logging Process**: 1. When P0/P1 error occurs, immediately add to Active Errors 2. Create corresponding JOURNAL.md entry with details 3. When resolved: - Move to Resolved Errors section - Update status to "resolved" - Add commit hash and PR link - Add `|ERROR:<ID>|` tag to JOURNAL.md entry - Link back to JOURNAL entry from ERRORS.md ### 13. TASKS.md (Active Task Management) **Purpose**: Track ongoing work with phase awareness and context preservation between sessions. **IMPORTANT**: TASKS.md complements Claude's built-in todo system - it does NOT replace it: - Claude's todos: For immediate task tracking within a session - TASKS.md: For preserving context and state between sessions **Required Structure**: ``` # Task Management ## Active Phase **Phase**: [High-level project phase name] **Started**: YYYY-MM-DD **Target**: YYYY-MM-DD **Progress**: X/Y tasks completed ## Current Task **Task ID**: TASK-YYYY-MM-DD-NNN **Title**: [Descriptive task name] **Status**: PLANNING | IN_PROGRESS | BLOCKED | TESTING | COMPLETE **Started**: YYYY-MM-DD HH:MM **Dependencies**: [List task IDs this depends on] ### Task Context  - **Previous Work**: [Link to related tasks/PRs] - **Key Files**: [Primary files being modified with line ranges] - **Environment**: [Specific config/versions if relevant] - **Next Steps**: [Immediate actions when resuming] ### Findings & Decisions - **FINDING-001**: [Discovery that affects approach] - **DECISION-001**: [Technical choice made] → Link to ARCHITECTURE.md - **BLOCKER-001**: [Issue preventing progress] → Link to resolution ### Task Chain 1. ✅ [Completed prerequisite task] (TASK-YYYY-MM-DD-001) 2. 🔄 [Current task] (CURRENT) 3. ⏳ [Next planned task] 4. ⏳ [Future task in phase] ``` **Task Management Rules**: 1. **One Active Task**: Only one task should be IN_PROGRESS at a time 2. **Context Capture**: Before switching tasks, capture all context needed to resume 3. **Findings Documentation**: Record unexpected discoveries that impact the approach 4. **Decision Linking**: Link architectural decisions to ARCHITECTURE.md 5. **Completion Trigger**: When task completes: - Generate JOURNAL.md entry with task summary - Archive task details to TASKS_ARCHIVE/YYYY-MM/TASK-ID.md - Load next task from chain or prompt for new phase **Task States**: - **PLANNING**: Defining approach and breaking down work - **IN_PROGRESS**: Actively working on implementation - **BLOCKED**: Waiting on external dependency or decision - **TESTING**: Implementation complete, validating functionality - **COMPLETE**: Task finished and documented **Integration with Journal**: - Each completed task auto-generates a journal entry - Journal references task ID for full context - Critical findings promoted to relevant documentation ## Documentation Optimization Rules ### 1. Line Number Anchors - Add exact line numbers for every class, function, and major code section - Format: `**Class Name (Lines 100-200)**` - Add HTML anchors: `` - Update when code structure changes significantly ### 2. Quick Reference Card - Place in CLAUDE.md after Table of Contents - Include 10-15 most common code locations - Format: `**Feature**: `file:lines` - Description` ### 3. Current State Tracking - Use checkbox format in CLAUDE.md - `- [x] Completed feature` - `- [ ] In-progress feature` - Update after each work session ### 4. Task Templates - Provide 3-5 step-by-step workflows - Include specific line numbers - Reference files that need updating - Add test/verification steps ### 5. Keywords Sections - Add to each major .md file - List alternative search terms - Format: `## Keywords ` - Include synonyms and related terms ### 6. Anti-Patterns - Use ❌ emoji for clarity - Explain why each is problematic - Include 5-6 critical mistakes - Place prominently in CLAUDE.md ### 7. System Flow Diagrams - Use ASCII art for simplicity - Show data/control flow - Keep visual and readable - Place in ARCHITECTURE.md ### 8. Common Patterns - Add to relevant docs (CONFIG.md, ARCHITECTURE.md) - Show exact code changes needed - Include before/after examples - Reference specific functions ### 9. Version History - Link to JOURNAL.md entries - Format: `v1.0.0 - Feature (see JOURNAL.md YYYY-MM-DD)` - Track major changes only ### 10. Cross-Linking - Link between related sections - Use relative paths: `[Link](./FILE.md#section)` - Ensure bidirectional linking where appropriate ## Journal System Setup ### JOURNAL.md Structure \``` # Engineering Journal ## YYYY-MM-DD HH:MM ### [Descriptive Title] - **What**: Brief description of the change - **Why**: Reason for the change - **How**: Technical approach taken - **Issues**: Any problems encountered - **Result**: Outcome and any metrics --- [Entries continue chronologically] \``` ### Journal Best Practices 1. **Entry Timing**: Add entry immediately after significant work 2. **Detail Level**: Include enough detail to understand the change months later 3. **Problem Documentation**: Especially document multi-attempt solutions 4. **Learning Moments**: Capture new techniques discovered 5. **Metrics**: Include performance improvements, time saved, etc. ### Archive Process When JOURNAL.md exceeds 500 lines: 1. Create `JOURNAL_ARCHIVE/` directory 2. Move oldest 250 lines to `JOURNAL_ARCHIVE/YYYY-MM.md` 3. Add summary header to archive file 4. Keep recent entries in main JOURNAL.md ## Implementation Steps ### Phase 1: Initial Setup (30-60 minutes) 1. **Create CLAUDE.md** with all required sections 2. **Fill Critical Context** with 6 essential facts 3. **Create Table of Contents** with placeholder links 4. **Add Quick Reference** with top 10-15 code locations 5. **Set up Journal section** with formatting rules ### Phase 2: Core Documentation (2-4 hours) 1. **Create each .md file** from the list above 2. **Add Keywords section** to each file 3. **Cross-link between files** where relevant 4. **Add line numbers** to code references 5. **Create PLAYBOOKS/ directory** with DEPLOY.md 6. **Create ERRORS.md** with schema table ### Phase 3: Optimization (1-2 hours) 1. **Add Task Templates** to CLAUDE.md 2. **Create ASCII system flow** in ARCHITECTURE.md 3. **Add Common Patterns** sections 4. **Document Anti-Patterns** 5. **Set up Version History** ### Phase 4: First Journal Entry Create initial JOURNAL.md entry documenting the setup: \``` ## YYYY-MM-DD HH:MM ### Documentation Framework Implementation - **What**: Implemented CLAUDE.md modular documentation system - **Why**: Improve AI navigation and code maintainability - **How**: Split monolithic docs into focused modules with cross-linking - **Issues**: None - clean implementation - **Result**: [Number] documentation files created with full cross-referencing \``` ## Maintenance Guidelines ### Daily - Update JOURNAL.md with significant changes - Mark completed items in Current State - Update line numbers if major refactoring ### Weekly - Review and update Quick Reference section - Check for broken cross-links - Update Task Templates if workflows change ### Monthly - Review Keywords sections for completeness - Update Version History - Check if JOURNAL.md needs archiving ### Per Release - Update Version History in CLAUDE.md - Create comprehensive JOURNAL.md entry - Review all documentation for accuracy - Update Current State checklist ## Benefits of This System 1. **AI Efficiency**: Claude can quickly navigate to exact code locations 2. **Modularity**: Easy to update specific documentation without affecting others 3. **Discoverability**: New developers/AI can quickly understand the project 4. **History Tracking**: Complete record of changes and decisions 5. **Task Automation**: Templates reduce repetitive instructions 6. **Error Prevention**: Anti-patterns prevent common mistakes

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sruckh/crawl-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CONDUCTOR.md•19.3 KiB