Crawl4AI RAG MCP Server

endpoints.md•18.1 kB

# API Endpoints This document provides comprehensive documentation for all REST API endpoints in the Crawl4AI RAG MCP Server. ## Base URL ``` http://localhost:8080 ``` ## Authentication All API endpoints (except `/health`) require authentication using a Bearer token: ```http Authorization: Bearer your-api-key-here ``` The API key is configured via environment variables: - `LOCAL_API_KEY`: For server mode - `REMOTE_API_KEY`: For client mode ## System Endpoints ### GET /health Basic health check endpoint (no authentication required). **Request:** ```http GET /health HTTP/1.1 Host: localhost:8080 ``` **Response:** ```json { "status": "healthy", "timestamp": "2024-01-15T14:30:00Z" } ``` ### GET /api/v1/status Get comprehensive system status and component health. **Request:** ```http GET /api/v1/status HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Response:** ```json { "success": true, "status": "operational", "components": { "database": "healthy", "crawl4ai": "healthy", "session": "abc123-session-id" }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### GET /api/v1/help Get comprehensive help documentation for all available tools. **Request:** ```http GET /api/v1/help HTTP/1.1 Host: localhost:8080 ``` **Response:** ```json { "success": true, "tools": [ { "name": "crawl_url", "example": "Crawl http://www.example.com without storing", "parameters": "url: string" }, ... ], "api_info": { "base_url": "/api/v1", "authentication": "Bearer token required in Authorization header", "formats": { "retention_policy": ["permanent", "session_only", "30_days"], "http_methods": { ... } } }, "timestamp": "2024-01-15T14:30:00Z" } ``` ## Crawling Endpoints ### POST /api/v1/crawl Crawl a URL without storing the content. **Request:** ```http POST /api/v1/crawl HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key Content-Type: application/json { "url": "https://example.com" } ``` **Parameters:** - `url` (string, required): URL to crawl **Response:** ```json { "success": true, "data": { "url": "https://example.com", "title": "Example Domain", "content": "Full page content...", "markdown": "# Example Domain\n\nContent in markdown...", "crawl_time": 2.34 }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### POST /api/v1/crawl/store Crawl a URL and store it permanently in the knowledge base. **Request:** ```http POST /api/v1/crawl/store HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key Content-Type: application/json { "url": "https://example.com", "tags": "example,documentation", "retention_policy": "permanent" } ``` **Parameters:** - `url` (string, required): URL to crawl - `tags` (string, optional): Comma-separated tags for organization - `retention_policy` (string, optional): Storage policy ('permanent', 'session_only', '30_days'). Default: 'permanent' **Response:** ```json { "success": true, "data": { "success": true, "content_id": 12345, "url": "https://example.com", "title": "Example Domain", "stored": true, "embeddings_created": true }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### POST /api/v1/crawl/temp Crawl a URL and store it temporarily (session only). **Request:** ```http POST /api/v1/crawl/temp HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key Content-Type: application/json { "url": "https://example.com", "tags": "temporary,testing" } ``` **Parameters:** - `url` (string, required): URL to crawl - `tags` (string, optional): Comma-separated tags for organization **Response:** ```json { "success": true, "data": { "success": true, "content_id": 12346, "url": "https://example.com", "retention_policy": "session_only", "stored": true }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### POST /api/v1/crawl/deep/store Deep crawl multiple pages using DFS strategy and store all discovered pages. **Request:** ```http POST /api/v1/crawl/deep/store HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key Content-Type: application/json { "url": "https://docs.example.com", "max_depth": 3, "max_pages": 50, "retention_policy": "permanent", "tags": "deep_crawl,documentation", "include_external": false, "score_threshold": 0.0, "timeout": 600 } ``` **Parameters:** - `url` (string, required): Starting URL for deep crawl - `max_depth` (integer, optional): Maximum depth to crawl (1-5, default: 2) - `max_pages` (integer, optional): Maximum pages to crawl (1-250, default: 10) - `retention_policy` (string, optional): Storage policy (default: 'permanent') - `tags` (string, optional): Tags for organization (auto-adds 'deep_crawl') - `include_external` (boolean, optional): Follow external domain links (default: false) - `score_threshold` (number, optional): Minimum URL score (0.0-1.0, default: 0.0) - `timeout` (integer, optional): Maximum crawl time in seconds (60-1800) **Response:** ```json { "success": true, "data": { "success": true, "pages_crawled": 48, "pages_stored": 45, "failed_pages": 3, "crawl_depth_reached": 3, "crawl_time": 125.5, "stored_pages": [ { "url": "https://docs.example.com/page1", "content_id": 12347, "depth": 1 }, ... ], "failed_urls": [ { "url": "https://docs.example.com/broken", "error": "404 Not Found" } ] }, "timestamp": "2024-01-15T14:30:00Z" } ``` ## Search Endpoints ### POST /api/v1/search Search stored knowledge using semantic similarity. **Request:** ```http POST /api/v1/search HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key Content-Type: application/json { "query": "machine learning algorithms", "limit": 10, "tags": "ml,algorithms" } ``` **Parameters:** - `query` (string, required): Search query text - `limit` (integer, optional): Number of results (default: 5, max: 1000) - `tags` (string, optional): Comma-separated tags to filter by (ANY match) **Response:** ```json { "success": true, "data": [ { "url": "https://docs.example.com/ml-algorithms", "title": "Machine Learning Algorithms Guide", "content": "This guide covers various machine learning algorithms...", "timestamp": "2024-01-15T14:30:00Z", "tags": "ml,algorithms,guide", "similarity_score": 0.92 }, ... ], "timestamp": "2024-01-15T14:30:00Z" } ``` ### POST /api/v1/search/target Intelligent search that discovers tags from initial results and expands search. **Request:** ```http POST /api/v1/search/target HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key Content-Type: application/json { "query": "react hooks", "initial_limit": 5, "expanded_limit": 20 } ``` **Parameters:** - `query` (string, required): Search query for tag discovery - `initial_limit` (integer, optional): Initial results for tag discovery (1-100, default: 5) - `expanded_limit` (integer, optional): Maximum expanded results (1-1000, default: 20) **Response:** ```json { "success": true, "data": { "success": true, "query": "react hooks", "results": [ { "url": "https://example.com/react-hooks-guide", "title": "React Hooks Complete Guide", "content": "...", "similarity_score": 0.95 }, ... ], "discovered_tags": ["react", "hooks", "javascript", "frontend"], "expansion_used": true, "initial_results_count": 5, "expanded_results_count": 18 }, "timestamp": "2024-01-15T14:30:00Z" } ``` ## Knowledge Base Management ### GET /api/v1/memory List stored content in the knowledge base. **Request:** ```http GET /api/v1/memory?filter=permanent&limit=100 HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Parameters:** - `filter` (string, optional): Filter by retention policy ('permanent', 'session_only', '30_days') - `limit` (integer, optional): Maximum results to return (default: 100, max: 1000) **Response:** ```json { "success": true, "data": { "content": [ { "url": "https://example.com/page1", "title": "Page Title", "timestamp": "2024-01-15T14:30:00Z", "retention_policy": "permanent", "tags": "example,docs" }, ... ], "count": 50, "total_count": 1250, "limited": true, "limit": 100 }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### DELETE /api/v1/memory Remove specific content by URL from the knowledge base. **Request:** ```http DELETE /api/v1/memory?url=https://example.com HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Parameters:** - `url` (string, required): URL to remove **Response:** ```json { "success": true, "data": { "removed_count": 1, "attempted_urls": [ "https://example.com", "https://www.example.com" ], "original_url": "https://example.com" }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### DELETE /api/v1/memory/temp Clear all temporary content from the current session. **Request:** ```http DELETE /api/v1/memory/temp HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Response:** ```json { "success": true, "data": { "removed_count": 5, "session_id": "abc123-session-id" }, "timestamp": "2024-01-15T14:30:00Z" } ``` ## Database Statistics ### GET /api/v1/stats Get comprehensive database statistics including record counts, storage size, and recent activity. **Request:** ```http GET /api/v1/stats HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Response:** ```json { "success": true, "data": { "success": true, "using_ram_db": true, "database_path": "/app/data/crawl4ai_rag.db", "total_pages": 1250, "vector_embeddings": 45678, "sessions": 12, "database_size_mb": 45.23, "database_size_bytes": 47423488, "retention_breakdown": { "permanent": 1100, "session_only": 120, "30_days": 30 }, "recent_activity": [ { "url": "https://example.com/recent", "title": "Recent Page", "timestamp": "2024-01-15T14:25:00Z", "size_kb": 123.4, "retention_policy": "permanent" }, ... ], "storage_breakdown": { "content_mb": 30.5, "embeddings_mb": 12.8, "metadata_mb": 1.93 }, "top_tags": [ {"tag": "documentation", "count": 450}, {"tag": "tutorial", "count": 320}, ... ], "vec_extension_available": true }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### GET /api/v1/db/stats Get RAM database sync statistics and health metrics. **Request:** ```http GET /api/v1/db/stats HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Response:** ```json { "success": true, "mode": "memory", "disk_db_path": "/app/data/crawl4ai_rag.db", "disk_db_size_mb": 45.23, "sync_metrics": { "total_syncs": 127, "last_sync_time": 1705329000.0, "last_sync_duration": 0.15, "total_records_synced": 3456, "failed_syncs": 0, "pending_changes": 0 }, "health": { "pending_changes": 0, "last_sync_ago_seconds": 12.5, "sync_success_rate": 1.0 }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### GET /api/v1/domains List all unique domains stored in the knowledge base with page counts. **Request:** ```http GET /api/v1/domains HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Response:** ```json { "success": true, "data": { "success": true, "domains": [ {"domain": "docs.python.org", "page_count": 245}, {"domain": "github.com", "page_count": 189}, {"domain": "example.com", "page_count": 156}, ... ], "total_domains": 42, "total_pages": 1250 }, "timestamp": "2024-01-15T14:30:00Z" } ``` ## Domain Blocking ### GET /api/v1/blocked-domains List all blocked domain patterns. **Request:** ```http GET /api/v1/blocked-domains HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Response:** ```json { "success": true, "data": { "success": true, "blocked_domains": [ { "pattern": "*.ru", "description": "Block all Russian domains", "created_at": "2024-01-01T00:00:00Z" }, { "pattern": "*porn*", "description": "Block URLs containing 'porn'", "created_at": "2024-01-01T00:00:00Z" }, ... ], "count": 6 }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### POST /api/v1/blocked-domains Add a domain pattern to the blocklist. **Request:** ```http POST /api/v1/blocked-domains HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key Content-Type: application/json { "pattern": "*.spam-site.com", "description": "Known spam domain" } ``` **Parameters:** - `pattern` (string, required): Domain pattern to block (supports wildcards) - `*.ru` - Block all .ru domains - `*spam*` - Block URLs containing 'spam' - `example.com` - Block exact domain - `description` (string, optional): Reason for blocking **Response:** ```json { "success": true, "data": { "success": true, "pattern": "*.spam-site.com", "description": "Known spam domain" }, "timestamp": "2024-01-15T14:30:00Z" } ``` ### DELETE /api/v1/blocked-domains Remove a domain pattern from the blocklist (requires authorization keyword). **Request:** ```http DELETE /api/v1/blocked-domains?pattern=*.spam-site.com&keyword=secret-keyword HTTP/1.1 Host: localhost:8080 Authorization: Bearer your-api-key ``` **Parameters:** - `pattern` (string, required): Domain pattern to unblock - `keyword` (string, required): Authorization keyword (from BLOCKED_DOMAIN_KEYWORD env var) **Response:** ```json { "success": true, "data": { "success": true, "pattern": "*.spam-site.com", "removed": true }, "timestamp": "2024-01-15T14:30:00Z" } ``` ## Error Responses All endpoints return consistent error responses: ### 400 Bad Request Invalid input or validation error: ```json { "success": false, "error": "URL contains dangerous SQL pattern: SELECT", "timestamp": "2024-01-15T14:30:00Z" } ``` ### 401 Unauthorized Missing or invalid API key: ```json { "success": false, "error": "Unauthorized", "timestamp": "2024-01-15T14:30:00Z" } ``` ### 404 Not Found Resource not found: ```json { "success": false, "error": "Pattern 'example.com' not found in blocklist", "timestamp": "2024-01-15T14:30:00Z" } ``` ### 500 Internal Server Error Server-side error: ```json { "success": false, "error": "Internal server error", "timestamp": "2024-01-15T14:30:00Z" } ``` ## Rate Limiting The API implements rate limiting to prevent abuse: - Default limit: 60 requests per minute per API key - Configurable via `RATE_LIMIT_PER_MINUTE` environment variable - Rate limit headers included in responses: - `X-RateLimit-Limit`: Maximum requests per window - `X-RateLimit-Remaining`: Requests remaining in current window - `X-RateLimit-Reset`: Time when the rate limit resets ## Pydantic Models All request bodies are validated using Pydantic models: ### CrawlRequest ```python { "url": str # Required, validated URL } ``` ### CrawlStoreRequest ```python { "url": str, # Required "tags": str, # Optional, max 255 chars "retention_policy": str # Optional: permanent|session_only|30_days } ``` ### DeepCrawlStoreRequest ```python { "url": str, # Required "max_depth": int, # Optional, 1-5, default 2 "max_pages": int, # Optional, 1-250, default 10 "include_external": bool, # Optional, default false "score_threshold": float, # Optional, 0.0-1.0, default 0.0 "timeout": int, # Optional, 60-1800 seconds "tags": str, # Optional "retention_policy": str # Optional } ``` ### SearchRequest ```python { "query": str, # Required, max 500 chars "limit": int, # Optional, 1-1000, default 5 "tags": str # Optional, comma-separated } ``` ### TargetSearchRequest ```python { "query": str, # Required, max 500 chars "initial_limit": int, # Optional, 1-100, default 5 "expanded_limit": int # Optional, 1-1000, default 20 } ``` ## Testing Endpoints ### Using curl ```bash # Test health check curl http://localhost:8080/health # Get API help curl -H "Authorization: Bearer your-api-key" \ http://localhost:8080/api/v1/help # Crawl and store a URL curl -X POST http://localhost:8080/api/v1/crawl/store \ -H "Authorization: Bearer your-api-key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "tags": "test"}' # Search knowledge base curl -X POST http://localhost:8080/api/v1/search \ -H "Authorization: Bearer your-api-key" \ -H "Content-Type: application/json" \ -d '{"query": "example", "limit": 5}' # Get database stats curl -H "Authorization: Bearer your-api-key" \ http://localhost:8080/api/v1/stats # List blocked domains curl -H "Authorization: Bearer your-api-key" \ http://localhost:8080/api/v1/blocked-domains ``` ### Using Python ```python import requests API_URL = "http://localhost:8080" API_KEY = "your-api-key" headers = { "Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json" } # Crawl and store response = requests.post( f"{API_URL}/api/v1/crawl/store", headers=headers, json={"url": "https://example.com", "tags": "python,test"} ) print(response.json()) # Search response = requests.post( f"{API_URL}/api/v1/search", headers=headers, json={"query": "python programming", "limit": 10} ) print(response.json()) # Get stats response = requests.get( f"{API_URL}/api/v1/stats", headers=headers ) print(response.json()) ``` ## See Also - [API Overview](index.md) - [Security Documentation](../advanced/security.md) - [Batch Operations](../advanced/batch-operations.md) - [RAM Database Mode](../advanced/ram-database.md) - [Troubleshooting Guide](../guides/troubleshooting.md)

Loading blob content...

Latest Blog Posts

Don't Use Large Strings as Cache Keys
By punkpeye on January 11, 2026.
markdown
node-js
cache
What are Claude Skills?
By punkpeye on January 10, 2026.
mcp
skills
How to Test MCP Streamable HTTP Endpoints Using cURL
By punkpeye on January 2, 2026.
tutorial
bash

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rob-P-Smith/mcpragcrawl4ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

endpoints.md•18.1 kB