Skip to main content
Glama
PHASE3_COMPLETE.md12 kB
# Phase 3: Crawler Engine - COMPLETE ✅ **Date Completed:** December 24, 2025 **Status:** Phase 3 implementation finished ## Overview Phase 3 brings the complete crawler infrastructure to the Context7 MCP Clone, enabling automated documentation collection, parsing, and indexing from multiple sources. ## Implemented Components ### 1. Redis-Based Rate Limiting ✅ **Files Created:** - `packages/backend-api/src/modules/rate-limiting/rate-limiting.service.ts` - `packages/backend-api/src/modules/rate-limiting/rate-limiting.module.ts` - `packages/backend-api/src/modules/rate-limiting/rate-limit.guard.ts` **Features:** - Tiered rate limiting (Free: 50 rpm/1000 rpd, Pro: 500 rpm/50k rpd, Enterprise: 5000 rpm/1M rpd) - Redis-backed counters with minute and day windows - Atomic operations using Redis pipelines - Global guard applied to all API endpoints - Response headers: `X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset` - Automatic rate limit reset based on time windows **Integration:** - Rate limiting module imported in `app.module.ts` - Rate limiting guard registered globally in `main.ts` - Respects user tier from JWT tokens - Throws `TooManyRequestsException` when limits exceeded ### 2. BullMQ Job Queue Infrastructure ✅ **Files Created:** - `packages/crawler-engine/src/queue/job-queue.ts` **Features:** - Queue-based job management using BullMQ - Support for priority-based job scheduling - Configurable concurrency (3 parallel workers by default) - Job retry with exponential backoff (3 attempts) - Job status tracking and monitoring - Queue statistics (waiting, active, completed, failed, delayed counts) - Old job cleanup functionality - Graceful worker and queue shutdown **Architecture:** - Uses Redis DB 2 for queue isolation - Supports custom job processors via `registerWorker()` - Job history preserved for debugging - Comprehensive logging of job lifecycle ### 3. GitHub Crawler Implementation ✅ **Files Created:** - `packages/crawler-engine/src/crawlers/github-crawler.ts` **Features:** - Repository metadata extraction (stars, description, tags) - README file crawling - package.json extraction (NPM projects) - Documentation files from `docs/`, `doc/`, `website/` directories - Example files from `examples/`, `example/`, `samples/`, `demo/` directories - Git tag/version detection with semantic versioning filtering - Rate-limited API calls with token support - Progress tracking for long-running crawls **Capabilities:** - Parse full names (owner/repo format) - Recursively traverse repository structure - Extract up to 50 files per directory - Fetch file contents as Base64-decoded UTF-8 - Handle HTTP errors gracefully - Extract latest 10 versions from git tags ### 4. Documentation Site Scraper ✅ **Files Created:** - `packages/crawler-engine/src/crawlers/docs-scraper.ts` **Features:** - Multi-page documentation site scraping - Respectful rate limiting (500ms delay between requests) - Internal link following with depth control - Page type classification (api, guide, example, other) - Topic/keyword extraction from headings and meta tags - Configurable max pages (default 200) and depth (default 5) - User-Agent identification - Content deduplication **Page Analysis:** - Extract main content (skips nav, footer, sidebars) - Title extraction from h1 and title tags - Automatic documentation URL detection - Support for common doc platforms (Docusaurus, VitePress, GitBook patterns) ### 5. Markdown Parser Implementation ✅ **Files Created:** - `packages/crawler-engine/src/parsers/markdown-parser.ts` **Features:** - Unified markdown processing with remark plugins - Support for GitHub-Flavored Markdown (GFM) - YAML/TOML frontmatter extraction - Heading extraction with slug generation - Code block identification and extraction - Automatic topic extraction from headings - Description extraction from first paragraph - Table of contents generation - Smart content chunking (default 2000 chars/chunk) **Code Example Extraction:** - Identify code blocks with language specification - Extract context from surrounding text - Generate searchable descriptions - Classify difficulty levels ### 6. Code Extraction Engine ✅ **Files Created:** - `packages/crawler-engine/src/parsers/code-extractor.ts` **Features:** - Code example extraction from mixed content - Difficulty classification (beginner/intermediate/advanced) - Function/class/import detection - Language-specific analysis: - JavaScript/TypeScript patterns - Python patterns - Code complexity analysis - Use case grouping - Context preservation for better searchability **Analyzed Metrics:** - Lines of code - Async/Promise usage - Class/Interface definitions - Nesting depth - Pattern matching for APIs ### 7. Crawler Engine Orchestrator ✅ **Files Created:** - `packages/crawler-engine/src/index.ts` **Features:** - Main orchestration layer - Processor registration system - Job queueing interface - Queue monitoring - Graceful shutdown handling - SIGTERM signal handling **Exported Components:** - All crawler classes (GitHubCrawler, DocsScraper) - All parser classes (MarkdownParser, CodeExtractor) - Job queue manager and types - Type definitions for integration ## Architecture Diagram ``` ┌─────────────────────────────────────────┐ │ Crawler Engine (index.ts) │ │ - Orchestration │ │ - Job queue management │ │ - Processor coordination │ └────────────┬───────────────┬────────────┘ │ │ ┌───────▼──────┐ ┌────▼────────┐ │ Job Queue │ │ Processors │ │ (BullMQ) │ │ Registry │ │ Redis DB 2 │ │ │ └───────┬──────┘ └────┬────────┘ │ │ ┌───────┴───────────────┴──────┐ │ │ ┌────▼──────┐ ┌────────────────┐ │ │ GitHub │ │ Docs Scraper │ │ │ Crawler │ │ (Web Scraper) │ │ │ (Octokit) │ │ (Cheerio) │ │ └────┬──────┘ └────────┬───────┘ │ │ │ │ └────────┬─────────┘ │ │ │ ┌────────▼──────────────┐ │ │ Parsers Layer │ │ │ ┌──────────────────┐ │ │ │ │ Markdown Parser │ │ │ │ │ Code Extractor │ │ │ │ └──────────────────┘ │ │ └───────────┬──────────┘ │ │ │ ┌───────▼─────────┐ │ │ Structured Data │ │ │ (to API) │ │ └─────────────────┘ │ │ Backend API ───┘ (rate-limiting + storage) ``` ## Data Flow ``` 1. Queue a library: CrawlJobData → JobQueue.addJob() → BullMQ 2. Process job: BullMQ → Worker → GitHubCrawler/DocsScraper → CrawlJobResult 3. Parse content: Raw Content → MarkdownParser → ParsedMarkdown Raw Content → CodeExtractor → CodeExample[] 4. Store results: ParsedData → Backend API → PostgreSQL ``` ## Configuration ### Environment Variables ```bash # Redis REDIS_HOST=localhost REDIS_PORT=6379 # GitHub Token (optional, increases API limits) GITHUB_TOKEN=ghp_xxxxx # Crawler Settings CRAWLER_MAX_PAGES=200 CRAWLER_MAX_DEPTH=5 CRAWLER_DELAY_MS=500 ``` ### Rate Limiting Tiers - **Free**: 50 requests/minute, 1,000/day - **Pro**: 500 requests/minute, 50,000/day - **Enterprise**: 5,000 requests/minute, 1,000,000/day ## Key Features ### ✅ Atomic Operations - Redis pipelines for consistent rate limiting - BullMQ job transactions for reliability ### ✅ Error Handling - Graceful fallbacks when sources unavailable - Exponential backoff on job failures - Comprehensive error logging ### ✅ Performance - Parallel job processing (3 concurrent by default) - Efficient content parsing with remark/unified - Lazy loading of dependencies ### ✅ Monitoring - Job status tracking - Queue statistics API - Detailed logging throughout pipeline ### ✅ Respectful Crawling - Rate limiting between requests (500ms) - User-Agent identification - No aggressive crawling patterns - Respects robots.txt patterns (future) ## Integration Points 1. **Backend API** - Rate limiting module imports Redis - Crawl results stored via API endpoints - Job status queryable via backend 2. **MCP Server** - Tools return crawled documentation - resolve-library-id uses crawled metadata - get-library-docs queries indexed content 3. **Database** - documentation_pages table stores parsed content - code_examples table stores extracted code - library_versions tracks crawl status ## Testing the Crawler ### Start the crawler engine: ```bash cd packages/crawler-engine pnpm dev ``` ### Queue a crawl job: ```typescript import CrawlerEngine, { CrawlJobData } from './src/index'; const engine = new CrawlerEngine(); await engine.initialize(); await engine.registerProcessors(); const job: CrawlJobData = { libraryId: '/facebook/react', libraryName: 'React', fullName: 'facebook/react', version: '18.2.0', repositoryUrl: 'https://github.com/facebook/react', crawlType: 'full', }; const jobId = await engine.queueCrawl(job); console.log('Job queued:', jobId); ``` ### Check job status: ```typescript const status = await engine.getJobStatus(jobId); console.log('Job state:', status?.state); console.log('Progress:', status?.progress); ``` ## Next Steps (Phase 4) The crawler infrastructure is now complete and ready for: 1. **Web UI Implementation** (Grafana-themed) - Landing page with purple gradients - Documentation browser interface - Admin dashboard 2. **Initial Data Seeding** - Queue crawl jobs for popular libraries - Populate database with real content - Validate search quality ## File Statistics **New Files Created:** 8 - `rate-limiting.service.ts` (114 lines) - `rate-limiting.module.ts` (29 lines) - `rate-limit.guard.ts` (79 lines) - `job-queue.ts` (232 lines) - `github-crawler.ts` (372 lines) - `docs-scraper.ts` (323 lines) - `markdown-parser.ts` (333 lines) - `code-extractor.ts` (376 lines) **Total Phase 3 Code:** ~1,858 lines of TypeScript **Modified Files:** 2 - `app.module.ts` (added RateLimitingModule import) - `main.ts` (added global rate limiting guard) ## Verification Checklist - ✅ Rate limiting service working with Redis - ✅ Job queue initialized and workers registered - ✅ GitHub crawler extracting content correctly - ✅ Documentation scraper handling websites - ✅ Markdown parser extracting structure - ✅ Code extractor identifying examples - ✅ All TypeScript types properly defined - ✅ Error handling in place - ✅ Logging throughout pipeline - ✅ Configuration via environment variables ## Status **Phase 3 Complete**: 15/15 tasks finished ✅ All core crawler infrastructure is now in place and ready for: - Data seeding with real libraries - Web UI development - Production testing --- **Ready for Phase 4: Web UI & Landing Page**

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/aiatamai/atamai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server