Skip to main content
Glama
spec.md15.1 kB
# Crawler Service Refactoring Specification ## Overview This specification outlines a comprehensive refactoring of the `CrawlerService` to: 1. Detect and handle SPAs (Single Page Applications) vs static pages appropriately 2. Apply SOLID principles to improve architecture 3. Improve organization, maintainability, and extensibility ## Current Issues The current `CrawlerService` implementation has several limitations: 1. **Single Responsibility Principle (SRP) violations**: - Handles too many responsibilities: crawling, content extraction, robots.txt handling, job management, document creation - Methods like `crawlPage`, `updateJobProgress`, and `loadRobotsTxt` should be in separate classes 2. **Open/Closed Principle issues**: - Not easily extendable for different crawling strategies (SPA vs regular pages) - Hard-coded content extraction logic isn't flexible for different site structures 3. **Lacks proper abstractions**: - Direct dependency on PrismaClient and DocumentService - No interfaces for HTTP requests or HTML parsing - No separation between static and dynamic page handling ## Architecture Design ### Directory Structure ``` /src/services/crawler/ ├── index.ts # Main export file ├── interfaces/ # Core interfaces │ ├── ICrawler.ts # Base crawler interface │ ├── IContentExtractor.ts # Content extraction interface │ ├── IPageDetector.ts # Page type detection interface │ ├── ILinkExtractor.ts # Link extraction interface │ ├── IRateLimiter.ts # Rate limiting interface │ ├── IJobManager.ts # Job management interface │ ├── IDocumentProcessor.ts # Document processing interface │ ├── IRobotsTxtService.ts # Robots.txt interface │ ├── IUrlQueue.ts # URL queue interface │ └── types.ts # Common types and enums ├── implementations/ # Interface implementations │ ├── BaseCrawler.ts # Abstract base crawler implementation │ ├── StandardCrawler.ts # Main crawler implementation │ ├── CheerioExtractor.ts # Cheerio-based content extractor │ ├── PuppeteerExtractor.ts # Puppeteer-based content extractor │ ├── SPADetector.ts # SPA detection implementation │ ├── DefaultLinkExtractor.ts # Link extraction implementation │ ├── PrismaJobManager.ts # Job management with Prisma │ ├── DocumentProcessor.ts # Document processing implementation │ ├── RobotsTxtService.ts # Robots.txt handling │ ├── InMemoryUrlQueue.ts # URL queue implementation │ └── TokenBucketRateLimiter.ts # Rate limiting implementation ├── factories/ # Factory classes │ ├── CrawlingStrategyFactory.ts # Factory for selecting extraction strategy │ └── ServiceFactory.ts # Factory for creating service instances └── utils/ # Helper utilities ├── UrlUtils.ts # URL handling utilities ├── HtmlUtils.ts # HTML parsing utilities ├── DelayUtils.ts # Timing and delay utilities └── LoggingUtils.ts # Crawler-specific logging ``` ### Core Interfaces #### `ICrawler` ```typescript export interface ICrawler { crawl(jobId: string, startUrl: string, options: CrawlOptions): Promise<void>; initialize(): Promise<void>; stop(): Promise<void>; pause(): Promise<void>; resume(): Promise<void>; getProgress(): Promise<CrawlProgress>; } ``` #### `IContentExtractor` ```typescript export interface IContentExtractor { extract(url: string, options: ExtractionOptions): Promise<ExtractedContent>; supportsPageType(pageType: PageType): boolean; cleanup(): Promise<void>; } ``` #### `IPageDetector` ```typescript export interface IPageDetector { detectPageType(url: string, htmlContent?: string): Promise<PageTypeResult>; isSPA(url: string, htmlContent?: string): Promise<boolean>; } ``` #### `ILinkExtractor` ```typescript export interface ILinkExtractor { extractLinks(htmlContent: string, baseUrl: string, currentUrl: string): Promise<string[]>; extractPaginationLinks(htmlContent: string, baseUrl: string, currentUrl: string): Promise<string[]>; } ``` #### `IRateLimiter` ```typescript export interface IRateLimiter { acquireToken(domain: string): Promise<void>; releaseToken(domain: string): void; setRateLimit(domain: string, rateLimit: number): void; getRateLimit(domain: string): number; } ``` #### `IJobManager` ```typescript export interface IJobManager { createJob(data: JobCreateData): Promise<Job>; updateProgress(jobId: string, progress: number, stats: JobStats): Promise<void>; markJobCompleted(jobId: string, stats: JobStats): Promise<void>; markJobFailed(jobId: string, error: string, stats: JobStats): Promise<void>; shouldContinue(jobId: string): Promise<boolean>; } ``` #### `IDocumentProcessor` ```typescript export interface IDocumentProcessor { createDocument(data: DocumentCreateData): Promise<Document>; findRecentDocument(url: string, age: number): Promise<Document | null>; copyDocument(existingDocument: Document, jobId: string, level: number): Promise<Document>; } ``` #### `IRobotsTxtService` ```typescript export interface IRobotsTxtService { loadRobotsTxt(baseUrl: string, userAgent: string): Promise<void>; isAllowed(url: string): boolean; getCrawlDelay(): number | null; } ``` #### `IUrlQueue` ```typescript export interface IUrlQueue { add(url: string, depth: number): void; addBulk(urls: Array<{url: string, depth: number}>): void; getNext(): {url: string, depth: number} | null; has(url: string): boolean; size(): number; markVisited(url: string): void; isVisited(url: string): boolean; visitedCount(): number; } ``` ## SPA Detection Strategy The system will use a hybrid approach to detect SPAs vs static pages: ### Static Analysis First, the system will analyze the static HTML for: - JavaScript framework signatures (React, Angular, Vue) - Minimal HTML with heavy JavaScript loading - SPA-specific DOM structures (like `#app` or `#root` divs) - Client-side routing code (history API or hash-based routing) ### Dynamic Analysis If the static analysis is inconclusive, the system will use Puppeteer to: - Monitor DOM changes after initial load - Track XHR/fetch API calls - Observe behavior after user interactions - Check for dynamic content loading ### Implementation Details The `SPADetector` will use a scoring approach: ```typescript export class SPADetector implements IPageDetector { // Framework signature patterns private readonly signaturePatterns = [ { pattern: /react|reactjs/i, framework: 'React' }, { pattern: /angular|ng-/i, framework: 'Angular' }, { pattern: /vue|vuejs/i, framework: 'Vue' }, { pattern: /ember|emberjs/i, framework: 'Ember' }, { pattern: /backbone/i, framework: 'Backbone' } ]; // Cache detection results by domain private domainTypeCache = new Map<string, PageTypeResult>(); async detectPageType(url: string, htmlContent?: string): Promise<PageTypeResult> { // Check cache first const domain = new URL(url).hostname; if (this.domainTypeCache.has(domain)) { return this.domainTypeCache.get(domain)!; } // Static analysis (score from 0-1) let staticScore = await this.analyzeStaticContent(url, htmlContent); // Dynamic analysis if needed (when static analysis is inconclusive) if (staticScore > 0.3 && staticScore < 0.7) { const dynamicScore = await this.analyzeDynamicBehavior(url); staticScore = (staticScore * 0.4) + (dynamicScore * 0.6); } // Determine result const isSPA = staticScore >= 0.6; const result = { isSPA, confidence: staticScore, pageType: isSPA ? PageType.SPA : PageType.STATIC, detectionMethod: staticScore > 0.7 ? 'static' : 'hybrid' }; // Cache and return this.domainTypeCache.set(domain, result); return result; } } ``` ## Content Extraction Strategy Based on the page type detection, the system will select the appropriate content extraction strategy: ### CheerioExtractor (for static pages) - Fast, lightweight HTML parsing - Lower resource usage - Suitable for static content sites ### PuppeteerExtractor (for SPAs) - Full browser rendering - JavaScript execution - Waits for dynamic content to load - Handles client-side routing ### Strategy Selection The `CrawlingStrategyFactory` will handle strategy selection: ```typescript export class CrawlingStrategyFactory { constructor( private readonly pageDetector: IPageDetector, private readonly cheerioExtractor: IContentExtractor, private readonly puppeteerExtractor: IContentExtractor, private readonly options: StrategyFactoryOptions ) {} async getExtractorForUrl(url: string, htmlContent?: string): Promise<IContentExtractor> { // Check if strategy is forced in options if (this.options.forceStrategy === 'cheerio') { return this.cheerioExtractor; } else if (this.options.forceStrategy === 'puppeteer') { return this.puppeteerExtractor; } // Detect page type and select appropriate extractor const pageTypeResult = await this.pageDetector.detectPageType(url, htmlContent); return pageTypeResult.isSPA ? this.puppeteerExtractor : this.cheerioExtractor; } } ``` ## Main Crawler Implementation The `StandardCrawler` will coordinate all components: ```typescript export class StandardCrawler implements ICrawler { constructor( private readonly pageDetector: IPageDetector, private readonly strategyFactory: CrawlingStrategyFactory, private readonly linkExtractor: ILinkExtractor, private readonly urlQueue: IUrlQueue, private readonly jobManager: IJobManager, private readonly documentProcessor: IDocumentProcessor, private readonly robotsTxtService: IRobotsTxtService, private readonly rateLimiter: IRateLimiter, private readonly options: CrawlOptions ) {} async crawl(jobId: string, startUrl: string, options: CrawlOptions): Promise<void> { // Initialize and configure await this.initialize(); this.urlQueue.add(startUrl, 0); const crawlOptions = { ...this.options, ...options }; try { // Main crawling loop while (this.urlQueue.size() > 0) { // Check job status (cancelled, paused) if (!(await this.jobManager.shouldContinue(jobId))) { break; } const { url, depth } = this.urlQueue.getNext()!; // Skip if already visited or too deep if (this.urlQueue.isVisited(url) || depth > crawlOptions.maxDepth) { continue; } // Process URL try { // Select appropriate extractor based on page type const extractor = await this.strategyFactory.getExtractorForUrl(url); // Extract content const content = await extractor.extract(url, { userAgent: crawlOptions.userAgent, timeout: crawlOptions.timeout }); // Mark URL as visited this.urlQueue.markVisited(url); // Process document await this.documentProcessor.createDocument({ url: content.url, title: content.title, content: content.content, metadata: content.metadata, crawlDate: new Date(), level: depth, jobId }); // Extract and queue new links const links = await this.linkExtractor.extractLinks( content.content, crawlOptions.baseUrl, url ); for (const link of links) { if (!this.urlQueue.isVisited(link)) { this.urlQueue.add(link, depth + 1); } } // Update job progress await this.updateJobProgress(jobId); } catch (error) { // Handle errors for this URL this.urlQueue.markVisited(url); // Update job with error but continue crawling } } // Complete job await this.markJobCompleted(jobId); } catch (error) { // Handle unexpected errors await this.markJobFailed(jobId, error); } } } ``` ## Performance Considerations 1. **Resource Management** - Puppeteer is only used when needed (for SPAs) - Caching detection results at the domain level - Cheerio used by default for static pages (lower resource usage) 2. **Optimization Strategies** - Reusing existing document data when available - Domain-level rate limiting to avoid overloading servers - Configurable crawl depth and timeouts 3. **Error Handling** - Graceful handling of network failures - Appropriate retries with exponential backoff - Detailed error reporting ## Implementation Plan 1. Create interfaces and types 2. Implement core utilities 3. Build concrete implementations for each interface 4. Create factory classes 5. Implement main crawler 6. Write tests for each component 7. Integration testing 8. Performance testing and optimization ## Configuration Options The system will support the following configuration options: ```typescript export interface CrawlOptions { maxDepth: number; // Maximum crawl depth baseUrl: string; // Base URL for same-domain checking rateLimit?: number; // Milliseconds between requests respectRobotsTxt?: boolean; // Whether to respect robots.txt userAgent?: string; // User agent for requests timeout?: number; // Request timeout forceStrategy?: 'cheerio' | 'puppeteer'; // Force a specific strategy maxRedirects?: number; // Maximum redirects to follow reuseCachedContent?: boolean; // Whether to reuse recently crawled content cacheExpiry?: number; // Age limit for reusing content (in days) } ``` ## Benefits of the New Architecture 1. **SOLID Principles Compliance** - Single Responsibility: Each class has one purpose - Open/Closed: Easy to extend without modifying existing code - Liskov Substitution: Implementations are interchangeable - Interface Segregation: Clean, focused interfaces - Dependency Inversion: High-level modules depend on abstractions 2. **Better Testing** - Components can be tested in isolation - Interfaces allow for easy mocking - Reduced test complexity 3. **Improved Maintainability** - Smaller, focused classes - Clear separation of concerns - Well-defined interfaces 4. **Enhanced Flexibility** - Support for both static sites and SPAs - Easy to add new extraction strategies - Configurable behavior

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/visheshd/docmcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server