RAGStack-Lambda

Overview Schema Related Servers Score Discussions

SCRAPER.md•11 KiB

# Web Scraper Module Web scraping for ingesting documentation sites into the knowledge base. ## scraper/ ```python from ragstack_common.scraper import ScrapeJob, ScrapeConfig, ScrapeStatus, ScrapeScope class ScrapeStatus(str, Enum): PENDING, DISCOVERING, PROCESSING, COMPLETED, COMPLETED_WITH_ERRORS, FAILED, CANCELLED class ScrapeScope(str, Enum): SUBPAGES # Only paths under starting URL HOSTNAME # Same subdomain DOMAIN # All subdomains @dataclass class ScrapeConfig: max_pages: int = 100 max_depth: int = 3 scope: ScrapeScope = ScrapeScope.SUBPAGES include_patterns: list[str] = None exclude_patterns: list[str] = None scrape_mode: str = "auto" # auto, fast, full cookies: str | None = None force_rescrape: bool = False @dataclass class ScrapeJob: job_id: str base_url: str status: ScrapeStatus config: ScrapeConfig total_urls: int processed_count: int failed_count: int ``` **Architecture:** Discovery via SQS, HTTP-first fetching with Playwright fallback, SHA-256 content deduplication. ## Overview The scraper module enables ingesting entire documentation sites or website sections into the knowledge base. It crawls pages, extracts content as markdown, and handles deduplication. Designed for technical documentation, blogs, and knowledge bases. ## Data Models ### ScrapeStatus ```python class ScrapeStatus(str, Enum): PENDING = "PENDING" # Job created, not started DISCOVERING = "DISCOVERING" # Finding URLs to scrape PROCESSING = "PROCESSING" # Scraping pages COMPLETED = "COMPLETED" # All pages processed successfully COMPLETED_WITH_ERRORS = "COMPLETED_WITH_ERRORS" # Some pages failed FAILED = "FAILED" # Job failed completely CANCELLED = "CANCELLED" # User cancelled job ``` ### ScrapeScope ```python class ScrapeScope(str, Enum): SUBPAGES = "SUBPAGES" # Only URLs under starting path HOSTNAME = "HOSTNAME" # Same subdomain (e.g., docs.example.com) DOMAIN = "DOMAIN" # All subdomains (e.g., *.example.com) ``` **Examples:** Starting URL: `https://docs.example.com/api/` | Scope | Includes | Excludes | |-------|----------|----------| | SUBPAGES | `docs.example.com/api/*` | `docs.example.com/blog/*` | | HOSTNAME | `docs.example.com/*` | `www.example.com/*` | | DOMAIN | `*.example.com/*` | `other-site.com/*` | ### ScrapeConfig ```python @dataclass class ScrapeConfig: max_pages: int = 100 max_depth: int = 3 scope: ScrapeScope = ScrapeScope.SUBPAGES include_patterns: list[str] = None # Regex patterns to include exclude_patterns: list[str] = None # Regex patterns to exclude scrape_mode: str = "auto" # auto, fast, full cookies: str | None = None # Cookie string for auth force_rescrape: bool = False # Ignore cached content ``` **scrape_mode:** - `auto`: HTTP first, fallback to Playwright for dynamic content - `fast`: HTTP only (faster, may miss dynamic content) - `full`: Always use Playwright (slower, handles JavaScript) ### ScrapeJob ```python @dataclass class ScrapeJob: job_id: str # Unique job identifier base_url: str # Starting URL status: ScrapeStatus # Current job status config: ScrapeConfig # Scraping configuration total_urls: int # Total URLs discovered processed_count: int # URLs successfully processed failed_count: int # URLs that failed ``` ## Usage ### Basic Scraping ```python from ragstack_common.scraper import ScrapeJob, ScrapeConfig, ScrapeScope, ScrapeStatus # Create config config = ScrapeConfig( max_pages=50, max_depth=2, scope=ScrapeScope.SUBPAGES, scrape_mode="auto" ) # Create job job = ScrapeJob( job_id="scrape-123", base_url="https://docs.example.com/api/", status=ScrapeStatus.PENDING, config=config, total_urls=0, processed_count=0, failed_count=0 ) ``` ### Scope Examples #### Subpages Only ```python config = ScrapeConfig( scope=ScrapeScope.SUBPAGES, max_pages=100 ) # Starting URL: https://docs.example.com/api/ # Scrapes: /api/overview, /api/authentication, /api/endpoints/... # Skips: /blog/, /about/, /pricing/ ``` #### Same Subdomain ```python config = ScrapeConfig( scope=ScrapeScope.HOSTNAME, max_pages=200 ) # Starting URL: https://docs.example.com/api/ # Scrapes: /api/*, /guides/*, /tutorials/* # Skips: www.example.com/*, blog.example.com/* ``` #### All Subdomains ```python config = ScrapeConfig( scope=ScrapeScope.DOMAIN, max_pages=500 ) # Starting URL: https://docs.example.com/ # Scrapes: docs.example.com/*, blog.example.com/*, www.example.com/* # Skips: other-domain.com/* ``` ### URL Filtering #### Include Patterns ```python config = ScrapeConfig( include_patterns=[ r"/api/.*", # Include all API docs r"/guides/.*", # Include all guides r".*\.html$" # Include only HTML files ] ) ``` #### Exclude Patterns ```python config = ScrapeConfig( exclude_patterns=[ r".*/archive/.*", # Skip archived pages r".*/draft/.*", # Skip draft pages r".*\.pdf$", # Skip PDFs r".*/comments.*" # Skip comment sections ] ) ``` #### Combined Filtering ```python config = ScrapeConfig( scope=ScrapeScope.HOSTNAME, include_patterns=[r"/docs/.*"], exclude_patterns=[r".*/v1/.*", r".*/deprecated/.*"], max_pages=200 ) # Scrapes: /docs/api/, /docs/guides/ # Skips: /docs/v1/*, /docs/deprecated/*, /blog/* ``` ### Scrape Modes #### Auto Mode (Recommended) ```python config = ScrapeConfig(scrape_mode="auto") # Tries HTTP first (fast) # Falls back to Playwright if: # - Content is JavaScript-rendered # - Page requires browser features # - HTTP fetch fails ``` #### Fast Mode ```python config = ScrapeConfig(scrape_mode="fast") # HTTP only # Pros: 5-10x faster, lower resource usage # Cons: Misses JavaScript-rendered content # Use for: Static sites, server-rendered docs ``` #### Full Mode ```python config = ScrapeConfig(scrape_mode="full") # Always uses Playwright browser # Pros: Handles all dynamic content # Cons: Slower, higher resource usage # Use for: Single-page apps, heavily dynamic sites ``` ### Authentication ```python config = ScrapeConfig( cookies="session=abc123; auth_token=xyz789" ) # Passes cookies to all requests # Use for: Authenticated documentation, private wikis ``` **Note:** Cookie format is standard HTTP cookie string (semicolon-separated) ### Force Rescrape ```python config = ScrapeConfig(force_rescrape=True) # Ignores content hash cache # Scrapes even if content unchanged # Use for: Testing, forcing updates ``` ## Content Deduplication The scraper uses SHA-256 content hashing to prevent duplicate ingestion: ```python # Content hash computed from: # - Main content text (excludes navigation, footer, etc.) # - URL path (normalized) # Example: # URL 1: https://docs.example.com/api/authentication # URL 2: https://docs.example.com/api/authentication?utm_source=email # If content identical → only scrapes once ``` **Deduplication rules:** - Same content, different URLs → scrapes once, uses canonical URL - Same URL, different content → rescrapes if hash differs - `force_rescrape=True` → bypasses hash check ## Discovery Architecture ``` User initiates scrape ↓ Discovery Lambda ├─ Fetch starting URL ├─ Extract links ├─ Apply scope/filters └─ Send URLs to SQS queue ↓ Processing Lambda (parallel) ├─ Fetch page content ├─ Extract markdown ├─ Check content hash ├─ Save to S3 └─ Update job progress ``` **Benefits:** - Parallel processing (10-100 concurrent fetches) - Resilient to failures (SQS retries) - Scalable (handles 1,000+ page sites) ## Complete Example ```python from ragstack_common.scraper import ScrapeJob, ScrapeConfig, ScrapeScope, ScrapeStatus def scrape_documentation_site(): """ Scrape Python documentation site. """ config = ScrapeConfig( # Limit to 500 pages max_pages=500, # Depth 3 from starting URL max_depth=3, # Stay within docs.python.org scope=ScrapeScope.HOSTNAME, # Include only current version docs include_patterns=[r"/3/.*"], # Exclude tutorials and archives exclude_patterns=[r".*/tutorial/.*", r".*/archive/.*"], # Auto mode for optimal speed/completeness scrape_mode="auto", # Don't force rescrape (use cache) force_rescrape=False ) job = ScrapeJob( job_id="scrape-python-docs", base_url="https://docs.python.org/3/", status=ScrapeStatus.PENDING, config=config, total_urls=0, processed_count=0, failed_count=0 ) return job ``` ## Progress Tracking ```python # Check job progress if job.status == ScrapeStatus.PROCESSING: progress = job.processed_count / job.total_urls print(f"Progress: {progress:.1%} ({job.processed_count}/{job.total_urls})") print(f"Failed: {job.failed_count}") # Check completion if job.status == ScrapeStatus.COMPLETED: print(f"Scrape complete: {job.processed_count} pages") elif job.status == ScrapeStatus.COMPLETED_WITH_ERRORS: print(f"Scrape complete with {job.failed_count} failures") elif job.status == ScrapeStatus.FAILED: print("Scrape failed") ``` ## Error Handling ```python from ragstack_common.scraper import ScrapeJob, ScrapeStatus # Partial failures if job.status == ScrapeStatus.COMPLETED_WITH_ERRORS: success_rate = job.processed_count / (job.processed_count + job.failed_count) if success_rate > 0.9: print("Scrape mostly successful") else: print(f"High failure rate: {job.failed_count} of {job.total_urls}") # Complete failure if job.status == ScrapeStatus.FAILED: # Check if base URL accessible # Check scope settings (too restrictive?) # Check authentication (cookies expired?) pass ``` **Common issues:** - **0 pages scraped**: Scope too restrictive, include patterns too narrow - **High failure rate**: Site blocking bot traffic, rate limiting - **Missing content**: `scrape_mode="fast"` on JavaScript-heavy site ## Best Practices 1. **Start Small**: Test with `max_pages=10` before full scrape 2. **Scope Appropriately**: Use SUBPAGES for documentation subsections 3. **Use Patterns**: Combine include/exclude patterns for precise control 4. **Choose Mode Wisely**: Use `auto` unless you know site is static or fully dynamic 5. **Monitor Progress**: Track `processed_count` and `failed_count` 6. **Handle Errors**: Expect some failures on large sites (broken links, 404s) 7. **Respect Rate Limits**: Use reasonable `max_pages` to avoid overwhelming sites 8. **Cache**: Set `force_rescrape=False` to leverage content hashing ## See Also - [STORAGE.md](./STORAGE.md) - S3 storage for scraped content - [TEXT_EXTRACTORS.md](./TEXT_EXTRACTORS.md) - HTML to markdown conversion - [appsync.py](./UTILITIES.md#appsync) - Scrape job progress updates

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HatmanStack/RAGStack-Lambda'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

SCRAPER.md•11 KiB