docs-mcp-server

Overview Schema Related Servers Score Discussions

content-processing.md•11.9 KiB

# Content Processing ## Overview The content processing system transforms raw content from various sources into searchable document chunks through a modular strategy-pipeline-splitter architecture. The system handles web pages, local files, and package registries, processing different content types with specialized pipelines that preserve document structure while optimizing chunk sizes for embedding generation. ## Strategy Pattern Architecture The content processing system uses the Strategy pattern to handle different content sources: ```mermaid classDiagram class ScraperStrategy { <<interface>> +canHandle(url) boolean +scrape(options, callback, signal) Promise~void~ +cleanup() Promise~void~ } class BaseScraperStrategy { <<abstract>> #visited Set~string~ #queue QueueItem[] +scrape(options, callback, signal) Promise~void~ #processItem(item) Promise~ProcessItemResult~* } class WebScraperStrategy { +canHandle(url) boolean #processItem(item) Promise~ProcessItemResult~ } class LocalFileStrategy { +canHandle(url) boolean #processItem(item) Promise~ProcessItemResult~ } class NpmScraperStrategy { +canHandle(url) boolean #processItem(item) Promise~ProcessItemResult~ } class PyPiScraperStrategy { +canHandle(url) boolean #processItem(item) Promise~ProcessItemResult~ } class GitHubScraperStrategy { +canHandle(url) boolean #processItem(item) Promise~ProcessItemResult~ } ScraperStrategy <|.. BaseScraperStrategy : implements BaseScraperStrategy <|-- WebScraperStrategy : extends BaseScraperStrategy <|-- LocalFileStrategy : extends BaseScraperStrategy <|-- NpmScraperStrategy : extends BaseScraperStrategy <|-- PyPiScraperStrategy : extends BaseScraperStrategy <|-- GitHubScraperStrategy : extends ``` **Code Reference:** - `src/scraper/types.ts` - ScraperStrategy interface definition - `src/scraper/strategies/BaseScraperStrategy.ts` - Abstract base implementation - `src/scraper/strategies/*.ts` - Concrete strategy implementations ## Architecture Components ### Scraper Strategies Handle different content sources and coordinate the overall scraping process: - **WebScraperStrategy**: HTTP/HTTPS URLs with JavaScript rendering support - **LocalFileStrategy**: Local filesystem access with directory traversal - **NpmScraperStrategy**: npm registry package documentation - **PyPiScraperStrategy**: PyPI package documentation - **GitHubScraperStrategy**: GitHub repository documentation Each strategy manages URL discovery, scope filtering, and progress tracking while delegating content processing to pipelines. ### Content Fetchers Abstract content retrieval across different sources: - **HttpFetcher**: Web content with Playwright support, retry logic, and error handling - **FileFetcher**: Local filesystem access with MIME type detection and encoding resolution ### Processing Pipelines Transform raw content using middleware chains and content-type-specific logic: - **HtmlPipeline**: Converts HTML to clean markdown via middleware, then applies semantic splitting - **MarkdownPipeline**: Processes markdown with metadata extraction and semantic splitting - **JsonPipeline**: Validates JSON structure and applies hierarchical splitting - **SourceCodePipeline**: Handles programming languages with language detection and line-based splitting - **TextPipeline**: Fallback for generic text content with basic processing ### Middleware System Transform content through ordered middleware chains within pipelines: **HTML Processing** uses the most extensive middleware pipeline, executed in a specific order to ensure correctness and efficiency: - **Dynamic Content Rendering**: (Optional) Uses Playwright to render JavaScript-heavy pages when needed - **DOM Parsing**: Converts the raw HTML string into a manipulable DOM object using Cheerio - **Metadata Extraction**: Extracts the document title from `<title>` or `<h1>` tags in the full DOM - **Link Discovery**: Gathers all links from the complete page for the crawler to potentially follow - **Content Sanitization**: Removes large, irrelevant sections like navigation, footers, ads, and boilerplate - **URL Normalization**: Cleans the _remaining_ content by converting relative image/link URLs to absolute ones and removing non-functional links (anchors, `javascript:`, etc.) while preserving their text content - **Markdown Conversion**: Converts the final, cleaned, and normalized HTML into Markdown format **Markdown Processing** applies lighter middleware: - Front matter and metadata extraction - Link processing and resolution **JSON and Source Code** use minimal middleware to preserve structural integrity. ### Document Splitters Segment content into semantic chunks while preserving document structure: - **SemanticMarkdownSplitter**: Hierarchy-aware splitting based on heading structure - **JsonDocumentSplitter**: Property-based hierarchical splitting for JSON files - **TreesitterSourceCodeSplitter**: Semantic boundary detection for source code - **TextDocumentSplitter**: Line-based fallback for unsupported content ### Size Optimization **GreedySplitter** provides universal size optimization across all content types: - Merges small chunks until reaching minimum size thresholds - Respects semantic boundaries from content-specific splitters - Handles oversized content while preserving document structure - Ensures optimal chunk sizes for embedding generation Chunk sizes are controlled by three character-based thresholds: | Setting | Role | |:--------|:-----| | `minChunkSize` | Floor for merging -- chunks below this are combined with neighbors | | `preferredChunkSize` | Soft target -- the optimizer splits when merging would exceed this | | `maxChunkSize` | Hard ceiling -- no chunk body will exceed this value | All sizes are measured in **characters** (`string.length`), not tokens. Before embedding, a metadata header (page title, URL, section path) is prepended to each chunk, so the total text sent to the embedding model is slightly larger than the chunk body. The actual token count depends on the embedding model's tokenizer. ## Content Processing Flow ```mermaid graph TD subgraph "Content Sources" A1[Web URLs] A2[Local Files] A3[Package Registries] end subgraph "Strategy Layer" B1[WebScraperStrategy] B2[LocalFileStrategy] B3[Package Strategies] end subgraph "Fetching Layer" C1[HttpFetcher] C2[FileFetcher] end subgraph "Pipeline Selection" D[PipelineFactory] E1[HtmlPipeline] E2[MarkdownPipeline] E3[JsonPipeline] E4[SourceCodePipeline] E5[TextPipeline] end subgraph "Two-Phase Splitting" F1[Semantic Splitters] F2[GreedySplitter] end subgraph "Output" G[ContentChunk Array] H[Embedding Generation] I[Database Storage] end A1 --> B1 A2 --> B2 A3 --> B3 B1 --> C1 B2 --> C2 B3 --> C1 C1 --> D C2 --> D D --> E1 D --> E2 D --> E3 D --> E4 D --> E5 E1 --> F1 E2 --> F1 E3 --> F1 E4 --> F1 E5 --> F1 F1 --> F2 F2 --> G G --> H H --> I style A1 fill:#e1f5fe style A2 fill:#e1f5fe style A3 fill:#e1f5fe style G fill:#f3e5f5 style I fill:#e8f5e8 ``` ## Content-Type-Specific Processing Different content types require specialized processing approaches while following the same overall pattern: ```mermaid graph TD subgraph "HTML Processing" A1[Raw HTML] --> B1[Playwright Rendering] B1 --> C1[DOM Parsing] C1 --> D1[Metadata & Link Extraction] D1 --> E1[Content Sanitization] E1 --> F1[URL Normalization] F1 --> G1[HTML to Markdown] G1 --> H1[SemanticMarkdownSplitter] end subgraph "JSON Processing" A2[Raw JSON] --> B2[Structure Validation] B2 --> C2[JsonDocumentSplitter] end subgraph "Source Code Processing" A3[Raw Source Code] --> B3[Language Detection] B3 --> C3[TreesitterSourceCodeSplitter] end subgraph "Markdown Processing" A4[Raw Markdown] --> B4[Metadata Extraction] B4 --> C4[SemanticMarkdownSplitter] end H1 --> G[GreedySplitter] C2 --> G C3 --> G C4 --> G G --> H[ContentChunk Array] style A1 fill:#ffebee style A2 fill:#fff3e0 style A3 fill:#f3e5f5 style A4 fill:#e8f5e8 style H fill:#e1f5fe ``` **Key Processing Differences:** - **HTML**: Multi-stage middleware pipeline for web content cleaning and conversion - **JSON**: Structural validation with hierarchical object/array splitting - **Source Code**: Tree-sitter semantic boundary detection with language-specific parsing - **Markdown**: Direct semantic splitting with metadata preservation All content types converge on GreedySplitter for universal size optimization while preserving content-specific semantic boundaries. ## Two-Phase Splitting Architecture ### Phase 1: Semantic Splitting Content-type-specific splitters preserve document structure: **SemanticMarkdownSplitter** (HTML, Markdown): - Analyzes heading hierarchy (H1-H6) - Creates hierarchical paths like `["Guide", "Installation", "Setup"]` - Preserves code blocks, tables, and list structures - Maintains parent-child relationships **JsonDocumentSplitter** (JSON): - Object and property-level splitting - Hierarchical path construction - Concatenation-friendly chunk design - Structural context preservation **TextDocumentSplitter** (Source Code, Text): - Line-based splitting with context - Simple hierarchical structure - Language-aware processing - Temporary until syntax-aware implementation ### Phase 2: Size Optimization **GreedySplitter** applies universal optimization: - **Greedy Concatenation**: Merges small chunks until minimum size - **Boundary Respect**: Preserves major section breaks (H1/H2 headings) - **Metadata Merging**: Combines chunk metadata intelligently - **Context Preservation**: Maintains hierarchical relationships ## Chunk Structure and Relationships ### Hierarchical Structure Chunks maintain hierarchical relationships through path-based organization, enabling context-aware search and retrieval: - **Parent**: Path with one fewer element - **Children**: Paths extending current by one level - **Siblings**: Same path length with shared parent - **Context**: Related chunks automatically included in search results ### Search Context Retrieval When returning search results, the system provides comprehensive context: - The matching chunk itself - Parent chunks for broader context - Previous and following siblings for navigation - Direct child chunks for deeper exploration This approach ensures users receive coherent, contextual results regardless of content type. ## Error Handling and Quality Control ### Content Filtering Automatic filtering of low-quality content: - Navigation menus and sidebars - Advertisement content and widgets - Boilerplate text and templates - Duplicate content detection - Minimum content length thresholds ### Error Recovery Graceful handling of processing errors: - **Recoverable Errors**: Retry with backoff - **Content Errors**: Skip and continue processing - **Fatal Errors**: Stop with detailed error information - **Warning Conditions**: Log and continue ### Progress Tracking Real-time processing feedback: - Page-level progress updates - Processing rate metrics - Error count and classification - Memory usage monitoring ## System Integration The content processing system integrates with downstream components through standardized interfaces: - **Embedding Generation**: Consistent chunk formatting enables seamless vector generation across all content types - **Database Storage**: Hierarchical paths and metadata support efficient indexing and retrieval - **Search System**: Context-aware results leverage chunk relationships for comprehensive response generation

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arabold/docs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

content-processing.md•11.9 KiB