Extracts content from GitHub repositories including README files, documentation, and code files, with support for different branches and file types.
Provides full document lifecycle management including storage, retrieval, full-text search, and CRUD operations with automatic collection creation and schema management.
Document Extractor MCP Server
A Model Context Protocol (MCP) server that extracts document content from Microsoft Learn and GitHub URLs, storing them in PocketBase for easy retrieval and search.
Features
✅ Latest MCP SDK Features (v1.12.0+)
Modern
McpServerarchitecture with enhanced capabilitiesMultiple transport protocols: STDIO, Streamable HTTP, SSE
Dynamic tool management with lazy loading
Session management for stateful connections
Server-Sent Events support with backwards compatibility
Real-time server statistics and metrics
✅ Content Extraction
Microsoft Learn articles with rich metadata
GitHub files (README, documentation, code files)
Intelligent content parsing and cleaning
Duplicate detection and updates
✅ PocketBase Integration
Persistent document storage
Full-text search capabilities
Metadata preservation
CRUD operations
✅ Advanced Server Features
Multiple transport modes (STDIO/HTTP)
Health check and info endpoints
Read-only mode support
Enhanced error handling and debugging
Resource endpoints for server metrics
✅ Rich Metadata
Word counts and content statistics
Source attribution and URLs
Extraction timestamps
Content headers and descriptions
Requirements
Node.js 18+ with ES modules support
PocketBase server running
Network access for content extraction
Installation
1. Install Dependencies
2. PocketBase Setup
The MCP server supports both local and remote PocketBase instances. Choose the setup that best fits your needs:
Option A: Local PocketBase Instance
Download and install PocketBase:
# Download from https://pocketbase.io/docs/ # Extract the executable to your preferred directoryStart local PocketBase server:
# Run from the directory containing pocketbase.exe .\pocketbase.exe serve # Or specify custom port and data directory .\pocketbase.exe serve --http="127.0.0.1:8090" --dir="./pb_data"Set up admin account:
Access PocketBase Admin UI at http://127.0.0.1:8090/_/
Create your admin account
Note the email/password for configuration
Option B: Remote PocketBase Instance
Deploy PocketBase to your preferred hosting:
Railway, Fly.io, DigitalOcean, AWS, etc.
Follow your hosting provider's deployment guide
Ensure HTTPS is enabled for production
Configure your remote instance:
Set up admin account through the web interface
Configure CORS settings if needed
Note the full URL (e.g., https://your-pb-instance.com)
Option C: Docker PocketBase
Using Docker Compose:
version: '3.8' services: pocketbase: image: ghcr.io/muchobien/pocketbase:latest ports: - "8090:8090" volumes: - ./pb_data:/pb/pb_dataCollection Management (Automatic for all setups):
The server will automatically create the required
documentscollection on startupIf
AUTO_CREATE_COLLECTION=true(default), no manual setup neededUse the
ensure_collectiontool to manually verify/create collectionsUse the
collection_infotool to check collection status
Manual Collection Setup (if needed):
Access PocketBase Admin UI
Create a new collection named
documentsAdd these fields:
title (Text, required) content (Text, required) metadata (JSON, required) created (Date, auto-generated) updated (Date, optional)
3. Environment Configuration
Create a .env file in the project root. The server supports both local and remote PocketBase instances:
For Local PocketBase Instance:
For Remote PocketBase Instance:
For Dockerized PocketBase:
Usage
Starting the Server
The server supports multiple transport modes:
Transport Modes
STDIO Mode (Default)
Perfect for Claude Desktop and command-line MCP clients:
HTTP Mode
Enables web-based clients and testing with multiple protocols:
Available endpoints in HTTP mode:
POST /mcp- Streamable HTTP transport (modern protocol 2025-03-26)GET /sse- Server-Sent Events transport (legacy protocol 2024-11-05)POST /messages- SSE message endpointGET /health- Health check endpointGET /info- Server information endpoint
Available Tools
1. extract_document
Extract and store content from URLs.
Parameters:
url(string, required): Microsoft Learn or GitHub URL
Example:
2. list_documents
List stored documents with pagination.
Parameters:
limit(number, optional): Max results per page (1-100, default: 20)page(number, optional): Page number (default: 1)
3. search_documents
Search documents by title or content.
Parameters:
query(string, required): Search querylimit(number, optional): Max results (1-100, default: 50)
4. get_document
Retrieve a specific document by ID.
Parameters:
id(string, required): Document ID
5. delete_document
Delete a document by ID.
Parameters:
id(string, required): Document ID to delete
6. ensure_collection ✨ New!
Check if the documents collection exists and create it if needed.
Parameters: None
Description: Automatically verifies the documents collection exists in PocketBase. If not found, creates the collection with the proper schema including all required fields and indexes.
7. collection_info ✨ New!
Get detailed information about the documents collection including statistics.
Parameters: None
Description: Returns comprehensive collection information including schema details, record counts, indexes, and timestamps.
Available Resources
1. stats://server
Real-time server statistics and metrics.
Content:
Total document count
Server information (name, version, uptime)
Memory usage statistics
Environment information
Read-only mode status
Dynamic Tool Management
The server supports dynamic tool management with lazy loading:
Session Management
In HTTP mode, the server supports session management:
Streamable HTTP: Modern session management with automatic session ID generation
SSE (Legacy): Backwards compatible session handling
Session persistence: Sessions are maintained across requests
Automatic cleanup: Sessions are cleaned up when connections close
Supported Sources
Microsoft Learn
Full article extraction
Metadata preservation (description, keywords, author)
Section headers extraction
Content cleaning and formatting
Example URLs:
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/https://learn.microsoft.com/en-us/dotnet/core/introduction
GitHub
File content extraction (README, docs, code)
Repository metadata
Branch handling (main/master fallback)
File type detection
Supported URL formats:
https://github.com/owner/repo(assumes README.md)https://github.com/owner/repo/blob/main/file.mdhttps://raw.githubusercontent.com/owner/repo/main/file.md
Configuration Options
Environment Variables
Variable | Description | Default |
| PocketBase server URL |
|
| Admin email for authentication | Required |
| Admin password | Required |
| Collection name for documents |
|
| Enable debug logging |
|
| Environment mode |
|
| Disable write operations |
|
| Auto-create collections on startup |
|
Debug Mode
Enable detailed logging:
Debug logs include:
Authentication status
Content extraction details
Database operations
Error context
Error Handling
The server implements comprehensive error handling:
Network errors: Timeout and connection issues
Authentication errors: PocketBase connection problems
Validation errors: Invalid input parameters
Content errors: Extraction failures
Database errors: Storage and retrieval issues
All errors are returned as structured MCP responses with appropriate error codes.
Development
Scripts
Testing the Server
Troubleshooting
Common Issues
Authentication Failed
Verify PocketBase is running:
http://127.0.0.1:8090Check admin credentials in
.envEnsure admin user exists in PocketBase
Content Extraction Errors
Check network connectivity
Verify URL accessibility
Review debug logs for details
Collection Not Found
Use the
ensure_collectiontool to automatically create the collectionCheck collection name in environment variables
Verify
AUTO_CREATE_COLLECTIONis enabledCheck collection permissions
Module Import Errors
Ensure
"type": "module"in package.jsonUse Node.js 18+ with ES modules support
Check all dependencies are installed
Debug Information
Enable debug mode to see detailed logs:
PocketBase Collection Schema
If you need to recreate the collection, use this schema:
MCP Client Configuration
Claude Desktop Configuration
Add this to your Claude Desktop MCP settings:
License
MIT License - see LICENSE file for details.
Contributing
Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request
Changelog
v1.1.0 ✨ Latest Update
Latest MCP SDK v1.13.1+: Upgraded to the newest Model Context Protocol SDK
Latest PocketBase SDK v0.26.1+: Updated to the latest PocketBase features
Collection Management Tools: Added
ensure_collectionandcollection_infotoolsAuto-Collection Creation: Automatic database schema setup on startup
Enhanced Lazy Loading: Improved dynamic tool management
Latest SSE Features: Modern Server-Sent Events implementation
Improved Error Handling: Better collection management error recovery
Enhanced Documentation: Comprehensive usage examples and troubleshooting
v1.0.0
Updated to latest Anthropic MCP SDK
Added comprehensive error handling
Implemented input validation with Zod
Enhanced metadata extraction
Added debug logging
Improved documentation
Added PocketBase integration
Support for Microsoft Learn and GitHub
Deployment
Smithery Deployment
This MCP server supports deployment on Smithery, a platform for hosting MCP servers.
TypeScript Deploy (Recommended)
The fastest way to deploy this server on Smithery:
Fork or Clone this repository to your GitHub account
Connect GitHub to Smithery (or claim your server if already listed)
Navigate to the Deployments tab on your server page
Click Deploy - Smithery will automatically build and host your server
The smithery.yaml file is already configured for TypeScript/Node.js deployment.
Note: Despite being called "TypeScript Deploy", this method works perfectly for Node.js projects with ES modules.
Custom Deploy (Docker)
For advanced deployment with full Docker control:
Replace smithery.yaml with the container configuration:
cp smithery-container.yaml smithery.yamlPush to GitHub with the updated configuration
Deploy via Smithery's Deployments tab
The Dockerfile is optimized for production deployment with security best practices.
Configuration
When deploying on Smithery, you'll configure:
PocketBase URL: Your PocketBase instance URL
Admin Credentials: Email and password for PocketBase admin
Collection Settings: Default collection name and auto-creation
Debug Mode: Enable detailed logging (optional)
Best Practices for Smithery
Tool Discovery: All tools are available without authentication for discovery
Lazy Authentication: API validation occurs only when tools are invoked
Environment Variables: Configuration is handled via Smithery's config schema
Health Checks: Built-in health monitoring at
/healthendpoint
This server cannot be installed