Provides web search capabilities through DuckDuckGo HTML endpoint with locale support and rate limiting, allowing LLMs to retrieve search results without API costs
Enables code search across GitHub repositories with support for advanced queries including language, repo, and file filters, allowing LLMs to find and extract code examples
Converts extracted web content to sanitized Markdown format with limited formatting (headings, bold, italic, links only) optimized for LLM consumption
Uses Mozilla's Readability library for content extraction from web pages, enabling LLMs to retrieve clean, article-focused content from arbitrary web pages
LLM Researcher
A lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages.
Built with TypeScript, tsup, and vitest for modern development experience.
Features
- MCP Server Support: Provides Model Context Protocol server for LLM integration
- Free Operation: Uses DuckDuckGo HTML endpoint (no API costs)
- GitHub Code Search: Search GitHub repositories for code examples and implementation patterns
- Smart Content Extraction: Playwright + @mozilla/readability for clean content
- LLM-Optimized Output: Sanitized Markdown (h1-h3, bold, italic, links only)
- Rate Limited: Respects DuckDuckGo with 1 req/sec limit
- Cross-Platform: Works on macOS, Linux, and WSL
- Multiple Modes: CLI, MCP server, search, direct URL, and interactive modes
- Type Safe: Full TypeScript implementation with strict typing
- Modern Tooling: Built with tsup bundler and vitest testing
Installation
Prerequisites
- Node.js 20.0.0 or higher
- No local Chrome installation required (uses Playwright's bundled Chromium)
Setup
Usage
MCP Server Mode
Use as a Model Context Protocol server to provide search and content extraction tools to LLMs:
Setting up with Claude Code
MCP Tool Usage Examples
Once configured, you can use these tools in Claude:
Command Line Interface
Development
Scripts
Interactive Commands
When in search results view:
- 1-10: Select a result by number
- b or back: Return to search results
- open <n>: Open result #n in external browser
- q or quit: Exit the program
When viewing content:
- b or back: Return to search results
- /<term>: Search for term within the extracted content
- open: Open current page in external browser
- q or quit: Exit the program
Configuration
Environment Variables
Create a .env
file in the project root:
Configuration File
Create ~/.llmresearcherrc
in your home directory:
Configuration Options
Option | Default | Description |
---|---|---|
userAgent | Mozilla/5.0 (compatible; LLMResearcher/1.0) | User agent for HTTP requests |
timeout | 30000 | Request timeout in milliseconds |
maxRetries | 3 | Maximum retry attempts for failed requests |
rateLimitDelay | 1000 | Delay between requests in milliseconds |
cacheEnabled | true | Enable/disable local caching |
maxResults | 10 | Maximum search results to display |
Architecture
Core Components
- MCPResearchServer (
src/mcp-server.ts
)- Model Context Protocol server implementation
- Three main tools: github_code_search, duckduckgo_web_search, extract_content
- JSON-based responses for LLM consumption
- DuckDuckGoSearcher (
src/search.ts
)- HTML scraping of DuckDuckGo search results with locale support
- URL decoding for
/l/?uddg=
format links - Rate limiting and retry logic
- GitHubCodeSearcher (
src/github-code-search.ts
)- GitHub Code Search API integration via gh CLI
- Advanced query support with language, repo, and file filters
- Authentication and rate limiting
- ContentExtractor (
src/extractor.ts
)- Playwright-based page rendering with resource blocking
- @mozilla/readability for main content extraction
- DOMPurify sanitization and Markdown conversion
- CLIInterface (
src/cli.ts
)- Interactive command-line interface
- Search result navigation
- Content viewing and text search
- Configuration (
src/config.ts
)- Environment and RC file configuration loading
- Verbose logging support
Content Processing Pipeline
MCP Server Mode
- Search:
- DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination
- GitHub: Code Search API → Format results → JSON response with code snippets
- Extract: URL from search results → Playwright navigation → Content extraction
- Process: @mozilla/readability → DOMPurify sanitization → Clean JSON output
- Output: Structured JSON for LLM consumption
CLI Mode
- Search: DuckDuckGo HTML endpoint → Parse results → Display numbered list
- Extract: Playwright navigation → Resource blocking → JS rendering
- Process: @mozilla/readability → DOMPurify sanitization → Turndown Markdown
- Output: Clean Markdown with h1-h3, bold, italic, links only
Security Features
- Resource Blocking: Prevents loading of images, CSS, fonts for speed and security
- Content Sanitization: DOMPurify removes scripts, iframes, and dangerous elements
- Limited Markdown: Only allows safe formatting elements (h1-h3, strong, em, a)
- Rate Limiting: Respects DuckDuckGo's rate limits with exponential backoff
Examples
MCP Server Usage with Claude Code
1. GitHub Code Search
2. Web Search with Locale
3. Content Extraction
CLI Examples
Basic Search
Direct URL Mode
Verbose Mode
Testing
Running Tests
Test Coverage
The test suite includes:
- Unit Tests: Individual component testing
search.test.ts
: DuckDuckGo search functionality, URL decoding, rate limitingextractor.test.ts
: Content extraction, Markdown conversion, resource managementconfig.test.ts
: Configuration validation and environment handling
- Integration Tests: End-to-end workflow testing
integration.test.ts
: Complete search-to-extraction workflows, error handling, cleanup
Test Features
- Fast: Powered by vitest for quick feedback
- Type-safe: Full TypeScript support in tests
- Isolated: Each test cleans up its resources
- Comprehensive: Covers search, extraction, configuration, and integration scenarios
Troubleshooting
Common Issues
"Browser not found" Error
Rate Limiting Issues
- The tool automatically handles rate limiting with 1-second delays
- If you encounter 429 errors, the tool will automatically retry with exponential backoff
Content Extraction Failures
- Some sites may block automated access
- The tool includes fallback extraction methods (main → body content)
- Use verbose mode (
-v
) to see detailed error information
Permission Denied (Unix/Linux)
Performance Optimization
The tool is optimized for speed:
- Resource Blocking: Automatically blocks images, CSS, fonts
- Network Idle: Waits for JavaScript to complete rendering
- Content Caching: Supports local caching to avoid repeated requests
- Minimal Dependencies: Uses lightweight, focused libraries
Development
Project Structure
Dependencies
Runtime Dependencies
- @modelcontextprotocol/sdk: Model Context Protocol server implementation
- @mozilla/readability: Content extraction from HTML
- cheerio: HTML parsing for search results
- commander: CLI argument parsing
- dompurify: HTML sanitization
- dotenv: Environment variable loading
- jsdom: DOM manipulation for server-side processing
- playwright: Browser automation for JS rendering
- turndown: HTML to Markdown conversion
Development Dependencies
- typescript: TypeScript compiler
- tsup: Fast TypeScript bundler
- vitest: Fast unit test framework
- @types/*: TypeScript type definitions
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Roadmap
Planned Features
- Enhanced MCP Tools: Additional specialized search tools for documentation, APIs, etc.
- Caching Layer: SQLite-based URL → Markdown caching with 24-hour TTL
- Search Engine Abstraction: Support for Brave Search, Bing, and other engines
- Content Summarization: Optional AI-powered content summarization
- Export Formats: JSON, plain text, and other output formats
- Batch Processing: Process multiple URLs from file input
- SSE Transport: Support for Server-Sent Events MCP transport
Performance Improvements
- Parallel Processing: Concurrent content extraction for multiple results
- Smart Caching: Intelligent cache invalidation based on content freshness
- Memory Optimization: Streaming content processing for large documents
This server cannot be installed
A lightweight MCP server that enables LLMs to search the web via DuckDuckGo, search GitHub code repositories, and extract clean content from web pages in LLM-friendly formats.
Related MCP Servers
- -securityAlicense-qualityThis MCP server utilizes DuckDuckGo for web searches, providing structured search results with metadata and features like smart content classification and language detection, facilitating easy integration with AI clients supporting the MCP protocol.Last updated -1251JavaScriptMIT License
- -securityAlicense-qualityA MCP server that transforms code repositories from GitHub, GitLab, or local directories into LLM-friendly formats, preserving context and structure for better AI processing.Last updated -1PythonApache 2.0
- -securityFlicense-qualityAn MCP server that enables LLMs to search YouTube, retrieve video information, and access video transcripts through standardized tools.Last updated -TypeScript
- -securityFlicense-qualityAn MCP server that fetches real-time documentation for popular libraries like Langchain, Llama-Index, MCP, and OpenAI, allowing LLMs to access updated library information beyond their knowledge cut-off dates.Last updated -Python