LLM Researcher

README.md•17.7 kB

# LLM Researcher A lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages. Built with **TypeScript**, **tsup**, and **vitest** for modern development experience. ## Features - **MCP Server Support**: Provides Model Context Protocol server for LLM integration - **Free Operation**: Uses DuckDuckGo HTML endpoint (no API costs) - **GitHub Code Search**: Search GitHub repositories for code examples and implementation patterns - **Smart Content Extraction**: Playwright + @mozilla/readability for clean content - **LLM-Optimized Output**: Sanitized Markdown (h1-h3, bold, italic, links only) - **Rate Limited**: Respects DuckDuckGo with 1 req/sec limit - **Cross-Platform**: Works on macOS, Linux, and WSL - **Multiple Modes**: CLI, MCP server, search, direct URL, and interactive modes - **Type Safe**: Full TypeScript implementation with strict typing - **Modern Tooling**: Built with tsup bundler and vitest testing ## Installation ### Prerequisites - Node.js 20.0.0 or higher - No local Chrome installation required (uses Playwright's bundled Chromium) ### Setup ```bash # Clone or download the project cd light-research-mcp # Install dependencies (using pnpm) pnpm install # Build the project pnpm build # Install Playwright browsers pnpm install-browsers # Optional: Link globally for system-wide access pnpm link --global ``` ## Usage ### MCP Server Mode Use as a Model Context Protocol server to provide search and content extraction tools to LLMs: ```bash # Start MCP server (stdio transport) llmresearcher --mcp # The server provides these tools to MCP clients: # - github_code_search: Search GitHub repositories for code # - duckduckgo_web_search: Search the web with DuckDuckGo # - extract_content: Extract detailed content from URLs ``` #### Setting up with Claude Code ```bash # Add as an MCP server to Claude Code claude mcp add light-research-mcp /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp # Or with project scope for team sharing claude mcp add light-research-mcp -s project /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp # List configured servers claude mcp list # Check server status claude mcp get light-research-mcp ``` #### MCP Tool Usage Examples Once configured, you can use these tools in Claude: ``` > Search for React hooks examples on GitHub Tool: github_code_search Query: "useState useEffect hooks language:javascript" > Search for TypeScript best practices Tool: duckduckgo_web_search Query: "TypeScript best practices 2024" Locale: us-en (or wt-wt for no region) > Extract content from a search result Tool: extract_content URL: https://example.com/article-from-search-results ``` ### Command Line Interface ```bash # Search mode - Search DuckDuckGo and interactively browse results llmresearcher "machine learning transformers" # GitHub Code Search mode - Search GitHub for code llmresearcher -g "useState hooks language:typescript" # Direct URL mode - Extract content from specific URL llmresearcher -u https://example.com/article # Interactive mode - Enter interactive search session llmresearcher # Verbose logging - See detailed operation logs llmresearcher -v "search query" # MCP Server mode - Start as Model Context Protocol server llmresearcher --mcp ``` ## Development ### Scripts ```bash # Build the project pnpm build # Build in watch mode (for development) pnpm dev # Run tests pnpm test # Run tests in CI mode (single run) pnpm test:run # Type checking pnpm type-check # Clean build artifacts pnpm clean # Install Playwright browsers pnpm install-browsers ``` ### Interactive Commands When in search results view: - **1-10**: Select a result by number - **b** or **back**: Return to search results - **open \<n>**: Open result #n in external browser - **q** or **quit**: Exit the program When viewing content: - **b** or **back**: Return to search results - **/\<term>**: Search for term within the extracted content - **open**: Open current page in external browser - **q** or **quit**: Exit the program ## Configuration ### Environment Variables Create a `.env` file in the project root: ```env USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0) TIMEOUT=30000 MAX_RETRIES=3 RATE_LIMIT_DELAY=1000 CACHE_ENABLED=true MAX_RESULTS=10 ``` ### Configuration File Create `~/.llmresearcherrc` in your home directory: ```json { "userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)", "timeout": 30000, "maxRetries": 3, "rateLimitDelay": 1000, "cacheEnabled": true, "maxResults": 10 } ``` ### Configuration Options | Option | Default | Description | |--------|---------|-------------| | `userAgent` | `Mozilla/5.0 (compatible; LLMResearcher/1.0)` | User agent for HTTP requests | | `timeout` | `30000` | Request timeout in milliseconds | | `maxRetries` | `3` | Maximum retry attempts for failed requests | | `rateLimitDelay` | `1000` | Delay between requests in milliseconds | | `cacheEnabled` | `true` | Enable/disable local caching | | `maxResults` | `10` | Maximum search results to display | ## Architecture ### Core Components 1. **MCPResearchServer** (`src/mcp-server.ts`) - Model Context Protocol server implementation - Three main tools: github_code_search, duckduckgo_web_search, extract_content - JSON-based responses for LLM consumption 2. **DuckDuckGoSearcher** (`src/search.ts`) - HTML scraping of DuckDuckGo search results with locale support - URL decoding for `/l/?uddg=` format links - Rate limiting and retry logic 3. **GitHubCodeSearcher** (`src/github-code-search.ts`) - GitHub Code Search API integration via gh CLI - Advanced query support with language, repo, and file filters - Authentication and rate limiting 4. **ContentExtractor** (`src/extractor.ts`) - Playwright-based page rendering with resource blocking - @mozilla/readability for main content extraction - DOMPurify sanitization and Markdown conversion 5. **CLIInterface** (`src/cli.ts`) - Interactive command-line interface - Search result navigation - Content viewing and text search 6. **Configuration** (`src/config.ts`) - Environment and RC file configuration loading - Verbose logging support ### Content Processing Pipeline #### MCP Server Mode 1. **Search**: - DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination - GitHub: Code Search API → Format results → JSON response with code snippets 2. **Extract**: URL from search results → Playwright navigation → Content extraction 3. **Process**: @mozilla/readability → DOMPurify sanitization → Clean JSON output 4. **Output**: Structured JSON for LLM consumption #### CLI Mode 1. **Search**: DuckDuckGo HTML endpoint → Parse results → Display numbered list 2. **Extract**: Playwright navigation → Resource blocking → JS rendering 3. **Process**: @mozilla/readability → DOMPurify sanitization → Turndown Markdown 4. **Output**: Clean Markdown with h1-h3, **bold**, *italic*, [links](url) only ### Security Features - **Resource Blocking**: Prevents loading of images, CSS, fonts for speed and security - **Content Sanitization**: DOMPurify removes scripts, iframes, and dangerous elements - **Limited Markdown**: Only allows safe formatting elements (h1-h3, strong, em, a) - **Rate Limiting**: Respects DuckDuckGo's rate limits with exponential backoff ## Examples ### MCP Server Usage with Claude Code #### 1. GitHub Code Search ``` You: "Find React hook examples for state management" Claude uses github_code_search tool: { "query": "useState useReducer state management language:javascript", "results": [ { "title": "facebook/react/packages/react/src/ReactHooks.js", "url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js", "snippet": "function useState(initialState) {\n return dispatcher.useState(initialState);\n}" } ], "pagination": { "currentPage": 1, "hasNextPage": true, "nextPageToken": "2" } } ``` #### 2. Web Search with Locale ``` You: "Search for Vue.js tutorials in Japanese" Claude uses duckduckgo_web_search tool: { "query": "Vue.js チュートリアル入門", "locale": "jp-jp", "results": [ { "title": "Vue.js入門ガイド", "url": "https://example.com/vue-tutorial", "snippet": "Vue.jsの基本的な使い方を学ぶチュートリアル..." } ] } ``` #### 3. Content Extraction ``` You: "Extract the full content from that Vue.js tutorial" Claude uses extract_content tool: { "url": "https://example.com/vue-tutorial", "title": "Vue.js入門ガイド", "extractedAt": "2024-01-15T10:30:00.000Z", "content": "# Vue.js入門ガイド\n\nVue.jsは...\n\n## インストール\n\n..." } ``` ### CLI Examples #### Basic Search ```bash $ llmresearcher "python web scraping" 🔍 Search Results: ══════════════════════════════════════════════════ 1. Python Web Scraping Tutorial URL: https://realpython.com/python-web-scraping-practical-introduction/ Complete guide to web scraping with Python using requests and Beautiful Soup... 2. Web Scraping with Python - BeautifulSoup and requests URL: https://www.dataquest.io/blog/web-scraping-python-tutorial/ Learn how to scrape websites with Python, Beautiful Soup, and requests... ══════════════════════════════════════════════════ Commands: [1-10] select result | b) back | q) quit | open <n>) open in browser > 1 📥 Extracting content from: Python Web Scraping Tutorial 📄 Content: ══════════════════════════════════════════════════ **Python Web Scraping Tutorial** Source: https://realpython.com/python-web-scraping-practical-introduction/ Extracted: 2024-01-15T10:30:00.000Z ────────────────────────────────────────────────── # Python Web Scraping: A Practical Introduction Web scraping is the process of collecting and parsing raw data from the web... ## What Is Web Scraping? Web scraping is a technique to automatically access and extract large amounts... ══════════════════════════════════════════════════ Commands: b) back to results | /<term>) search in text | q) quit | open) open in browser > /beautiful soup 🔍 Found 3 matches for "beautiful soup": ────────────────────────────────────────────────── Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents. Line 42: from bs4 import BeautifulSoup Line 67: soup = BeautifulSoup(html_content, 'html.parser') ``` ### Direct URL Mode ```bash $ llmresearcher -u https://docs.python.org/3/tutorial/ 📄 Content: ══════════════════════════════════════════════════ **The Python Tutorial** Source: https://docs.python.org/3/tutorial/ Extracted: 2024-01-15T10:35:00.000Z ────────────────────────────────────────────────── # The Python Tutorial Python is an easy to learn, powerful programming language... ## An Informal Introduction to Python In the following examples, input and output are distinguished... ``` ### Verbose Mode ```bash $ llmresearcher -v "nodejs tutorial" [VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us-en [VERBOSE] Response: 200 in 847ms [VERBOSE] Parsed 10 results [VERBOSE] Launching browser... [VERBOSE] Blocking resource: https://example.com/style.css [VERBOSE] Blocking resource: https://example.com/image.png [VERBOSE] Navigating to page... [VERBOSE] Page loaded in 1243ms [VERBOSE] Processing content with Readability... [VERBOSE] Readability extraction successful [VERBOSE] Closing browser... ``` ## Testing ### Running Tests ```bash # Run tests in watch mode pnpm test # Run tests once (CI mode) pnpm test:run # Run tests with coverage pnpm test -- --coverage ``` ### Test Coverage The test suite includes: - **Unit Tests**: Individual component testing - `search.test.ts`: DuckDuckGo search functionality, URL decoding, rate limiting - `extractor.test.ts`: Content extraction, Markdown conversion, resource management - `config.test.ts`: Configuration validation and environment handling - **Integration Tests**: End-to-end workflow testing - `integration.test.ts`: Complete search-to-extraction workflows, error handling, cleanup ### Test Features - **Fast**: Powered by vitest for quick feedback - **Type-safe**: Full TypeScript support in tests - **Isolated**: Each test cleans up its resources - **Comprehensive**: Covers search, extraction, configuration, and integration scenarios ## Troubleshooting ### Common Issues **"Browser not found" Error** ```bash pnpm install-browsers ``` **Rate Limiting Issues** - The tool automatically handles rate limiting with 1-second delays - If you encounter 429 errors, the tool will automatically retry with exponential backoff **Content Extraction Failures** - Some sites may block automated access - The tool includes fallback extraction methods (main → body content) - Use verbose mode (`-v`) to see detailed error information **Permission Denied (Unix/Linux)** ```bash chmod +x bin/llmresearcher.js ``` ### Performance Optimization The tool is optimized for speed: - **Resource Blocking**: Automatically blocks images, CSS, fonts - **Network Idle**: Waits for JavaScript to complete rendering - **Content Caching**: Supports local caching to avoid repeated requests - **Minimal Dependencies**: Uses lightweight, focused libraries ## Development ### Project Structure ``` light-research-mcp/ ├── dist/ # Built JavaScript files (generated) │ ├── bin/ │ │ └── llmresearcher.js # CLI entry point (executable) │ └── *.js # Compiled TypeScript modules ├── src/ # TypeScript source files │ ├── bin.ts # CLI entry point │ ├── index.ts # Main LLMResearcher class │ ├── mcp-server.ts # MCP server implementation │ ├── search.ts # DuckDuckGo search implementation │ ├── github-code-search.ts # GitHub Code Search implementation │ ├── extractor.ts # Content extraction with Playwright │ ├── cli.ts # Interactive CLI interface │ ├── config.ts # Configuration management │ └── types.ts # TypeScript type definitions ├── test/ # Test files (vitest) │ ├── search.test.ts # Search functionality tests │ ├── extractor.test.ts # Content extraction tests │ ├── config.test.ts # Configuration tests │ ├── mcp-locale.test.ts # MCP locale functionality tests │ ├── mcp-content-extractor.test.ts # MCP content extractor tests │ └── integration.test.ts # End-to-end integration tests ├── tsconfig.json # TypeScript configuration ├── tsup.config.ts # Build configuration ├── vitest.config.ts # Test configuration ├── package.json └── README.md ``` ### Dependencies #### Runtime Dependencies - **@modelcontextprotocol/sdk**: Model Context Protocol server implementation - **@mozilla/readability**: Content extraction from HTML - **cheerio**: HTML parsing for search results - **commander**: CLI argument parsing - **dompurify**: HTML sanitization - **dotenv**: Environment variable loading - **jsdom**: DOM manipulation for server-side processing - **playwright**: Browser automation for JS rendering - **turndown**: HTML to Markdown conversion #### Development Dependencies - **typescript**: TypeScript compiler - **tsup**: Fast TypeScript bundler - **vitest**: Fast unit test framework - **@types/***: TypeScript type definitions ## License MIT License - see LICENSE file for details. ## Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests if applicable 5. Submit a pull request ## Roadmap ### Planned Features - **Enhanced MCP Tools**: Additional specialized search tools for documentation, APIs, etc. - **Caching Layer**: SQLite-based URL → Markdown caching with 24-hour TTL - **Search Engine Abstraction**: Support for Brave Search, Bing, and other engines - **Content Summarization**: Optional AI-powered content summarization - **Export Formats**: JSON, plain text, and other output formats - **Batch Processing**: Process multiple URLs from file input - **SSE Transport**: Support for Server-Sent Events MCP transport ### Performance Improvements - **Parallel Processing**: Concurrent content extraction for multiple results - **Smart Caching**: Intelligent cache invalidation based on content freshness - **Memory Optimization**: Streaming content processing for large documents

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Code-Hex/light-research-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server