README.md•17.7 kB
# LLM Researcher
A lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages.
Built with **TypeScript**, **tsup**, and **vitest** for modern development experience.
## Features
- **MCP Server Support**: Provides Model Context Protocol server for LLM integration
- **Free Operation**: Uses DuckDuckGo HTML endpoint (no API costs)
- **GitHub Code Search**: Search GitHub repositories for code examples and implementation patterns
- **Smart Content Extraction**: Playwright + @mozilla/readability for clean content
- **LLM-Optimized Output**: Sanitized Markdown (h1-h3, bold, italic, links only)
- **Rate Limited**: Respects DuckDuckGo with 1 req/sec limit
- **Cross-Platform**: Works on macOS, Linux, and WSL
- **Multiple Modes**: CLI, MCP server, search, direct URL, and interactive modes
- **Type Safe**: Full TypeScript implementation with strict typing
- **Modern Tooling**: Built with tsup bundler and vitest testing
## Installation
### Prerequisites
- Node.js 20.0.0 or higher
- No local Chrome installation required (uses Playwright's bundled Chromium)
### Setup
```bash
# Clone or download the project
cd light-research-mcp
# Install dependencies (using pnpm)
pnpm install
# Build the project
pnpm build
# Install Playwright browsers
pnpm install-browsers
# Optional: Link globally for system-wide access
pnpm link --global
```
## Usage
### MCP Server Mode
Use as a Model Context Protocol server to provide search and content extraction tools to LLMs:
```bash
# Start MCP server (stdio transport)
llmresearcher --mcp
# The server provides these tools to MCP clients:
# - github_code_search: Search GitHub repositories for code
# - duckduckgo_web_search: Search the web with DuckDuckGo
# - extract_content: Extract detailed content from URLs
```
#### Setting up with Claude Code
```bash
# Add as an MCP server to Claude Code
claude mcp add light-research-mcp /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp
# Or with project scope for team sharing
claude mcp add light-research-mcp -s project /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp
# List configured servers
claude mcp list
# Check server status
claude mcp get light-research-mcp
```
#### MCP Tool Usage Examples
Once configured, you can use these tools in Claude:
```
> Search for React hooks examples on GitHub
Tool: github_code_search
Query: "useState useEffect hooks language:javascript"
> Search for TypeScript best practices
Tool: duckduckgo_web_search
Query: "TypeScript best practices 2024"
Locale: us-en (or wt-wt for no region)
> Extract content from a search result
Tool: extract_content
URL: https://example.com/article-from-search-results
```
### Command Line Interface
```bash
# Search mode - Search DuckDuckGo and interactively browse results
llmresearcher "machine learning transformers"
# GitHub Code Search mode - Search GitHub for code
llmresearcher -g "useState hooks language:typescript"
# Direct URL mode - Extract content from specific URL
llmresearcher -u https://example.com/article
# Interactive mode - Enter interactive search session
llmresearcher
# Verbose logging - See detailed operation logs
llmresearcher -v "search query"
# MCP Server mode - Start as Model Context Protocol server
llmresearcher --mcp
```
## Development
### Scripts
```bash
# Build the project
pnpm build
# Build in watch mode (for development)
pnpm dev
# Run tests
pnpm test
# Run tests in CI mode (single run)
pnpm test:run
# Type checking
pnpm type-check
# Clean build artifacts
pnpm clean
# Install Playwright browsers
pnpm install-browsers
```
### Interactive Commands
When in search results view:
- **1-10**: Select a result by number
- **b** or **back**: Return to search results
- **open \<n>**: Open result #n in external browser
- **q** or **quit**: Exit the program
When viewing content:
- **b** or **back**: Return to search results
- **/\<term>**: Search for term within the extracted content
- **open**: Open current page in external browser
- **q** or **quit**: Exit the program
## Configuration
### Environment Variables
Create a `.env` file in the project root:
```env
USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0)
TIMEOUT=30000
MAX_RETRIES=3
RATE_LIMIT_DELAY=1000
CACHE_ENABLED=true
MAX_RESULTS=10
```
### Configuration File
Create `~/.llmresearcherrc` in your home directory:
```json
{
"userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)",
"timeout": 30000,
"maxRetries": 3,
"rateLimitDelay": 1000,
"cacheEnabled": true,
"maxResults": 10
}
```
### Configuration Options
| Option | Default | Description |
|--------|---------|-------------|
| `userAgent` | `Mozilla/5.0 (compatible; LLMResearcher/1.0)` | User agent for HTTP requests |
| `timeout` | `30000` | Request timeout in milliseconds |
| `maxRetries` | `3` | Maximum retry attempts for failed requests |
| `rateLimitDelay` | `1000` | Delay between requests in milliseconds |
| `cacheEnabled` | `true` | Enable/disable local caching |
| `maxResults` | `10` | Maximum search results to display |
## Architecture
### Core Components
1. **MCPResearchServer** (`src/mcp-server.ts`)
- Model Context Protocol server implementation
- Three main tools: github_code_search, duckduckgo_web_search, extract_content
- JSON-based responses for LLM consumption
2. **DuckDuckGoSearcher** (`src/search.ts`)
- HTML scraping of DuckDuckGo search results with locale support
- URL decoding for `/l/?uddg=` format links
- Rate limiting and retry logic
3. **GitHubCodeSearcher** (`src/github-code-search.ts`)
- GitHub Code Search API integration via gh CLI
- Advanced query support with language, repo, and file filters
- Authentication and rate limiting
4. **ContentExtractor** (`src/extractor.ts`)
- Playwright-based page rendering with resource blocking
- @mozilla/readability for main content extraction
- DOMPurify sanitization and Markdown conversion
5. **CLIInterface** (`src/cli.ts`)
- Interactive command-line interface
- Search result navigation
- Content viewing and text search
6. **Configuration** (`src/config.ts`)
- Environment and RC file configuration loading
- Verbose logging support
### Content Processing Pipeline
#### MCP Server Mode
1. **Search**:
- DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination
- GitHub: Code Search API → Format results → JSON response with code snippets
2. **Extract**: URL from search results → Playwright navigation → Content extraction
3. **Process**: @mozilla/readability → DOMPurify sanitization → Clean JSON output
4. **Output**: Structured JSON for LLM consumption
#### CLI Mode
1. **Search**: DuckDuckGo HTML endpoint → Parse results → Display numbered list
2. **Extract**: Playwright navigation → Resource blocking → JS rendering
3. **Process**: @mozilla/readability → DOMPurify sanitization → Turndown Markdown
4. **Output**: Clean Markdown with h1-h3, **bold**, *italic*, [links](url) only
### Security Features
- **Resource Blocking**: Prevents loading of images, CSS, fonts for speed and security
- **Content Sanitization**: DOMPurify removes scripts, iframes, and dangerous elements
- **Limited Markdown**: Only allows safe formatting elements (h1-h3, strong, em, a)
- **Rate Limiting**: Respects DuckDuckGo's rate limits with exponential backoff
## Examples
### MCP Server Usage with Claude Code
#### 1. GitHub Code Search
```
You: "Find React hook examples for state management"
Claude uses github_code_search tool:
{
"query": "useState useReducer state management language:javascript",
"results": [
{
"title": "facebook/react/packages/react/src/ReactHooks.js",
"url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js",
"snippet": "function useState(initialState) {\n return dispatcher.useState(initialState);\n}"
}
],
"pagination": {
"currentPage": 1,
"hasNextPage": true,
"nextPageToken": "2"
}
}
```
#### 2. Web Search with Locale
```
You: "Search for Vue.js tutorials in Japanese"
Claude uses duckduckgo_web_search tool:
{
"query": "Vue.js チュートリアル 入門",
"locale": "jp-jp",
"results": [
{
"title": "Vue.js入門ガイド",
"url": "https://example.com/vue-tutorial",
"snippet": "Vue.jsの基本的な使い方を学ぶチュートリアル..."
}
]
}
```
#### 3. Content Extraction
```
You: "Extract the full content from that Vue.js tutorial"
Claude uses extract_content tool:
{
"url": "https://example.com/vue-tutorial",
"title": "Vue.js入門ガイド",
"extractedAt": "2024-01-15T10:30:00.000Z",
"content": "# Vue.js入門ガイド\n\nVue.jsは...\n\n## インストール\n\n..."
}
```
### CLI Examples
#### Basic Search
```bash
$ llmresearcher "python web scraping"
🔍 Search Results:
══════════════════════════════════════════════════
1. Python Web Scraping Tutorial
URL: https://realpython.com/python-web-scraping-practical-introduction/
Complete guide to web scraping with Python using requests and Beautiful Soup...
2. Web Scraping with Python - BeautifulSoup and requests
URL: https://www.dataquest.io/blog/web-scraping-python-tutorial/
Learn how to scrape websites with Python, Beautiful Soup, and requests...
══════════════════════════════════════════════════
Commands: [1-10] select result | b) back | q) quit | open <n>) open in browser
> 1
📥 Extracting content from: Python Web Scraping Tutorial
📄 Content:
══════════════════════════════════════════════════
**Python Web Scraping Tutorial**
Source: https://realpython.com/python-web-scraping-practical-introduction/
Extracted: 2024-01-15T10:30:00.000Z
──────────────────────────────────────────────────
# Python Web Scraping: A Practical Introduction
Web scraping is the process of collecting and parsing raw data from the web...
## What Is Web Scraping?
Web scraping is a technique to automatically access and extract large amounts...
══════════════════════════════════════════════════
Commands: b) back to results | /<term>) search in text | q) quit | open) open in browser
> /beautiful soup
🔍 Found 3 matches for "beautiful soup":
──────────────────────────────────────────────────
Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents.
Line 42: from bs4 import BeautifulSoup
Line 67: soup = BeautifulSoup(html_content, 'html.parser')
```
### Direct URL Mode
```bash
$ llmresearcher -u https://docs.python.org/3/tutorial/
📄 Content:
══════════════════════════════════════════════════
**The Python Tutorial**
Source: https://docs.python.org/3/tutorial/
Extracted: 2024-01-15T10:35:00.000Z
──────────────────────────────────────────────────
# The Python Tutorial
Python is an easy to learn, powerful programming language...
## An Informal Introduction to Python
In the following examples, input and output are distinguished...
```
### Verbose Mode
```bash
$ llmresearcher -v "nodejs tutorial"
[VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us-en
[VERBOSE] Response: 200 in 847ms
[VERBOSE] Parsed 10 results
[VERBOSE] Launching browser...
[VERBOSE] Blocking resource: https://example.com/style.css
[VERBOSE] Blocking resource: https://example.com/image.png
[VERBOSE] Navigating to page...
[VERBOSE] Page loaded in 1243ms
[VERBOSE] Processing content with Readability...
[VERBOSE] Readability extraction successful
[VERBOSE] Closing browser...
```
## Testing
### Running Tests
```bash
# Run tests in watch mode
pnpm test
# Run tests once (CI mode)
pnpm test:run
# Run tests with coverage
pnpm test -- --coverage
```
### Test Coverage
The test suite includes:
- **Unit Tests**: Individual component testing
- `search.test.ts`: DuckDuckGo search functionality, URL decoding, rate limiting
- `extractor.test.ts`: Content extraction, Markdown conversion, resource management
- `config.test.ts`: Configuration validation and environment handling
- **Integration Tests**: End-to-end workflow testing
- `integration.test.ts`: Complete search-to-extraction workflows, error handling, cleanup
### Test Features
- **Fast**: Powered by vitest for quick feedback
- **Type-safe**: Full TypeScript support in tests
- **Isolated**: Each test cleans up its resources
- **Comprehensive**: Covers search, extraction, configuration, and integration scenarios
## Troubleshooting
### Common Issues
**"Browser not found" Error**
```bash
pnpm install-browsers
```
**Rate Limiting Issues**
- The tool automatically handles rate limiting with 1-second delays
- If you encounter 429 errors, the tool will automatically retry with exponential backoff
**Content Extraction Failures**
- Some sites may block automated access
- The tool includes fallback extraction methods (main → body content)
- Use verbose mode (`-v`) to see detailed error information
**Permission Denied (Unix/Linux)**
```bash
chmod +x bin/llmresearcher.js
```
### Performance Optimization
The tool is optimized for speed:
- **Resource Blocking**: Automatically blocks images, CSS, fonts
- **Network Idle**: Waits for JavaScript to complete rendering
- **Content Caching**: Supports local caching to avoid repeated requests
- **Minimal Dependencies**: Uses lightweight, focused libraries
## Development
### Project Structure
```
light-research-mcp/
├── dist/ # Built JavaScript files (generated)
│ ├── bin/
│ │ └── llmresearcher.js # CLI entry point (executable)
│ └── *.js # Compiled TypeScript modules
├── src/ # TypeScript source files
│ ├── bin.ts # CLI entry point
│ ├── index.ts # Main LLMResearcher class
│ ├── mcp-server.ts # MCP server implementation
│ ├── search.ts # DuckDuckGo search implementation
│ ├── github-code-search.ts # GitHub Code Search implementation
│ ├── extractor.ts # Content extraction with Playwright
│ ├── cli.ts # Interactive CLI interface
│ ├── config.ts # Configuration management
│ └── types.ts # TypeScript type definitions
├── test/ # Test files (vitest)
│ ├── search.test.ts # Search functionality tests
│ ├── extractor.test.ts # Content extraction tests
│ ├── config.test.ts # Configuration tests
│ ├── mcp-locale.test.ts # MCP locale functionality tests
│ ├── mcp-content-extractor.test.ts # MCP content extractor tests
│ └── integration.test.ts # End-to-end integration tests
├── tsconfig.json # TypeScript configuration
├── tsup.config.ts # Build configuration
├── vitest.config.ts # Test configuration
├── package.json
└── README.md
```
### Dependencies
#### Runtime Dependencies
- **@modelcontextprotocol/sdk**: Model Context Protocol server implementation
- **@mozilla/readability**: Content extraction from HTML
- **cheerio**: HTML parsing for search results
- **commander**: CLI argument parsing
- **dompurify**: HTML sanitization
- **dotenv**: Environment variable loading
- **jsdom**: DOM manipulation for server-side processing
- **playwright**: Browser automation for JS rendering
- **turndown**: HTML to Markdown conversion
#### Development Dependencies
- **typescript**: TypeScript compiler
- **tsup**: Fast TypeScript bundler
- **vitest**: Fast unit test framework
- **@types/***: TypeScript type definitions
## License
MIT License - see LICENSE file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## Roadmap
### Planned Features
- **Enhanced MCP Tools**: Additional specialized search tools for documentation, APIs, etc.
- **Caching Layer**: SQLite-based URL → Markdown caching with 24-hour TTL
- **Search Engine Abstraction**: Support for Brave Search, Bing, and other engines
- **Content Summarization**: Optional AI-powered content summarization
- **Export Formats**: JSON, plain text, and other output formats
- **Batch Processing**: Process multiple URLs from file input
- **SSE Transport**: Support for Server-Sent Events MCP transport
### Performance Improvements
- **Parallel Processing**: Concurrent content extraction for multiple results
- **Smart Caching**: Intelligent cache invalidation based on content freshness
- **Memory Optimization**: Streaming content processing for large documents