Skip to main content
Glama

LLM Researcher

A lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages.

Built with TypeScript, tsup, and vitest for modern development experience.

Features

  • MCP Server Support: Provides Model Context Protocol server for LLM integration

  • Free Operation: Uses DuckDuckGo HTML endpoint (no API costs)

  • GitHub Code Search: Search GitHub repositories for code examples and implementation patterns

  • Smart Content Extraction: Playwright + @mozilla/readability for clean content

  • LLM-Optimized Output: Sanitized Markdown (h1-h3, bold, italic, links only)

  • Rate Limited: Respects DuckDuckGo with 1 req/sec limit

  • Cross-Platform: Works on macOS, Linux, and WSL

  • Multiple Modes: CLI, MCP server, search, direct URL, and interactive modes

  • Type Safe: Full TypeScript implementation with strict typing

  • Modern Tooling: Built with tsup bundler and vitest testing

Related MCP server: MCP-Repo2LLM

Installation

Prerequisites

  • Node.js 20.0.0 or higher

  • No local Chrome installation required (uses Playwright's bundled Chromium)

Setup

# Clone or download the project cd light-research-mcp # Install dependencies (using pnpm) pnpm install # Build the project pnpm build # Install Playwright browsers pnpm install-browsers # Optional: Link globally for system-wide access pnpm link --global

Usage

MCP Server Mode

Use as a Model Context Protocol server to provide search and content extraction tools to LLMs:

# Start MCP server (stdio transport) llmresearcher --mcp # The server provides these tools to MCP clients: # - github_code_search: Search GitHub repositories for code # - duckduckgo_web_search: Search the web with DuckDuckGo # - extract_content: Extract detailed content from URLs

Setting up with Claude Code

# Add as an MCP server to Claude Code claude mcp add light-research-mcp /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp # Or with project scope for team sharing claude mcp add light-research-mcp -s project /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp # List configured servers claude mcp list # Check server status claude mcp get light-research-mcp

MCP Tool Usage Examples

Once configured, you can use these tools in Claude:

> Search for React hooks examples on GitHub Tool: github_code_search Query: "useState useEffect hooks language:javascript" > Search for TypeScript best practices Tool: duckduckgo_web_search Query: "TypeScript best practices 2024" Locale: us-en (or wt-wt for no region) > Extract content from a search result Tool: extract_content URL: https://example.com/article-from-search-results

Command Line Interface

# Search mode - Search DuckDuckGo and interactively browse results llmresearcher "machine learning transformers" # GitHub Code Search mode - Search GitHub for code llmresearcher -g "useState hooks language:typescript" # Direct URL mode - Extract content from specific URL llmresearcher -u https://example.com/article # Interactive mode - Enter interactive search session llmresearcher # Verbose logging - See detailed operation logs llmresearcher -v "search query" # MCP Server mode - Start as Model Context Protocol server llmresearcher --mcp

Development

Scripts

# Build the project pnpm build # Build in watch mode (for development) pnpm dev # Run tests pnpm test # Run tests in CI mode (single run) pnpm test:run # Type checking pnpm type-check # Clean build artifacts pnpm clean # Install Playwright browsers pnpm install-browsers

Interactive Commands

When in search results view:

  • 1-10: Select a result by number

  • b or back: Return to search results

  • open <n>: Open result #n in external browser

  • q or quit: Exit the program

When viewing content:

  • b or back: Return to search results

  • /<term>: Search for term within the extracted content

  • open: Open current page in external browser

  • q or quit: Exit the program

Configuration

Environment Variables

Create a .env file in the project root:

USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0) TIMEOUT=30000 MAX_RETRIES=3 RATE_LIMIT_DELAY=1000 CACHE_ENABLED=true MAX_RESULTS=10

Configuration File

Create ~/.llmresearcherrc in your home directory:

{ "userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)", "timeout": 30000, "maxRetries": 3, "rateLimitDelay": 1000, "cacheEnabled": true, "maxResults": 10 }

Configuration Options

Option

Default

Description

userAgent

Mozilla/5.0 (compatible; LLMResearcher/1.0)

User agent for HTTP requests

timeout

30000

Request timeout in milliseconds

maxRetries

3

Maximum retry attempts for failed requests

rateLimitDelay

1000

Delay between requests in milliseconds

cacheEnabled

true

Enable/disable local caching

maxResults

10

Maximum search results to display

Architecture

Core Components

  1. MCPResearchServer (src/mcp-server.ts)

    • Model Context Protocol server implementation

    • Three main tools: github_code_search, duckduckgo_web_search, extract_content

    • JSON-based responses for LLM consumption

  2. DuckDuckGoSearcher (src/search.ts)

    • HTML scraping of DuckDuckGo search results with locale support

    • URL decoding for /l/?uddg= format links

    • Rate limiting and retry logic

  3. GitHubCodeSearcher (src/github-code-search.ts)

    • GitHub Code Search API integration via gh CLI

    • Advanced query support with language, repo, and file filters

    • Authentication and rate limiting

  4. ContentExtractor (src/extractor.ts)

    • Playwright-based page rendering with resource blocking

    • @mozilla/readability for main content extraction

    • DOMPurify sanitization and Markdown conversion

  5. CLIInterface (src/cli.ts)

    • Interactive command-line interface

    • Search result navigation

    • Content viewing and text search

  6. Configuration (src/config.ts)

    • Environment and RC file configuration loading

    • Verbose logging support

Content Processing Pipeline

MCP Server Mode

  1. Search:

    • DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination

    • GitHub: Code Search API → Format results → JSON response with code snippets

  2. Extract: URL from search results → Playwright navigation → Content extraction

  3. Process: @mozilla/readability → DOMPurify sanitization → Clean JSON output

  4. Output: Structured JSON for LLM consumption

CLI Mode

  1. Search: DuckDuckGo HTML endpoint → Parse results → Display numbered list

  2. Extract: Playwright navigation → Resource blocking → JS rendering

  3. Process: @mozilla/readability → DOMPurify sanitization → Turndown Markdown

  4. Output: Clean Markdown with h1-h3, bold, italic, links only

Security Features

  • Resource Blocking: Prevents loading of images, CSS, fonts for speed and security

  • Content Sanitization: DOMPurify removes scripts, iframes, and dangerous elements

  • Limited Markdown: Only allows safe formatting elements (h1-h3, strong, em, a)

  • Rate Limiting: Respects DuckDuckGo's rate limits with exponential backoff

Examples

MCP Server Usage with Claude Code

1. GitHub Code Search

You: "Find React hook examples for state management" Claude uses github_code_search tool: { "query": "useState useReducer state management language:javascript", "results": [ { "title": "facebook/react/packages/react/src/ReactHooks.js", "url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js", "snippet": "function useState(initialState) {\n return dispatcher.useState(initialState);\n}" } ], "pagination": { "currentPage": 1, "hasNextPage": true, "nextPageToken": "2" } }

2. Web Search with Locale

You: "Search for Vue.js tutorials in Japanese" Claude uses duckduckgo_web_search tool: { "query": "Vue.js チュートリアル 入門", "locale": "jp-jp", "results": [ { "title": "Vue.js入門ガイド", "url": "https://example.com/vue-tutorial", "snippet": "Vue.jsの基本的な使い方を学ぶチュートリアル..." } ] }

3. Content Extraction

You: "Extract the full content from that Vue.js tutorial" Claude uses extract_content tool: { "url": "https://example.com/vue-tutorial", "title": "Vue.js入門ガイド", "extractedAt": "2024-01-15T10:30:00.000Z", "content": "# Vue.js入門ガイド\n\nVue.jsは...\n\n## インストール\n\n..." }

CLI Examples

Basic Search

$ llmresearcher "python web scraping" 🔍 Search Results: ══════════════════════════════════════════════════ 1. Python Web Scraping Tutorial URL: https://realpython.com/python-web-scraping-practical-introduction/ Complete guide to web scraping with Python using requests and Beautiful Soup... 2. Web Scraping with Python - BeautifulSoup and requests URL: https://www.dataquest.io/blog/web-scraping-python-tutorial/ Learn how to scrape websites with Python, Beautiful Soup, and requests... ══════════════════════════════════════════════════ Commands: [1-10] select result | b) back | q) quit | open <n>) open in browser > 1 📥 Extracting content from: Python Web Scraping Tutorial 📄 Content: ══════════════════════════════════════════════════ **Python Web Scraping Tutorial** Source: https://realpython.com/python-web-scraping-practical-introduction/ Extracted: 2024-01-15T10:30:00.000Z ────────────────────────────────────────────────── # Python Web Scraping: A Practical Introduction Web scraping is the process of collecting and parsing raw data from the web... ## What Is Web Scraping? Web scraping is a technique to automatically access and extract large amounts... ══════════════════════════════════════════════════ Commands: b) back to results | /<term>) search in text | q) quit | open) open in browser > /beautiful soup 🔍 Found 3 matches for "beautiful soup": ────────────────────────────────────────────────── Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents. Line 42: from bs4 import BeautifulSoup Line 67: soup = BeautifulSoup(html_content, 'html.parser')

Direct URL Mode

$ llmresearcher -u https://docs.python.org/3/tutorial/ 📄 Content: ══════════════════════════════════════════════════ **The Python Tutorial** Source: https://docs.python.org/3/tutorial/ Extracted: 2024-01-15T10:35:00.000Z ────────────────────────────────────────────────── # The Python Tutorial Python is an easy to learn, powerful programming language... ## An Informal Introduction to Python In the following examples, input and output are distinguished...

Verbose Mode

$ llmresearcher -v "nodejs tutorial" [VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us-en [VERBOSE] Response: 200 in 847ms [VERBOSE] Parsed 10 results [VERBOSE] Launching browser... [VERBOSE] Blocking resource: https://example.com/style.css [VERBOSE] Blocking resource: https://example.com/image.png [VERBOSE] Navigating to page... [VERBOSE] Page loaded in 1243ms [VERBOSE] Processing content with Readability... [VERBOSE] Readability extraction successful [VERBOSE] Closing browser...

Testing

Running Tests

# Run tests in watch mode pnpm test # Run tests once (CI mode) pnpm test:run # Run tests with coverage pnpm test -- --coverage

Test Coverage

The test suite includes:

  • Unit Tests: Individual component testing

    • search.test.ts: DuckDuckGo search functionality, URL decoding, rate limiting

    • extractor.test.ts: Content extraction, Markdown conversion, resource management

    • config.test.ts: Configuration validation and environment handling

  • Integration Tests: End-to-end workflow testing

    • integration.test.ts: Complete search-to-extraction workflows, error handling, cleanup

Test Features

  • Fast: Powered by vitest for quick feedback

  • Type-safe: Full TypeScript support in tests

  • Isolated: Each test cleans up its resources

  • Comprehensive: Covers search, extraction, configuration, and integration scenarios

Troubleshooting

Common Issues

"Browser not found" Error

pnpm install-browsers

Rate Limiting Issues

  • The tool automatically handles rate limiting with 1-second delays

  • If you encounter 429 errors, the tool will automatically retry with exponential backoff

Content Extraction Failures

  • Some sites may block automated access

  • The tool includes fallback extraction methods (main → body content)

  • Use verbose mode (-v) to see detailed error information

Permission Denied (Unix/Linux)

chmod +x bin/llmresearcher.js

Performance Optimization

The tool is optimized for speed:

  • Resource Blocking: Automatically blocks images, CSS, fonts

  • Network Idle: Waits for JavaScript to complete rendering

  • Content Caching: Supports local caching to avoid repeated requests

  • Minimal Dependencies: Uses lightweight, focused libraries

Development

Project Structure

light-research-mcp/ ├── dist/ # Built JavaScript files (generated) │ ├── bin/ │ │ └── llmresearcher.js # CLI entry point (executable) │ └── *.js # Compiled TypeScript modules ├── src/ # TypeScript source files │ ├── bin.ts # CLI entry point │ ├── index.ts # Main LLMResearcher class │ ├── mcp-server.ts # MCP server implementation │ ├── search.ts # DuckDuckGo search implementation │ ├── github-code-search.ts # GitHub Code Search implementation │ ├── extractor.ts # Content extraction with Playwright │ ├── cli.ts # Interactive CLI interface │ ├── config.ts # Configuration management │ └── types.ts # TypeScript type definitions ├── test/ # Test files (vitest) │ ├── search.test.ts # Search functionality tests │ ├── extractor.test.ts # Content extraction tests │ ├── config.test.ts # Configuration tests │ ├── mcp-locale.test.ts # MCP locale functionality tests │ ├── mcp-content-extractor.test.ts # MCP content extractor tests │ └── integration.test.ts # End-to-end integration tests ├── tsconfig.json # TypeScript configuration ├── tsup.config.ts # Build configuration ├── vitest.config.ts # Test configuration ├── package.json └── README.md

Dependencies

Runtime Dependencies

  • @modelcontextprotocol/sdk: Model Context Protocol server implementation

  • @mozilla/readability: Content extraction from HTML

  • cheerio: HTML parsing for search results

  • commander: CLI argument parsing

  • dompurify: HTML sanitization

  • dotenv: Environment variable loading

  • jsdom: DOM manipulation for server-side processing

  • playwright: Browser automation for JS rendering

  • turndown: HTML to Markdown conversion

Development Dependencies

  • typescript: TypeScript compiler

  • tsup: Fast TypeScript bundler

  • vitest: Fast unit test framework

  • @types/*: TypeScript type definitions

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository

  2. Create a feature branch

  3. Make your changes

  4. Add tests if applicable

  5. Submit a pull request

Roadmap

Planned Features

  • Enhanced MCP Tools: Additional specialized search tools for documentation, APIs, etc.

  • Caching Layer: SQLite-based URL → Markdown caching with 24-hour TTL

  • Search Engine Abstraction: Support for Brave Search, Bing, and other engines

  • Content Summarization: Optional AI-powered content summarization

  • Export Formats: JSON, plain text, and other output formats

  • Batch Processing: Process multiple URLs from file input

  • SSE Transport: Support for Server-Sent Events MCP transport

Performance Improvements

  • Parallel Processing: Concurrent content extraction for multiple results

  • Smart Caching: Intelligent cache invalidation based on content freshness

  • Memory Optimization: Streaming content processing for large documents

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Code-Hex/light-research-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server