Enables the extraction of large result sets and listing data from Airbnb while handling DataDome anti-bot protections.
Supports searching and extracting product information and search results from Amazon while navigating AWS Shield protections.
Utilizes Brave as a supported search engine for conducting web searches and parallel meta-search operations.
Integrates DuckDuckGo as a search engine for retrieving web search results and performing data gathering.
Uses Google as a primary search engine for web search tasks and retrieving comprehensive search results.
Supports integration with local Ollama instances for LLM synthesis within deep research workflows via OpenAI-compatible APIs.
Provides LLM synthesis capabilities for deep research tools using OpenAI or OpenAI-compatible API endpoints.
Enables automated extraction of data from Ticketmaster by handling Cloudflare Turnstile challenges.
Supports retrieving protected job listings and data from Upwork by navigating reCAPTCHA and other bot detection systems.
CortexScout (cortex-scout) — Search and Web Extraction Engine for AI Agents
Overview
CortexScout provides a single, self-hostable Rust binary that exposes search and extraction capabilities over MCP (stdio) and an optional HTTP server. Output formats are structured and optimized for downstream LLM use.
It is built to handle the practical failure modes of web retrieval (rate limits, bot challenges, JavaScript-heavy pages) through progressive fallbacks: native retrieval → Chromium CDP rendering → HITL workflows.
Tools (Capability Roster)
Area | MCP Tools / Capabilities |
Search |
|
Fetch |
|
Crawl |
|
Extraction |
|
Anti-bot handling | CDP rendering, proxy rotation, block-aware retries |
HITL |
|
Memory |
|
Deep research |
|
Ecosystem Integration
While CortexScout runs as a standalone tool today, it is designed to integrate with CortexDB and CortexStudio for multi-agent scaling, shared retrieval artifacts, and centralized governance.
Anti-Bot Efficacy & Validation
This repository includes captured evidence artifacts that validate extraction and HITL flows against representative protected targets.
Target | Protection | Evidence | Notes |
Cloudflare + Auth | Auth-gated listings extraction | ||
Ticketmaster | Cloudflare Turnstile | Challenge-handled extraction | |
Airbnb | DataDome | Large result sets under bot controls | |
Upwork | reCAPTCHA | Protected listings retrieval | |
Amazon | AWS Shield | Search result extraction | |
nowsecure.nl | Cloudflare | Manual return path validated |
See proof/README.md for methodology and raw outputs.
Quick Start
Option A — Prebuilt binaries
Download the latest release assets from GitHub Releases and run one of:
cortex-scout-mcp— MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)cortex-scout— optional HTTP server (default port5000; override via--port,PORT, orCORTEX_SCOUT_PORT)
Health check (HTTP server):
./cortex-scout --port 5000
curl http://localhost:5000/healthOption B — Build from source
Basic build (search, scrape, deep research, memory):
git clone https://github.com/cortex-works/cortex-scout.git
cd cortex-scout/mcp-server
cargo build --release --bin cortex-scout-mcpFull build (includes hitl_web_fetch / visible-browser HITL):
cargo build --release --all-features --bin cortex-scout-mcpIf you also want the optional HTTP server binary, build it explicitly with cargo build --release --bin cortex-scout.
Local MCP smoke test:
python3 publish/ci/smoke_mcp.pyThis runs a newline-delimited JSON-RPC stdio session against the local cortex-scout-mcp binary and exercises the main public tools with safe example inputs.
MCP Integration (VS Code / Cursor / Claude Desktop)
Add a server entry to your MCP config.
VS Code (mcp.json — global, or settings.json under mcp.servers):
// mcp.json (global): top-level key is "servers"
// settings.json (workspace): use "mcp.servers" instead
{
"servers": {
"cortex-scout": {
"type": "stdio",
"command": "env",
"args": [
"RUST_LOG=warn",
"SEARCH_ENGINES=google,bing,duckduckgo,brave",
"LANCEDB_URI=/YOUR_PATH/cortex-scout/lancedb",
"HTTP_TIMEOUT_SECS=30",
"MAX_CONTENT_CHARS=10000",
"/YOUR_PATH/cortex-scout/mcp-server/target/release/cortex-scout-mcp"
]
}
}
}Default behavior is direct/no-proxy. Add IP_LIST_PATH and PROXY_SOURCE_PATH only if you want proxy tools available. If you want proxy_control available without routing normal traffic through proxies, point IP_LIST_PATH at an empty ip.txt file and let agents populate it on demand.
Important: Always use
RUST_LOG=warn, notinfo. Atinfolevel, the server emits hundreds of log lines per request to stderr, which can confuse MCP clients that monitor stderr.
Windows: Windows has no
envcommand. Use thecommand+envobject format instead — see docs/IDE_SETUP.md.
With deep research (LLM synthesis via OpenRouter / any OpenAI-compatible API):
{
"servers": {
"cortex-scout": {
"type": "stdio",
"command": "env",
"args": [
"RUST_LOG=warn",
"SEARCH_ENGINES=google,bing,duckduckgo,brave",
"LANCEDB_URI=/YOUR_PATH/cortex-scout/lancedb",
"HTTP_TIMEOUT_SECS=30",
"MAX_CONTENT_CHARS=10000",
"OPENAI_BASE_URL=https://openrouter.ai/api/v1",
"OPENAI_API_KEY=sk-or-v1-...",
"DEEP_RESEARCH_LLM_MODEL=moonshotai/kimi-k2.5",
"DEEP_RESEARCH_ENABLED=1",
"DEEP_RESEARCH_SYNTHESIS=1",
"DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS=4096",
"/YOUR_PATH/cortex-scout/mcp-server/target/release/cortex-scout-mcp"
]
}
}
}Multi-IDE guide: docs/IDE_SETUP.md
Configuration (cortex-scout.json)
Create cortex-scout.json in the same directory as the binary (or repository root). All fields are optional; environment variables act as fallback.
{
"deep_research": {
"enabled": true,
"llm_base_url": "http://localhost:1234/v1",
"llm_api_key": "",
"llm_model": "lfm2-2.6b",
"synthesis_enabled": true,
"synthesis_max_sources": 3,
"synthesis_max_chars_per_source": 800,
"synthesis_max_tokens": 1024
}
}Key Environment Variables
Core
Variable | Default | Description |
|
| Log level. Keep — |
|
| Per-request read timeout (seconds) |
|
| TCP connect timeout (seconds) |
|
| Max concurrent outbound HTTP connections |
|
| Max characters returned per scraped page |
Browser / Anti-bot
Variable | Default | Description |
| auto-detected | Override path to Chromium/Chrome/Brave binary |
|
| Retry search engine fetches via native Chromium CDP when blocked |
| unset | Set |
|
| Max links followed per page crawl |
Search
Variable | Default | Description |
|
| Active engines (comma-separated) |
|
| Results per engine before merge/dedup |
Proxy
Variable | Default | Description |
| — | Optional path to |
| — | Optional path to |
Semantic Memory (LanceDB)
Variable | Default | Description |
| — | Directory path for persistent research memory. Omit to disable |
|
| Set |
| built-in | HuggingFace model ID or local path for embedding (e.g. |
Deep Research
Variable | Default | Description |
|
| Set |
| — | API key for LLM synthesis. Omit for key-less local endpoints (Ollama) |
|
| OpenAI-compatible endpoint (OpenRouter, Ollama, LM Studio, etc.) |
|
| Model identifier (must be supported by the endpoint) |
|
| Set |
|
| Max tokens for synthesis response. Use |
|
| Max source documents fed to LLM synthesis |
|
| Max characters extracted per source for synthesis |
HTTP Server only
Variable | Default | Description |
|
| Listening port for the HTTP server binary ( |
Agent Best Practices
Recommended operational flow:
Call
memory_searchbefore any new research run — skip live fetching if similarity ≥ 0.60 andskip_live_fetchistrue.For initial topic discovery use
web_search_json(returns structured snippets, lower token cost than full scrape).For known URLs use
web_fetchwithoutput_format="clean_json", setquery+strict_relevance=trueto truncate irrelevant content.On 403/429: call
proxy_controlwithaction:"grab"to refresh the proxy list, then retry withuse_proxy:true.For auth-gated pages:
visual_scoutto confirm the gate type →human_auth_sessionto complete login (cookies persisted under~/.cortex-scout/sessions/).For deep research:
deep_researchhandles multi-hop search + scrape + LLM synthesis automatically. Tunedepth(1–3) andmax_sourcesper run cost budget.For CAPTCHA or heavy JS pages that all other paths fail:
hitl_web_fetchopens a visible Brave/Chrome window for human completion (requires--all-featuresbuild and a local desktop session).
FAQ
Why does deep_research with Ollama or qwen3.5 sometimes fail or fall back to heuristic mode?
Some reasoning-capable local models return OpenAI-compatible /v1/chat/completions responses with message.reasoning populated but message.content empty. Cortex Scout now retries local Ollama endpoints through native /api/chat with think:false when that pattern is detected.
Recommended config for local 4B-class Ollama models:
llm_api_key: ""incortex-scout.jsonis valid and means "no auth required"Keep
synthesis_max_sourcesat1-2Keep
synthesis_max_chars_per_sourcearound600-1000Keep
synthesis_max_tokensaround512-768
If you still see slow or unstable synthesis, reduce synthesis_max_sources before increasing token limits.
Why do I see Chromium profile lock errors?
Each headless request uses a unique temporary profile, so normal scraping and deep_research are safe from profile lock races. Only HITL flows (like non_robot_search) using a real browser profile can hit a lock if you run them concurrently or have Brave/Chrome open on the same profile. To avoid: run HITL calls one at a time, and close all browser windows before reusing a profile.
Checklist:
Use a recent build (2026-03-05 or newer)
Avoid persistent profile paths unless you need a logged-in session
Run HITL/profile flows sequentially
Close all browser windows before reusing a profile
Let Cortex Scout use its own temp profiles for concurrent research
My MCP client connects but tools fail or time out immediately. What should I check first?
Check these before anything else:
Use
RUST_LOG=warn, notinfo.On macOS/Linux
env-style configs, pass the binary path directly after the env assignments. Do not insert"--"inmcp.jsonargs.On Windows, do not use
env; usecommandplus anenvobject.Make sure the binary path points to a current build.
Versioning and Changelog
See CHANGELOG.md.
License
MIT. See LICENSE.