| searchA | Run a multi-engine web search and return a ranked, deduplicated link list. Best for:
- Discovery queries ("what is X", "find me X", "who is X").
- Getting a list of URLs you can hand to `fetch` / `fetch_batch` next.
- Topics likely to be after your knowledge cutoff (use `freshness="week"`).
- Filtering to specific domains (`include_domains=["python.org"]`) or
content types (`category="paper"|"pdf"|"github"|"news"|"forum"|"blog"`).
Not recommended for:
- You already know the URL -> use `fetch` instead.
- You want both links AND their full text in one call -> use `research`.
- You want to query pages already in the local cache -> use `cache_search`.
- Reading PDFs/DOCX from a known URL -> use `read_doc`.
Returns:
- markdown (default): numbered list of `n. title`, `<url>`, snippet — ~40%
fewer tokens than json.
- json: dict with `results` (list of {title,url,snippet,engines,score}),
`engines`, `cached`, optional `errors` map, optional `hint` string.
Common mistakes:
- Passing a URL as `query` — that's `fetch`'s job.
- Cranking `max_results` to 50 hoping for better recall; engines cap around
10-20 each, anything beyond is duplicate noise.
- Adding `engines=["startpage","brave","bing","baidu"]` by default — those
need browser rendering or captcha-friendly conditions; stick with the
defaults unless they returned 0.
- Using `category="news"` for breaking news without also setting
`freshness="day"` — the index lag is days, not minutes.
Args:
query: Natural-language query (the same string a human would type).
engines: Subset of `engines()`. None = duckduckgo+mojeek+googlenews.
(startpage is opt-in and browser-rendered.)
max_results: Merged result count after dedup. 5-20 is the useful range.
use_cache: Reuse the last result for this exact (query, engines,
max_results) within the cache TTL. False forces a re-fetch.
max_age_hours: Treat cached results older than this as a read miss; a
fresh result is ALWAYS written back to the cache regardless of this
value, so caching is never disabled. Use 0 to force-refresh while
keeping cache writes; None = use server default TTL (7 days).
freshness: "day"|"week"|"month"|"year" — restrict to recent results.
include_domains: List of domains to restrict to (e.g. ["python.org"]).
exclude_domains: List of domains to exclude.
category: "news"|"pdf"|"github"|"paper"|"forum"|"blog" — content-type
shortcut. "paper" => arxiv/acm/springer/ieee/etc; "forum" =>
reddit/HN/stackexchange; "github" => github.com only.
include_text: Substring required in title or snippet (case-insensitive).
exclude_text: Substring forbidden in title or snippet.
format: "markdown" (default) or "json".
|
| fetchA | Fetch one URL and return reader-mode Markdown of the main content. Best for:
- You already have a URL (from `search`, the user, or your own knowledge)
and need the actual page text.
- Verifying a single claim by reading the source.
- Pages that need reader-mode cleanup (nav/footer/scripts stripped).
Not recommended for:
- Multiple URLs at once -> use `fetch_batch` (concurrent, one round-trip).
- "Search then read top N" -> use `research` (one call, not two).
- PDF/DOCX URLs -> use `read_doc` (proper binary parsing).
- You don't have a URL yet -> use `search` first.
Returns:
- markdown (default): a small header (URL, render method, token count)
plus the cleaned page body.
- json: {url, title, content, method, truncated, tokens_estimated,
author, published_date, sitename}.
Common mistakes:
- Passing a search query instead of a URL.
- Using `render="http"` on a JS-only SPA — it returns near-empty content;
use "auto" (default) or "browser".
- Forgetting that results are cached 7 days — use `force_refresh=True`
or `max_age_hours=0` for a fresh pull.
Args:
url: Absolute http(s) URL.
render: "auto" (try HTTP, fall back to stealth Chromium), "http"
(fast, fails on JS), "browser" (slow, robust).
force_refresh: Bypass the page cache entirely.
max_age_hours: Treat cached pages older than this as a miss. 0 = same
as force_refresh. None = server default TTL (7 days).
format: "markdown" or "json".
|
| fetch_batchA | Fetch a list of URLs in parallel. Per-URL failures do not raise. Best for:
- 2+ URLs you want to read in one round-trip.
- Reading the top N results of a previous `search` call.
Not recommended for:
- A single URL -> `fetch` (no list-wrapping overhead).
- "Search and then read" -> `research` collapses both into one tool call.
- PDFs/DOCX -> `read_doc` per file.
Returns:
- markdown (default): each page rendered as a Markdown section, separated
by horizontal rules; failed URLs become inline error notes.
- json: list[dict], one entry per URL, with `error` set on failures.
Common mistakes:
- Passing a single URL inside a 1-element list — use `fetch` directly.
- Assuming an exception means the whole batch failed; check each item's
`error` field instead.
Args:
urls: List of absolute http(s) URLs.
render: Same as `fetch`.
format: "markdown" or "json".
|
| read_docA | Read an http(s) document (or a sandboxed local file) into Markdown. Best for:
- Remote PDFs and DOCX from an http(s) URL (parsed locally, no remote API).
- Local PDF/DOCX/text/Markdown files — ONLY when local reads are enabled
(see Security below).
- Paginating through a long document via `start` / `length`.
Not recommended for:
- Arbitrary HTML web pages -> `fetch` does reader-mode cleanup that this
tool does not.
- Pages discovered through search -> `fetch` or `research`.
Security (local files are sandboxed and OFF by default):
- Local-file reads are DISABLED unless the server operator sets the
SEARCH_MCP_DOCUMENT_ROOT env var to a directory. With it unset, a local
path raises a "local file reads are disabled" error — pass an http(s)
URL instead, or ask the operator to enable the sandbox.
- When enabled, `source` must resolve INSIDE that root; relative paths
resolve against the root (not the process CWD) and any `..` traversal
that escapes the root is rejected. `file://` URLs are always rejected.
- Remote http(s) sources are unaffected by this setting.
Returns:
- markdown (default): rendered document text with a small header.
- json: {content, title, format, total_chars, start, returned_chars,
truncated}. Use `total_chars` and `returned_chars` to drive pagination.
Common mistakes:
- Calling this on a normal article URL — you'll get raw HTML noise; use
`fetch` instead.
- Forgetting to advance `start` when paginating: next call should pass
`start = previous_start + returned_chars`.
- Passing a negative `length` (raises an error) or a `start` past the end
(clamped to EOF: you'll get `returned_chars == 0`, `start == total_chars`,
and `truncated == False` — that's the signal you've paged off the end).
Args:
source: http(s) URL, or a local path UNDER SEARCH_MCP_DOCUMENT_ROOT when
local reads are enabled (disabled by default — see Security).
start: Character offset to begin reading from. Default 0. Clamped into
[0, total_chars]; a negative value is treated as 0.
length: Max characters to return; None = read to end (still capped by
the per-call max content size). Must be >= 0 — a negative length
is rejected with a ValueError.
format: "markdown" or "json".
|
| researchA | One-shot research: search the web, fetch the top results, return both. Best for:
- Open-ended questions that need finding sources AND reading them
("what's new with X", "summarize the controversy around Y").
- Replacing a `search` + N x `fetch` chain with one call.
- Producing a citable brief with [n]-style source references.
Not recommended for:
- You only need links -> `search` (cheaper, no fetching).
- You only need to read one URL you already have -> `fetch`.
- You want to query previously-fetched cached pages -> `cache_search`.
Returns:
- markdown (default): a "Research brief" with a Sources index then the
full Markdown body of each fetched document, separated by horizontal
rules; includes a token estimate.
- json: {question, engines, sources:[{rank,title,url,snippet,...}],
documents:[...], tokens_estimated, errors}.
Common mistakes:
- Using `depth=8` for a quick lookup — that's 8 page fetches; 2-3 is
almost always enough.
- Calling `research` for a known URL — that's `fetch` territory.
- Forgetting that `fetch=False` returns sources only (much cheaper if
the LLM only needs to pick which one to read).
Args:
question: What you want to know, in natural language.
depth: How many top results to fetch (1-8). 3 is a good default.
engines: Override the engine set (see `engines()` for names).
fetch: If False, return source list without reading them.
use_cache: Reuse cached search/page data within TTL.
max_age_hours: Treat cached search results AND cached page bodies older
than this as a read miss; fresh data is always written back. 0 =
force-refresh both the engine search and every fetched page body;
None = server default TTL (7 days). A non-zero value is honored for
both halves (it used to be ignored for anything but 0).
format: "markdown" or "json".
|
| cache_searchA | Full-text search over pages already fetched into the local SQLite FTS5 index. Best for:
- Recalling something the user/agent fetched earlier in the conversation
("what did that Wikipedia page say about X").
- Avoiding re-fetching content already in the local cache.
- Quick keyword grep across the corpus you've built up.
Not recommended for:
- Discovering new pages on the open web -> use `search` or `research`.
- When the cache is empty (fresh install) -> `search`/`research` first to
populate it.
Returns:
- markdown (default): a per-hit list of title, URL, and a `[bracket]`-
highlighted snippet around the matched terms.
- json: list of {url, title, snippet}.
Common mistakes:
- Treating this like web search — it ONLY hits pages already in the local
cache. If the user hasn't fetched anything, you'll get zero hits.
- Using natural-language phrases without quoting them; FTS5 splits on
whitespace as AND. For an exact phrase use `"like this"`.
Args:
query: FTS5 query. Bare terms = AND. Supports OR / NOT, prefix
(`term*`), and phrase (`"exact phrase"`).
limit: Max hits to return.
format: "markdown" or "json".
|
| enginesA | List engine names accepted by the engines= parameter of search / research. Best for:
- Discovering what's installable before passing a non-default engine.
- Building user-facing UIs that let humans pick engines.
Not recommended for:
- Calling on every search — the list is static; cache it.
Returns:
- A list of engine name strings (e.g. ["duckduckgo", "mojeek",
"googlenews", "startpage", "brave", "bing", "baidu"]).
Common mistakes:
- Passing one of these names as a query to `search` — they go in the
`engines=` argument, not `query`.
Defaults: duckduckgo + mojeek + googlenews (all reliable, no captchas;
googlenews is an RSS index with structured publish dates).
Opt-in: startpage (browser-rendered, slower), brave (PoW captcha after a
few calls), bing (UA-gated), baidu (results wrapped in
baidu.com/link redirects).
|
| compareA | Fetch 2-5 URLs concurrently and return per-URL excerpts so the LLM can
compare them against a single question in one round trip. Best for:
- Side-by-side product/feature/article comparisons.
- "Compare X to Y" or "How does A differ from B" queries.
- Triangulating a fact across multiple sources.
Not recommended for:
- >5 URLs -> use `fetch_batch`.
- 1 URL -> use `fetch`.
- Don't have URLs yet -> use `search` or `research` first.
Returns:
- markdown (default): a comparison brief with per-URL sections, each
containing title, sitename, published date, and a smart-truncated excerpt.
- json: {question, urls, excerpts:[{url, title, excerpt, ...}],
tokens_estimated}.
Common mistakes:
- Asking `compare` to actually answer the question — it returns material,
the LLM does the comparison.
- Passing >5 URLs and expecting them all to fit in context — use
`fetch_batch` for bulk reads.
Args:
question: The comparison question the LLM will answer using the
returned excerpts.
urls: 2-5 absolute http(s) URLs.
format: "markdown" (default) or "json".
|
| extract_structuredA | Pull JSON-LD, OpenGraph, Twitter cards, and microdata from a web page. Best for:
- Product pages (price, currency, availability, brand, rating).
- Article pages (author, publish date, image, headline).
- Recipe / event / video pages where rich metadata IS the answer.
- Cases where `fetch` returns prose but you need fields.
Not recommended for:
- Just reading a page -> use `fetch`.
- PDFs / DOCX -> use `read_doc`.
- Pages that don't publish schema.org metadata (most blogs) — you'll get
empty lists; fall back to `fetch`.
Returns:
- json: {url, json_ld:[], microdata:[], opengraph:[], rdfa:[]}. Twitter
card meta tags are surfaced inside the `opengraph` list.
- markdown (default): a flattened key/value view with each block printed
as a JSON code block under its syntax heading.
Common mistakes:
- Calling on every URL "just in case" — most sites have no structured
data, and `fetch` is what you actually want.
Args:
url: Absolute http(s) URL.
format: "markdown" (default) or "json".
|