index.html•16.5 kB
<!DOCTYPE html>
<html lang="en">
<head>
<title>mcp-server-webcrawl | MCP server for web crawlers</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="description" content="AI search and retrieval for web crawlers" />
<link rel="shortcut icon" href="../media/static/images/mcp-server-webcrawl/favicon.b04adb6828.png"/>
<link type="text/css" rel="stylesheet" href="../media/static/styles/css/mcp.min.b04adb6828.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="og:image" content="https://pragmar.com/media/static/images/mcp-server-webcrawl/og-mcp-server-webcrawl.png?202505251919" />
<meta name="og:description" content="AI search and retrieval for web crawlers" />
<meta name="og:title" content="mcp-server-webcrawl | MCP server for web crawlers" />
<meta name="twitter:card" content="summary" />
<script>var _SiteOneUrlDepth = 2;</script></head>
<body>
<header>
<div>
<div class="constrain header__wrap">
<nav class="links">
<a href="../index.html">Home</a>
<a href="../mcp-server-webcrawl/help/index.html">Help</a>
<a href="https://github.com/pragmar/mcp-server-webcrawl">Github</a>
</nav>
<h1 class="header__main">
<a href="../mcp-server-webcrawl/index.html">
<span class="accessible">mcp-server-webcrawl</span>
<img src="../media/static/images/mcp-server-webcrawl/mcpswc.b04adb6828.svg" alt="mcp-server-webcrawl logo and visual metaphors alluding to DC adapter interchange"/>
</a>
</h1>
</div>
</div>
</header>
<main>
<div class="constrain">
<h2>AI Search and Retrieval for Web Crawlers</h2>
<div class="summary">
<div>
<p>
Bridge the gap between your web crawl and AI language models using
Model Context Protocol (MCP). With <strong>mcp-server-webcrawl</strong>,
your AI client filters and analyzes web content under your direction or autonomously, extracting insights
from your web content.
</p>
<p>
Support for
<a href="https://en.wikipedia.org/wiki/WARC_(file_format)">WARC</a>,
<a href="https://en.wikipedia.org/wiki/Wget">wget</a>,
<a href="https://interro.bot/">InterroBot</a>,
<a href="https://github.com/projectdiscovery/katana">Katana</a>, and
<a href="https://crawler.siteone.io/">SiteOne</a> crawlers
is available out of the gate. The server includes a full-text search interface
with boolean support, resource filtering by type,
HTTP status, and more. <strong>mcp-server-webcrawl</strong> provides the LLM a complete menu
with which to search your web content.
</p>
<p>
<strong>mcp-server-webcrawl is free</strong> and <a href="https://github.com/pragmar/mcp-server-webcrawl">open
source</a>, and requires <a href="https://claude.ai/download">Claude Desktop</a> and
<a href="https://www.python.org/downloads/">Python</a> (>=3.10). It is installed on the
command line, via pip install:
</p>
<pre class="summary__pip">pip install mcp-server-webcrawl</pre>
</div>
<div class="video__wrap">
<video src="../media/static/images/mcp-server-webcrawl/mcpdemo.mp4"
poster="/media/static/images/mcp-server-webcrawl/mcpdemo.png"
autoplay loop muted playsinline
aria-label="MCP demo video (autoplay, no audio) showcasing MCP setup using mcp-server-webcrawl">Your browser does not support the video tag.</video>
</div>
</div>
<h2>Main Features</h2>
<div class="features">
<div class="features__list">
<ul>
<li>Claude Desktop ready</li>
<li>Multi-crawler compatible</li>
<li>Filter by type, status, and more</li>
</ul>
</div>
<div class="features__list">
<ul>
<li>Boolean search support</li>
<li>Support for Markdown and snippets</li>
<li>Roll your own website knowledgebase</li>
</ul>
</div>
</div>
<h2>Getting Started</h2>
<div class="summary alternate">
<div class="tabbed__selection">
<p>
Setup videos are available for each supported crawler, showing how
to connect your crawl data to your LLM.
</p>
<p>
<!--
In addition to your preferred web crawler,
<strong>mcp-server-webcrawl</strong> requires
<a href="https://python.org/">Python</a> and
<a href="https://python.org/">Claude Desktop</a> (or other
MCP-supporting client).
-->
If you prefer text-only as opposed to video, step action guides are available within the
<strong>mcp-server-webcrawl</strong>
<a href="https://pragmar.github.io/mcp-server-webcrawl/guides.html">documentation</a>.
</p>
</div>
<div class="tabbed">
<input type="radio" name="videos" id="radioVideosWget" class="tabbed__radio" checked>
<label for="radioVideosWget" class="tabbed__label">wget</label>
<input type="radio" name="videos" id="radioVideosWarc" class="tabbed__radio">
<label for="radioVideosWarc" class="tabbed__label">WARC</label>
<input type="radio" name="videos" id="radioVideosInterrobot" class="tabbed__radio">
<label for="radioVideosInterrobot" class="tabbed__label">InterroBot</label>
<input type="radio" name="videos" id="radioVideosKatana" class="tabbed__radio">
<label for="radioVideosKatana" class="tabbed__label">Katana</label>
<input type="radio" name="videos" id="radioVideosSiteone" class="tabbed__radio">
<label for="radioVideosSiteone" class="tabbed__label">SiteOne</label>
<div class="tabbed__content">
<div id="videosWget">
<iframe loading="lazy" width="100%" height="100%" src="../_www.youtube.com/embed/uqEEqVsofhc.jpg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<div id="videosWarc">
<iframe loading="lazy" width="100%" height="100%" src="../_www.youtube.com/embed/fx-4WZu-UT8.jpg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<div id="videosInterrobot">
<iframe loading="lazy" width="100%" height="100%" src="../_www.youtube.com/embed/55y8oKWXJLs.jpg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<div id="videosKatana">
<iframe loading="lazy" width="100%" height="100%" src="../_www.youtube.com/embed/sOMaojm0R0Y.jpg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
<div id="videosSiteone">
<iframe loading="lazy" width="100%" height="100%" src="../_www.youtube.com/embed/JOGRYbo6WwI.jpg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
</div>
</div>
</div>
<h2>MCP Configuration</h2>
<div class="summary">
<div class="tabbed">
<input type="radio" name="crawler" id="radioConfWget" class="tabbed__radio" checked>
<label for="radioConfWget" class="tabbed__label">wget</label>
<input type="radio" name="crawler" id="radioConfWarc" class="tabbed__radio">
<label for="radioConfWarc" class="tabbed__label">WARC</label>
<input type="radio" name="crawler" id="radioConfInterrobot" class="tabbed__radio">
<label for="radioConfInterrobot" class="tabbed__label">InterroBot</label>
<input type="radio" name="crawler" id="radioConfKatana" class="tabbed__radio">
<label for="radioConfKatana" class="tabbed__label">Katana</label>
<input type="radio" name="crawler" id="radioConfSiteone" class="tabbed__radio">
<label for="radioConfSiteone" class="tabbed__label">SiteOne</label>
<div class="tabbed__content">
<div id="confWget">
<pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span>
<span class="comment"># macOS: command set to absolute path, i.e.</span>
<span class="comment"># the value of $ which mcp-server-webcrawl</span>
{
"mcpServers": {
"webcrawl": {
"command": "/path/to/mcp-server-webcrawl",
"args": ["--crawler", "wget", "--datasrc",
"/path/to/wget/archives/"]
}
}
}
<span class="comment"># tested configurations (macOS Terminal/Windows WSL)</span>
<span class="comment"># from /path/to/wget/archives/ as current working direcory</span>
<span class="comment"># --adjust-extension for file extensions, e.g. *.html</span>
<span>$ wget --mirror https://example.com</span>
<span>$ wget --mirror https://example.com --adjust-extension</span></pre>
</div>
<div id="confWarc">
<pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span>
<span class="comment"># macOS: command set to absolute path, i.e.</span>
<span class="comment"># the value of $ which mcp-server-webcrawl</span>
{
"mcpServers": {
"webcrawl": {
"command": "/path/to/mcp-server-webcrawl",
"args": ["--crawler", "warc", "--datasrc",
"/path/to/warc/archives/"]
}
}
}
<span class="comment"># tested configurations (macOS Terminal/Windows WSL)</span>
<span class="comment"># from /path/to/warc/archives/ as current working direcory</span>
<span>$ wget --warc-file=example --recursive https://example.com</span>
<span>$ wget --warc-file=example --recursive --page-requisites https://example.com</span></pre>
</div>
<div id="confInterrobot">
<pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span>
<span class="comment"># macOS: command set to absolute path, i.e.</span>
<span class="comment"># the value of $ which mcp-server-webcrawl</span>
{
"mcpServers": {
"webcrawl": {
"command": "/path/to/mcp-server-webcrawl",
"args": ["--crawler", "interrobot", "--datasrc",
"[homedir]/Documents/InterroBot/interrobot.v2.db"]
}
}
}
<span class="comment"># crawls executed in InterroBot (windowed)</span>
<span class="comment"># Windows: replace [homedir] with /Users/...</span>
<span class="comment"># macOS: path provided on InterroBot settings page</span></pre>
</div>
<div id="confKatana">
<pre>
<span class="comment"># Windows: command set to "mcp-server-webcrawl"</span>
<span class="comment"># macOS: command set to absolute path, i.e.</span>
<span class="comment"># the value of $ which mcp-server-webcrawl</span>
{
"mcpServers": {
"webcrawl": {
"command": "/path/to/mcp-server-webcrawl",
"args": ["--crawler", "katana", "--datasrc",
"/path/to/katana/crawls/"]
}
}
}
<span class="comment"># tested configurations (macOS Terminal/Powershell/WSL)</span>
<span class="comment"># -store-response to save crawl contents</span>
<span class="comment"># -store-response-dir allows for expansion of hosts</span>
<span class="comment"># consistent with default Katana behavior to </span>
<span class="comment"># spread assets across host directories</span>
<span>$ katana -u https://example.com -store-response -store-response-dir /path/to/katana/crawls/example.com/</span></pre>
</div>
<div id="confSiteone">
<pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span>
<span class="comment"># macOS: command set to absolute path, i.e.</span>
<span class="comment"># the value of $ which mcp-server-webcrawl</span>
{
"mcpServers": {
"webcrawl": {
"command": "/path/to/mcp-server-webcrawl",
"args": ["--crawler", "siteone", "--datasrc",
"/path/to/siteone/archives/"]
}
}
}
<span class="comment"># crawls executed in SiteOne (windowed)</span>
<span class="comment"># *Generate offline website* must be checked</span></pre>
</div>
</div>
</div>
<div class="configuration__selection">
<p>
From Claude's developer settings, find the MCP configuration to include
your crawl. Open in a text editor and
modify the example to reflect your datasrc path.
</p>
<p>
You can set up more <strong>mcp-server-webcrawl</strong> connections under mcpServers
if you want.
</p>
<p>
For additional technical information, including crawler feature support, be sure to check out <a href="../mcp-server-webcrawl/help/index.html">help</a>.
</p>
</div>
</div>
<div class="tabbed__visualization">
<img src="../media/static/images/mcp-server-webcrawl/netwww.b04adb6828.svg" alt="Abstraction of LLM clients (Claude and OpenAI) communicating with a website archive"/>
</div>
</div>
</main>
<footer>
<nav class="pragmar">
<div class="pragmar__also">Software by <a href="../index.html">Pragmar</a></div>
<div class="pragmar__products__wrap">
<div class="pragmar__products">
<a class="pragmar__product interrobot" href="https://interro.bot/?utm_source=pragmar.com">
<img src="../media/static/images/home/interrobot.b04adb6828.png" alt="InterroBot icon"/>
<div><strong>InterroBot</strong>.
Web crawler and analyzer. Free/paid.</div>
</a>
<a class="pragmar__product appstat" href="../appstat/index.html">
<img src="../media/static/images/home/appstat.b04adb6828.png" alt="appstat icon"/>
<div><strong>appstat</strong>.
Windows process monitor. Free.</div>
</a>
<a class="pragmar__product moffitor" href="../moffitor/index.html">
<img src="../media/static/images/home/moffitor.b04adb6828.png" alt="Moffitor icon"/>
<div><strong>Moffitor</strong>.
One-click monitor sleep. Free.</div>
</a>
<a class="pragmar__product qbit" href="../qbit/index.html">
<img src="../media/static/images/home/qbit.b04adb6828.png" alt="Qbit icon"/>
<div><strong>Qbit</strong>.
Skybox generator for game devs. Free/paid.</div>
</a>
</div>
</div>
</nav>
</footer>
<script src="../media/static/scripts/js/main.min.b04adb6828.js"></script>
</body>
</html>