Skip to main content
Glama

mcp-server-webcrawl

index.html16.5 kB
<!DOCTYPE html> <html lang="en"> <!-- Mirrored from pragmar.com/mcp-server-webcrawl/ by HTTrack Website Copier/3.x [XR&CO'2014], Wed, 27 Aug 2025 18:49:25 GMT --> <!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=utf-8" /><!-- /Added by HTTrack --> <head> <title>mcp-server-webcrawl | MCP server for web crawlers</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="description" content="AI search and retrieval for web crawlers" /> <link rel="shortcut icon" href="../media/static/images/mcp-server-webcrawl/favicone939.png?202508252215"/> <link type="text/css" rel="stylesheet" href="../media/static/styles/css/mcp.mine939.css?202508252215" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <meta name="og:image" content="../media/static/images/mcp-server-webcrawl/og-mcp-server-webcrawle939.png?202508252215" /> <meta name="og:description" content="AI search and retrieval for web crawlers" /> <meta name="og:title" content="mcp-server-webcrawl | MCP server for web crawlers" /> <meta name="twitter:card" content="summary" /> </head> <body> <header> <div> <div class="constrain header__wrap"> <nav class="links"> <a href="../index.html">Home</a> <a href="help/index.html">Help</a> <a href="https://github.com/pragmar/mcp-server-webcrawl">Github</a> </nav> <h1 class="header__main"> <a href="index.html"> <span class="accessible">mcp-server-webcrawl</span> <img src="../media/static/images/mcp-server-webcrawl/mcpswce939.svg?202508252215" alt="mcp-server-webcrawl logo and visual metaphors alluding to DC adapter interchange"/> </a> </h1> </div> </div> </header> <main> <div class="constrain"> <h2>AI Search and Retrieval for Web Crawlers</h2> <div class="summary"> <div class="video__wrap"> <video src="../media/static/images/mcp-server-webcrawl/what-is-mcp-server-webcrawl.mp4" poster="../media/static/images/mcp-server-webcrawl/what-is-mcp-server-webcrawl.webp" autoplay muted controls aria-label="MCP demo video (autoplay) showcasing MCP server setup usage with mcp-server-webcrawl"> <track kind="subtitles" src="../media/static/images/mcp-server-webcrawl/what-is-mcp-server-webcrawl.vtt" srclang="en" label="English subtitles" default> Your browser does not support the video tag. </video> </div> <div> <p> With <strong>mcp-server-webcrawl</strong>, your AI client filters and analyzes web content under your direction or autonomously. </p> <p> Support for multiple crawlers, including <a href="https://en.wikipedia.org/wiki/WARC_(file_format)">WARC</a>, <a href="https://en.wikipedia.org/wiki/Wget">wget</a>, <a href="https://interro.bot/">InterroBot</a>, <a href="https://github.com/projectdiscovery/katana">Katana</a>, and <a href="https://crawler.siteone.io/">SiteOne</a> is baked in. </p> <p>The server includes a full-text search interface with boolean support, filtering by type, HTTP status, and more. </p> </div> </div> <h2>Main Features</h2> <div class="features"> <div class="features__list"> <ul> <li>Claude Desktop ready</li> <li>Multi-crawler compatible</li> <li>Filter by type, status, and more</li> </ul> </div> <div class="features__list"> <ul> <li>Boolean search support</li> <li>Support for Markdown and snippets</li> <li>Roll your own website knowledgebase</li> </ul> </div> </div> <h2>Getting Started</h2> <div class="summary alternate"> <div class="tabbed__selection"> <p> <strong>mcp-server-webcrawl</strong> is free and <a href="https://github.com/pragmar/mcp-server-webcrawl">open source</a>, and requires <a href="https://claude.ai/download">Claude Desktop</a> and <a href="https://www.python.org/downloads/">Python</a> (>=3.10). It is installed on the command line, via pip install: <pre class="summary__pip">pip install mcp-server-webcrawl</pre> </p> <p> Setup videos are available for each supported crawler, showing how to connect your crawl data to your LLM. </p> <p> If you prefer text-only, step action guides are available in the <a href="https://pragmar.github.io/mcp-server-webcrawl/guides.html">docs</a>. </p> </div> <div class="tabbed"> <input type="radio" name="videos" id="radioVideosWget" class="tabbed__radio" checked> <label for="radioVideosWget" class="tabbed__label">wget</label> <input type="radio" name="videos" id="radioVideosWarc" class="tabbed__radio"> <label for="radioVideosWarc" class="tabbed__label">WARC</label> <input type="radio" name="videos" id="radioVideosInterrobot" class="tabbed__radio"> <label for="radioVideosInterrobot" class="tabbed__label">InterroBot</label> <input type="radio" name="videos" id="radioVideosKatana" class="tabbed__radio"> <label for="radioVideosKatana" class="tabbed__label">Katana</label> <input type="radio" name="videos" id="radioVideosSiteone" class="tabbed__radio"> <label for="radioVideosSiteone" class="tabbed__label">SiteOne</label> <div class="tabbed__content"> <div id="videosWget"> <iframe loading="lazy" width="100%" height="100%" src="https://www.youtube.com/embed/uqEEqVsofhc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> </div> <div id="videosWarc"> <iframe loading="lazy" width="100%" height="100%" src="https://www.youtube.com/embed/fx-4WZu-UT8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> </div> <div id="videosInterrobot"> <iframe loading="lazy" width="100%" height="100%" src="https://www.youtube.com/embed/55y8oKWXJLs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> </div> <div id="videosKatana"> <iframe loading="lazy" width="100%" height="100%" src="https://www.youtube.com/embed/sOMaojm0R0Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> </div> <div id="videosSiteone"> <iframe loading="lazy" width="100%" height="100%" src="https://www.youtube.com/embed/JOGRYbo6WwI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> </div> </div> </div> </div> <h2>MCP Configuration</h2> <div class="summary"> <div class="tabbed"> <input type="radio" name="crawler" id="radioConfWget" class="tabbed__radio" checked> <label for="radioConfWget" class="tabbed__label">wget</label> <input type="radio" name="crawler" id="radioConfWarc" class="tabbed__radio"> <label for="radioConfWarc" class="tabbed__label">WARC</label> <input type="radio" name="crawler" id="radioConfInterrobot" class="tabbed__radio"> <label for="radioConfInterrobot" class="tabbed__label">InterroBot</label> <input type="radio" name="crawler" id="radioConfKatana" class="tabbed__radio"> <label for="radioConfKatana" class="tabbed__label">Katana</label> <input type="radio" name="crawler" id="radioConfSiteone" class="tabbed__radio"> <label for="radioConfSiteone" class="tabbed__label">SiteOne</label> <div class="tabbed__content"> <div id="confWget"> <pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span> <span class="comment"># macOS: command set to absolute path, i.e.</span> <span class="comment"># the value of $ which mcp-server-webcrawl</span> { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "wget", "--datasrc", "/path/to/wget/archives/"] } } } <span class="comment"># tested configurations (macOS Terminal/Windows WSL)</span> <span class="comment"># from /path/to/wget/archives/ as current working direcory</span> <span class="comment"># --adjust-extension for file extensions, e.g. *.html</span> <span>$ wget --mirror https://example.com</span> <span>$ wget --mirror https://example.com --adjust-extension</span></pre> </div> <div id="confWarc"> <pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span> <span class="comment"># macOS: command set to absolute path, i.e.</span> <span class="comment"># the value of $ which mcp-server-webcrawl</span> { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "warc", "--datasrc", "/path/to/warc/archives/"] } } } <span class="comment"># tested configurations (macOS Terminal/Windows WSL)</span> <span class="comment"># from /path/to/warc/archives/ as current working direcory</span> <span>$ wget --warc-file=example --recursive https://example.com</span> <span>$ wget --warc-file=example --recursive --page-requisites https://example.com</span></pre> </div> <div id="confInterrobot"> <pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span> <span class="comment"># macOS: command set to absolute path, i.e.</span> <span class="comment"># the value of $ which mcp-server-webcrawl</span> { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "interrobot", "--datasrc", "[homedir]/Documents/InterroBot/interrobot.v2.db"] } } } <span class="comment"># crawls executed in InterroBot (windowed)</span> <span class="comment"># Windows: replace [homedir] with /Users/...</span> <span class="comment"># macOS: path provided on InterroBot settings page</span></pre> </div> <div id="confKatana"> <pre> <span class="comment"># Windows: command set to "mcp-server-webcrawl"</span> <span class="comment"># macOS: command set to absolute path, i.e.</span> <span class="comment"># the value of $ which mcp-server-webcrawl</span> { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "katana", "--datasrc", "/path/to/katana/crawls/"] } } } <span class="comment"># tested configurations (macOS Terminal/Powershell/WSL)</span> <span class="comment"># -store-response to save crawl contents</span> <span class="comment"># -store-response-dir allows for expansion of hosts</span> <span class="comment"># &nbsp; consistent with default Katana behavior to </span> <span class="comment"># &nbsp; spread assets across host directories</span> <span>$ katana -u https://example.com -store-response -store-response-dir /path/to/katana/crawls/example.com/</span></pre> </div> <div id="confSiteone"> <pre><span class="comment"># Windows: command set to "mcp-server-webcrawl"</span> <span class="comment"># macOS: command set to absolute path, i.e.</span> <span class="comment"># the value of $ which mcp-server-webcrawl</span> { "mcpServers": { "webcrawl": { "command": "/path/to/mcp-server-webcrawl", "args": ["--crawler", "siteone", "--datasrc", "/path/to/siteone/archives/"] } } } <span class="comment"># crawls executed in SiteOne (windowed)</span> <span class="comment"># *Generate offline website* must be checked</span></pre> </div> </div> </div> <div class="configuration__selection"> <p> From Claude's developer settings, find the MCP configuration to include your crawl. Open in a text editor and modify the example to reflect your datasrc path. </p> <p> You can set up more <strong>mcp-server-webcrawl</strong> connections under mcpServers if you want. </p> <p> For additional technical information, including crawler feature support, be sure to check out <a href="help/index.html">help</a>. </p> </div> </div> <div class="tabbed__visualization"> <img src="../media/static/images/mcp-server-webcrawl/netwwwe939.svg?202508252215" alt="Abstraction of LLM clients (Claude and OpenAI) communicating with a website archive"/> </div> </div> </main> <footer> <nav class="pragmar"> <div class="pragmar__also">Software by <a href="../index.html">Pragmar</a></div> <div class="pragmar__products__wrap"> <div class="pragmar__products"> <a class="pragmar__product interrobot" href="https://interro.bot/?utm_source=pragmar.com"> <img src="../media/static/images/home/interrobote939.png?202508252215" alt="InterroBot icon"/> <div><strong>InterroBot</strong>. Web crawler and analyzer. Free/paid.</div> </a> <a class="pragmar__product appstat" href="../appstat/index.html"> <img src="../media/static/images/home/appstate939.png?202508252215" alt="appstat icon"/> <div><strong>appstat</strong>. Windows process monitor. Free.</div> </a> <a class="pragmar__product moffitor" href="../moffitor/index.html"> <img src="../media/static/images/home/moffitore939.png?202508252215" alt="Moffitor icon"/> <div><strong>Moffitor</strong>. One-click monitor sleep. Free.</div> </a> <a class="pragmar__product qbit" href="../qbit/index.html"> <img src="../media/static/images/home/qbite939.png?202508252215" alt="Qbit icon"/> <div><strong>Qbit</strong>. Skybox generator for game devs. Free/paid.</div> </a> </div> </div> </nav> </footer> <script src="../media/static/scripts/js/main.mine939.js?202508252215"></script> </body> <!-- Mirrored from pragmar.com/mcp-server-webcrawl/ by HTTrack Website Copier/3.x [XR&CO'2014], Wed, 27 Aug 2025 18:49:27 GMT --> </html>

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pragmar/mcp_server_webcrawl'

If you have feedback or need assistance with the MCP directory API, please join our Discord server