AIMLPM/markcrawl
The MarkCrawl server enables comprehensive website crawling, content extraction, and LLM-powered structured data analysis.
Crawl websites (
crawl_site) — Fetch and convert web pages to clean Markdown or plain text, stripping boilerplate. Supports JavaScript rendering (React/Vue/Angular SPAs), subdomain inclusion, and configurable page limits.List crawled pages (
list_pages) — Get an overview of all crawled pages with URLs, titles, and word counts.Search crawled content (
search_pages) — Keyword search across crawled page titles and content, with relevance-ranked results and context snippets — no network requests needed.Read specific pages (
read_page) — Retrieve the full content of any crawled page by its URL.Extract structured data (
extract_data) — Use an LLM to extract specified fields (e.g. pricing, features, API endpoints) from crawled pages, with support for automatic field discovery based on your analysis goal.
Designed for LLM pipelines, RAG workflows, and AI agent integration via MCP.
Provides LLM extraction capabilities using Google Gemini models to extract structured information from crawled web content.
Provides tool wrappers for integrating MarkCrawl crawling and extraction capabilities into LangChain agents and chains.
Provides LLM extraction to structure data from crawled content and generates embeddings for RAG upload, supporting automated field extraction like pricing and API endpoints.
Enables RAG pipeline by uploading crawled content with vector embeddings to Supabase pgvector, allowing LLMs to query the vector store for grounded answers.
MarkCrawl by iD8 🕷️📝
Turn any website into clean Markdown for LLM pipelines — in one command.
pip install markcrawl
markcrawl --base https://docs.example.com --out ./output --show-progressMarkCrawl crawls a website, strips navigation/scripts/boilerplate, and writes clean Markdown files with a structured JSONL index. Every page includes a citation with the access date. No API keys needed.
Quickstart (2 minutes)
pip install markcrawl
markcrawl --base https://httpbin.org --out ./demo --show-progressYour ./demo folder now contains:
demo/
├── index__a4f3b2c1d0.md ← clean Markdown of the page
└── pages.jsonl ← structured index (one JSON line per page)Each line in pages.jsonl:
{
"url": "https://httpbin.org/",
"title": "httpbin.org",
"crawled_at": "2026-04-04T12:30:00Z",
"citation": "httpbin.org. httpbin.org. Available at: https://httpbin.org/ [Accessed April 04, 2026].",
"tool": "markcrawl",
"text": "# httpbin.org\n\nA simple HTTP Request & Response Service..."
}How it compares
MarkCrawl sits between simple scraping libraries and heavyweight crawling platforms. It's designed for developers who want clean Markdown from a website without configuring a framework.
MarkCrawl | FireCrawl | Crawl4AI | Scrapy | |
License | MIT | AGPL-3.0 | Apache-2.0 | BSD-3 |
Install |
| SaaS or self-host | pip + Playwright | pip + framework |
Output | Markdown + JSONL | Markdown + JSON | Markdown | Custom pipelines |
JS rendering | Optional ( | Built-in | Built-in | Plugin |
LLM extraction | Optional add-on | Via API | Built-in | None |
Best for | Single-site crawl → Markdown | Hosted scraping API | AI-native crawling | Large-scale distributed |
If you need a hosted API, use FireCrawl. If you need distributed crawling at scale, use Scrapy. MarkCrawl is for local, single-site crawls that produce LLM-ready output.
Installation
The core crawler is the only thing you need. Everything else is optional.
pip install markcrawl # Core crawler (free, no API keys)Optional add-ons:
pip install markcrawl[extract] # + LLM extraction (OpenAI, Claude, Gemini, Grok)
pip install markcrawl[js] # + JavaScript rendering (Playwright)
pip install markcrawl[upload] # + Supabase upload with embeddings
pip install markcrawl[mcp] # + MCP server for AI agents
pip install markcrawl[langchain] # + LangChain tool wrappers
pip install markcrawl[all] # EverythingFor Playwright, also run playwright install chromium after installing.
git clone https://github.com/AIMLPM/markcrawl.git
cd markcrawl
python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"Crawling
markcrawl --base https://www.example.com --out ./output --show-progressAdd flags as needed:
markcrawl \
--base https://www.example.com \
--out ./output \
--include-subdomains \ # crawl sub.example.com too
--render-js \ # render JavaScript (React, Vue, etc.)
--concurrency 5 \ # fetch 5 pages in parallel
--proxy http://proxy:8080 \ # route through a proxy
--max-pages 200 \ # stop after 200 pages
--format markdown \ # or "text" for plain text
--show-progressResume an interrupted crawl:
markcrawl --base https://www.example.com --out ./output --resume --show-progressOutput
Each page becomes a .md file with a citation header:
# Getting Started
> URL: https://docs.example.com/getting-started
> Crawled: April 04, 2026
> Citation: Getting Started. docs.example.com. Available at: https://docs.example.com/getting-started [Accessed April 04, 2026].
Welcome to the platform. This guide walks you through installation...Navigation, footer, cookie banners, and scripts are stripped. Only the main content remains.
Argument | Description |
| Base site URL to crawl |
| Output directory |
|
|
| Print progress and crawl events |
| Render JavaScript with Playwright before extracting |
| Pages to fetch in parallel (default: |
| HTTP/HTTPS proxy URL |
| Resume from saved state |
| Include subdomains under the base domain |
| Max pages to save; |
| Delay between requests in seconds (default: |
| Per-request timeout in seconds (default: |
| Skip pages with fewer words (default: |
| Override the default user agent |
| Enable/disable sitemap discovery |
Optional: structured extraction
If you need structured data (not just text), the extraction add-on uses an LLM to pull specific fields from each page.
pip install markcrawl[extract]
markcrawl-extract \
--jsonl ./output/pages.jsonl \
--fields company_name pricing features \
--show-progressAuto-discover fields across multiple crawled sites:
markcrawl-extract \
--jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
--auto-fields \
--context "competitor pricing analysis" \
--show-progressSupports OpenAI, Anthropic (Claude), Google Gemini, and xAI (Grok) via --provider.
Provider and model selection
markcrawl-extract --jsonl ... --fields pricing --provider openai # default
markcrawl-extract --jsonl ... --fields pricing --provider anthropic # Claude
markcrawl-extract --jsonl ... --fields pricing --provider gemini # Gemini
markcrawl-extract --jsonl ... --fields pricing --provider grok # Grok
markcrawl-extract --jsonl ... --fields pricing --model gpt-4o # override modelProvider | API key env var | Default model |
OpenAI |
|
|
Anthropic |
|
|
Google Gemini |
|
|
xAI (Grok) |
|
|
All extraction CLI arguments
Argument | Description |
| Path(s) to |
| Field names to extract (space-separated) |
| Auto-discover fields by sampling pages |
| Describe your goal for auto-discovery |
| Pages to sample for auto-discovery (default: |
|
|
| Override the default model |
| Output path (default: |
| Delay between LLM calls in seconds (default: |
| Print progress |
Output format
Extracted rows include LLM attribution:
{
"url": "https://competitor.com/pricing",
"citation": "Pricing. competitor.com. Available at: ... [Accessed April 04, 2026].",
"pricing_tiers": "Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)",
"extracted_by": "gpt-4o-mini (openai)",
"extraction_note": "Field values were extracted by an LLM and may be interpreted, not verbatim."
}Optional: Supabase vector search (RAG)
Chunk pages, generate embeddings, and upload to Supabase with pgvector:
pip install markcrawl[upload]
markcrawl --base https://docs.example.com --out ./output --show-progress
markcrawl-upload --jsonl ./output/pages.jsonl --show-progressRequires SUPABASE_URL, SUPABASE_KEY, and OPENAI_API_KEY. See docs/SUPABASE.md for table setup, query examples, and recommendations.
Optional: agent integrations
MarkCrawl includes integrations for AI agents. Each is an optional add-on.
pip install markcrawl[mcp]{
"mcpServers": {
"markcrawl": {
"command": "python",
"args": ["-m", "markcrawl.mcp_server"]
}
}
}Tools: crawl_site, list_pages, read_page, search_pages, extract_data
pip install markcrawl[langchain]from markcrawl.langchain import all_tools
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
agent = initialize_agent(tools=all_tools, llm=ChatOpenAI(model="gpt-4o-mini"),
agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION)
agent.run("Crawl docs.example.com and summarize their auth guide")npx clawhub install markcrawl-skillSee AIMLPM/markcrawl-clawhub-skill.
Copy the system prompt from docs/LLM_PROMPT.md into any LLM to get an assistant that generates correct MarkCrawl commands.
When NOT to use MarkCrawl
Sites behind login/auth — no cookie or session support
Aggressive bot protection (Cloudflare, Akamai) — no anti-bot evasion
Millions of pages — designed for hundreds to low thousands; use Scrapy for scale
PDF content — HTML only (PDF support is on the roadmap)
JavaScript SPAs without
--render-js— addmarkcrawl[js]for React/Vue/Angular
Architecture
MarkCrawl is a web crawler. The optional layers (extraction, upload, agents) are separate add-ons that work with the crawler's output.
CORE (free, no API keys) OPTIONAL ADD-ONS
┌──────────────────────────┐
│ 1. Discover URLs │ markcrawl[extract] — LLM field extraction
│ (sitemap or links) │ markcrawl[upload] — Supabase/pgvector RAG
│ 2. Fetch & clean HTML │ markcrawl[js] — Playwright JS rendering
│ 3. Write Markdown + JSONL│ markcrawl[mcp] — MCP server for agents
│ + auto-citation │ markcrawl[langchain] — LangChain tools
└──────────────────────────┘For internals, see docs/ARCHITECTURE.md.
Extending MarkCrawl
from markcrawl import crawl
result = crawl("https://example.com", out_dir="./output")
print(f"Saved {result.pages_saved} pages")# Process output in your own pipeline
import json
with open(result.index_file) as f:
for line in f:
page = json.loads(line)
your_db.insert(page) # Pinecone, Weaviate, Elasticsearch, etc.# Use individual components
from markcrawl import chunk_text
from markcrawl.extract import LLMClient, extract_fieldsSee docs/ARCHITECTURE.md for the full module map and extensibility guide.
Cost
The core crawler is free. Two optional features have API costs:
Feature | Cost | When |
Structured extraction | ~$0.01-0.03 per page |
|
Supabase upload | ~$0.0001 per page |
|
Setting up API keys
Only needed for extraction and upload. The core crawler requires no keys.
# .env — in your working directory
OPENAI_API_KEY="sk-..." # extraction (--provider openai) + upload
ANTHROPIC_API_KEY="sk-ant-..." # extraction (--provider anthropic)
GEMINI_API_KEY="AI..." # extraction (--provider gemini)
XAI_API_KEY="xai-..." # extraction (--provider grok)
SUPABASE_URL="https://..." # upload
SUPABASE_KEY="eyJ..." # upload (service-role key)source .env.
├── README.md
├── LICENSE
├── PRIVACY.md
├── SECURITY.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── Dockerfile
├── glama.json
├── pyproject.toml
├── requirements.txt
├── .github/
│ ├── pull_request_template.md
│ └── workflows/
│ ├── ci.yml
│ └── publish.yml
├── docs/
│ ├── ARCHITECTURE.md
│ ├── LLM_PROMPT.md
│ ├── MCP_SUBMISSION.md
│ └── SUPABASE.md
├── tests/
│ ├── test_core.py
│ ├── test_chunker.py
│ ├── test_extract.py
│ └── test_upload.py
└── markcrawl/
├── __init__.py
├── cli.py
├── core.py
├── chunker.py
├── exceptions.py
├── utils.py
├── extract.py
├── extract_cli.py
├── upload.py
├── upload_cli.py
├── langchain.py
└── mcp_server.pyRoadmap
Canonical URL support
Fuzzy duplicate-content detection
PDF support
Authenticated crawling
Multi-provider embeddings
pip install markcrawlon PyPI82 automated tests + GitHub Actions CI (Python 3.10-3.13) + ruff linting
Markdown and plain text output with auto-citation
Sitemap-first crawling with robots.txt compliance
Text chunking with configurable overlap
Supabase/pgvector upload for RAG
JavaScript rendering via Playwright
Concurrent fetching and proxy support
Resume interrupted crawls
LLM extraction (OpenAI, Claude, Gemini, Grok) with auto-field discovery
MCP server, LangChain tools, OpenClaw skill
Contributing
See CONTRIBUTING.md. If you used an LLM to generate code, include the prompt in your PR.
Security
See SECURITY.md.
Privacy
MarkCrawl runs locally. No telemetry, no analytics, no data sent anywhere. See PRIVACY.md.
License
MIT. See LICENSE.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/AIMLPM/markcrawl'
If you have feedback or need assistance with the MCP directory API, please join our Discord server