Skip to main content
Glama
AIMLPM

AIMLPM/markcrawl

MarkCrawl by iD8 🕷️📝

Turn any website into clean Markdown for LLM pipelines — in one command.

CI PyPI Version License MCP Server

pip install markcrawl
markcrawl --base https://docs.example.com --out ./output --show-progress

MarkCrawl crawls a website, strips navigation/scripts/boilerplate, and writes clean Markdown files with a structured JSONL index. Every page includes a citation with the access date. No API keys needed.

Quickstart (2 minutes)

pip install markcrawl
markcrawl --base https://httpbin.org --out ./demo --show-progress

Your ./demo folder now contains:

demo/
├── index__a4f3b2c1d0.md    ← clean Markdown of the page
└── pages.jsonl              ← structured index (one JSON line per page)

Each line in pages.jsonl:

{
  "url": "https://httpbin.org/",
  "title": "httpbin.org",
  "crawled_at": "2026-04-04T12:30:00Z",
  "citation": "httpbin.org. httpbin.org. Available at: https://httpbin.org/ [Accessed April 04, 2026].",
  "tool": "markcrawl",
  "text": "# httpbin.org\n\nA simple HTTP Request & Response Service..."
}

How it compares

MarkCrawl sits between simple scraping libraries and heavyweight crawling platforms. It's designed for developers who want clean Markdown from a website without configuring a framework.

MarkCrawl

FireCrawl

Crawl4AI

Scrapy

License

MIT

AGPL-3.0

Apache-2.0

BSD-3

Install

pip install markcrawl

SaaS or self-host

pip + Playwright

pip + framework

Output

Markdown + JSONL

Markdown + JSON

Markdown

Custom pipelines

JS rendering

Optional (--render-js)

Built-in

Built-in

Plugin

LLM extraction

Optional add-on

Via API

Built-in

None

Best for

Single-site crawl → Markdown

Hosted scraping API

AI-native crawling

Large-scale distributed

If you need a hosted API, use FireCrawl. If you need distributed crawling at scale, use Scrapy. MarkCrawl is for local, single-site crawls that produce LLM-ready output.

Installation

The core crawler is the only thing you need. Everything else is optional.

pip install markcrawl                # Core crawler (free, no API keys)

Optional add-ons:

pip install markcrawl[extract]       # + LLM extraction (OpenAI, Claude, Gemini, Grok)
pip install markcrawl[js]            # + JavaScript rendering (Playwright)
pip install markcrawl[upload]        # + Supabase upload with embeddings
pip install markcrawl[mcp]           # + MCP server for AI agents
pip install markcrawl[langchain]     # + LangChain tool wrappers
pip install markcrawl[all]           # Everything

For Playwright, also run playwright install chromium after installing.

git clone https://github.com/AIMLPM/markcrawl.git
cd markcrawl
python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

Crawling

markcrawl --base https://www.example.com --out ./output --show-progress

Add flags as needed:

markcrawl \
  --base https://www.example.com \
  --out ./output \
  --include-subdomains \        # crawl sub.example.com too
  --render-js \                 # render JavaScript (React, Vue, etc.)
  --concurrency 5 \             # fetch 5 pages in parallel
  --proxy http://proxy:8080 \   # route through a proxy
  --max-pages 200 \             # stop after 200 pages
  --format markdown \           # or "text" for plain text
  --show-progress

Resume an interrupted crawl:

markcrawl --base https://www.example.com --out ./output --resume --show-progress

Output

Each page becomes a .md file with a citation header:

# Getting Started

> URL: https://docs.example.com/getting-started
> Crawled: April 04, 2026
> Citation: Getting Started. docs.example.com. Available at: https://docs.example.com/getting-started [Accessed April 04, 2026].

Welcome to the platform. This guide walks you through installation...

Navigation, footer, cookie banners, and scripts are stripped. Only the main content remains.

Argument

Description

--base

Base site URL to crawl

--out

Output directory

--format

markdown or text (default: markdown)

--show-progress

Print progress and crawl events

--render-js

Render JavaScript with Playwright before extracting

--concurrency

Pages to fetch in parallel (default: 1)

--proxy

HTTP/HTTPS proxy URL

--resume

Resume from saved state

--include-subdomains

Include subdomains under the base domain

--max-pages

Max pages to save; 0 = unlimited (default: 500)

--delay

Delay between requests in seconds (default: 1.0)

--timeout

Per-request timeout in seconds (default: 15)

--min-words

Skip pages with fewer words (default: 20)

--user-agent

Override the default user agent

--use-sitemap / --no-sitemap

Enable/disable sitemap discovery

Optional: structured extraction

If you need structured data (not just text), the extraction add-on uses an LLM to pull specific fields from each page.

pip install markcrawl[extract]

markcrawl-extract \
  --jsonl ./output/pages.jsonl \
  --fields company_name pricing features \
  --show-progress

Auto-discover fields across multiple crawled sites:

markcrawl-extract \
  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
  --auto-fields \
  --context "competitor pricing analysis" \
  --show-progress

Supports OpenAI, Anthropic (Claude), Google Gemini, and xAI (Grok) via --provider.

Provider and model selection

markcrawl-extract --jsonl ... --fields pricing --provider openai         # default
markcrawl-extract --jsonl ... --fields pricing --provider anthropic      # Claude
markcrawl-extract --jsonl ... --fields pricing --provider gemini         # Gemini
markcrawl-extract --jsonl ... --fields pricing --provider grok           # Grok
markcrawl-extract --jsonl ... --fields pricing --model gpt-4o           # override model

Provider

API key env var

Default model

OpenAI

OPENAI_API_KEY

gpt-4o-mini

Anthropic

ANTHROPIC_API_KEY

claude-sonnet-4-20250514

Google Gemini

GEMINI_API_KEY

gemini-2.0-flash

xAI (Grok)

XAI_API_KEY

grok-3-mini-fast

All extraction CLI arguments

Argument

Description

--jsonl

Path(s) to pages.jsonl — pass multiple for cross-site analysis

--fields

Field names to extract (space-separated)

--auto-fields

Auto-discover fields by sampling pages

--context

Describe your goal for auto-discovery

--sample-size

Pages to sample for auto-discovery (default: 3)

--provider

openai, anthropic, gemini, or grok

--model

Override the default model

--output

Output path (default: extracted.jsonl)

--delay

Delay between LLM calls in seconds (default: 0.25)

--show-progress

Print progress

Output format

Extracted rows include LLM attribution:

{
  "url": "https://competitor.com/pricing",
  "citation": "Pricing. competitor.com. Available at: ... [Accessed April 04, 2026].",
  "pricing_tiers": "Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)",
  "extracted_by": "gpt-4o-mini (openai)",
  "extraction_note": "Field values were extracted by an LLM and may be interpreted, not verbatim."
}

Optional: Supabase vector search (RAG)

Chunk pages, generate embeddings, and upload to Supabase with pgvector:

pip install markcrawl[upload]

markcrawl --base https://docs.example.com --out ./output --show-progress
markcrawl-upload --jsonl ./output/pages.jsonl --show-progress

Requires SUPABASE_URL, SUPABASE_KEY, and OPENAI_API_KEY. See docs/SUPABASE.md for table setup, query examples, and recommendations.

Optional: agent integrations

MarkCrawl includes integrations for AI agents. Each is an optional add-on.

pip install markcrawl[mcp]
{
  "mcpServers": {
    "markcrawl": {
      "command": "python",
      "args": ["-m", "markcrawl.mcp_server"]
    }
  }
}

Tools: crawl_site, list_pages, read_page, search_pages, extract_data

pip install markcrawl[langchain]
from markcrawl.langchain import all_tools
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType

agent = initialize_agent(tools=all_tools, llm=ChatOpenAI(model="gpt-4o-mini"),
                         agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION)
agent.run("Crawl docs.example.com and summarize their auth guide")
npx clawhub install markcrawl-skill

See AIMLPM/markcrawl-clawhub-skill.

Copy the system prompt from docs/LLM_PROMPT.md into any LLM to get an assistant that generates correct MarkCrawl commands.

When NOT to use MarkCrawl

  • Sites behind login/auth — no cookie or session support

  • Aggressive bot protection (Cloudflare, Akamai) — no anti-bot evasion

  • Millions of pages — designed for hundreds to low thousands; use Scrapy for scale

  • PDF content — HTML only (PDF support is on the roadmap)

  • JavaScript SPAs without --render-js — add markcrawl[js] for React/Vue/Angular

Architecture

MarkCrawl is a web crawler. The optional layers (extraction, upload, agents) are separate add-ons that work with the crawler's output.

CORE (free, no API keys)              OPTIONAL ADD-ONS
┌──────────────────────────┐
│ 1. Discover URLs         │          markcrawl[extract]  — LLM field extraction
│    (sitemap or links)    │          markcrawl[upload]   — Supabase/pgvector RAG
│ 2. Fetch & clean HTML    │          markcrawl[js]       — Playwright JS rendering
│ 3. Write Markdown + JSONL│          markcrawl[mcp]      — MCP server for agents
│    + auto-citation       │          markcrawl[langchain] — LangChain tools
└──────────────────────────┘

For internals, see docs/ARCHITECTURE.md.

Extending MarkCrawl

from markcrawl import crawl

result = crawl("https://example.com", out_dir="./output")
print(f"Saved {result.pages_saved} pages")
# Process output in your own pipeline
import json
with open(result.index_file) as f:
    for line in f:
        page = json.loads(line)
        your_db.insert(page)  # Pinecone, Weaviate, Elasticsearch, etc.
# Use individual components
from markcrawl import chunk_text
from markcrawl.extract import LLMClient, extract_fields

See docs/ARCHITECTURE.md for the full module map and extensibility guide.

Cost

The core crawler is free. Two optional features have API costs:

Feature

Cost

When

Structured extraction

~$0.01-0.03 per page

markcrawl-extract

Supabase upload

~$0.0001 per page

markcrawl-upload

Setting up API keys

Only needed for extraction and upload. The core crawler requires no keys.

# .env — in your working directory
OPENAI_API_KEY="sk-..."           # extraction (--provider openai) + upload
ANTHROPIC_API_KEY="sk-ant-..."    # extraction (--provider anthropic)
GEMINI_API_KEY="AI..."            # extraction (--provider gemini)
XAI_API_KEY="xai-..."             # extraction (--provider grok)
SUPABASE_URL="https://..."        # upload
SUPABASE_KEY="eyJ..."             # upload (service-role key)
source .env
.
├── README.md
├── LICENSE
├── PRIVACY.md
├── SECURITY.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── Dockerfile
├── glama.json
├── pyproject.toml
├── requirements.txt
├── .github/
│   ├── pull_request_template.md
│   └── workflows/
│       ├── ci.yml
│       └── publish.yml
├── docs/
│   ├── ARCHITECTURE.md
│   ├── LLM_PROMPT.md
│   ├── MCP_SUBMISSION.md
│   └── SUPABASE.md
├── tests/
│   ├── test_core.py
│   ├── test_chunker.py
│   ├── test_extract.py
│   └── test_upload.py
└── markcrawl/
    ├── __init__.py
    ├── cli.py
    ├── core.py
    ├── chunker.py
    ├── exceptions.py
    ├── utils.py
    ├── extract.py
    ├── extract_cli.py
    ├── upload.py
    ├── upload_cli.py
    ├── langchain.py
    └── mcp_server.py

Roadmap

  • Canonical URL support

  • Fuzzy duplicate-content detection

  • PDF support

  • Authenticated crawling

  • Multi-provider embeddings

  • pip install markcrawl on PyPI

  • 82 automated tests + GitHub Actions CI (Python 3.10-3.13) + ruff linting

  • Markdown and plain text output with auto-citation

  • Sitemap-first crawling with robots.txt compliance

  • Text chunking with configurable overlap

  • Supabase/pgvector upload for RAG

  • JavaScript rendering via Playwright

  • Concurrent fetching and proxy support

  • Resume interrupted crawls

  • LLM extraction (OpenAI, Claude, Gemini, Grok) with auto-field discovery

  • MCP server, LangChain tools, OpenClaw skill

Contributing

See CONTRIBUTING.md. If you used an LLM to generate code, include the prompt in your PR.

Security

See SECURITY.md.

Privacy

MarkCrawl runs locally. No telemetry, no analytics, no data sent anywhere. See PRIVACY.md.

License

MIT. See LICENSE.

Install Server
A
security – no known vulnerabilities
A
license - permissive license
A
quality - A tier

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AIMLPM/markcrawl'

If you have feedback or need assistance with the MCP directory API, please join our Discord server