@digidai/mcp-website2markdown
OfficialClick on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@@digidai/mcp-website2markdownconvert https://example.com to markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
URL to Markdown Converter
Convert any web page to clean Markdown — JS-heavy SPAs, paywalled content, Chinese platforms (WeChat, Zhihu, Feishu), and more. Powered by Cloudflare Workers with a 5-layer fallback pipeline and 14 site adapters.
Quick Start
# Convert any URL to Markdown (try it now!)
curl -H "Accept: text/markdown" https://md.genedai.me/https://example.com
# WeChat article
curl -H "Accept: text/markdown" "https://md.genedai.me/https://mp.weixin.qq.com/s/YOUR_ARTICLE_ID"
# JSON output with metadata
curl "https://md.genedai.me/https://example.com?format=json&raw=true"Or just open in your browser: md.genedai.me/https://example.com
Need browser-rendered pages (WeChat, Feishu, JS-heavy SPAs) or higher limits? Get a free API key at md.genedai.me/portal/.
How It Works
https://md.genedai.me/<target-url>Conversion Flow
Request
│
▼
Fetch target with Accept: text/markdown
│
├─ Response is text/markdown? ──▶ Path 1: Native Markdown
│
└─ Response is text/html?
│
├─ Anti-bot / JS-required detected? ──▶ Path 3: Browser Rendering → Readability + Turndown
│
└─ Normal HTML ──▶ Path 2: Readability + TurndownPath | When | How |
|
Native | Target site supports Markdown for Agents | Cloudflare edge converts via |
|
Fallback | Normal HTML pages | Readability extracts main content → Turndown converts to Markdown |
|
Browser | Anti-bot pages, JS-rendered content | Headless Chrome renders the page → Readability + Turndown |
|
Jina | Explicit | Convert via Jina Reader API while preserving the same output/query surface |
|
API Usage
Browser (URL bar)
# Full URL
https://md.genedai.me/https://example.com/page
# Bare domain (auto-prepends https://)
https://md.genedai.me/example.com/pageRaw Markdown API
# Get raw Markdown via query param
curl "https://md.genedai.me/https://example.com/page?raw=true"
# Get raw Markdown via Accept header
curl https://md.genedai.me/https://example.com/page \
-H "Accept: text/markdown"API Keys and Tiers
Sign up at md.genedai.me/portal/ with your email to get an API key. No password; a sign-in link is emailed to you.
Tier | Credits/month | Browser rendering | Proxy / Engine selection |
Anonymous (no key) | — | ❌ cache + readability only | ❌ |
Free | 1,000 | ✅ | ❌ |
Pro | 50,000 | ✅ | ✅ ( |
Credit cost is fixed per request type, not per actual conversion path (so bills are predictable even if a site silently switches from static to browser rendering behind the scenes):
Endpoint | Credits |
| 1 |
| 1 |
| 1 |
| 3 |
| 2 |
Cache hits on a paying tier still consume 1 credit; when your quota is
exhausted the API keeps serving cached URLs (with X-Quota-Exceeded: true)
but rejects cache-miss requests with 429.
Using your key
# Bearer header (recommended)
curl "https://md.genedai.me/https://example.com/page?raw=true" \
-H "Authorization: Bearer mk_..."
# The old ?token= query-parameter form is supported for legacy
# PUBLIC_API_TOKEN deployments, but NOT for mk_ keys. Never put a real
# API key in a query string — logs, referrers, and monitoring capture it.Every authenticated response includes per-key rate limit headers:
X-RateLimit-Limit: 50000
X-RateLimit-Remaining: 49993
X-Request-Cost: 1Portal API (session cookie)
Once signed in at /portal/, these endpoints are available under the same
session cookie:
Endpoint | Method | Description |
| GET | Current account (email, tier, account_id) |
| GET | List your keys (prefix only, never plaintext) |
| POST | Create a new key; plaintext returned once |
| DELETE | Revoke a key (takes effect within 60s — LRU cache) |
| GET | Usage breakdown (tier, quota, used, daily history) |
| POST | Destroy session, clear cookie |
/api/usage also accepts an Authorization: Bearer mk_... header so SDK
and CLI tools can poll usage without a session.
Output Formats
# Markdown (default)
curl "https://md.genedai.me/https://example.com?format=markdown&raw=true"
# Clean HTML
curl "https://md.genedai.me/https://example.com?format=html&raw=true"
# Plain text (no formatting)
curl "https://md.genedai.me/https://example.com?format=text&raw=true"
# JSON (structured: url, title, markdown, method, timestamp)
curl "https://md.genedai.me/https://example.com?format=json&raw=true"CSS Selector Extraction
Extract specific page elements instead of the full article:
# Extract only the article body
curl "https://md.genedai.me/https://example.com?selector=.article-body&raw=true"
# Extract a specific section
curl "https://md.genedai.me/https://example.com?selector=%23main-content&raw=true"
selectormaximum length is256characters.
Force Browser Rendering
curl "https://md.genedai.me/https://example.com/js-heavy-page?raw=true&force_browser=true"Jina Reader Engine
Use engine=jina to convert via r.jina.ai instead of the built-in pipeline. This is useful for JS-heavy pages when browser rendering is unavailable. Free tier: 20 RPM, 2 concurrent, per-IP rate limit.
curl "https://md.genedai.me/https://example.com?raw=true&engine=jina"Jina is also used automatically as a last-resort fallback when Readability extraction produces very little content and no browser/proxy path was used.
Cache Control
Results are cached in KV for fast repeat access. To bypass cache:
curl "https://md.genedai.me/https://example.com?raw=true&no_cache=true"Batch Conversion
Convert multiple URLs in a single request:
curl -X POST https://md.genedai.me/api/batch \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/page1",
{
"url": "https://example.com/page2",
"format": "text",
"selector": "article",
"force_browser": false,
"no_cache": true
}
]
}'urls supports:
String item:
"https://example.com/a"(defaults to markdown)Object item:
{ "url": "...", "format?": "markdown|html|text|json", "selector?": "...", "force_browser?": boolean, "no_cache?": boolean, "engine?": "jina" }
Response:
{
"results": [
{
"url": "...",
"format": "markdown",
"content": "...",
"markdown": "...",
"title": "...",
"method": "...",
"cached": false,
"fallbacks": ["jsonld"]
},
{
"url": "...",
"format": "text",
"content": "...",
"title": "...",
"method": "...",
"cached": true
}
]
}Structured Extraction API
Extract structured fields from URL or raw HTML.
curl -X POST https://md.genedai.me/api/extract \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"strategy": "css",
"url": "https://example.com/article",
"schema": {
"fields": [
{ "name": "title", "selector": "h1", "type": "text", "required": true },
{ "name": "author", "selector": ".author", "type": "text" }
]
},
"include_markdown": true
}'Batch extraction (items) is also supported (max 10 items).
Additional extraction capabilities:
Use either top-level
url/htmlor nestedinput.url/input.html.schema.fields[*].requiredfails extraction when a required field is missing.optionssupportsdedupe,includeEmpty, andregexFlags.include_markdown: trueattaches converted markdown alongside extracted data.
Job API (create / query / stream / run)
Submit crawl/extract tasks as queued jobs, then run and monitor. Jobs are persisted as queued records in KV; execution begins when you call /run:
# 1) Create job
curl -X POST https://md.genedai.me/api/jobs \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-H "Idempotency-Key: demo-job-1" \
-d '{
"type": "crawl",
"tasks": [
"https://example.com/a",
"https://example.com/b"
],
"priority": 10,
"maxRetries": 2
}'
# 2) Query status
curl -H "Authorization: Bearer <api-token>" \
https://md.genedai.me/api/jobs/<job-id>
# 3) Watch status stream (SSE)
curl -N -H "Authorization: Bearer <api-token>" \
https://md.genedai.me/api/jobs/<job-id>/stream
# 4) Execute queued tasks
curl -X POST -H "Authorization: Bearer <api-token>" \
https://md.genedai.me/api/jobs/<job-id>/runJob API notes:
Supports both
type: "crawl"andtype: "extract".type: "crawl"accepts string URLs or object tasks withformat,selector,force_browser, andno_cache.type: "extract"reuses the same task shape as/api/extract.Idempotency-Keyis keyed by both the header value and request payload: same key + same payload returns the existing job; same key + different payload returns409 Conflict.priorityis normalized to1..100(default10),maxRetriesto0..10(default2).Up to
100tasks are allowed per job.
Deep Crawl API
Run BFS/BestFirst deep crawl with filters/scoring and opt-in checkpoint resume.
# non-stream
curl -X POST https://md.genedai.me/api/deepcrawl \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"seed": "https://example.com/docs",
"max_depth": 2,
"max_pages": 20,
"strategy": "best_first",
"filters": {
"allow_domains": ["example.com"],
"url_patterns": ["https://example.com/docs/*"]
},
"scorer": {
"keywords": ["api", "reference"],
"weight": 2
},
"checkpoint": {
"crawl_id": "docs-crawl-001",
"snapshot_interval": 5
}
}'
# stream mode (SSE: start/node/done/fail)
curl -N -X POST https://md.genedai.me/api/deepcrawl \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"seed": "https://example.com/docs",
"stream": true
}'Deep crawl request supports:
include_externalto traverse off-domain links.filters.url_patterns,filters.allow_domains,filters.block_domains,filters.content_types.scorer.keywords,scorer.weight,scorer.score_threshold.output.include_markdownto attach per-page markdown.fetch.selector,fetch.force_browser,fetch.no_cacheto control page conversion.checkpoint.crawl_id,checkpoint.resume,checkpoint.snapshot_interval,checkpoint.ttl_seconds.
Supported Sites
Special adapters for optimal extraction on these platforms:
Site | Features |
WeChat ( | MicroMessenger UA, image proxy for hotlink bypass |
Feishu/Lark Docs (document surfaces such as | Virtual scroll handling, R2 image storage, UI noise removal |
Zhihu ( | Login wall removal, lazy image swap, hybrid proxy bypass |
Yuque ( | SPA rendering, sidebar/toc removal |
Notion ( | SPA rendering, lazy scroll loading |
Juejin ( | Login popup removal, code block expansion |
Twitter/X ( | Stealth rendering, login wall bypass |
Reddit ( | URL transform to old.reddit.com, content extraction |
CSDN ( | Login popup removal, code block expansion |
36Kr ( | Stealth rendering, content extraction |
Toutiao ( | Stealth rendering, content extraction |
NetEase ( | Content extraction |
Weibo ( | Stealth rendering, hybrid proxy bypass |
All other sites | Generic mobile UA, lazy image handling |
JavaScript / TypeScript
const res = await fetch(
"https://md.genedai.me/https://example.com/page?raw=true"
);
const markdown = await res.text();
console.log(res.headers.get("X-Markdown-Method"));
console.log(res.headers.get("X-Cache-Status")); // "HIT" or "MISS"Python
import requests
url = "https://md.genedai.me/https://example.com/page"
resp = requests.get(url, params={"raw": "true", "format": "json"})
data = resp.json()
print(data["title"], data["method"])API Endpoints
Endpoint | Method | Description |
| GET | Landing page with URL input form |
| GET | Convert URL and render Markdown as HTML page |
| GET | Return raw Markdown as plain text |
| GET | Return structured JSON (url, title, markdown, method) |
| GET | Return HTML output for preview/basic rendering |
| GET | Return plain text (no formatting) |
| GET | Extract specific CSS selector |
| GET | Force browser rendering |
| GET | Convert via Jina Reader API using the same output formats |
| GET | Bypass KV cache |
| GET | SSE conversion stream ( |
| POST | Batch convert multiple URLs (max 10) |
| POST | Structured extraction API ( |
| POST | Create queued crawl/extract job record |
| GET | Query job status |
| GET | SSE job status stream |
| POST | Execute queued/failed tasks in job |
| POST | Deep crawl API (BFS/BestFirst, stream/non-stream, checkpoint) |
| GET | Dynamic Open Graph image for landing/rendered pages |
| GET | Image proxy (bypasses hotlink protection) |
| GET | Serve image from R2 storage |
| GET | Health + runtime + operational metrics |
Authentication Matrix
The hosted instance at md.genedai.me uses D1-backed API keys with tiers
(see API Keys and Tiers). Self-hosted deployments
can skip the AUTH_DB binding and fall back to the legacy
API_TOKEN / PUBLIC_API_TOKEN secrets.
Route Group | Anonymous | Free tier ( | Pro tier ( |
| ✅ cache + readability | ✅ full pipeline | ✅ + |
| ✅ cache + readability | ✅ full pipeline | ✅ full + params |
| ❌ 401 | ✅ | ✅ |
| ❌ 401 | ✅ | ✅ |
| ❌ 401 | ✅ | ✅ |
| ❌ 401 | ✅ | ✅ |
| — | session cookie | session cookie or Bearer key |
| public | public | public |
| public (single-use token) | — | — |
| public HTML | — | — |
| public | public | public |
The batch / extract / deepcrawl / jobs endpoints are always gated because they either fan out into many conversions or touch Browser Rendering directly.
Response Headers (Raw API)
Header | Description |
|
|
| The original target URL |
| Token count (native Markdown for Agents only) |
|
|
|
|
|
|
| Comma-separated fallback list (when used) |
|
|
|
|
| Monthly credit quota (authenticated requests only) |
| Credits remaining this month |
| Fixed per-request-type credit cost |
|
|
| Present on |
|
|
Features
Feature | Description |
Any Website | Works on every site with four conversion paths |
Site Adapters | Specialized extractors for WeChat, Feishu, Zhihu, Yuque, Notion, Juejin |
Anti-Bot Bypass | Browser Rendering handles JS challenges, CAPTCHAs, and verification |
3-Tier Cache | In-memory hot cache → Cloudflare Cache API (per-colo, free) → KV (global, persistent) |
Developer Portal | Self-service signup, API key management, real-time usage dashboard |
Tier System | Anonymous (cache+readability only), Free (1k/mo), Pro (50k/mo) |
R2 Image Storage | Images stored reliably, served via proxy URLs |
Multiple Formats | Markdown, HTML, text, or structured JSON output |
CSS Selectors | Target specific page elements for extraction |
Batch API v2 | Convert up to 10 URLs with per-item format/selector/browser/cache options |
Structured Extraction | CSS/XPath/Regex extraction via |
Job Dispatcher | Queue + run + monitor crawl/extract workloads via |
Deep Crawl | BFS + BestFirst traversal, filters/scorers, stream mode, checkpoint/resume |
Table Support | Improved handling of simple and complex tables |
Smart Extraction | Readability strips nav, ads, sidebars — extracts main article content |
Rendered View | Dark-themed Markdown preview with GitHub CSS and tab switching |
Session Profiles | Persist/replay cookies and localStorage for repeat authenticated crawling |
Proxy Pool Fallback | Multi-proxy + UA/header variant rotation for challenge-prone targets |
SSRF Protection | Blocks private IPs, IPv6 link-local, cloud metadata endpoints |
Timeout Protection | Time-budgeted scrolling for Feishu virtual scroll documents |
Built-in Rate Limiting | Per-IP limits for conversion, stream, and batch routes |
Runtime Paywall Rules | Support dynamic paywall rule updates via env/KV JSON |
Operational Health |
|
Tech Stack
Component | Role |
Edge runtime — global deployment | |
Headless Chrome for JS-heavy/anti-bot pages | |
Edge key-value cache for converted content | |
Object storage for images | |
Native HTML→Markdown at edge | |
Article content extraction (Firefox Reader View) | |
HTML→Markdown conversion | |
Puppeteer API for Browser Rendering | |
Lightweight DOM for Workers | |
Unit testing framework |
AI Agent Integration
Three ways to use Website2Markdown from AI agents:
Agent Skills (Claude Code, OpenClaw, Claw)
One command install, auto-discovered by your agent. Includes usage patterns, error handling, and guides for all 21 adapters.
# Claude Code
git clone https://github.com/Digidai/website2markdown-skills ~/.claude/skills/website2markdown
# Codex CLI
git clone https://github.com/Digidai/website2markdown-skills ~/.codex/skills/website2markdown
# Gemini CLI
git clone https://github.com/Digidai/website2markdown-skills ~/.gemini/skills/website2markdown
# OpenClaw
npx clawhub@latest install website2markdownOne command, auto-discovered in new sessions. See the website2markdown-skills repo for full documentation.
MCP Server (Claude Desktop, Cursor IDE, Windsurf)
Standard MCP protocol with convert_url tool.
npm install -g @digidai/mcp-website2markdownClaude Desktop config (~/.claude/claude_desktop_config.json):
{
"mcpServers": {
"website2markdown": {
"command": "mcp-website2markdown",
"env": {
"WEBSITE2MARKDOWN_API_URL": "https://md.genedai.me"
}
}
}
}llms.txt
Machine-readable API description for AI system auto-discovery:
https://md.genedai.me/llms.txtWhich to choose?
Skills | MCP Server | llms.txt | |
Best for | CLI-based agents (Claude Code, OpenClaw) | IDE-based agents (Claude Desktop, Cursor) | Any AI with web access |
Latency | Direct HTTP (fastest) | MCP protocol overhead | Direct HTTP |
Context | Rich (patterns, error handling, adapters) | Tool schema only | API description |
Install |
|
| None |
Project Structure
md-genedai/
├── src/
│ ├── index.ts # Router + conversion + extraction + job/deepcrawl endpoints
│ ├── types.ts # Shared TS types (Env, extraction/job payloads, adapters)
│ ├── config.ts # Limits, timeouts, UA and parser constants
│ ├── utils.ts # Shared helpers (headers, parsing, formatting)
│ ├── converter.ts # Readability + Turndown pipeline and content shaping
│ ├── security.ts # SSRF guardrails, retry wrappers, safe fetch helpers
│ ├── paywall.ts # Paywall heuristics + runtime rule updates
│ ├── proxy.ts # Forward proxy + pool parsing/selection
│ ├── browser/
│ │ ├── index.ts # Browser rendering orchestrator and capacity control
│ │ ├── stealth.ts # Anti-detection hardening
│ │ └── adapters/ # 14 site-specific browser adapters
│ ├── cache/
│ │ └── index.ts # KV conversion cache + R2 image storage
│ ├── extraction/
│ │ └── strategies.ts # CSS/XPath/Regex structured extraction
│ ├── dispatcher/
│ │ ├── model.ts # Job schema + KV persistence/idempotency
│ │ └── runner.ts # Job execution and retry orchestration
│ ├── deepcrawl/
│ │ ├── bfs.ts # BFS/BestFirst traversal core
│ │ ├── filters.ts # Crawl filters (domains, patterns, content hints)
│ │ └── scorers.ts # Keyword/domain scoring for BestFirst strategy
│ ├── session/
│ │ └── profile.ts # Session profile capture/replay (cookie/localStorage)
│ ├── observability/
│ │ └── metrics.ts # Throughput/success/retry/backlog/latency snapshots
│ ├── templates/
│ │ ├── landing.ts # Landing page HTML
│ │ ├── rendered.ts # Markdown preview page HTML
│ │ ├── loading.ts # SSE loading/progress page HTML
│ │ └── error.ts # Error page HTML
│ └── __tests__/ # 37 test files
├── docs/
│ └── slo-reference.md # SLO targets used by /api/health operational metrics
├── scripts/
│ └── smoke-api.sh # End-to-end API smoke checks for deployed/local worker
├── package.json
├── wrangler.toml # Worker config: browser, KV, R2 bindings
├── tsconfig.json
├── vitest.config.ts
└── .gitignoreDeployment
This project uses Cloudflare Git Integration — push to main and Cloudflare automatically builds and deploys.
Setup (one-time)
Fork or push this repo to GitHub
Create required resources:
# Create KV namespace wrangler kv namespace create CACHE_KV # Update the namespace ID in wrangler.toml # Create R2 bucket wrangler r2 bucket create md-imagesGo to Cloudflare Dashboard > Workers & Pages > Create > Import a Git repository
Select the GitHub repo — Cloudflare will deploy automatically on every push to
main
Secrets / Runtime Variables
# Required: Bearer auth for protected write APIs
# Used by: /api/batch, /api/extract, /api/jobs, /api/deepcrawl
wrangler secret put API_TOKEN
# Optional: protect raw convert API + /api/stream
wrangler secret put PUBLIC_API_TOKEN
# Optional: dynamic paywall rules (JSON array)
wrangler secret put PAYWALL_RULES_JSON
# Optional: single upstream proxy (format: username:password@host:port)
wrangler secret put PROXY_URL
# Optional: proxy pool for rotation/fallback (comma or newline separated)
wrangler secret put PROXY_POOLOptional KV-driven paywall rule source:
Set
PAYWALL_RULES_KV_KEY(plain env var) to a KV key that stores JSON paywall rules.If both
PAYWALL_RULES_JSONand KV key are configured, KV value takes precedence.
Example plain env var in wrangler.toml:
[vars]
PAYWALL_RULES_KV_KEY = "paywall:rules:v1"Browser Rendering Binding
[browser]
binding = "MYBROWSER"Note: Browser Rendering requires a Workers Paid plan. It only works in deployed Workers or with
wrangler dev --remote.
Custom Domain
In Cloudflare Dashboard > Workers & Pages > your Worker > Settings > Domains & Routes
Add your custom domain (e.g.
md.example.com)
Local Development
npm install
npm run dev # Local dev at http://localhost:8787
npm run build # Dry-run bundle to dist/
npm run typecheck # Type check
npm test # Run unit tests
npm run test:watch # Watch mode
npm run test:coverage # Coverage
npm run smoke:api # API smoke checks (requires BASE_URL + API_TOKEN env vars)Checkpoint behavior:
Deep crawl checkpoint persistence is only enabled when you provide
checkpointoptions such ascrawl_id,resume,snapshot_interval, orttl_seconds.If you omit
checkpoint, the API still returns acrawlIdfor tracing, but no checkpoint record is written.Resume requests must match the original crawl configuration; changing filters, scoring, or fetch options returns
409 Conflict.
Smoke example:
BASE_URL="https://md.genedai.me" \
API_TOKEN="<api-token>" \
TARGET_URL="https://example.com" \
npm run smoke:apiValidation Workflow (2026-03-06)
Use Node 22 locally (see .nvmrc) or rely on GitHub Actions in .github/workflows/ci.yml:
Check | Command |
Type safety |
|
Unit/integration tests |
|
Coverage |
|
Worker bundle dry-run |
|
Live health check |
|
Live public conversion |
|
Production note:
Protected write APIs (
/api/extract,/api/jobs*,/api/deepcrawl,/api/batch) requireAPI_TOKEN.If
API_TOKENis not configured in deployed Worker, these endpoints return503(API_TOKEN not set).
License
MIT
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Digidai/website2markdown'
If you have feedback or need assistance with the MCP directory API, please join our Discord server