Schema | AIMLPM/markcrawl

AIMLPM/markcrawl

Overview Schema Related Servers Score Discussions

Server Configuration

Describes the environment variables required to run the server.

Name	Required	Description
`XAI_API_KEY`	No	API key for xAI/Grok (required for extraction tool with --provider grok)
`GEMINI_API_KEY`	No	API key for Google Gemini (required for extraction tool with --provider gemini)
`OPENAI_API_KEY`	No	API key for OpenAI (required for extraction tool with --provider openai and Supabase upload)
`ANTHROPIC_API_KEY`	No	API key for Anthropic/Claude (required for extraction tool with --provider anthropic)

Capabilities

Features and capabilities supported by this server

Capability	Details
`tools`	{ "listChanged": false }
`prompts`	{ "listChanged": false }
`resources`	{ "subscribe": false, "listChanged": false }
`experimental`	{}

Tools

Functions exposed to the LLM to take actions

Name	Description
crawl_siteA	Crawl a website and save extracted content as clean Markdown or plain text. This tool fetches pages from the given URL, strips navigation, footers, scripts, and boilerplate, then saves each page as a Markdown file with a JSONL index (pages.jsonl). It respects robots.txt and uses sitemap-first discovery when available. Use this tool when asked to research, read, analyze, or archive a website. The output_dir from this tool is required by search_pages, read_page, list_pages, and extract_data. Typical workflow: crawl_site → list_pages or search_pages → read_page. Args: url: The base URL to crawl (e.g. "https://docs.example.com/"). Only public, non-authenticated pages will be fetched. output_dir: Directory to save output files. Each crawl creates .md files and a pages.jsonl index here. Default: ./crawl_output format: Output format — "markdown" (preserves headings, code blocks, lists) or "text" (plain text). Default: "markdown". max_pages: Maximum number of pages to save. Set to 0 for unlimited. Default: 100. Use lower values (10-20) for quick previews. include_subdomains: If True, also crawl subdomains (e.g. docs.example.com when crawling example.com). Default: False. render_js: If True, use a headless Chromium browser to render JavaScript before extracting content. Required for React/Vue/Angular sites. Slower but necessary for SPAs. Default: False.
search_pagesA	Search through previously crawled pages by keyword. Performs case-insensitive keyword search across page titles and text content. Results are ranked by the number of matching query words found. Each result includes the page URL, title, and a text snippet showing context around the first match. This is a read-only operation on local files — no network requests are made. Requires a prior crawl_site call to have populated the pages.jsonl file. Args: query: Search query — one or more keywords separated by spaces. All words are searched independently (OR logic). Example: "authentication API key". jsonl_path: Full path to the pages.jsonl file from a previous crawl. If empty, defaults to <MARKCRAWL_OUTPUT_DIR>/pages.jsonl. max_results: Maximum number of results to return. Default: 10. Use lower values for focused searches, higher for comprehensive surveys.
read_pageA	Read the full extracted content of a specific crawled page by its URL. Returns the complete Markdown or text content of a single page, including its title and source URL. Use this after search_pages to read the full content of a relevant result. This is a read-only operation on local files — no network requests are made. URL matching is case-insensitive and tolerates trailing slashes. Args: url: The exact URL of the page to read. Must match a URL from a previous crawl. Case-insensitive. Example: "https://docs.example.com/auth". jsonl_path: Full path to the pages.jsonl file. If empty, defaults to <MARKCRAWL_OUTPUT_DIR>/pages.jsonl.
list_pagesA	List all pages from a previous crawl with their URLs, titles, and word counts. `Returns a summary of every page in the crawl index. Use this to get an overview of available content before searching or reading specific pages. Word counts help identify content-rich pages vs. thin landing pages. This is a read-only operation on local files — no network requests are made. Args: jsonl_path: Full path to the pages.jsonl file. If empty, defaults to <MARKCRAWL_OUTPUT_DIR>/pages.jsonl.`
extract_dataA	Extract structured fields from crawled pages using an LLM. Analyzes each crawled page and pulls out specific data fields you define (e.g. company_name, pricing, features, api_endpoints). If no fields are specified, the LLM automatically discovers relevant fields by sampling pages from the crawl. This tool makes external API calls to OpenAI (requires OPENAI_API_KEY environment variable). Results are saved to extracted.jsonl and include LLM attribution metadata. Use this for competitive research, API documentation analysis, or building structured datasets from unstructured web content. Args: jsonl_path: Full path to the pages.jsonl file. If empty, defaults to <MARKCRAWL_OUTPUT_DIR>/pages.jsonl. fields: Comma-separated field names to extract. Example: "company_name,pricing,features,api_endpoints". Leave empty to let the LLM auto-discover the most relevant fields. context: Description of your analysis goal. Improves auto-field discovery quality. Example: "competitor pricing analysis" or "API documentation review". Ignored when fields are specified. sample_size: Number of pages to sample for auto-field discovery. Default: 3. Higher values give better field suggestions but cost more tokens.

Prompts

Interactive templates invoked by user choice

Name	Description
No prompts

Resources

Contextual data attached and managed by the client

Name	Description
No resources

Server Configuration
Capabilities
Tools
Prompts
Resources

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AIMLPM/markcrawl'

If you have feedback or need assistance with the MCP directory API, please join our Discord server