Skip to main content
Glama
AIMLPM

AIMLPM/markcrawl

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
XAI_API_KEYNoAPI key for xAI/Grok (required for extraction tool with --provider grok)
GEMINI_API_KEYNoAPI key for Google Gemini (required for extraction tool with --provider gemini)
OPENAI_API_KEYNoAPI key for OpenAI (required for extraction tool with --provider openai and Supabase upload)
ANTHROPIC_API_KEYNoAPI key for Anthropic/Claude (required for extraction tool with --provider anthropic)

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}
prompts
{
  "listChanged": false
}
resources
{
  "subscribe": false,
  "listChanged": false
}
experimental
{}

Tools

Functions exposed to the LLM to take actions

NameDescription
crawl_siteA

Crawl a website and save extracted content as clean Markdown or plain text.

This tool fetches pages from the given URL, strips navigation, footers,
scripts, and boilerplate, then saves each page as a Markdown file with a
JSONL index (pages.jsonl). It respects robots.txt and uses sitemap-first
discovery when available.

Use this tool when asked to research, read, analyze, or archive a website.
The output_dir from this tool is required by search_pages, read_page,
list_pages, and extract_data.

Typical workflow: crawl_site → list_pages or search_pages → read_page.

Args:
    url: The base URL to crawl (e.g. "https://docs.example.com/"). Only
        public, non-authenticated pages will be fetched.
    output_dir: Directory to save output files. Each crawl creates .md files
        and a pages.jsonl index here. Default: ./crawl_output
    format: Output format — "markdown" (preserves headings, code blocks,
        lists) or "text" (plain text). Default: "markdown".
    max_pages: Maximum number of pages to save. Set to 0 for unlimited.
        Default: 100. Use lower values (10-20) for quick previews.
    include_subdomains: If True, also crawl subdomains (e.g. docs.example.com
        when crawling example.com). Default: False.
    render_js: If True, use a headless Chromium browser to render JavaScript
        before extracting content. Required for React/Vue/Angular sites.
        Slower but necessary for SPAs. Default: False.
search_pagesA

Search through previously crawled pages by keyword.

Performs case-insensitive keyword search across page titles and text content.
Results are ranked by the number of matching query words found. Each result
includes the page URL, title, and a text snippet showing context around the
first match.

This is a read-only operation on local files — no network requests are made.
Requires a prior crawl_site call to have populated the pages.jsonl file.

Args:
    query: Search query — one or more keywords separated by spaces. All words
        are searched independently (OR logic). Example: "authentication API key".
    jsonl_path: Full path to the pages.jsonl file from a previous crawl. If
        empty, defaults to <WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
    max_results: Maximum number of results to return. Default: 10. Use lower
        values for focused searches, higher for comprehensive surveys.
read_pageA

Read the full extracted content of a specific crawled page by its URL.

Returns the complete Markdown or text content of a single page, including
its title and source URL. Use this after search_pages to read the full
content of a relevant result.

This is a read-only operation on local files — no network requests are made.
URL matching is case-insensitive and tolerates trailing slashes.

Args:
    url: The exact URL of the page to read. Must match a URL from a previous
        crawl. Case-insensitive. Example: "https://docs.example.com/auth".
    jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
        <WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
list_pagesA

List all pages from a previous crawl with their URLs, titles, and word counts.

Returns a summary of every page in the crawl index. Use this to get an
overview of available content before searching or reading specific pages.
Word counts help identify content-rich pages vs. thin landing pages.

This is a read-only operation on local files — no network requests are made.

Args:
    jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
        <WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
extract_dataA

Extract structured fields from crawled pages using an LLM.

Analyzes each crawled page and pulls out specific data fields you define
(e.g. company_name, pricing, features, api_endpoints). If no fields are
specified, the LLM automatically discovers relevant fields by sampling
pages from the crawl.

This tool makes external API calls to OpenAI (requires OPENAI_API_KEY
environment variable). Results are saved to extracted.jsonl and include
LLM attribution metadata.

Use this for competitive research, API documentation analysis, or building
structured datasets from unstructured web content.

Args:
    jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
        <WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
    fields: Comma-separated field names to extract. Example:
        "company_name,pricing,features,api_endpoints". Leave empty to
        let the LLM auto-discover the most relevant fields.
    context: Description of your analysis goal. Improves auto-field
        discovery quality. Example: "competitor pricing analysis" or
        "API documentation review". Ignored when fields are specified.
    sample_size: Number of pages to sample for auto-field discovery.
        Default: 3. Higher values give better field suggestions but
        cost more tokens.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AIMLPM/markcrawl'

If you have feedback or need assistance with the MCP directory API, please join our Discord server