AIMLPM/markcrawl

by AIMLPM

Overview Schema Related Servers Score Discussions

Python

Local

Server Quality Checklist

Profile completionA complete profile improves this server's visibility in search results.

Disambiguation5/5
Each tool has a clearly distinct purpose: crawl_site handles network fetching and initial extraction, while the remaining tools provide different local analysis modes (listing, reading, searching, and LLM-based structured extraction). No functional overlap exists between tools.
Naming Consistency5/5
All five tools follow a consistent verb_noun snake_case pattern (crawl_site, extract_data, list_pages, read_page, search_pages). The verb choices clearly indicate the action performed (crawl/extract/list/read/search).
Tool Count5/5
Five tools is ideal for this focused domain. The set includes one acquisition tool (crawl_site), three discovery/retrieval tools (list_pages, search_pages, read_page), and one analysis tool (extract_data), with no redundant or extraneous operations.
Completeness4/5
The tool surface covers the complete core workflow from crawling to structured analysis with no dead ends. Minor gaps exist in crawl lifecycle management (no tools for deleting crawls, resuming interrupted crawls, or updating crawl parameters), but these are workarounds via filesystem operations.
Average 4.6/5 across 5 of 5 tools scored.
See the tool scores section below for per-tool breakdowns.
This repository includes a README.md file.
This repository includes a LICENSE file.
Latest release: v1.0.2
No tool usage detected in the last 30 days. Usage tracking helps demonstrate server value.
Tip: use the "Try in Browser" feature on the server page to seed initial usage.
This repository includes a glama.json configuration file.
View server inspector
This server provides 5 tools. View schema
No known security issues or vulnerabilities reported.
Report a security issue
This server has been verified by its author.
View related servers

Tool Scores

Behavior4/5
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
With no annotations provided, the description carries full disclosure burden and succeeds well: warns about external OpenAI API dependency, required OPENAI_API_KEY environment variable, output file location ('extracted.jsonl'), and cost implications ('cost more tokens'). Could improve by mentioning error handling or rate limiting behavior.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Conciseness4/5
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with logical flow: purpose → mechanism → behavioral warnings → use cases → detailed args. The Args section is necessary given zero schema coverage. Slightly verbose in the opening but every section adds distinct value not present in structured fields.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Completeness4/5
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given 0% schema coverage, no annotations, and presence of output schema, the description successfully covers: input file handling, output file destination (extracted.jsonl), authentication requirements, and the dual-mode operation (manual fields vs. auto-discovery). Adequately complete for safe invocation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Parameters5/5
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema has 0% description coverage (only titles provided). The Args section compensates comprehensively: documents all 4 parameters with types, defaults, format examples (comma-separated fields), and behavioral notes (context is ignored when fields specified, sample_size tradeoffs). Fully bridges the schema gap.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Purpose5/5
Does the description clearly state what the tool does and how it differs from similar tools?
Opens with specific verb ('Extract'), clear resource ('structured fields from crawled pages'), and mechanism ('using an LLM'). The LLM-based extraction clearly distinguishes this from siblings like read_page or list_pages which presumably handle raw content rather than structured extraction.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Usage Guidelines4/5
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit use cases ('competitive research, API documentation analysis, or building structured datasets') that clarify when to use extraction over simple reading. Documents the conditional logic for auto-discovery vs. specified fields. Lacks explicit comparison to sibling alternatives (e.g., 'use this instead of read_page when...').
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Behavior4/5
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
No annotations provided, so description carries full burden. Discloses critical safety info ('read-only operation on local files — no network requests'), return format ('Markdown or text'), and matching behavior ('case-insensitive and tolerates trailing slashes'). Could add error conditions for missing URLs.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Conciseness4/5
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with purpose front-loaded, followed by return value, usage context, behavioral notes, and Args section. Each sentence earns its place. Slightly verbose in Args section but necessary given zero schema coverage.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Completeness4/5
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
With output schema present, description appropriately focuses on invocation context rather than return values. Covers purpose, prerequisites, parameter semantics, and safety properties. Missing only edge case handling (e.g., what happens if URL not found in index).
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Parameters5/5
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema has 0% description coverage. Description fully compensates: url includes constraints ('exact URL', 'must match previous crawl', 'case-insensitive') and example; jsonl_path explains default behavior ('defaults to <WEBCRAWLER_OUTPUT_DIR>/pages.jsonl'). Adds essential meaning missing from schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Purpose5/5
Does the description clearly state what the tool does and how it differs from similar tools?
Specific verb 'Read' + resource 'extracted content of a specific crawled page' + scope 'full content of a single page'. Clearly distinguishes from sibling search_pages (finds results vs reads them) and crawl_site (crawls vs reads existing data).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Usage Guidelines4/5
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states 'Use this after search_pages to read the full content of a relevant result', establishing clear workflow order. Also notes prerequisite 'Must match a URL from a previous crawl'. Lacks explicit 'when not to use' alternatives, but context is clear.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Behavior4/5
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
With no annotations provided, the description carries the full burden. It discloses key behavioral traits: robots.txt compliance, sitemap-first discovery, headless Chromium option for SPAs (with performance warning 'Slower but necessary'), and authentication constraints ('Only public, non-authenticated pages will be fetched'). Minor gap: does not mention rate limiting or error handling behavior.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Conciseness4/5
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with logical flow: purpose → usage context → dependencies → workflow → parameters. Information is front-loaded with the core action. Despite length, every sentence serves a necessary function given the lack of schema documentation and annotations. Minor deduction for slight verbosity in the Args section, though justified by the 0% schema coverage.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Completeness5/5
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Comprehensive coverage for a 6-parameter tool with zero annotations and zero schema descriptions. Addresses the tool's role in the ecosystem (workflow integration), output artifacts (.md files and pages.jsonl), and behavioral constraints. Since an output schema exists (per context signals), the brief mention of output files is sufficient.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Parameters5/5
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The schema has 0% description coverage, requiring the description to compensate fully. The Args section provides rich semantic detail for all 6 parameters: examples ('https://docs.example.com/'), value guidance ('Use lower values (10-20) for quick previews'), format implications ('preserves headings, code blocks'), and technology context ('Required for React/Vue/Angular sites').
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Purpose5/5
Does the description clearly state what the tool does and how it differs from similar tools?
The description opens with a specific verb and resource ('Crawl a website and save extracted content as clean Markdown or plain text'), clearly distinguishing it from siblings like 'read_page' (single page) or 'extract_data' (data extraction). It further clarifies the scope by mentioning boilerplate stripping and robots.txt respect.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Usage Guidelines5/5
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states when to use ('when asked to research, read, analyze, or archive a website') and provides a clear workflow sequence ('crawl_site → list_pages or search_pages → read_page'). Critically, it notes the dependency relationship: 'The output_dir from this tool is required by search_pages, read_page, list_pages, and extract_data', which is essential for correct orchestration.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Behavior4/5
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
With no annotations provided, the description carries full burden of behavioral disclosure. It explicitly states 'read-only operation on local files' and 'no network requests are made,' addressing safety and execution context. It also clarifies that it returns a summary of 'every page in the crawl index.' Minor gap: doesn't mention error handling (e.g., file not found) or performance with large files.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Conciseness5/5
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with front-loaded key details (URLs, titles, word counts in first sentence). Zero waste: each sentence addresses functionality, return value, usage timing, data interpretation (word counts), safety characteristics, and parameter details. The Args section efficiently documents the single parameter.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Completeness5/5
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool has an output schema (so return values needn't be detailed) and only one optional parameter, the description is comprehensive. It covers purpose, workflow context, safety constraints, and parameter semantics—sufficient for an agent to invoke correctly without external documentation.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Parameters5/5
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 0% (parameter lacks description field), but the description fully compensates by documenting: 'Full path to the pages.jsonl file' and the default behavior 'If empty, defaults to <WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.' This provides complete semantic meaning where the schema fails to do so.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Purpose5/5
Does the description clearly state what the tool does and how it differs from similar tools?
Description explicitly states the verb (List), resource (pages from a previous crawl), and return details (URLs, titles, word counts). It clearly distinguishes from siblings like crawl_site (emphasizes 'previous crawl' vs. new crawl) and read_page/search_pages (emphasizes 'overview' vs. specific content retrieval).
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Usage Guidelines5/5
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit workflow guidance: 'Use this to get an overview of available content before searching or reading specific pages.' This establishes when to use the tool (before search/read operations) and implicitly references sibling alternatives (search_pages, read_page).
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Behavior5/5
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
With no annotations provided, description carries full burden and excels: discloses read-only/local-file nature ('no network requests'), ranking algorithm ('ranked by the number of matching query words'), search logic ('case-insensitive', 'OR logic'), and result structure ('URL, title, and a text snippet').
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Conciseness4/5
Is the description appropriately sized, front-loaded, and free of redundancy?
Well-structured with logical flow from general function to specific behavior to parameters. Length is appropriate given zero schema coverage necessitating detailed Args section. No redundant sentences, though slightly verbose compared to ultra-concise ideal.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Completeness5/5
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Comprehensive for a search tool: covers search mechanics, ranking, result format, safety characteristics, file dependencies, and all three parameters. Output schema exists (per context signals), so brief mention of result contents is sufficient without detailing return structure.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Parameters5/5
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema has 0% description coverage (only titles), but description fully compensates: explains query syntax ('one or more keywords separated by spaces', example provided), jsonl_path context ('from a previous crawl', default location), and max_results usage guidance ('lower values for focused searches').
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Purpose5/5
Does the description clearly state what the tool does and how it differs from similar tools?
Specific verb+resource ('Search through previously crawled pages by keyword') clearly distinguishes from siblings: contrasts with crawl_site (which populates data) and implies difference from list_pages/read_page by specifying keyword search across content rather than enumeration or retrieval.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Usage Guidelines5/5
Does the description explain when to use this tool, when not to, or what alternatives exist?
Explicitly states prerequisite dependency: 'Requires a prior crawl_site call to have populated the pages.jsonl file.' This clearly defines when to use (after crawling) and implies when not to use (no crawl data available), providing essential workflow context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

GitHub Badge

Glama performs regular codebase and documentation scans to:

Confirm that the MCP server is working as expected.
Confirm that there are no obvious security issues.
Evaluate tool definition quality.

Our badge communicates server capabilities, safety, and installation instructions.

Card Badge

Copy to your README.md:

[![markcrawl MCP server](https://glama.ai/mcp/servers/AIMLPM/markcrawl/badges/card.svg)](https://glama.ai/mcp/servers/AIMLPM/markcrawl)

Score Badge

Copy to your README.md:

[![markcrawl MCP server](https://glama.ai/mcp/servers/AIMLPM/markcrawl/badges/score.svg)](https://glama.ai/mcp/servers/AIMLPM/markcrawl)

How to claim the server?

If you are the author of the server, you simply need to authenticate using GitHub.

However, if the MCP server belongs to an organization, you need to first add glama.json to the root of your repository.

{
  "$schema": "https://glama.ai/mcp/schemas/server.json",
  "maintainers": [
    "your-github-username"
  ]
}

Then, authenticate using GitHub.

Browse examples.

How to make a release?

A "release" on Glama is not the same as a GitHub release. To create a Glama release:

Claim the server if you haven't already.
Go to the Dockerfile admin page, configure the build spec, and click Deploy.
Once the build test succeeds, click Make Release, enter a version, and publish.

This process allows Glama to run security checks on your server and enables users to deploy it.

How to add a LICENSE?

Please follow the instructions in the GitHub documentation.

Once GitHub recognizes the license, the system will automatically detect it within a few hours.

If the license does not appear on the server after some time, you can manually trigger a new scan using the MCP server admin interface.

How to sync the server with GitHub?

Servers are automatically synced at least once per day, but you can also sync manually at any time to instantly update the server profile.

To manually sync the server, click the "Sync Server" button in the MCP server admin interface.

How is the quality score calculated?

The overall quality score combines two components: Tool Definition Quality (70%) and Server Coherence (30%).

Tool Definition Quality measures how well each tool describes itself to AI agents. Every tool is scored 1–5 across six dimensions: Purpose Clarity (25%), Usage Guidelines (20%), Behavioral Transparency (20%), Parameter Semantics (15%), Conciseness & Structure (10%), and Contextual Completeness (10%). The server-level definition quality score is calculated as 60% mean TDQS + 40% minimum TDQS, so a single poorly described tool pulls the score down.

Server Coherence evaluates how well the tools work together as a set, scoring four dimensions equally: Disambiguation (can agents tell tools apart?), Naming Consistency, Tool Count Appropriateness, and Completeness (are there gaps in the tool surface?).

Tiers are derived from the overall score: A (≥3.5), B (≥3.0), C (≥2.0), D (≥1.0), F (<1.0). B and above is considered passing.

Latest Blog Posts

Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security
Open Source Has a Bot Problem
By punkpeye on March 19, 2026.
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AIMLPM/markcrawl'

If you have feedback or need assistance with the MCP directory API, please join our Discord server