Skip to main content
Glama

extract_structured_data

Extract structured JSON data from any public webpage by defining the fields you need. Specify a schema and URL to get validated, typed JSON output.

Instructions

Extract structured JSON from any public webpage using Extrapify's schema-guided extraction engine. Define the fields you want (title, price, author, tags, etc.) and their types, point the tool at a URL, and get back validated, typed JSON. Handles JavaScript-heavy pages via Browserless rendering. Ideal for scraping product pages, articles, job listings, company data, search results, and any other structured web content. Returns extracted fields, confidence score, item count, and tokens used.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesFully qualified public webpage URL to extract structured data from (e.g. https://example.com/article). Must be publicly accessible. Does not support login-protected or paywalled pages.
schemaYesSchema definition that controls what fields to extract. Each key is the field name and each value is the field type. Supported types: "string", "number", "integer", "float", "boolean", "date", "datetime", "url", and array variants using [] suffix (e.g. "string[]"). Example: { "title": "string", "price": "number", "tags": "string[]", "published_at": "date" }. Nested objects are supported for grouped fields.
modeNoExtraction mode controlling how many items are returned. "auto" detects automatically based on page structure (recommended). "single" forces extraction of one primary item only (use for product pages, articles, profiles). "list" extracts all matching items as an array (use for search results, directories, tables). Default: "auto".auto
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description fully covers behavioral traits. It explains the extraction process (schema-guided, handles JS with Browserless), return format (validated typed JSON), and what is returned ('extracted fields, confidence score, item count, and tokens used'). It also discloses that it only works on public pages, setting clear expectations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is efficiently structured: it starts with the primary purpose, then elaborates on features and ideal use cases, and ends with return values. Every sentence adds unique information without redundancy. It is front-loaded with the most important detail (what it extracts) and fits within a few sentences.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (3 parameters, one being a complex object schema) and no output schema, the description is remarkably complete. It covers all input parameters with usage guidance, explains the extraction engine's capabilities, and lists the output fields. No critical information is missing for an agent to decide and invoke the tool correctly.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description adds significant value beyond the schema's property descriptions. For 'url', it specifies must be public and no login-protected. For 'schema', it provides supported types, examples, and notes nested object support. For 'mode', it explains each enum value with concrete use cases. This extra context is critical for correct usage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's action: 'Extract structured JSON from any public webpage' using a schema-guided engine. It specifies the verb (extract), resource (structured JSON), and context (public webpages, JavaScript-heavy handling). Since there are no sibling tools, it effectively distinguishes its purpose.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states when to use the tool: 'Ideal for scraping product pages, articles, job listings, company data, search results, and any other structured web content.' It also clarifies limitations: 'Does not support login-protected or paywalled pages.' The mode parameter description provides additional guidance on when to use 'single' vs 'list' modes.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/christ0pper/extrapify-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server