query_webpage_data
Extract structured data from webpages using natural language descriptions or AgentQL queries. Accepts live URLs or raw HTML as input.
Instructions
Extracts structured data from a webpage using an AgentQL query or a natural language prompt. Accepts either a live URL or raw HTML as the data source.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| content-type | Yes | MIME type of the request body, required to correctly parse the payload. | |
| query | No | An AgentQL (AQL) query string specifying the exact data fields to extract from the page. If omitted, a query will be auto-generated from the prompt. | |
| prompt | No | A natural language description of the data to extract, used to auto-generate an AgentQL query when no explicit query is provided. | |
| url | No | The fully qualified URL of the webpage to load and query. Either url or html must be provided as the data source. | |
| html | No | Raw HTML content of the webpage to query, used as an alternative to providing a live URL. | |
| mode | No | Controls the response generation strategy: 'fast' prioritizes speed, 'standard' prioritizes accuracy and completeness. | |
| wait_for | No | Number of seconds to wait for dynamic page content to load before capturing the snapshot. Maximum allowed wait time is 10 seconds. | |
| is_scroll_to_bottom_enabled | No | When enabled, the browser scrolls to the bottom of the page before capturing the snapshot, useful for triggering lazy-loaded content. | |
| is_screenshot_enabled | No | When enabled, a screenshot of the page is captured during the query session, which may be useful for debugging or visual verification. | |
| browser_profile | No | Determines the browser profile used for the session: 'light' uses a fast headless browser, 'stealth' applies anti-detection techniques for bot-protected pages. | |
| proxy | No | Optional proxy configuration to route the browser session through a specific proxy server, useful for geo-restricted or access-controlled pages. |
Implementation Reference
- servers/agentql/server.py:1216-1216 (registration)The @mcp.tool() decorator registers query_webpage_data as an MCP tool on the FastMCP server instance.
@mcp.tool() - servers/agentql/server.py:1217-1265 (handler)The query_webpage_data async function is the tool handler: it constructs a request model, injects auth, and executes an HTTP POST to /v1/query-data to extract structured data from a webpage.
async def query_webpage_data( content_type: str = Field(..., alias="content-type", description="MIME type of the request body, required to correctly parse the payload."), query: str | None = Field(None, description="An AgentQL (AQL) query string specifying the exact data fields to extract from the page. If omitted, a query will be auto-generated from the prompt."), prompt: str | None = Field(None, description="A natural language description of the data to extract, used to auto-generate an AgentQL query when no explicit query is provided."), url: str | None = Field(None, description="The fully qualified URL of the webpage to load and query. Either url or html must be provided as the data source."), html: str | None = Field(None, description="Raw HTML content of the webpage to query, used as an alternative to providing a live URL."), mode: Literal["fast", "standard"] | None = Field(None, description="Controls the response generation strategy: 'fast' prioritizes speed, 'standard' prioritizes accuracy and completeness."), wait_for: int | None = Field(None, description="Number of seconds to wait for dynamic page content to load before capturing the snapshot. Maximum allowed wait time is 10 seconds."), is_scroll_to_bottom_enabled: bool | None = Field(None, description="When enabled, the browser scrolls to the bottom of the page before capturing the snapshot, useful for triggering lazy-loaded content."), is_screenshot_enabled: bool | None = Field(None, description="When enabled, a screenshot of the page is captured during the query session, which may be useful for debugging or visual verification."), browser_profile: Literal["light", "stealth", "tf-browser"] | None = Field(None, description="Determines the browser profile used for the session: 'light' uses a fast headless browser, 'stealth' applies anti-detection techniques for bot-protected pages."), proxy: _models.TetraProxy | _models.CustomProxy | None = Field(None, description="Optional proxy configuration to route the browser session through a specific proxy server, useful for geo-restricted or access-controlled pages."), ) -> dict[str, Any] | ToolResult: """Extracts structured data from a webpage using an AgentQL query or a natural language prompt. Accepts either a live URL or raw HTML as the data source.""" # Construct request model with validation try: _request = _models.QueryDataServiceV1QueryDataPostRequest( header=_models.QueryDataServiceV1QueryDataPostRequestHeader(content_type=content_type), body=_models.QueryDataServiceV1QueryDataPostRequestBody(query=query, prompt=prompt, url=url, html=html, params=_models.QueryDataServiceV1QueryDataPostRequestBodyParams(mode=mode, wait_for=wait_for, is_scroll_to_bottom_enabled=is_scroll_to_bottom_enabled, is_screenshot_enabled=is_screenshot_enabled, browser_profile=browser_profile, proxy=proxy) if any(v is not None for v in [mode, wait_for, is_scroll_to_bottom_enabled, is_screenshot_enabled, browser_profile, proxy]) else None) ) except pydantic.ValidationError as _validation_err: logging.error(f"Parameter validation failed for query_webpage_data: {_validation_err}") raise ValueError(f"Invalid parameters: {_validation_err.errors()}") from _validation_err # Extract parameters for API call _http_path = "/v1/query-data" _http_body = _request.body.model_dump(by_alias=True, exclude_none=True) if _request.body else None _http_headers = _request.header.model_dump(by_alias=True, exclude_none=True) if _request.header else {} # Inject per-operation authentication _auth = await _get_auth_for_operation("query_webpage_data") _http_headers.update(_auth.get("headers", {})) _request_id = str(uuid.uuid4()) _log_tool_invocation("query_webpage_data", "POST", _http_path, _request_id) # Execute request (returns normalized dict and status code) _response_data, _ = await _execute_tool_request( tool_name="query_webpage_data", method="POST", path=_http_path, request_id=_request_id, body=_http_body, headers=_http_headers, ) return _response_data - servers/agentql/_models.py:28-46 (schema)Pydantic request models for the query_webpage_data operation: header (content-type), body (query, prompt, url, html), and body params (mode, wait_for, is_scroll_to_bottom_enabled, is_screenshot_enabled, browser_profile, proxy).
class QueryDataServiceV1QueryDataPostRequestHeader(StrictModel): content_type: str = Field(default=..., validation_alias="content-type", serialization_alias="content-type", description="MIME type of the request body, required to correctly parse the payload.") class QueryDataServiceV1QueryDataPostRequestBodyParams(StrictModel): mode: Literal["fast", "standard"] | None = Field(default=None, validation_alias="mode", serialization_alias="mode", description="Controls the response generation strategy: 'fast' prioritizes speed, 'standard' prioritizes accuracy and completeness.") wait_for: int | None = Field(default=None, validation_alias="wait_for", serialization_alias="wait_for", description="Number of seconds to wait for dynamic page content to load before capturing the snapshot. Maximum allowed wait time is 10 seconds.") is_scroll_to_bottom_enabled: bool | None = Field(default=None, validation_alias="is_scroll_to_bottom_enabled", serialization_alias="is_scroll_to_bottom_enabled", description="When enabled, the browser scrolls to the bottom of the page before capturing the snapshot, useful for triggering lazy-loaded content.") is_screenshot_enabled: bool | None = Field(default=None, validation_alias="is_screenshot_enabled", serialization_alias="is_screenshot_enabled", description="When enabled, a screenshot of the page is captured during the query session, which may be useful for debugging or visual verification.") browser_profile: Literal["light", "stealth", "tf-browser"] | None = Field(default=None, validation_alias="browser_profile", serialization_alias="browser_profile", description="Determines the browser profile used for the session: 'light' uses a fast headless browser, 'stealth' applies anti-detection techniques for bot-protected pages.") proxy: TetraProxy | CustomProxy | None = Field(default=None, validation_alias="proxy", serialization_alias="proxy", description="Optional proxy configuration to route the browser session through a specific proxy server, useful for geo-restricted or access-controlled pages.") class QueryDataServiceV1QueryDataPostRequestBody(StrictModel): query: str | None = Field(default=None, description="An AgentQL (AQL) query string specifying the exact data fields to extract from the page. If omitted, a query will be auto-generated from the prompt.") prompt: str | None = Field(default=None, description="A natural language description of the data to extract, used to auto-generate an AgentQL query when no explicit query is provided.") url: str | None = Field(default=None, description="The fully qualified URL of the webpage to load and query. Either url or html must be provided as the data source.") html: str | None = Field(default=None, description="Raw HTML content of the webpage to query, used as an alternative to providing a live URL.") params: QueryDataServiceV1QueryDataPostRequestBodyParams | None = None class QueryDataServiceV1QueryDataPostRequest(StrictModel): """Extracts structured data from a webpage using an AgentQL query or a natural language prompt. Accepts either a live URL or raw HTML as the data source.""" header: QueryDataServiceV1QueryDataPostRequestHeader body: QueryDataServiceV1QueryDataPostRequestBody | None = None - servers/agentql/_auth.py:110-110 (helper)Auth configuration mapping: query_webpage_data requires APIKeyHeader authentication scheme.
"query_webpage_data": [["APIKeyHeader"]],