s_fetch_pattern
Extract content matching regex patterns from web pages while avoiding bot detection. Retrieve specific website data with configurable modes for different security levels.
Instructions
Extracts content matching regex patterns from web pages. Retrieves specific content from websites with bot-detection avoidance. For best performance, start with 'basic' mode (fastest), then only escalate to 'stealth' or 'max-stealth' modes if basic mode fails. Returns matched content as 'METADATA: {json}\n\n[content]' where metadata includes match statistics and truncation information. Each matched content chunk is delimited with '॥๛॥' and prefixed with '[Position: start-end]' indicating its byte position in the original document, allowing targeted follow-up requests with s-fetch-page using specific start_index values.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| search_pattern | Yes | ||
| mode | No | basic | |
| format | No | markdown | |
| max_length | No | ||
| context_chars | No |
Implementation Reference
- src/scrapling_fetch_mcp/mcp.py:41-69 (handler)The main handler function for the 's_fetch_pattern' tool. It is registered via the @mcp.tool() decorator and implements the tool logic by calling the underlying fetch_pattern_impl, with error handling.@mcp.tool() async def s_fetch_pattern( url: str, search_pattern: str, mode: str = "basic", format: str = "markdown", max_length: int = 5000, context_chars: int = 200, ) -> str: """Extracts content matching regex patterns from web pages. Retrieves specific content from websites with bot-detection avoidance. For best performance, start with 'basic' mode (fastest), then only escalate to 'stealth' or 'max-stealth' modes if basic mode fails. Returns matched content as 'METADATA: {json}\\n\\n[content]' where metadata includes match statistics and truncation information. Each matched content chunk is delimited with '॥๛॥' and prefixed with '[Position: start-end]' indicating its byte position in the original document, allowing targeted follow-up requests with s-fetch-page using specific start_index values. Args: url: URL to fetch search_pattern: Regular expression pattern to search for in the content mode: Fetching mode (basic, stealth, or max-stealth) format: Output format (html or markdown) max_length: Maximum number of characters to return. context_chars: Number of characters to include before and after each match """ try: result = await fetch_pattern_impl( url, search_pattern, mode, format, max_length, context_chars ) return result except Exception as e: logger = getLogger("scrapling_fetch_mcp") logger.error("DETAILED ERROR IN s_fetch_pattern: %s", str(e)) logger.error("TRACEBACK: %s", format_exc()) raise
- The core helper function implementing the pattern fetching logic: fetches the page, converts to format, searches with regex, extracts contexts, truncates, and formats metadata.async def fetch_pattern_impl( url: str, search_pattern: str, mode: str, format: str, max_length: int, context_chars: int, ) -> str: page = await browse_url(url, mode) is_markdown = format == "markdown" full_content = ( _html_to_markdown(page.html_content) if is_markdown else page.html_content ) original_length = len(full_content) matched_content, match_count = _search_content( full_content, search_pattern, context_chars ) if not matched_content: metadata_json = _create_metadata(original_length, 0, False, None, 0) return f"METADATA: {metadata_json}\n\n" truncated_content = matched_content[:max_length] is_truncated = len(matched_content) > max_length metadata_json = _create_metadata( original_length, len(truncated_content), is_truncated, None, match_count ) return f"METADATA: {metadata_json}\n\n{truncated_content}"
- Helper function that performs regex search on content, extracts context around matches, merges overlapping chunks, and formats the output with position markers.def _search_content( content: str, pattern: str, context_chars: int = 200 ) -> tuple[str, int]: try: matches = list(compile(pattern).finditer(content)) if not matches: return "", 0 chunks = [ ( max(0, match.start() - context_chars), min(len(content), match.end() + context_chars), ) for match in matches ] merged_chunks = reduce( lambda acc, chunk: ( [*acc[:-1], (acc[-1][0], max(acc[-1][1], chunk[1]))] if acc and chunk[0] <= acc[-1][1] else [*acc, chunk] ), chunks, [], ) result_sections = [ f"॥๛॥\n[Position: {start}-{end}]\n{content[start:end]}" for start, end in merged_chunks ] return "\n".join(result_sections), len(matches) except re_error as e: return f"ERROR: Invalid regex pattern: {str(e)}", 0