Skip to main content
Glama
elad12390

Web Research Assistant

by elad12390

crawl_url

Fetch webpage text for quoting or analysis using crawl4ai to extract content from URLs.

Instructions

Fetch a URL with crawl4ai when you need the actual page text for quoting or analysis.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
reasoningYes
max_charsNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • Primary handler and registration for the 'crawl_url' MCP tool. Uses CrawlerClient to fetch and process URL content, with input schema via Annotated types, error handling, and usage tracking.
    @mcp.tool()
    async def crawl_url(
        url: Annotated[str, "HTTP(S) URL (ideally from web_search output)"],
        reasoning: Annotated[str, "Why you're crawling this URL (required for analytics)"],
        max_chars: Annotated[int, "Trim textual result to this many characters"] = CRAWL_MAX_CHARS,
    ) -> str:
        """Fetch a URL with crawl4ai when you need the actual page text for quoting or analysis."""
    
        start_time = time.time()
        success = False
        error_msg = None
        result = ""
    
        try:
            text = await crawler_client.fetch(url, max_chars=max_chars)
            result = clamp_text(text, MAX_RESPONSE_CHARS)
            success = True
        except Exception as exc:  # noqa: BLE001
            error_msg = str(exc)
            result = f"Crawl failed for {url}: {exc}"
        finally:
            # Track usage
            response_time = (time.time() - start_time) * 1000
            tracker.track_usage(
                tool_name="crawl_url",
                reasoning=reasoning,
                parameters={"url": url, "max_chars": max_chars},
                response_time_ms=response_time,
                success=success,
                error_message=error_msg,
                response_size=len(result.encode("utf-8")),
            )
    
        return result
  • Core crawling logic in CrawlerClient.fetch(), called by the tool handler. Uses crawl4ai AsyncWebCrawler to fetch, extract markdown/content, clean, and trim text.
    async def fetch(self, url: str, *, max_chars: int | None = None) -> str:
        """Fetch *url* and return cleaned markdown, trimmed to *max_chars*."""
    
        run_config = CrawlerRunConfig(cache_mode=self.cache_mode)
    
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(url=url, config=run_config)
    
        if getattr(result, "error", None):
            raise RuntimeError(str(result.error))  # type: ignore
    
        text = (
            getattr(result, "markdown", None)
            or getattr(result, "content", None)
            or getattr(result, "html", None)
            or ""
        )
    
        text = text.strip()
        if not text:
            raise RuntimeError("Crawl completed but returned no readable content.")
    
        limit = max_chars or CRAWL_MAX_CHARS
        return clamp_text(text, limit)
  • CrawlerClient class initialization and configuration, instantiated globally as crawler_client in server.py.
    class CrawlerClient:
        """Lightweight wrapper around crawl4ai's async crawler."""
    
        def __init__(self, *, cache_mode: CacheMode = CacheMode.BYPASS) -> None:
  • Import and global instantiation of CrawlerClient used by the crawl_url tool.
    from .crawler import CrawlerClient
    from .errors import ErrorParser
    from .extractor import DataExtractor
    from .github import GitHubClient, RepoInfo
    from .images import PixabayClient
    from .registry import PackageInfo, PackageRegistryClient
    from .search import SearxSearcher
    from .service_health import ServiceHealthChecker
    from .tracking import get_tracker
    
    mcp = FastMCP("web-research-assistant")
    searcher = SearxSearcher()
    crawler_client = CrawlerClient()
    registry_client = PackageRegistryClient()
    github_client = GitHubClient()
    pixabay_client = PixabayClient()
    error_parser = ErrorParser()
    api_docs_detector = APIDocsDetector()
    api_docs_extractor = APIDocsExtractor()
    data_extractor = DataExtractor()
    tech_comparator = TechComparator(searcher, github_client, registry_client)
    changelog_fetcher = ChangelogFetcher(github_client, registry_client)
    service_health_checker = ServiceHealthChecker(crawler_client)
    tracker = get_tracker()
    
    
    def _format_search_hits(hits):
        lines = []
        for idx, hit in enumerate(hits, 1):
            snippet = f"\n{hit.snippet}" if hit.snippet else ""
            lines.append(f"{idx}. {hit.title} — {hit.url}{snippet}")
        body = "\n\n".join(lines)
        return clamp_text(body, MAX_RESPONSE_CHARS)
    
    
    @mcp.tool()
    async def web_search(
        query: Annotated[str, "Natural-language web query"],
        reasoning: Annotated[str, "Why you're using this tool (required for analytics)"],
        category: Annotated[
            str, "Optional SearXNG category (general, images, news, it, science, etc.)"
        ] = DEFAULT_CATEGORY,
        max_results: Annotated[int, "How many ranked hits to return (1-10)"] = DEFAULT_MAX_RESULTS,
    ) -> str:
        """Use this first to gather fresh web search results via the local SearXNG instance."""
    
        start_time = time.time()
        success = False
        error_msg = None
        result = ""
    
        try:
            hits = await searcher.search(query, category=category, max_results=max_results)
            if not hits:
                result = f"No results for '{query}' in category '{category}'."
            else:
                result = _format_search_hits(hits)
            success = True
        except Exception as exc:  # noqa: BLE001
            error_msg = str(exc)
            result = f"Search failed: {exc}"
        finally:
            # Track usage
            response_time = (time.time() - start_time) * 1000  # Convert to ms
            tracker.track_usage(
                tool_name="web_search",
                reasoning=reasoning,
                parameters={
                    "query": query,
                    "category": category,
                    "max_results": max_results,
                },
                response_time_ms=response_time,
                success=success,
                error_message=error_msg,
                response_size=len(result.encode("utf-8")),
            )
    
        return result
    
    
    @mcp.tool()
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions 'fetch' and 'crawl4ai', implying a read operation, but doesn't disclose critical traits like rate limits, authentication needs, error handling, or what 'crawl4ai' entails (e.g., whether it bypasses paywalls or handles JavaScript). The description adds minimal context beyond the basic action.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that front-loads the key information (action and purpose) with zero wasted words. Every part earns its place by specifying the tool, context, and use case concisely.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has an output schema (which should cover return values), the description's job is lighter. However, with 3 parameters (2 required), 0% schema coverage, and no annotations, the description is incomplete—it doesn't explain parameter purposes or behavioral constraints. It's minimally adequate but leaves significant gaps for a tool that fetches web content.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate for undocumented parameters. It mentions 'URL' and implies 'page text', but doesn't explain the 'reasoning' parameter (required) or 'max_chars' (with a default of 12000). The description adds no meaningful semantics beyond what the parameter names suggest, failing to address the coverage gap.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action ('Fetch a URL with crawl4ai') and resource ('page text'), specifying it's for 'quoting or analysis'. It distinguishes from generic web tools by mentioning crawl4ai, but doesn't explicitly differentiate from potential siblings like 'web_search' or 'extract_data' beyond the crawl4ai reference.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context on when to use it ('when you need the actual page text for quoting or analysis'), which helps differentiate it from tools like 'web_search' (which might return summaries) or 'extract_data' (which might process structured data). However, it doesn't explicitly state when NOT to use it or name specific alternatives among the siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/elad12390/web-research-assistant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server