Skip to main content
Glama
refreshdotdev

WebEvalAgent MCP Server

Official

web_eval_agent

Assess web application UX/UI quality by performing specific tasks and analyzing interaction flow to identify issues and provide improvement recommendations.

Instructions

Evaluate the user experience / interface of a web application.

This tool allows the AI to assess the quality of user experience and interface design of a web application by performing specific tasks and analyzing the interaction flow.

Before this tool is used, the web application should already be running locally on a port.

Args: url: Required. The localhost URL of the web application to evaluate, including the port number. Example: http://localhost:3000, http://localhost:8080, http://localhost:4200, http://localhost:5173, etc. Try to avoid using the path segments of the URL, and instead use the root URL. task: Required. The specific UX/UI aspect to test (e.g., "test the checkout flow", "evaluate the navigation menu usability", "check form validation feedback") Be as detailed as possible in your task description. It could be anywhere from 2 sentences to 2 paragraphs. headless_browser: Optional. Whether to hide the browser window popup during evaluation. If headless_browser is True, only the operative control center browser will show, and no popup browser will be shown.

Returns: list[list[TextContent, ImageContent]]: A detailed evaluation of the web application's UX/UI, including observations, issues found, and recommendations for improvement and screenshots of the web application during the evaluation

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
taskYes
headless_browserNo

Implementation Reference

  • Core handler function that orchestrates the web evaluation: starts log server, validates inputs, generates evaluation prompt, executes browser task via run_browser_task, formats comprehensive results including agent steps, console/network logs, timeline, and attaches screenshots.
    async def handle_web_evaluation(arguments: Dict[str, Any], ctx: Context, api_key: str) -> list[TextContent]:
        """Handle web_eval_agent tool calls
        
        This function evaluates the user experience of a web application by using
        the browser-use agent to perform specific tasks and analyze the interaction flow.
        
        Args:
            arguments: The tool arguments containing 'url' and 'task'
            ctx: The MCP context for reporting progress
            api_key: The API key for authentication with the LLM service
            
        Returns:
            list[List[Any]]: The evaluation results, including console logs, network requests, and screenshots
        """
        # Initialize log server immediately (if not already running)
        try:
            # stop_log_server() # Commented out stop_log_server
            start_log_server()
            # Give the server a moment to start
            await asyncio.sleep(1)
            # Open the dashboard in a new tab
            open_log_dashboard()
        except Exception:
            pass
        
        # Validate required arguments
        if "url" not in arguments or "task" not in arguments:
            return [TextContent(
                type="text",
                text="Error: Both 'url' and 'task' parameters are required. Please provide a URL to evaluate and a specific UX/UI task to test."
            )]
        
        url = arguments["url"]
        task = arguments["task"]
        tool_call_id = arguments.get("tool_call_id", str(uuid.uuid4()))
        headless = arguments.get("headless", True)
    
        send_log(f"Handling web evaluation call with context: {ctx}", "šŸ¤”")
    
        
        # Ensure URL has a protocol (add https:// if missing)
        if not url.startswith(("http://", "https://", "file://", "data:", "chrome:", "javascript:")):
            url = "https://" + url
            send_log(f"Added https:// protocol to URL: {url}", "šŸ”—")
        
        if not url or not isinstance(url, str):
            return [TextContent(
                type="text",
                text="Error: 'url' must be a non-empty string containing the web application URL to evaluate."
            )]
            
        if not task or not isinstance(task, str):
            return [TextContent(
                type="text",
                text="Error: 'task' must be a non-empty string describing the UX/UI aspect to test."
            )]
        
        # Send initial status to dashboard
        send_log(f"šŸš€ Received web evaluation task: {task}", "šŸš€")
        send_log(f"šŸ”— Target URL: {url}", "šŸ”—")
        
        # Update the URL and task in the dashboard
        set_url_and_task(url, task)
    
        # Get the singleton browser manager and initialize it
        browser_manager = get_browser_manager()
        if not browser_manager.is_initialized:
            # Note: browser_manager.initialize will no longer need to start the log server
            # since we've already done it above
            await browser_manager.initialize()
            
        # Get the evaluation task prompt
        evaluation_task = get_web_evaluation_prompt(url, task)
        send_log("šŸ“ Generated evaluation prompt.", "šŸ“")
        
        # Run the browser task
        agent_result_data = None # Changed to agent_result_data
        try:
            # run_browser_task now returns a dictionary with result and screenshots # Updated comment
            agent_result_data = await run_browser_task(
                evaluation_task,
                headless=headless, # Pass the headless parameter
                tool_call_id=tool_call_id,
                api_key=api_key
            )
            
            # Extract the final result string
            agent_final_result = agent_result_data.get("result", "No result provided")
            screenshots = agent_result_data.get("screenshots", []) # Added this line
    
            # Log detailed screenshot information
            send_log(f"Received {len(screenshots)} screenshots from run_browser_task", "šŸ“ø")
            for i, screenshot in enumerate(screenshots):
                if 'screenshot' in screenshot and screenshot['screenshot']:
                    b64_length = len(screenshot['screenshot'])
                    send_log(f"Processing screenshot {i+1}: Step {screenshot.get('step', 'unknown')}, {b64_length} base64 chars", "šŸ”¢")
                else:
                    send_log(f"Screenshot {i+1} missing 'screenshot' data! Keys: {list(screenshot.keys())}", "āš ļø")
    
            # Log the number of screenshots captured
            send_log(f"šŸ“ø Captured {len(screenshots)} screenshots during evaluation", "šŸ“ø")
    
        except Exception as browser_task_error:
            error_msg = f"Error during browser task execution: {browser_task_error}\n{traceback.format_exc()}"
            send_log(error_msg, "āŒ")
            agent_final_result = f"Error: {browser_task_error}" # Provide error as result
            screenshots = [] # Ensure screenshots is defined even on error
    
        # Format the agent result in a more user-friendly way, including console and network errors
        formatted_result = format_agent_result(agent_final_result, url, task, console_log_storage, network_request_storage)
        
        # Determine if the task was successful
        task_succeeded = True
        if agent_final_result.startswith("Error:"):
            task_succeeded = False
        elif "success=False" in agent_final_result and "is_done=True" in agent_final_result:
            task_succeeded = False
        
        # Use appropriate status emoji
        status_emoji = "āœ…" if task_succeeded else "āŒ"
        
        # Return a better formatted message to the MCP user
        # Including a reference to the dashboard for detailed logs
        confirmation_text = f"{formatted_result}\n\nšŸ‘ļø See the 'Operative Control Center' dashboard for detailed live logs.\nWeb Evaluation completed!"
        send_log(f"Web evaluation task completed for {url}.", status_emoji) # Also send confirmation to dashboard
        
        # Log final screenshot count before constructing response
        send_log(f"Constructing final response with {len(screenshots)} screenshots", "🧩")
        
        # Create the final response structure
        response = [TextContent(type="text", text=confirmation_text)]
        
        # Debug the screenshot data structure one last time before adding to response
        for i, screenshot_data in enumerate(screenshots[1:]):
            if 'screenshot' in screenshot_data and screenshot_data['screenshot']:
                b64_length = len(screenshot_data['screenshot'])
                send_log(f"Adding screenshot {i+1} to response ({b64_length} chars)", "āž•")
                response.append(ImageContent(
                    type="image",
                    data=screenshot_data["screenshot"],
                    mimeType="image/jpeg"
                ))
            else:
                send_log(f"Screenshot {i+1} can't be added to response - missing data!", "āŒ")
        
        send_log(f"Final response contains {len(response)} items ({len(response)-1} images)", "šŸ“¦")
        
        # MCP tool function expects list[list[TextContent, ImageContent]] - see docstring in mcp_server.py
        send_log(f"Returning wrapped response: list[ [{len(response)} items] ]", "šŸŽ")
        
        # return [response]  # This structure may be incorrect
        
        # The correct structure based on docstring is list[list[TextContent, ImageContent]]
        # i.e., a list containing a single list of mixed content items
        return [response]
  • Registers the MCP tool named 'web_eval_agent' with input schema (url:str, task:str, ctx:Context, headless_browser:bool) and detailed docstring; validates API key and delegates execution to handle_web_evaluation.
    @mcp.tool(name=BrowserTools.WEB_EVAL_AGENT)
    async def web_eval_agent(url: str, task: str, ctx: Context, headless_browser: bool = False) -> list[TextContent]:
        """Evaluate the user experience / interface of a web application.
    
        This tool allows the AI to assess the quality of user experience and interface design
        of a web application by performing specific tasks and analyzing the interaction flow.
    
        Before this tool is used, the web application should already be running locally on a port.
    
        Args:
            url: Required. The localhost URL of the web application to evaluate, including the port number.
                Example: http://localhost:3000, http://localhost:8080, http://localhost:4200, http://localhost:5173, etc.
                Try to avoid using the path segments of the URL, and instead use the root URL.
            task: Required. The specific UX/UI aspect to test (e.g., "test the checkout flow",
                 "evaluate the navigation menu usability", "check form validation feedback")
                 Be as detailed as possible in your task description. It could be anywhere from 2 sentences to 2 paragraphs.
            headless_browser: Optional. Whether to hide the browser window popup during evaluation.
            If headless_browser is True, only the operative control center browser will show, and no popup browser will be shown.
    
        Returns:
            list[list[TextContent, ImageContent]]: A detailed evaluation of the web application's UX/UI, including
                             observations, issues found, and recommendations for improvement
                             and screenshots of the web application during the evaluation
        """
        headless = headless_browser
        is_valid = await validate_api_key(api_key)
    
        if not is_valid:
            error_message_str = "āŒ Error: API Key validation failed when running the tool.\n"
            error_message_str += "   Reason: Free tier limit reached.\n"
            error_message_str += "   šŸ‘‰ Please subscribe at https://operative.sh to continue."
            return [TextContent(type="text", text=error_message_str)]
        try:
            # Generate a new tool_call_id for this specific tool call
            tool_call_id = str(uuid.uuid4())
            return await handle_web_evaluation(
                {"url": url, "task": task, "headless": headless, "tool_call_id": tool_call_id},
                ctx,
                api_key
            )
        except Exception as e:
            tb = traceback.format_exc()
            return [TextContent(
                type="text",
                text=f"Error executing web_eval_agent: {str(e)}\n\nTraceback:\n{tb}"
            )]
  • Tool schema and documentation defining inputs (url, task, headless_browser), expected usage, and output format (evaluation report with screenshots).
    """Evaluate the user experience / interface of a web application.
    
    This tool allows the AI to assess the quality of user experience and interface design
    of a web application by performing specific tasks and analyzing the interaction flow.
    
    Before this tool is used, the web application should already be running locally on a port.
    
    Args:
        url: Required. The localhost URL of the web application to evaluate, including the port number.
            Example: http://localhost:3000, http://localhost:8080, http://localhost:4200, http://localhost:5173, etc.
            Try to avoid using the path segments of the URL, and instead use the root URL.
        task: Required. The specific UX/UI aspect to test (e.g., "test the checkout flow",
             "evaluate the navigation menu usability", "check form validation feedback")
             Be as detailed as possible in your task description. It could be anywhere from 2 sentences to 2 paragraphs.
        headless_browser: Optional. Whether to hide the browser window popup during evaluation.
        If headless_browser is True, only the operative control center browser will show, and no popup browser will be shown.
    
    Returns:
        list[list[TextContent, ImageContent]]: A detailed evaluation of the web application's UX/UI, including
                         observations, issues found, and recommendations for improvement
                         and screenshots of the web application during the evaluation
    """
  • Helper function generating the specific prompt template for the browser agent to evaluate the web app's UX/UI based on the provided URL and task description.
    def get_web_evaluation_prompt(url: str, task: str) -> str:
        """
        Generate a prompt for web application evaluation.
        
        Args:
            url: The URL of the web application to evaluate
            task: The specific aspect to test
            
        Returns:
            str: The formatted evaluation prompt
        """
        return f"""VISIT: {url}
    GOAL: {task}
    
    Evaluate the UI/UX of the site. If you hit any critical errors (e.g., page fails to load, JS errors), stop and report the exact issue.
    
    If a login page appears, first try clicking "Login" — saved credentials may work.
    If login fields appear and no credentials are provided, do not guess. Stop and report that login is required. Suggest the user run setup_browser_state to log in and retry.
    
    If no errors block progress, proceed and attempt the task. Try a couple times if needed before giving up — unless blocked by missing login access.
    Make sure to click through the application from the base url, don't jump to other pages without naturally arriving there.
    
    Report any UX issues (e.g., incorrect content, broken flows), or confirm everything worked smoothly.
    Take note of any opportunities for improvement in the UI/UX, test and think about the application like a real user would.
    """
  • Enum defining the tool name constant 'web_eval_agent' used in the @mcp.tool registration.
    class BrowserTools(str, Enum):
        WEB_EVAL_AGENT = "web_eval_agent"
        SETUP_BROWSER_STATE = "setup_browser_state"  # Add new tool enum
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden. It discloses that the tool performs tasks and analyzes interaction flow, uses a browser (with optional headless mode), and returns evaluations with screenshots. However, it lacks details on permissions, rate limits, error handling, or whether it's read-only/destructive. The description doesn't contradict annotations (none exist).

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded with the core purpose. Each sentence adds value: purpose, prerequisites, parameter explanations, and return format. Some minor redundancy exists (e.g., repeating 'evaluate' concepts), but overall it's well-structured with zero wasted sentences.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 3 parameters with 0% schema coverage and no output schema, the description does a good job explaining inputs and outputs. It details parameter usage and describes the return format (list of text and image content with observations, issues, recommendations). For a tool with no annotations, it's reasonably complete, though could benefit from more behavioral context like error cases.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It provides detailed semantics for all 3 parameters: 'url' (required, localhost URL with port examples, avoid path segments), 'task' (required, specific UX/UI aspect with examples and length guidance), and 'headless_browser' (optional, controls browser window visibility). This adds substantial meaning beyond the bare schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Evaluate the user experience / interface of a web application' and 'assess the quality of user experience and interface design'. It specifies the verb ('evaluate', 'assess') and resource ('web application'), but doesn't explicitly differentiate from the sibling tool 'setup_browser_state', which likely has a different function.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool: 'Before this tool is used, the web application should already be running locally on a port.' It also gives examples of tasks like 'test the checkout flow' or 'evaluate the navigation menu usability'. However, it doesn't explicitly state when NOT to use it or mention the sibling tool as an alternative.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/refreshdotdev/web-eval-agent'

If you have feedback or need assistance with the MCP directory API, please join our Discord server