Skip to main content
Glama

tailtest_classify_failures

Parse test runner output into structured failure records with R12 classification (real_bug, environment, test_bug, unknown). Returns detailed failure info and summary counts per category.

Instructions

Parse runner output (pytest, jest, etc.) into structured failure records and apply heuristic R12 classification. Returns failures with type (real_bug / environment / test_bug / unknown), reason, test name, file, line, error type, message, and a summary count per R12 category. The agent verifies or overrides the heuristic when context warrants.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
runner_outputYesStdout (and optionally stderr) from the test runner.
runnerNoRunner name. Defaults to pytest.

Implementation Reference

  • The main handler function 'classify_failures' that parses runner output and returns structured R12-classified failure records. Orchestrates parsing via _parse_pytest_failures or _parse_jest_failures.
    def classify_failures(runner_output: str, runner: str = "pytest") -> dict[str, Any]:
        """Parse runner output and return structured R12-classified failures.
    
        Args:
            runner_output: stdout (and optionally stderr) from the test runner.
            runner: one of "pytest", "jest", "mocha", "vitest". Defaults to pytest.
    
        Returns:
            Dict with `failures` (list of failure records), `summary` (counts per
            R12 type), and `runner` echoed back for the agent's reference.
    
        Each failure record:
            {
                "type": "real_bug" | "environment" | "test_bug" | "unknown",
                "reason": str,
                "test_name": str,
                "file": str,
                "line": int | None,
                "error_type": str,
                "message": str,
            }
        """
        if runner in ("jest", "vitest"):
            failures = _parse_jest_failures(runner_output)
        else:
            failures = _parse_pytest_failures(runner_output)
    
        summary = {"real_bug": 0, "environment": 0, "test_bug": 0, "unknown": 0}
        for f in failures:
            summary[f["type"]] += 1
    
        return {
            "runner": runner,
            "failures": failures,
            "summary": summary,
            "total_failures": len(failures),
        }
  • Heuristic R12 classification logic mapping error types to real_bug/environment/test_bug/unknown. Includes fixture/conftest detection to flip real_bug to test_bug.
    def _heuristic_classification(
        error_type: str, message: str, traceback_text: str
    ) -> tuple[str, str]:
        """Apply heuristic R12 classification.
    
        Returns (classification, reason) where classification is one of:
        real_bug, environment, test_bug, unknown.
        """
        if error_type in ENV_ERRORS:
            return ("environment", f"{error_type} typically indicates a missing dependency or system resource")
    
        if error_type in LIKELY_REAL_BUG_ERRORS:
            # Refine: if the traceback shows the error originated in test fixture
            # setup, flip to test_bug.
            if traceback_text and any(
                marker in traceback_text
                for marker in ("conftest.py", "fixture", "setup_method", "setUp(")
            ):
                return (
                    "test_bug",
                    f"{error_type} originated in test fixture or setup, not source under test",
                )
            return (
                "real_bug",
                f"{error_type} typically indicates a bug in the source under test",
            )
    
        if error_type in AMBIGUOUS_ERRORS:
            # AssertionError: try to disambiguate from message.
            msg_lower = (message or "").lower()
            # Common test_bug signals
            if any(
                phrase in msg_lower
                for phrase in (
                    "fixture not found",
                    "expected fixture",
                    "wrong expectation",
                    "stub",
                    "mock not configured",
                )
            ):
                return ("test_bug", "Assertion message indicates the test setup is wrong")
            # Common real_bug signals
            if any(
                phrase in msg_lower
                for phrase in (
                    "expected ",
                    "got ",
                    "should",
                    "to equal",
                    "to be",
                )
            ):
                return (
                    "real_bug",
                    "Assertion compares actual vs expected behavior of the source",
                )
            return (
                "real_bug",
                "AssertionError defaults to real_bug when ambiguous (per CLAUDE.md / mdc rule)",
            )
    
        return ("unknown", f"No heuristic for {error_type}; agent must classify")
  • Enumeration sets for error type categorization: ENV_ERRORS (ImportError, ConnectionError, etc.), LIKELY_REAL_BUG_ERRORS (AttributeError, TypeError, etc.), AMBIGUOUS_ERRORS (AssertionError).
    ENV_ERRORS = {
        "ImportError",
        "ModuleNotFoundError",
        "ConnectionError",
        "ConnectionRefusedError",
        "ConnectionResetError",
        "TimeoutError",
        "FileNotFoundError",
        "PermissionError",
        "OSError",
    }
    
    LIKELY_REAL_BUG_ERRORS = {
        "AttributeError",
        "TypeError",
        "KeyError",
        "ValueError",
        "IndexError",
        "ZeroDivisionError",
        "RecursionError",
        "OverflowError",
    }
    
    # AssertionError is ambiguous: real_bug if assertion is on source behavior,
    # test_bug if assertion is on test fixture / setup.
    AMBIGUOUS_ERRORS = {
        "AssertionError",
    }
  • Tool registration in the MCP server's list_tools() function: defines name, description, and inputSchema for tailtest_classify_failures.
    Tool(
        name="tailtest_classify_failures",
        description=(
            "Parse runner output (pytest, jest, etc.) into structured failure records and "
            "apply heuristic R12 classification. Returns failures with type "
            "(real_bug / environment / test_bug / unknown), reason, test name, file, "
            "line, error type, message, and a summary count per R12 category. The agent "
            "verifies or overrides the heuristic when context warrants."
        ),
        inputSchema={
            "type": "object",
            "properties": {
                "runner_output": {
                    "type": "string",
                    "description": "Stdout (and optionally stderr) from the test runner.",
                },
                "runner": {
                    "type": "string",
                    "enum": ["pytest", "jest", "vitest", "mocha"],
                    "description": "Runner name. Defaults to pytest.",
                },
            },
            "required": ["runner_output"],
            "additionalProperties": False,
        },
    ),
  • Dispatch handler in call_tool() that imports classify_failures from .tools.classify_failures and invokes it with runner_output and runner arguments.
    if name == "tailtest_classify_failures":
        from .tools.classify_failures import classify_failures
        import json as _json
    
        result = classify_failures(
            runner_output=arguments["runner_output"],
            runner=arguments.get("runner", "pytest"),
        )
        return [TextContent(type="text", text=_json.dumps(result, indent=2))]
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so description carries full burden. It discloses the heuristic classification and that agent can verify or override results. It does not mention side effects or persistence, but given the context (classification of failures), this is acceptable.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two sentences, front-loading the main action and output. It is reasonably concise, but could be slightly tighter by merging the output list into the first sentence.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has 2 parameters (one optional) and no output schema, the description provides comprehensive context: lists output fields, explains heuristic classification, and notes agent's role. It is sufficient for the agent to use the tool correctly.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. The description adds value by explaining that the runner_output is parsed and that runner defaults to pytest, and that the output includes structured failure records with heuristic classification. This goes beyond schema definitions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: parsing runner output into structured failure records and applying heuristic R12 classification. It lists specific output fields (type, reason, test name, etc.), distinguishing it from sibling tools like tailtest_pick_template or tailtest_ping.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage when agent needs to classify test failures, but does not explicitly state when to use or not use this tool. No mention of alternatives or exclusions, which would help differentiate from siblings.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/avansaber/tailtest-cline'

If you have feedback or need assistance with the MCP directory API, please join our Discord server