Skip to main content
Glama
notasandy

MCP Code Sanitizer

compare_code

Compares two versions of code to identify improvements, regressions, and neutral changes. Returns a merge recommendation for code review and refactoring validation.

Instructions

Compares two versions of code and evaluates whether the change is an improvement.

Performs a structured diff analysis: identifies what improved, what regressed, and what changed neutrally. Returns a merge recommendation based on the findings. Useful for code review, refactoring validation, and AI-generated code verification.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
code_beforeYesThe original version of the code (before changes). Include complete function or class -- not just the diff.
code_afterYesThe new version of the code (after changes). Must be the same scope as code_before for accurate comparison.
languageNoProgramming language of both versions. Examples: "python", "javascript", "go", "typescript". Defaults to "python".python
contextNoOptional description of the intent behind the change. Helps distinguish intentional trade-offs from bugs. Example: "Optimized for memory usage at the cost of readability"

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The main handler function for compare_code tool. Takes code_before and code_after as inputs, sends both to Groq with a structured prompt, caches results by SHA256 hash, and returns JSON with verdict, summary, improvements, regressions, scores, and merge recommendation.
    async def compare_code(
        code_before: str, code_after: str,
        language: str = "python", context: str = "",
    ) -> str:
        """
        Compares two versions of code and evaluates whether the change is an improvement.
    
        Performs a structured diff analysis: identifies what improved, what regressed,
        and what changed neutrally. Returns a merge recommendation based on the findings.
        Useful for code review, refactoring validation, and AI-generated code verification.
    
        Behavior:
            - Sends both versions to Groq with a code review prompt focused on regressions
            - Detects improvements: better performance, fixed bugs, improved readability
            - Detects regressions: new bugs, security issues, reduced maintainability
            - Neutral changes: formatting, renaming, restructuring without quality impact
            - recommendation field maps to standard PR actions:
                "merge"             -- change is safe and beneficial
                "request_changes"   -- regressions found, must be fixed before merging
                "needs_discussion"  -- trade-offs present, team should decide
            - Results are cached by SHA256 hash of (code_before + code_after + language)
            - Returns valid JSON even on errors (with an "error" field)
    
        Args:
            code_before: The original version of the code (before changes).
                         Include complete function or class -- not just the diff.
            code_after:  The new version of the code (after changes).
                         Must be the same scope as code_before for accurate comparison.
            language:    Programming language of both versions.
                         Examples: "python", "javascript", "go", "typescript".
                         Defaults to "python".
            context:     Optional description of the intent behind the change.
                         Helps distinguish intentional trade-offs from bugs.
                         Example: "Optimized for memory usage at the cost of readability"
    
        Returns:
            JSON string with the following fields:
            - verdict (str): "improvement" | "regression" | "neutral change"
            - summary (str): One-sentence conclusion about the overall change
            - improvements (list): What got better, each with title and description
            - regressions (list): What got worse, each with severity, title, description
            - neutral_changes (list): Changes with no quality impact (strings)
            - score_before (int): Quality score of the original code (0-100)
            - score_after (int): Quality score of the new code (0-100)
            - recommendation (str): "merge" | "request_changes" | "needs_discussion"
    
        Usage guidelines:
            - Use before merging a PR to get an objective quality comparison
            - Pass context when the change is intentional (e.g., trading speed for memory)
            - Works best when code_before and code_after cover the same function/class scope
            - If score_after < score_before, check regressions before merging
            - Combine with analyze_code on code_after to get detailed issue breakdown
    
        Example:
            compare_code(
                code_before="def get(id): return db.query(f'SELECT * WHERE id={id}')",
                code_after="def get(id): return db.query('SELECT * WHERE id=?', [id])",
                language="python",
                context="Fixed SQL injection vulnerability"
            )
        """
        if not code_before.strip() or not code_after.strip():
            return error_response("Both code_before and code_after must be provided.")
    
        key = cache.make_key("compare_code", code_before, code_after, language, context)
        if hit := cache.get(key):
            return hit
    
        context_block = f"\nChange context: {context}" if context else ""
        user = (
            f"Language: {language}{context_block}\n\n"
            f"OLD CODE:\n```{language}\n{code_before}\n```\n\n"
            f"NEW CODE:\n```{language}\n{code_after}\n```"
        )
    
        try:
            raw = await call(COMPARE, user)
            result = json.loads(raw)
        except httpx.HTTPStatusError as e:
            return error_response(f"Groq API error {e.response.status_code}", e.response.text[:300])
        except json.JSONDecodeError as e:
            return error_response("Groq returned invalid JSON", str(e))
        except ValueError as e:
            return error_response(str(e))
    
        out = json.dumps(result, ensure_ascii=True, indent=2)
        cache.set(key, out)
        return out
  • Re-exports compare_code from tools/compare.py to make it accessible from the tools package.
    from .compare   import compare_code
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description bears full responsibility. It discloses the tool performs 'structured diff analysis' and returns a 'merge recommendation' with categories of improvements, regressions, and neutrals. It does not mention destructive actions or side effects, which aligns with a read-only analysis tool. A minor omission: no mention of output schema details, but the presence of an output schema is noted.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is two paragraphs with clear, front-loaded purpose. Every sentence adds value: first sentence states purpose, second paragraph outlines analysis outputs. No fluff or redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the presence of an output schema (not shown), and the schema's 100% coverage, the description sufficiently explains what the tool does, its outputs (improved, regressed, neutral, recommendation), and appropriate use cases. It is complete for a comparison tool.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. The description adds value by explaining that the 'context' parameter 'helps distinguish intentional trade-offs from bugs' and that the language parameter defaults to Python. This enriches the schema descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states it 'Compares two versions of code' and 'evaluates whether the change is an improvement'. The verb 'compares' and resource 'code versions' are specific. It distinguishes from siblings like analyze_code and explain_code by focusing on comparison.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly lists use cases: 'code review, refactoring validation, and AI-generated code verification'. This provides clear context for when to use. It does not explicitly state when not to use or alternative tools, but the listed use cases are sufficient guidance.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/notasandy/mcp-code-sanitizer'

If you have feedback or need assistance with the MCP directory API, please join our Discord server