Skip to main content
Glama

compare_models

Compare 2-5 LLM/VLM models side-by-side for pricing, benchmarks, and capabilities. Returns a compact Markdown comparison table to help select the right model.

Instructions

Compare 2-5 LLM/VLM models side-by-side: pricing, benchmarks, capabilities. Returns a compact Markdown comparison table (~400 tokens).

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
modelsYesModel IDs or partial names (e.g., ["claude-sonnet-4.6", "gpt-5.2", "gemini-3-pro"])

Implementation Reference

  • The main handler function for compare_models tool. Resolves model queries from registry, validates at least 2 models found, and returns a formatted Markdown comparison table. Returns error if fewer than 2 models can be resolved.
    async ({ models: modelQueries }) => {
      await registry.ensureLoaded();
    
      const resolved: UnifiedModel[] = [];
      const notFound: string[] = [];
    
      for (const query of modelQueries) {
        const found = registry.getModel(query);
        if (found) {
          resolved.push(found);
        } else {
          notFound.push(query);
        }
      }
    
      if (resolved.length < 2) {
        const similar = notFound
          .flatMap((q) => registry.findSimilar(q))
          .slice(0, 5);
        return {
          content: [
            {
              type: "text" as const,
              text:
                `Need at least 2 models to compare. Not found: ${notFound.join(", ")}.` +
                (similar.length > 0
                  ? ` Did you mean: ${similar.join(", ")}?`
                  : ""),
            },
          ],
          isError: true,
        };
      }
    
      const fetchedAt = registry.getCacheFreshnessMs();
      const output = formatComparison(resolved, notFound, fetchedAt);
    
      return {
        content: [{ type: "text" as const, text: output }],
      };
    }
  • Zod schema for compare_models input: requires 'models' array of 2-5 strings (model IDs or partial names).
      models: z
        .array(z.string())
        .min(2)
        .max(5)
        .describe(
          'Model IDs or partial names (e.g., ["claude-sonnet-4.6", "gpt-5.2", "gemini-3-pro"])'
        ),
    },
  • registerCompareTool function that registers the 'compare_models' tool with the MCP server, including name, description, schema, and handler.
    export function registerCompareTool(
      server: McpServer,
      registry: ModelRegistry
    ): void {
      server.tool(
        "compare_models",
        "Compare 2-5 LLM/VLM models side-by-side: pricing, benchmarks, capabilities. " +
          "Returns a compact Markdown comparison table (~400 tokens).",
        {
          models: z
            .array(z.string())
            .min(2)
            .max(5)
            .describe(
              'Model IDs or partial names (e.g., ["claude-sonnet-4.6", "gpt-5.2", "gemini-3-pro"])'
            ),
        },
        async ({ models: modelQueries }) => {
          await registry.ensureLoaded();
    
          const resolved: UnifiedModel[] = [];
          const notFound: string[] = [];
    
          for (const query of modelQueries) {
            const found = registry.getModel(query);
            if (found) {
              resolved.push(found);
            } else {
              notFound.push(query);
            }
          }
    
          if (resolved.length < 2) {
            const similar = notFound
              .flatMap((q) => registry.findSimilar(q))
              .slice(0, 5);
            return {
              content: [
                {
                  type: "text" as const,
                  text:
                    `Need at least 2 models to compare. Not found: ${notFound.join(", ")}.` +
                    (similar.length > 0
                      ? ` Did you mean: ${similar.join(", ")}?`
                      : ""),
                },
              ],
              isError: true,
            };
          }
    
          const fetchedAt = registry.getCacheFreshnessMs();
          const output = formatComparison(resolved, notFound, fetchedAt);
    
          return {
            content: [{ type: "text" as const, text: output }],
          };
        }
      );
    }
  • Helper functions for compare_models: formatComparison builds the Markdown table, row creates table rows, highlightBest bolds the highest numeric values in benchmark comparisons.
    function formatComparison(
      models: UnifiedModel[],
      notFound: string[],
      fetchedAt?: number
    ): string {
      const lines: string[] = [];
    
      lines.push(`## Model Comparison (${models.length} models)`);
      if (notFound.length > 0) {
        lines.push(`\n> Not found: ${notFound.join(", ")}`);
      }
      lines.push("");
    
      // Header row: Feature | Model1 | Model2 | ...
      const header = `| | ${models.map((m) => `**${m.id}**`).join(" | ")} |`;
      const sep = `|------|${models.map(() => "------").join("|")}|`;
    
      const rows: string[] = [];
    
      // Pricing
      rows.push(row("Input $/1M", models.map((m) => fmtPrice(m.pricing.input))));
      rows.push(row("Output $/1M", models.map((m) => fmtPrice(m.pricing.output))));
      if (models.some((m) => m.pricing.cacheRead !== undefined)) {
        rows.push(
          row("Cache Read $/1M", models.map((m) => fmtPrice(m.pricing.cacheRead)))
        );
      }
    
      // Context & output
      rows.push(
        row("Context", models.map((m) => fmtContext(m.capabilities.contextLength)))
      );
      rows.push(
        row(
          "Max Output",
          models.map((m) => fmtContext(m.capabilities.maxOutputTokens))
        )
      );
    
      // Benchmarks — only show rows where at least one model has data
      const benchmarks: [string, (m: UnifiedModel) => string][] = [
        ["SWE-bench", (m) => fmtScore(m.benchmarks.sweBenchVerified)],
        ["Aider Polyglot", (m) => fmtScore(m.benchmarks.aiderPolyglot)],
        ["Arena Elo", (m) => fmtElo(m.benchmarks.arenaElo)],
        ["MMLU-Pro", (m) => fmtScore(m.benchmarks.mmluPro)],
        ["GPQA Diamond", (m) => fmtScore(m.benchmarks.gpqaDiamond)],
        ["MATH-500", (m) => fmtScore(m.benchmarks.math500)],
        ["MMMU", (m) => fmtScore(m.benchmarks.mmmu)],
      ];
    
      for (const [label, extractor] of benchmarks) {
        const values = models.map(extractor);
        if (values.some((v) => v !== "n/a")) {
          // Bold the best value
          rows.push(row(label, highlightBest(values)));
        }
      }
    
      // Capabilities
      rows.push(
        row("Vision", models.map((m) =>
          m.capabilities.inputModalities.includes("image") ? "Yes" : "No"
        ))
      );
      rows.push(
        row("Tools", models.map((m) =>
          m.capabilities.supportsTools ? "Yes" : "No"
        ))
      );
      rows.push(
        row("Reasoning", models.map((m) =>
          m.capabilities.supportsReasoning ? "Yes" : "No"
        ))
      );
      rows.push(
        row("Open Source", models.map((m) =>
          m.metadata.isOpenSource ? "Yes" : "No"
        ))
      );
      rows.push(
        row("Released", models.map((m) => m.metadata.releaseDate ?? "n/a"))
      );
    
      lines.push(header);
      lines.push(sep);
      lines.push(...rows);
      lines.push(freshnessFooter(fetchedAt));
    
      return lines.join("\n");
    }
    
    function row(label: string, values: string[]): string {
      return `| ${label} | ${values.join(" | ")} |`;
    }
    
    /** Bold the best (highest) numeric value in the array */
    function highlightBest(values: string[]): string[] {
      const nums = values.map((v) => {
        const n = parseFloat(v.replace(/[^0-9.\-]/g, ""));
        return isNaN(n) ? -Infinity : n;
      });
      const max = Math.max(...nums);
      if (max === -Infinity) return values;
    
      return values.map((v, i) => (nums[i] === max && v !== "n/a" ? `**${v}**` : v));
    }
  • Formatter utilities used by compare_models: fmtPrice ($X.XX format), fmtContext (128K/1M format), fmtScore (percentage), fmtElo (integer), and freshnessFooter (data age timestamp).
    /** Format price as "$X.XX" or "free" */
    export function fmtPrice(price: number | undefined): string {
      if (price === undefined || price === null) return "n/a";
      if (price === 0) return "free";
      if (price < 0.01) return `$${price.toFixed(4)}`;
      return `$${price.toFixed(2)}`;
    }
    
    /** Format large context lengths: 128000 → "128K", 1000000 → "1M" */
    export function fmtContext(tokens: number | undefined): string {
      if (!tokens) return "n/a";
      if (tokens >= 1_000_000) return `${(tokens / 1_000_000).toFixed(tokens % 1_000_000 === 0 ? 0 : 1)}M`;
      if (tokens >= 1_000) return `${Math.round(tokens / 1_000)}K`;
      return String(tokens);
    }
    
    /** Format benchmark score: 72.1 → "72.1%" or "n/a" */
    export function fmtScore(score: number | undefined): string {
      if (score === undefined || score === null) return "n/a";
      return `${score.toFixed(1)}%`;
    }
    
    /** Format Elo rating */
    export function fmtElo(elo: number | undefined): string {
      if (elo === undefined || elo === null) return "n/a";
      return String(Math.round(elo));
    }
    
    /** Format modalities as short string: ["text", "image"] → "text+image" */
    export function fmtModalities(mods: string[]): string {
      return mods.join("+") || "text";
    }
    
    /** Data freshness footer */
    export function freshnessFooter(fetchedAt?: number): string {
      if (!fetchedAt) return "";
      const date = new Date(fetchedAt).toISOString().replace(/\.\d{3}Z$/, "Z");
      const ageMin = Math.round((Date.now() - fetchedAt) / 60_000);
      return `\n**Data freshness**: ${date} (${ageMin}min ago)`;
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden. It discloses key behavioral traits: the tool performs comparison (not creation or modification), returns a Markdown table with specific content areas, and has a token limit (~400 tokens). However, it doesn't mention potential limitations like data freshness, error handling, or authentication needs, leaving some gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise and front-loaded, with every sentence earning its place: the first defines the core functionality, and the second specifies the output format and constraints. There's no wasted text or redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (comparing models across multiple dimensions) and lack of annotations/output schema, the description is mostly complete. It covers purpose, scope, and output format but could benefit from mentioning data sources or update frequency. Without an output schema, it helpfully describes the return format (Markdown table).

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema description coverage is 100%, with the 'models' parameter fully documented in the schema (array of strings, 2-5 items, example values). The description adds no additional parameter semantics beyond what's in the schema, so it meets the baseline of 3 without compensating or adding value.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('compare') and resources ('2-5 LLM/VLM models'), specifying what aspects are compared (pricing, benchmarks, capabilities) and the output format (compact Markdown comparison table). It distinguishes from siblings like 'get_model_info' (single model details), 'list_top_models' (ranking), and 'recommend_model' (suggestion).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use this tool (comparing multiple models side-by-side) and implies usage by specifying the 2-5 model range. However, it doesn't explicitly state when not to use it or name alternatives like 'get_model_info' for single-model details, which would be needed for a perfect score.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Daichi-Kudo/llm-advisor-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server