Skip to main content
Glama

ask_model

Query any AI model with a prompt and receive its response with metadata including latency and token usage. Optionally limit response tokens with automatic distillation.

Instructions

Query any AI model with a prompt. Returns the model's response with metadata.

OUTPUT: Markdown with the model's response, latency, and token usage. If max_response_tokens is set and compression occurred, includes distillation metadata (original tokens, compressed tokens, compressor model, compressor latency). Shows "Saved: X tokens (Y% smaller)" when compression is active. Shows "(cached)" when response is served from cache.

WHEN TO USE: When you need another model's perspective, analysis, or capabilities. Set max_response_tokens to control how much of your context window this response consumes — the response will be distilled by a fast model to fit the budget while preserving code, file paths, errors, and actionable details. Set include_raw=true to see both compressed and original responses for quality verification.

FAILURE MODES:

  • "Model query failed (4xx/5xx)" → The model or provider is unavailable. Try a different model or check that CLIProxyAPI/Ollama is running.

  • "circuit breaker open" → The model failed too many times recently. Try a different model or wait for automatic recovery.

  • Compression silently skipped → If the compressor model is unavailable or the response already fits the budget, the raw response is returned unchanged. This is not an error.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
modelYesModel ID to query (e.g. 'gpt-4o', 'gemini-2.5-pro')
promptYesThe prompt to send to the model
system_promptNoOptional system prompt to set model behavior
temperatureNoSampling temperature (0 = deterministic, 2 = creative)
max_tokensNoMaximum tokens in response (default: 1024)
max_response_tokensNoMaximum tokens in the response returned to you. If the model's response exceeds this, it will be distilled by a fast model to fit — preserving code, file paths, errors, and actionable details while stripping filler. Omit for no compression.
formatNoResponse format — 'brief' for token-efficient summary, 'detailed' for full responsedetailed
include_rawNoWhen true and compression is active, include the original uncompressed response for quality comparison. Use this to verify distillation preserved critical details.

Implementation Reference

  • The `askModel` function — core handler that queries a model via the provider, optionally compresses the response, then formats and returns the result. Calls `provider.query(...)` with model, prompt, and options, then optionally compresses via `compressResponse` if `max_response_tokens` is set, and finally formats the output via `formatResponse`.
    export async function askModel(
      provider: Provider,
      input: AskModelInput
    ): Promise<string> {
      const response = await provider.query(input.model, input.prompt, {
        system_prompt: input.system_prompt,
        temperature: input.temperature,
        max_tokens: input.max_tokens,
      });
    
      // Compress if max_response_tokens is set
      let compression: CompressionResult | undefined;
      if (input.max_response_tokens) {
        compression = await compressResponse(
          provider,
          response,
          input.max_response_tokens
        );
      }
    
      return formatResponse(
        response,
        input.format ?? "detailed",
        compression,
        input.include_raw ?? false
      );
    }
  • `askModelSchema` — Zod schema defining the input shape for the ask_model tool. Fields: model (string), prompt (string), system_prompt (optional string), temperature (optional number 0-2), max_tokens (optional int, default 1024), max_response_tokens (optional int for compression budget), format (optional enum 'brief'|'detailed', default 'detailed'), include_raw (optional boolean, default false).
    export const askModelSchema = z.object({
      model: z.string().describe("Model ID to query (e.g. 'gpt-4o', 'gemini-2.5-pro')"),
      prompt: z.string().describe("The prompt to send to the model"),
      system_prompt: z
        .string()
        .optional()
        .describe("Optional system prompt to set model behavior"),
      temperature: z
        .number()
        .min(0)
        .max(2)
        .optional()
        .describe("Sampling temperature (0 = deterministic, 2 = creative)"),
      max_tokens: z
        .number()
        .int()
        .positive()
        .optional()
        .default(1024)
        .describe("Maximum tokens in response (default: 1024)"),
      max_response_tokens: z
        .number()
        .int()
        .positive()
        .optional()
        .describe(
          "Maximum tokens in the response returned to you. If the model's response exceeds this, " +
          "it will be distilled by a fast model to fit — preserving code, file paths, errors, " +
          "and actionable details while stripping filler. Omit for no compression."
        ),
      format: z
        .enum(["brief", "detailed"])
        .optional()
        .default("detailed")
        .describe(
          "Response format — 'brief' for token-efficient summary, 'detailed' for full response"
        ),
      include_raw: z
        .boolean()
        .optional()
        .default(false)
        .describe(
          "When true and compression is active, include the original uncompressed response " +
          "for quality comparison. Use this to verify distillation preserved critical details."
        ),
    });
  • `formatResponse` — formats the query result into a markdown string. Supports 'brief' format (compact with model name, latency tag, and optional distillation note) and 'detailed' format (full markdown with response content, latency, token usage, compression/distillation metadata, finish reason, and optional raw response escape hatch via HTML details tag).
    function formatResponse(
      response: QueryResponse,
      format: "brief" | "detailed",
      compression?: CompressionResult,
      includeRaw?: boolean
    ): string {
      const content = compression?.content ?? response.content;
      const isCached = response.latency_ms === 0;
    
      if (format === "brief") {
        const latencyTag = isCached ? "cached" : `${response.latency_ms}ms`;
        const lines = [
          `**${response.model}** (${latencyTag})`,
          "",
          content,
        ];
        if (compression?.compressed) {
          const saved = (compression.originalTokens ?? 0) - (compression.compressedTokens ?? 0);
          lines.push("");
          lines.push(
            `*Distilled by ${compression.compressorModel} — saved ${saved} tokens*`
          );
        }
        return lines.join("\n");
      }
    
      // Detailed format
      const latencyText = isCached
        ? `**Latency:** 0ms (cached)`
        : `**Latency:** ${response.latency_ms}ms`;
    
      const lines = [
        `## Response from ${response.model}`,
        "",
        content,
        "",
        "---",
        latencyText,
      ];
    
      if (response.usage) {
        lines.push(
          `**Tokens:** ${response.usage.prompt_tokens} in → ${response.usage.completion_tokens} out (${response.usage.total_tokens} total)`
        );
      }
    
      if (compression?.compressed) {
        const orig = compression.originalTokens ?? 0;
        const comp = compression.compressedTokens ?? 0;
        const saved = orig - comp;
        const pct = orig > 0 ? Math.round((saved / orig) * 100) : 0;
    
        lines.push(
          `**Distilled:** ${orig} → ${comp} tokens by ${compression.compressorModel} (${compression.compressorLatency}ms)`
        );
        lines.push(`**Saved:** ${saved} tokens (${pct}% smaller)`);
      }
    
      if (response.finish_reason && response.finish_reason !== "stop") {
        lines.push(`**Note:** Response ended due to: ${response.finish_reason}`);
      }
    
      // Escape hatch: include raw uncompressed response
      if (includeRaw && compression?.compressed && compression.rawContent) {
        lines.push("");
        lines.push(
          `<details>\n<summary>Raw response (${compression.originalTokens ?? "?"} tokens, before distillation)</summary>\n\n${compression.rawContent}\n\n</details>`
        );
      }
    
      return lines.join("\n");
    }
  • src/server.ts:93-121 (registration)
    Registration of the 'ask_model' tool on the MCP server via `server.tool(...)`. Includes tool name, description with output format, usage guidance, and failure modes. Uses `askModelSchema.shape` for input validation and calls `askModel(provider, input)` as the handler, with error handling returning structured error responses.
      // --- ask_model ---
      server.tool(
        "ask_model",
        `Query any AI model with a prompt. Returns the model's response with metadata.
    
    OUTPUT: Markdown with the model's response, latency, and token usage. If max_response_tokens is set and compression occurred, includes distillation metadata (original tokens, compressed tokens, compressor model, compressor latency). Shows "Saved: X tokens (Y% smaller)" when compression is active. Shows "(cached)" when response is served from cache.
    
    WHEN TO USE: When you need another model's perspective, analysis, or capabilities. Set max_response_tokens to control how much of your context window this response consumes — the response will be distilled by a fast model to fit the budget while preserving code, file paths, errors, and actionable details. Set include_raw=true to see both compressed and original responses for quality verification.
    
    FAILURE MODES:
    - "Model query failed (4xx/5xx)" → The model or provider is unavailable. Try a different model or check that CLIProxyAPI/Ollama is running.
    - "circuit breaker open" → The model failed too many times recently. Try a different model or wait for automatic recovery.
    - Compression silently skipped → If the compressor model is unavailable or the response already fits the budget, the raw response is returned unchanged. This is not an error.`,
        askModelSchema.shape,
        async (input) => {
          logger.info(`ask_model: querying ${input.model}`);
          try {
            const result = await askModel(provider, input);
            return { content: [{ type: "text" as const, text: result }] };
          } catch (err) {
            const message = err instanceof Error ? err.message : String(err);
            logger.error(`ask_model failed: ${message}`);
            return {
              content: [{ type: "text" as const, text: `Error: ${message}` }],
              isError: true,
            };
          }
        }
      );
  • src/server.ts:13-13 (registration)
    Import of `askModelSchema` and `askModel` from the ask-model tool module into the server registration file.
    import { askModelSchema, askModel } from "./tools/ask-model.js";
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description thoroughly discloses behavioral traits: compression and distillation details, caching indicators ('(cached)'), failure modes including circuit breaker and model unavailability, and edge cases like compression silently skipped. This compensates fully for the lack of annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is efficiently structured with clear sections (OUTPUT, WHEN TO USE, FAILURE MODES), no redundant sentences, and front-loaded with the core functionality. Every sentence adds value.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has 8 parameters and no output schema or annotations, the description covers expected behavior, output format, parameter usage, failure modes, and edge cases comprehensively. It leaves little ambiguity and provides sufficient context for correct tool invocation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage, the baseline is 3. The description adds significant value by explaining the compression mechanism for max_response_tokens, the verification purpose of include_raw, and the response metadata (latency, token usage, distillation info). It enhances parameter understanding beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Query any AI model with a prompt. Returns the model's response with metadata.' It uses a specific verb and resource, and the output format distinguishes it from sibling tools like list_models or compare_models.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description includes a 'WHEN TO USE' section explaining the appropriate context ('When you need another model's perspective, analysis, or capabilities') and provides guidance on parameters like max_response_tokens and include_raw. However, it does not explicitly mention when not to use this tool or compare it to alternatives like consensus.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Pickle-Pixel/HydraMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server