ask_model
Query any AI model with a prompt and receive its response with metadata including latency and token usage. Optionally limit response tokens with automatic distillation.
Instructions
Query any AI model with a prompt. Returns the model's response with metadata.
OUTPUT: Markdown with the model's response, latency, and token usage. If max_response_tokens is set and compression occurred, includes distillation metadata (original tokens, compressed tokens, compressor model, compressor latency). Shows "Saved: X tokens (Y% smaller)" when compression is active. Shows "(cached)" when response is served from cache.
WHEN TO USE: When you need another model's perspective, analysis, or capabilities. Set max_response_tokens to control how much of your context window this response consumes — the response will be distilled by a fast model to fit the budget while preserving code, file paths, errors, and actionable details. Set include_raw=true to see both compressed and original responses for quality verification.
FAILURE MODES:
"Model query failed (4xx/5xx)" → The model or provider is unavailable. Try a different model or check that CLIProxyAPI/Ollama is running.
"circuit breaker open" → The model failed too many times recently. Try a different model or wait for automatic recovery.
Compression silently skipped → If the compressor model is unavailable or the response already fits the budget, the raw response is returned unchanged. This is not an error.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | Model ID to query (e.g. 'gpt-4o', 'gemini-2.5-pro') | |
| prompt | Yes | The prompt to send to the model | |
| system_prompt | No | Optional system prompt to set model behavior | |
| temperature | No | Sampling temperature (0 = deterministic, 2 = creative) | |
| max_tokens | No | Maximum tokens in response (default: 1024) | |
| max_response_tokens | No | Maximum tokens in the response returned to you. If the model's response exceeds this, it will be distilled by a fast model to fit — preserving code, file paths, errors, and actionable details while stripping filler. Omit for no compression. | |
| format | No | Response format — 'brief' for token-efficient summary, 'detailed' for full response | detailed |
| include_raw | No | When true and compression is active, include the original uncompressed response for quality comparison. Use this to verify distillation preserved critical details. |
Implementation Reference
- src/tools/ask-model.ts:75-101 (handler)The `askModel` function — core handler that queries a model via the provider, optionally compresses the response, then formats and returns the result. Calls `provider.query(...)` with model, prompt, and options, then optionally compresses via `compressResponse` if `max_response_tokens` is set, and finally formats the output via `formatResponse`.
export async function askModel( provider: Provider, input: AskModelInput ): Promise<string> { const response = await provider.query(input.model, input.prompt, { system_prompt: input.system_prompt, temperature: input.temperature, max_tokens: input.max_tokens, }); // Compress if max_response_tokens is set let compression: CompressionResult | undefined; if (input.max_response_tokens) { compression = await compressResponse( provider, response, input.max_response_tokens ); } return formatResponse( response, input.format ?? "detailed", compression, input.include_raw ?? false ); } - src/tools/ask-model.ts:26-71 (schema)`askModelSchema` — Zod schema defining the input shape for the ask_model tool. Fields: model (string), prompt (string), system_prompt (optional string), temperature (optional number 0-2), max_tokens (optional int, default 1024), max_response_tokens (optional int for compression budget), format (optional enum 'brief'|'detailed', default 'detailed'), include_raw (optional boolean, default false).
export const askModelSchema = z.object({ model: z.string().describe("Model ID to query (e.g. 'gpt-4o', 'gemini-2.5-pro')"), prompt: z.string().describe("The prompt to send to the model"), system_prompt: z .string() .optional() .describe("Optional system prompt to set model behavior"), temperature: z .number() .min(0) .max(2) .optional() .describe("Sampling temperature (0 = deterministic, 2 = creative)"), max_tokens: z .number() .int() .positive() .optional() .default(1024) .describe("Maximum tokens in response (default: 1024)"), max_response_tokens: z .number() .int() .positive() .optional() .describe( "Maximum tokens in the response returned to you. If the model's response exceeds this, " + "it will be distilled by a fast model to fit — preserving code, file paths, errors, " + "and actionable details while stripping filler. Omit for no compression." ), format: z .enum(["brief", "detailed"]) .optional() .default("detailed") .describe( "Response format — 'brief' for token-efficient summary, 'detailed' for full response" ), include_raw: z .boolean() .optional() .default(false) .describe( "When true and compression is active, include the original uncompressed response " + "for quality comparison. Use this to verify distillation preserved critical details." ), }); - src/tools/ask-model.ts:103-174 (helper)`formatResponse` — formats the query result into a markdown string. Supports 'brief' format (compact with model name, latency tag, and optional distillation note) and 'detailed' format (full markdown with response content, latency, token usage, compression/distillation metadata, finish reason, and optional raw response escape hatch via HTML details tag).
function formatResponse( response: QueryResponse, format: "brief" | "detailed", compression?: CompressionResult, includeRaw?: boolean ): string { const content = compression?.content ?? response.content; const isCached = response.latency_ms === 0; if (format === "brief") { const latencyTag = isCached ? "cached" : `${response.latency_ms}ms`; const lines = [ `**${response.model}** (${latencyTag})`, "", content, ]; if (compression?.compressed) { const saved = (compression.originalTokens ?? 0) - (compression.compressedTokens ?? 0); lines.push(""); lines.push( `*Distilled by ${compression.compressorModel} — saved ${saved} tokens*` ); } return lines.join("\n"); } // Detailed format const latencyText = isCached ? `**Latency:** 0ms (cached)` : `**Latency:** ${response.latency_ms}ms`; const lines = [ `## Response from ${response.model}`, "", content, "", "---", latencyText, ]; if (response.usage) { lines.push( `**Tokens:** ${response.usage.prompt_tokens} in → ${response.usage.completion_tokens} out (${response.usage.total_tokens} total)` ); } if (compression?.compressed) { const orig = compression.originalTokens ?? 0; const comp = compression.compressedTokens ?? 0; const saved = orig - comp; const pct = orig > 0 ? Math.round((saved / orig) * 100) : 0; lines.push( `**Distilled:** ${orig} → ${comp} tokens by ${compression.compressorModel} (${compression.compressorLatency}ms)` ); lines.push(`**Saved:** ${saved} tokens (${pct}% smaller)`); } if (response.finish_reason && response.finish_reason !== "stop") { lines.push(`**Note:** Response ended due to: ${response.finish_reason}`); } // Escape hatch: include raw uncompressed response if (includeRaw && compression?.compressed && compression.rawContent) { lines.push(""); lines.push( `<details>\n<summary>Raw response (${compression.originalTokens ?? "?"} tokens, before distillation)</summary>\n\n${compression.rawContent}\n\n</details>` ); } return lines.join("\n"); } - src/server.ts:93-121 (registration)Registration of the 'ask_model' tool on the MCP server via `server.tool(...)`. Includes tool name, description with output format, usage guidance, and failure modes. Uses `askModelSchema.shape` for input validation and calls `askModel(provider, input)` as the handler, with error handling returning structured error responses.
// --- ask_model --- server.tool( "ask_model", `Query any AI model with a prompt. Returns the model's response with metadata. OUTPUT: Markdown with the model's response, latency, and token usage. If max_response_tokens is set and compression occurred, includes distillation metadata (original tokens, compressed tokens, compressor model, compressor latency). Shows "Saved: X tokens (Y% smaller)" when compression is active. Shows "(cached)" when response is served from cache. WHEN TO USE: When you need another model's perspective, analysis, or capabilities. Set max_response_tokens to control how much of your context window this response consumes — the response will be distilled by a fast model to fit the budget while preserving code, file paths, errors, and actionable details. Set include_raw=true to see both compressed and original responses for quality verification. FAILURE MODES: - "Model query failed (4xx/5xx)" → The model or provider is unavailable. Try a different model or check that CLIProxyAPI/Ollama is running. - "circuit breaker open" → The model failed too many times recently. Try a different model or wait for automatic recovery. - Compression silently skipped → If the compressor model is unavailable or the response already fits the budget, the raw response is returned unchanged. This is not an error.`, askModelSchema.shape, async (input) => { logger.info(`ask_model: querying ${input.model}`); try { const result = await askModel(provider, input); return { content: [{ type: "text" as const, text: result }] }; } catch (err) { const message = err instanceof Error ? err.message : String(err); logger.error(`ask_model failed: ${message}`); return { content: [{ type: "text" as const, text: `Error: ${message}` }], isError: true, }; } } ); - src/server.ts:13-13 (registration)Import of `askModelSchema` and `askModel` from the ask-model tool module into the server registration file.
import { askModelSchema, askModel } from "./tools/ask-model.js";