Skip to main content
Glama
collapseindex

CI-1T Prediction Stability Engine

fleet_evaluate

Evaluate prediction stability across multiple model nodes. Detects warnings, faults, and ghost confirmations in score streams, providing per-node episodes and fleet summary for monitoring and alerting.

Instructions

Evaluate a fleet of model nodes for prediction stability. Each node provides a score stream. Returns per-node episodes and aggregate fleet stats. Accepts floats (0.0–1.0) or Q0.16 integers (0–65535) — auto-converts per node. Response: { nodes: [{ node_id, episodes: [{ ci_out, ci_ema_out, al_out, warn, fault, ghost_confirmed, ... }] }], fleet_summary, credits_used, credits_remaining }. Chain per-node episodes → visualize, alert_check, or compare_windows. For persistent multi-round fleet monitoring, use fleet_session_create instead.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
nodesYesArray of node score arrays — each inner array is one node's scores (floats or Q0.16). Max 16 nodes, 10,000 scores per node.
nNoEpisode length (default: 3)

Implementation Reference

  • src/index.ts:484-504 (registration)
    The tool 'fleet_evaluate' is registered via server.tool() with its schema (zod validation for nodes and n) and handler function.
    server.tool(
      "fleet_evaluate",
      "Evaluate a fleet of model nodes for prediction stability. Each node provides a score stream. Returns per-node episodes and aggregate fleet stats. Accepts floats (0.0–1.0) or Q0.16 integers (0–65535) — auto-converts per node. Response: { nodes: [{ node_id, episodes: [{ ci_out, ci_ema_out, al_out, warn, fault, ghost_confirmed, ... }] }], fleet_summary, credits_used, credits_remaining }. Chain per-node episodes → visualize, alert_check, or compare_windows. For persistent multi-round fleet monitoring, use fleet_session_create instead.",
      {
        nodes: z.array(z.array(z.number().min(0).max(65535)).min(1).max(10000)).min(1).max(16).describe("Array of node score arrays — each inner array is one node's scores (floats or Q0.16). Max 16 nodes, 10,000 scores per node."),
        n: z.number().int().min(2).max(8).optional().describe("Episode length (default: 3)"),
      },
      async ({ nodes, n }) => {
        const guard = requireApiKey();
        if (guard) return guard;
        const q16Nodes = nodes.map((nodeScores) => toQ16(nodeScores));
        const body: Record<string, unknown> = { nodes: q16Nodes };
        if (n !== undefined) body.config = { n };
        const result = await apiFetch("/api/fleet-evaluate", {
          method: "POST",
          headers: apiKeyHeaders(),
          body,
        });
        return formatResult(result);
      }
    );
  • The handler function for fleet_evaluate: checks API key auth, converts scores to Q0.16 via toQ16, POSTs to /api/fleet-evaluate, and returns the formatted API result.
    async ({ nodes, n }) => {
      const guard = requireApiKey();
      if (guard) return guard;
      const q16Nodes = nodes.map((nodeScores) => toQ16(nodeScores));
      const body: Record<string, unknown> = { nodes: q16Nodes };
      if (n !== undefined) body.config = { n };
      const result = await apiFetch("/api/fleet-evaluate", {
        method: "POST",
        headers: apiKeyHeaders(),
        body,
      });
      return formatResult(result);
    }
  • Input schema: nodes (array of number arrays, min 1 node, max 16 nodes, each 1-10000 scores), optional n (episode length, 2-8).
    {
      nodes: z.array(z.array(z.number().min(0).max(65535)).min(1).max(10000)).min(1).max(16).describe("Array of node score arrays — each inner array is one node's scores (floats or Q0.16). Max 16 nodes, 10,000 scores per node."),
      n: z.number().int().min(2).max(8).optional().describe("Episode length (default: 3)"),
    },
  • toQ16 helper: auto-converts float scores (0-1) to Q0.16 integers (0-65535), used by fleet_evaluate to normalize node scores before sending to API.
    /** Auto-convert: if all values are 0–1 floats with decimals, scale to Q0.16. Otherwise clamp to 0–65535. */
    function toQ16(scores: number[]): number[] {
      const hasDecimals = scores.some((s) => s % 1 !== 0);
      const allInUnit = scores.every((s) => s >= 0 && s <= 1);
      const isFloat = hasDecimals && allInUnit;
      return isFloat
        ? scores.map((s) => Math.round(Math.max(0, Math.min(1, s)) * Q16))
        : scores.map((s) => Math.round(Math.max(0, Math.min(Q16, s))));
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries full burden. It discloses input acceptance (floats or Q0.16 integers, auto-conversion), output structure (nodes, episodes with fields, fleet_summary, credits), and behavior ('auto-converts per node'). No side effects or limitations are mentioned, but the core behavior is well-described.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise yet information-rich, front-loading the purpose and covering input, output, and usage in three sentences with no redundancy. Every sentence earns its place.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (2 parameters, no output schema), the description sufficiently covers input format, output structure, conversion behavior, and usage context. It also distinguishes from the persistent alternative, making it complete for one-shot fleet evaluation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% but the description adds value by explaining the conversion semantics for scores, the episode length default, and the response structure. It also suggests chaining, going beyond schema constraints.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's action ('Evaluate'), resource ('fleet of model nodes'), and purpose ('prediction stability'). It also distinguishes itself from siblings by explicitly mentioning 'For persistent multi-round fleet monitoring, use fleet_session_create instead.'

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool vs alternatives: it suggests chaining to 'visualize, alert_check, or compare_windows' and advises using 'fleet_session_create' for persistent multi-round monitoring.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/collapseindex/ci-1t-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server