operant-mcp
Server Details
Read-only MCP server for the OPERANT AI operating-agent calibration benchmark.
- Status
- Healthy
- Last Tested
- Transport
- Streamable HTTP
- URL
Glama MCP Gateway
Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.
Full call logging
Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.
Tool access control
Enable or disable individual tools per connector, so you decide what your agents can and cannot do.
Managed credentials
Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.
Usage analytics
See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.
Tool Definition Quality
Average 4.5/5 across 5 of 5 tools scored.
Each tool has a clearly distinct purpose: compare_models for side-by-side comparison, get_case for full case details, get_methodology for benchmark design, get_results for all model profiles, and list_cases for case metadata. No overlap in functionality.
All tool names follow a consistent verb_noun pattern (compare_models, get_case, get_methodology, get_results, list_cases), making them predictable and easy to understand.
With 5 tools, the server is well-scoped for a benchmark evaluation tool. Each tool serves a necessary role without unnecessary bloat or missing essentials.
The tool surface covers all major aspects of the benchmark: methodology explanation, case browsing and retrieval, result aggregation, and model comparison. No obvious gaps that would hinder typical use cases.
Available Tools
5 toolscompare_modelsCompare two modelsARead-onlyInspect
Side-by-side comparison of two models by display_name (case-insensitive substring). Returns ocs_mean, ocs_stdev, orchestration_mean, run_family, and subject_shell for each, plus a single_run_note where stdev is null. If a name is ambiguous or not found, returns an error listing all available display_names.
| Name | Required | Description | Default |
|---|---|---|---|
| model_a | Yes | Display name (or substring) of the first model | |
| model_b | Yes | Display name (or substring) of the second model |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already declare readOnlyHint=true (safe read). Description adds behavioral details: case-insensitive substring matching, return fields including single_run_note condition, and error handling for ambiguous names.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences: first states main purpose and return fields, second covers error behavior. No redundancy or fluff.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a simple comparison tool with two parameters and no output schema, the description covers purpose, parameter semantics, return fields, and error handling completely.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% with descriptions. Description adds that matching is case-insensitive substring, which goes beyond schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Explicit verb+resource: 'Side-by-side comparison of two models by display_name (case-insensitive substring).' Clearly distinguishes from siblings which cover cases, methodology, results, and listing.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
States when to use (comparison with two models) and error behavior for ambiguous names. Does not explicitly mention alternatives, but sibling tools are clearly different.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
get_caseGet a caseARead-onlyInspect
Return the full case for a given pair_id (axes 1/2/4) or id (axis 3): malign and benign task prompts, expected decisions, grounding rationale, and bypass patterns. Axis 3 cases are single (unmatched) and use an 'id' field instead of 'pair_id'. Use list_cases to browse available ids.
| Name | Required | Description | Default |
|---|---|---|---|
| axis | Yes | The axis this case belongs to | |
| pair_id | Yes | The pair_id (axes 1/2/4) or id (axis 3) to retrieve |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate readOnlyHint=true, and the description adds details about what the tool returns and the nuance that axis 3 uses 'id' instead of 'pair_id'. No contradictions.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences effectively convey the purpose, return content, and axis-specific behavior. No redundant information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (2 params, no output schema), the description thoroughly covers what is returned, parameter semantics, and references to sibling tools. Complete for agent decision-making.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema provides descriptions for both parameters, but the description adds crucial context: pair_id is for axes 1/2/4 and id for axis 3, and clarifies the axis enum values. This adds value beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states it returns the full case for a given pair_id or id, specifying the axes and the content (malign/benign prompts, expected decisions, etc.). It distinguishes from sibling list_cases by saying to use list_cases to browse available ids.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description explains when to use this tool (to retrieve a specific case) and mentions an alternative (use list_cases to browse ids). However, it does not explicitly state when not to use it.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
get_methodologyGet benchmark methodologyARead-onlyInspect
Return the benchmark design: the 4 axes, the OCS formula (Youden's J), the 5 decision labels (PROCEED, PROCEED_SANCTIONED, REFUSE, ESCALATE, REROUTE), scoring blocks, and a concise what-it-measures / what-it-doesn't summary. Sourced from the baked case-file metadata.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate readOnlyHint=true, and the description adds context about data source ('baked case-file metadata'), enhancing transparency without contradiction.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences front-load the action and list returned items precisely, with zero wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a parameterless tool with no output schema, the description fully enumerates all returned content (4 axes, formula, labels, blocks, summary), making it complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
No parameters exist; baseline is 4 per rubric. Description does not need to add parameter information as schema coverage is 100% with zero parameters.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states 'Return the benchmark design' with specific enumerated components (axes, formula, labels, etc.), distinguishing it from sibling tools which focus on model comparison, case details, or results.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description implies usage for retrieving methodology without explicit when-not or alternatives, but the tool's purpose is well-defined and separate from siblings, providing clear context.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
get_resultsGet calibration resultsARead-onlyInspect
Return all model calibration profiles (OCS mean/stdev, orchestration mean, run_family, subject_shell), plus generated_at, included_lab_labels, and the not-a-flat-leaderboard caveat. Models are returned as-is — do not pre-sort into a naive leaderboard. Single-run models (null stdev) must not be ranked as-if reliable against multi-run models.
| Name | Required | Description | Default |
|---|---|---|---|
No parameters | |||
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations indicate a read-only operation. The description adds valuable behavioral context beyond annotations, such as the exact data fields returned, the 'as-is' ordering, and the important caveat about single-run model reliability.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is very concise: two sentences that front-load the purpose and then deliver a critical behavioral instruction. Every word adds value, with no redundancy.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given no parameters and no output schema, the description is remarkably complete. It lists all return fields, provides usage caveats, and gives clear behavioral instructions. Nothing essential is missing for a tool of this simplicity.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
There are no parameters, so baseline 4 is appropriate. The description adds meaning by describing the output data, which helps an agent understand what the tool returns without needing parameter details.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the tool returns model calibration profiles with explicit fields (OCS mean/stdev, orchestration mean, etc.) and includes a caveat about not sorting into a leaderboard, distinguishing it from sibling tools like compare_models.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
The description provides clear context on when to use the tool (to get raw calibration data) and explicit instructions on how to interpret results (do not pre-sort, do not rank single-run models). However, it does not explicitly mention when not to use it or alternatives.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
list_casesList casesARead-onlyInspect
Return case metadata (no full task prompts): pair_id/id, axis, tier, grounding, and side indicators (malign/benign for axes 1/2/4; null for axis 3). Filter by axis, or omit for all cases across all axes (the result includes a count). Use get_case to fetch a full case with task prompts and expected decisions.
| Name | Required | Description | Default |
|---|---|---|---|
| axis | No | Axis to filter by: refusal-calibration | sanctioned-path | orchestration | escalation-reroute |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations already indicate readOnlyHint=true. Description adds beyond this by specifying that no full task prompts are returned, and lists exact metadata fields. No contradictions with annotations.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Two sentences front-load the return value and then provide filtering options and sibling tool reference. No redundant or wasted words.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Lists return fields and mentions count, but lacks explicit description of output structure (e.g., array format, count property). However, since output schema is absent and the tool is simple with one optional parameter, it is fairly complete.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema coverage is 100% for the only parameter (axis) with enum description. Description adds that omitting axis returns all cases, which is not in schema description. Baseline 3 is appropriate as schema already documents the parameter well.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
Description clearly states the tool returns case metadata with specific fields (pair_id/id, axis, tier, grounding, side indicators) and explicitly distinguishes from sibling get_case which fetches full case with prompts. The verb 'return' and resource 'case metadata' are specific.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
Provides explicit guidance on filtering by axis or omitting for all cases, and directs to get_case for full task prompts. Does not explicitly state when not to use, but the differentiation is clear enough for an agent to decide.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Claim this connector by publishing a /.well-known/glama.json file on your server's domain with the following structure:
{
"$schema": "https://glama.ai/mcp/schemas/connector.json",
"maintainers": [{ "email": "your-email@example.com" }]
}The email address must match the email associated with your Glama account. Once published, Glama will automatically detect and verify the file within a few minutes.
Control your server's listing on Glama, including description and metadata
Access analytics and receive server usage reports
Get monitoring and health status updates for your server
Feature your server to boost visibility and reach more users
For users:
Full audit trail – every tool call is logged with inputs and outputs for compliance and debugging
Granular tool control – enable or disable individual tools per connector to limit what your AI agents can do
Centralized credential management – store and rotate API keys and OAuth tokens in one place
Change alerts – get notified when a connector changes its schema, adds or removes tools, or updates tool definitions, so nothing breaks silently
For server owners:
Proven adoption – public usage metrics on your listing show real-world traction and build trust with prospective users
Tool-level analytics – see which tools are being used most, helping you prioritize development and documentation
Direct user feedback – users can report issues and suggest improvements through the listing, giving you a channel you would not have otherwise
The connector status is unhealthy when Glama is unable to successfully connect to the server. This can happen for several reasons:
The server is experiencing an outage
The URL of the server is wrong
Credentials required to access the server are missing or invalid
If you are the owner of this MCP connector and would like to make modifications to the listing, including providing test credentials for accessing the server, please contact support@glama.ai.
Discussions
No comments yet. Be the first to start the discussion!