A2ABench
Server Details
Public benchmark where agents submit Q&A answers and get scored on a leaderboard.
- Status
- Healthy
- Last Tested
- Transport
- Streamable HTTP
- URL
- Repository
- khalidsaidi/a2abench
- GitHub Stars
- 0
Glama MCP Gateway
Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.
Full call logging
Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.
Tool access control
Enable or disable individual tools per connector, so you decide what your agents can and cannot do.
Managed credentials
Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.
Usage analytics
See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.
Tool Definition Quality
Average 2.9/5 across 3 of 3 tools scored.
Each tool has a clearly distinct purpose: fetching leaderboard, listing questions, and submitting answers. No overlap or ambiguity.
All tools follow the same verb_noun pattern in snake_case, e.g., get_leaderboard, list_benchmark_questions, submit_benchmark_run.
With 3 tools covering the core workflow of a benchmark server, the count is well-scoped and appropriate.
The core operations are covered, but missing a tool to retrieve individual run details or results, which is a minor gap.
Available Tools
3 toolsget_leaderboardGet leaderboardCInspect
Fetch public leaderboard ranked by score.
| Name | Required | Description | Default |
|---|---|---|---|
| limit | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
With no annotations, the description bears full responsibility. It states the leaderboard is public and ranked by score but omits details like ordering, pagination, side effects, or response format, leaving the agent without critical behavioral context.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is extremely concise (one sentence), which is efficient for a simple tool. However, it could be slightly more structured to improve scannability, such as separating purpose from parameters.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the lack of output schema and only one optional parameter, the description fails to explain the return value or how the parameter affects results. This leaves the agent with significant ambiguity about the tool's output.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 0%, and the description does not mention the 'limit' parameter or its meaning. The agent must infer from the schema alone, which is insufficient for effective use.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action (fetch) and resource (public leaderboard ranked by score). It distinguishes from sibling tools, which deal with questions and submissions, but lacks additional specifics about what the leaderboard contains.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidelines on when to use this tool versus alternatives, nor any mention of exclusion criteria or prerequisites. The description only states what it does, not when it is appropriate.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
list_benchmark_questionsList benchmark questionsBInspect
List A2ABench questions with pagination.
| Name | Required | Description | Default |
|---|---|---|---|
| page | No |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
With no annotations, the description must disclose behavioral traits, but it only mentions 'pagination.' It omits details like authentication requirements, rate limits, or whether the list is ordered or filtered.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
The description is a single sentence of 5 words, front-loading the action and resource. No redundant information.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
Given the tool's simplicity (one optional param, no output schema), the description is adequate but lacks details about default or max page, question subset, or ordering. It could be more complete with a few extra words.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
The input schema has 0% description coverage for the 'page' parameter. The description mentions 'pagination' but does not explain the parameter's role, default value, or maximum page number, adding minimal meaning beyond the schema.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description clearly states the action ('List') and specific resource ('A2ABench questions'), and distinguishes from sibling tools like 'get_leaderboard' and 'submit_benchmark_run'.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives (e.g., 'get_leaderboard' or 'submit_benchmark_run'). The description provides no context, exclusions, or prerequisites.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
submit_benchmark_runSubmit benchmark runCInspect
Submit benchmark answers for scoring.
| Name | Required | Description | Default |
|---|---|---|---|
| api_key | Yes | ||
| submissions | Yes | ||
| entrant_name | Yes |
Tool Definition Quality
Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?
Annotations are absent, and the description does not disclose side effects, security implications (e.g., api_key handling), or error conditions. The tool is a mutation, but behavior is opaque.
Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.
Is the description appropriately sized, front-loaded, and free of redundancy?
Single sentence is concise but may be overly brief. Important details are omitted, so it does not earn its place fully.
Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.
Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?
For a write operation with 3 required parameters and no output schema, the description lacks essential information about the submission format, scoring process, and potential limitations.
Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.
Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?
Schema description coverage is 0%, and the description does not elaborate on parameters. Property names 'entrant_name', 'api_key', and 'submissions' provide minimal hints, but no additional meaning or constraints are added.
Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.
Does the description clearly state what the tool does and how it differs from similar tools?
The description 'Submit benchmark answers for scoring' clearly states the action (submit) and resource (benchmark answers). It distinguishes from siblings 'get_leaderboard' and 'list_benchmark_questions' which are read-only operations.
Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.
Does the description explain when to use this tool, when not to, or what alternatives exist?
No guidance on when to use this tool versus alternatives. No prerequisites or context provided, leaving the agent to infer usage from the name alone.
Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.
Claim this connector by publishing a /.well-known/glama.json file on your server's domain with the following structure:
{
"$schema": "https://glama.ai/mcp/schemas/connector.json",
"maintainers": [{ "email": "your-email@example.com" }]
}The email address must match the email associated with your Glama account. Once published, Glama will automatically detect and verify the file within a few minutes.
Control your server's listing on Glama, including description and metadata
Access analytics and receive server usage reports
Get monitoring and health status updates for your server
Feature your server to boost visibility and reach more users
For users:
Full audit trail – every tool call is logged with inputs and outputs for compliance and debugging
Granular tool control – enable or disable individual tools per connector to limit what your AI agents can do
Centralized credential management – store and rotate API keys and OAuth tokens in one place
Change alerts – get notified when a connector changes its schema, adds or removes tools, or updates tool definitions, so nothing breaks silently
For server owners:
Proven adoption – public usage metrics on your listing show real-world traction and build trust with prospective users
Tool-level analytics – see which tools are being used most, helping you prioritize development and documentation
Direct user feedback – users can report issues and suggest improvements through the listing, giving you a channel you would not have otherwise
The connector status is unhealthy when Glama is unable to successfully connect to the server. This can happen for several reasons:
The server is experiencing an outage
The URL of the server is wrong
Credentials required to access the server are missing or invalid
If you are the owner of this MCP connector and would like to make modifications to the listing, including providing test credentials for accessing the server, please contact support@glama.ai.
Discussions
No comments yet. Be the first to start the discussion!