Scholar Feed MCP Server

get_benchmark_stats

Retrieve statistical metrics for research benchmarks to contextualize paper performance. Provides distribution data including min, max, median, mean, and standard deviation for dataset-metric combinations.

Instructions

Get score distribution statistics for a dataset+metric across all papers. Returns min, max, median, mean, p25, p75, stddev, and count. Use this to contextualize a paper's claims — e.g., 'For MMLU accuracy, the median is 72.5% across 45 papers, range 33%-95%.' No judgment or outlier flags — just raw statistics.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`dataset`	Yes	Dataset/benchmark name e.g. 'ImageNet', 'MMLU', 'SWE-bench Verified'
`metric`	Yes	Metric name e.g. 'accuracy', 'F1', 'pass@1'

Tool Definition Quality

A4.4/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key traits: it's a read-only operation (implied by 'Get'), returns aggregated statistics across all papers, and explicitly states it provides 'raw statistics' without judgment or outlier flags. However, it doesn't mention potential limitations like rate limits, error handling, or data freshness, leaving some gaps in behavioral context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with the core purpose in the first sentence, followed by usage guidance and exclusions. Every sentence earns its place: the first defines the tool, the second provides a usage example, and the third clarifies limitations. It's efficiently structured with zero waste, making it easy to parse quickly.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (2 parameters, no output schema, no annotations), the description is largely complete. It covers purpose, usage, and behavioral traits well. However, without an output schema, it doesn't detail the return format (e.g., structure of the statistics object), which could leave ambiguity for the agent. This minor gap prevents a perfect score.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already fully documents the two parameters (dataset and metric) with clear descriptions and examples. The description adds no additional parameter semantics beyond what's in the schema, such as format constraints or interdependencies. Baseline 3 is appropriate as the schema does the heavy lifting, but the description doesn't enhance parameter understanding.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('Get score distribution statistics') and resources ('for a dataset+metric across all papers'), distinguishing it from siblings like get_leaderboard (which might show rankings) or get_paper_results (which focuses on individual papers). It explicitly mentions what it returns (min, max, median, etc.), making the function unambiguous.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool: 'Use this to contextualize a paper's claims' with a concrete example. It also specifies exclusions: 'No judgment or outlier flags — just raw statistics,' which helps differentiate it from tools like compare_methods or get_leaderboard that might involve analysis or ranking. This clearly defines its role versus alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/YGao2005/scholar-feed-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server