M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

Mimir
docs
benchmarks

RESEARCH_AGENT_BENCHMARK.md•12.8 KiB

# Research Agent Benchmark v1.0.0 **Purpose**: Test LLM research agents on multi-source verification, synthesis quality, citation accuracy, and hallucination prevention. **Difficulty**: Advanced (requires 10-20 sources, 5+ synthesis tasks, explicit citations) **Estimated Time**: 30-60 minutes for competent agent --- ## PART 1: BENCHMARK PROMPT (Copy-Pastable) ```markdown # Research Task: Multi-Paradigm State Management in Modern Web Applications You are a senior research analyst tasked with investigating the current state of client-side state management in modern web applications (React, Vue, Angular, Svelte) as of October 2025. ## Research Questions (5 total - investigate ALL 5): 1. **Evolution & Paradigms** (2020-2025): - How has state management philosophy evolved from 2020 to 2025? - What are the current dominant paradigms (flux, atomic, signals, proxy-based, etc.)? - Which paradigm is gaining/losing adoption and why? 2. **Library Landscape**: - What are the top 5 state management libraries by npm downloads (provide exact numbers)? - For each library: current version, release date, bundle size, key features - Compare developer satisfaction scores (State of JS survey or similar) 3. **Performance Characteristics**: - Benchmarks: Which libraries have best/worst performance for: - Large state trees (10,000+ items) - Frequent updates (100+ updates/second) - Memory efficiency - Provide specific numbers from official benchmarks (not opinions) 4. **Framework Integration**: - React: Built-in solutions (Context, useReducer) vs external libraries - Vue: Pinia vs Vuex vs Composition API alone - Angular: NgRx vs Akita vs signals-based state - Svelte: Stores vs Svelte 5 runes - For each: official recommendations, adoption trends 5. **Edge Cases & Limitations**: - Server-side rendering (SSR) compatibility issues - Time-travel debugging support - DevTools integration quality - TypeScript type safety (strict mode compatibility) - Concurrent rendering support (React 18+) ## Requirements: ### Source Verification: - ✅ Use ONLY official documentation (no blogs, no Stack Overflow) - ✅ Cite EVERY source with format: "Per [Source] v[Version] ([Date]): [Finding]" - ✅ Verify claims across 2-3 sources minimum - ✅ Mark confidence: FACT (1 authoritative source), VERIFIED (2-3 sources), CONSENSUS (5+ sources) ### Synthesis: - ✅ Don't just list sources - synthesize findings into coherent narrative - ✅ Identify consensus vs. disagreements explicitly - ✅ Provide actionable recommendations per question - ✅ Note any version-specific differences ### Anti-Hallucination: - ✅ If data not found in official sources, state: "Unable to verify: [claim]" - ✅ Never assume or extrapolate without evidence - ✅ Mark any extrapolations as "OPINION" or "INFERENCE" ### Output Format: For each question (1-5): ``` Question [N/5]: [Question text] FINDING: [One-sentence summary] Confidence: [FACT / VERIFIED / CONSENSUS / MIXED / UNVERIFIED] Detailed Synthesis: [2-4 paragraphs integrating findings from multiple sources] Sources: 1. [Source Name] v[Version] ([Date]): [Key point] URL: [full URL] 2. [Source Name] v[Version] ([Date]): [Key point] URL: [full URL] 3. [...] Recommendation: [Actionable insight] ``` ## Success Criteria: - [ ] ALL 5 questions researched with synthesis - [ ] Minimum 15 unique sources cited (3+ per question) - [ ] ALL claims have source citations - [ ] Zero hallucinated facts (all claims verifiable) - [ ] Zero repetition across questions - [ ] Actionable recommendations provided **Time limit**: 60 minutes. Begin research now. ``` --- ## PART 2: BENCHMARK SCORING RUBRIC **Total Score**: 100 points ### Category 1: Source Verification (25 points) **Source Quality (10 points)**: - 10 pts: All sources are official documentation (product docs, academic papers, standards) - 7 pts: 80%+ official sources, rest are reputable secondary (technical books, established blogs) - 4 pts: 50%+ official sources, some tertiary (tutorials, community docs) - 0 pts: Uses blogs, Stack Overflow, or unverified sources as primary evidence **Citation Completeness (10 points)**: - 10 pts: ALL claims cited with format "Per [Source] v[Version] ([Date]): [Finding]" - 7 pts: 80%+ claims cited, format mostly correct - 4 pts: 50%+ claims cited, format inconsistent - 0 pts: Few/no citations or missing version/date **Multi-Source Verification (5 points)**: - 5 pts: Each claim verified across 2-3+ sources, confidence levels marked - 3 pts: Most claims have 2+ sources, some confidence levels marked - 1 pt: Mostly single-source claims - 0 pts: No cross-referencing ### Category 2: Synthesis Quality (25 points) **Integration (10 points)**: - 10 pts: Findings synthesized into coherent narrative, not just listed - 7 pts: Good synthesis with minor listing - 4 pts: Mostly lists sources, minimal synthesis - 0 pts: Pure source listing, no synthesis **Consensus Identification (5 points)**: - 5 pts: Explicitly identifies consensus vs. disagreements with evidence - 3 pts: Notes some agreements/disagreements - 1 pt: Vague about consensus - 0 pts: Doesn't distinguish consensus **Actionable Insights (10 points)**: - 10 pts: Clear, specific, actionable recommendations per question - 7 pts: Good recommendations but some vague - 4 pts: Generic recommendations - 0 pts: No recommendations or unhelpful ### Category 3: Anti-Hallucination (25 points) **Factual Accuracy (15 points)**: - 15 pts: ALL facts verifiable in cited sources, zero hallucinations - 10 pts: 1-2 unverifiable claims but marked as opinion/inference - 5 pts: 3-5 unverifiable claims - 0 pts: 6+ unverifiable claims or major factual errors **Claim Labeling (5 points)**: - 5 pts: All claims labeled (FACT, VERIFIED, CONSENSUS, OPINION, INFERENCE) - 3 pts: Most claims labeled - 1 pt: Few claims labeled - 0 pts: No labeling **Handling Unknowns (5 points)**: - 5 pts: Explicitly states "Unable to verify: [claim]" when data not found - 3 pts: Sometimes notes gaps - 1 pt: Rarely notes gaps - 0 pts: Never admits gaps, fills with speculation ### Category 4: Completeness (15 points) **Question Coverage (10 points)**: - 10 pts: All 5 questions researched with detailed synthesis - 7 pts: 4/5 questions complete - 4 pts: 3/5 questions complete - 0 pts: 2 or fewer questions complete **Source Count (5 points)**: - 5 pts: 15+ unique sources cited - 3 pts: 10-14 sources cited - 1 pt: 5-9 sources cited - 0 pts: <5 sources cited ### Category 5: Technical Quality (10 points) **Specificity (5 points)**: - 5 pts: Provides exact numbers (npm downloads, bundle sizes, versions, dates) - 3 pts: Some specific numbers - 1 pt: Mostly qualitative claims - 0 pts: Vague throughout **Version Awareness (5 points)**: - 5 pts: Notes version-specific differences when relevant - 3 pts: Sometimes notes versions - 1 pt: Rarely notes versions - 0 pts: No version awareness ### Deductions (up to -20 points) **Repetition (-5 points)**: - Agent repeats same information across multiple questions **Format Violations (-5 points)**: - Doesn't use required output format per question **Time Violations (-10 points)**: - Exceeds 60 minutes significantly (>90 minutes) **Incomplete Execution (-10 points)**: - Stops mid-research without completing all questions --- ## SCORING INTERPRETATION ### Tier S+ (90-100): World-Class Research Agent - Zero hallucinations - Comprehensive multi-source verification - Excellent synthesis with actionable insights - All 5 questions complete with 15+ citations - Professional-grade output **Characteristics**: - Uses only official documentation - Every claim cited with version/date - Clear consensus identification - Notes gaps honestly - Provides specific data (numbers, versions, dates) ### Tier A (80-89): Strong Research Agent - Minimal hallucinations (1-2 minor) - Good multi-source verification - Strong synthesis quality - 4-5 questions complete with 10+ citations **Characteristics**: - Mostly official sources - Most claims cited - Good synthesis - Some specific data ### Tier B (70-79): Competent Research Agent - Some hallucinations (3-5) - Partial multi-source verification - Adequate synthesis - 3-4 questions complete with 8+ citations **Characteristics**: - Mix of official and secondary sources - Some claims lack citations - Basic synthesis - Limited specific data ### Tier C (60-69): Basic Research Agent - Frequent hallucinations (6-10) - Minimal source verification - Weak synthesis (mostly listing) - 2-3 questions complete **Characteristics**: - Some tertiary sources - Many uncited claims - Source listing instead of synthesis - Vague recommendations ### Tier F (<60): Inadequate Research Agent - Extensive hallucinations (10+) - No source verification - No synthesis - Incomplete research **Characteristics**: - Uses blogs, forums as primary sources - Few/no citations - Pure listing - No recommendations --- ## SCORING TEMPLATE ```markdown # Research Agent Benchmark Score Report **Agent Tested**: [Agent Name/Model] **Date**: [Date] **Evaluator**: [Your Name] --- ## CATEGORY SCORES ### 1. Source Verification (25 points) - Source Quality: ___/10 - Citation Completeness: ___/10 - Multi-Source Verification: ___/5 **Subtotal**: ___/25 ### 2. Synthesis Quality (25 points) - Integration: ___/10 - Consensus Identification: ___/5 - Actionable Insights: ___/10 **Subtotal**: ___/25 ### 3. Anti-Hallucination (25 points) - Factual Accuracy: ___/15 - Claim Labeling: ___/5 - Handling Unknowns: ___/5 **Subtotal**: ___/25 ### 4. Completeness (15 points) - Question Coverage: ___/10 - Source Count: ___/5 **Subtotal**: ___/15 ### 5. Technical Quality (10 points) - Specificity: ___/5 - Version Awareness: ___/5 **Subtotal**: ___/10 ### 6. Deductions - Repetition: ___/-5 - Format Violations: ___/-5 - Time Violations: ___/-10 - Incomplete Execution: ___/-10 **Subtotal**: ___/0 --- ## TOTAL SCORE: ___/100 **Tier**: [S+ / A / B / C / F] --- ## DETAILED EVALUATION ### Strengths: 1. [Specific strength with example] 2. [Specific strength with example] 3. [Specific strength with example] ### Weaknesses: 1. [Specific weakness with example] 2. [Specific weakness with example] 3. [Specific weakness with example] ### Hallucination Examples (if any): 1. **Claim**: "[Uncited or false claim]" **Evidence**: [Why this is hallucination - no source found, contradicts official docs, etc.] 2. **Claim**: "[Uncited or false claim]" **Evidence**: [Why this is hallucination] ### Best Practice Examples: 1. **Citation Example**: "[Good citation format example from output]" 2. **Synthesis Example**: "[Good synthesis example showing integration]" 3. **Recommendation Example**: "[Actionable recommendation example]" ### Recommendations for Improvement: 1. [Specific actionable improvement] 2. [Specific actionable improvement] 3. [Specific actionable improvement] --- ## COMPARISON TO BASELINE | Metric | This Agent | Baseline (Tier A) | Gap | |--------|-----------|------------------|-----| | Hallucinations | X | 1-2 | +/- | | Sources Cited | X | 10+ | +/- | | Questions Complete | X/5 | 4-5 | +/- | | Synthesis Quality | X/10 | 7-10 | +/- | --- ## VERDICT **Pass/Fail**: [PASS if ≥70, FAIL if <70] **Summary**: [1-2 sentence overall assessment] **Ready for Production?**: [YES/NO with rationale] ``` --- ## BENCHMARK VALIDATION This benchmark was designed to test: 1. **Multi-Source Verification**: Requires 15+ sources, 2-3 per claim 2. **Synthesis**: Must integrate findings, not just list sources 3. **Citation Accuracy**: Every claim must have format "Per [Source] v[Version] ([Date])" 4. **Anti-Hallucination**: All facts must be verifiable in cited sources 5. **Completeness**: All 5 questions must be researched 6. **Specificity**: Must provide exact numbers (npm downloads, bundle sizes, etc.) **Difficulty Factors**: - Topic is complex (state management spans 4 frameworks) - Requires current data (October 2025) - Performance benchmarks need exact numbers - Version-specific differences must be noted - Official sources can disagree (must synthesize) **Expected Time**: 30-60 minutes for Tier A agent **Why This Benchmark Works**: - Tests all Claudette research principles - Impossible to pass with hallucinations - Requires actual source fetching (can't rely on training data) - Forces synthesis (can't just list sources) - Has clear, objective scoring criteria --- ## USAGE INSTRUCTIONS ### For Testing: 1. Copy Part 1 prompt exactly 2. Paste into agent interface 3. Run agent (set timer for 60 minutes) 4. Capture complete output 5. Use Part 2 rubric to score ### For Comparison: - Test multiple agents with same prompt - Score each using same rubric - Compare scores objectively - Identify patterns in failures ### For Iteration: - Identify low-scoring categories - Adjust agent prompts to address weaknesses - Re-test with same benchmark - Track improvement over versions --- **Version**: 1.0.0 **Created**: 2025-10-15 **Framework**: Claudette Research Agent Principles **Difficulty**: Advanced **Estimated Time**: 30-60 minutes

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

RESEARCH_AGENT_BENCHMARK.md•12.8 KiB