Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
RESEARCH_AGENT_BENCHMARK.md13.2 kB
# Research Agent Benchmark v1.0.0 **Purpose**: Test LLM research agents on multi-source verification, synthesis quality, citation accuracy, and hallucination prevention. **Difficulty**: Advanced (requires 10-20 sources, 5+ synthesis tasks, explicit citations) **Estimated Time**: 30-60 minutes for competent agent --- ## PART 1: BENCHMARK PROMPT (Copy-Pastable) ```markdown # Research Task: Multi-Paradigm State Management in Modern Web Applications You are a senior research analyst tasked with investigating the current state of client-side state management in modern web applications (React, Vue, Angular, Svelte) as of October 2025. ## Research Questions (5 total - investigate ALL 5): 1. **Evolution & Paradigms** (2020-2025): - How has state management philosophy evolved from 2020 to 2025? - What are the current dominant paradigms (flux, atomic, signals, proxy-based, etc.)? - Which paradigm is gaining/losing adoption and why? 2. **Library Landscape**: - What are the top 5 state management libraries by npm downloads (provide exact numbers)? - For each library: current version, release date, bundle size, key features - Compare developer satisfaction scores (State of JS survey or similar) 3. **Performance Characteristics**: - Benchmarks: Which libraries have best/worst performance for: - Large state trees (10,000+ items) - Frequent updates (100+ updates/second) - Memory efficiency - Provide specific numbers from official benchmarks (not opinions) 4. **Framework Integration**: - React: Built-in solutions (Context, useReducer) vs external libraries - Vue: Pinia vs Vuex vs Composition API alone - Angular: NgRx vs Akita vs signals-based state - Svelte: Stores vs Svelte 5 runes - For each: official recommendations, adoption trends 5. **Edge Cases & Limitations**: - Server-side rendering (SSR) compatibility issues - Time-travel debugging support - DevTools integration quality - TypeScript type safety (strict mode compatibility) - Concurrent rendering support (React 18+) ## Requirements: ### Source Verification: - ✅ Use ONLY official documentation (no blogs, no Stack Overflow) - ✅ Cite EVERY source with format: "Per [Source] v[Version] ([Date]): [Finding]" - ✅ Verify claims across 2-3 sources minimum - ✅ Mark confidence: FACT (1 authoritative source), VERIFIED (2-3 sources), CONSENSUS (5+ sources) ### Synthesis: - ✅ Don't just list sources - synthesize findings into coherent narrative - ✅ Identify consensus vs. disagreements explicitly - ✅ Provide actionable recommendations per question - ✅ Note any version-specific differences ### Anti-Hallucination: - ✅ If data not found in official sources, state: "Unable to verify: [claim]" - ✅ Never assume or extrapolate without evidence - ✅ Mark any extrapolations as "OPINION" or "INFERENCE" ### Output Format: For each question (1-5): ``` Question [N/5]: [Question text] FINDING: [One-sentence summary] Confidence: [FACT / VERIFIED / CONSENSUS / MIXED / UNVERIFIED] Detailed Synthesis: [2-4 paragraphs integrating findings from multiple sources] Sources: 1. [Source Name] v[Version] ([Date]): [Key point] URL: [full URL] 2. [Source Name] v[Version] ([Date]): [Key point] URL: [full URL] 3. [...] Recommendation: [Actionable insight] ``` ## Success Criteria: - [ ] ALL 5 questions researched with synthesis - [ ] Minimum 15 unique sources cited (3+ per question) - [ ] ALL claims have source citations - [ ] Zero hallucinated facts (all claims verifiable) - [ ] Zero repetition across questions - [ ] Actionable recommendations provided **Time limit**: 60 minutes. Begin research now. ``` --- ## PART 2: BENCHMARK SCORING RUBRIC **Total Score**: 100 points ### Category 1: Source Verification (25 points) **Source Quality (10 points)**: - 10 pts: All sources are official documentation (product docs, academic papers, standards) - 7 pts: 80%+ official sources, rest are reputable secondary (technical books, established blogs) - 4 pts: 50%+ official sources, some tertiary (tutorials, community docs) - 0 pts: Uses blogs, Stack Overflow, or unverified sources as primary evidence **Citation Completeness (10 points)**: - 10 pts: ALL claims cited with format "Per [Source] v[Version] ([Date]): [Finding]" - 7 pts: 80%+ claims cited, format mostly correct - 4 pts: 50%+ claims cited, format inconsistent - 0 pts: Few/no citations or missing version/date **Multi-Source Verification (5 points)**: - 5 pts: Each claim verified across 2-3+ sources, confidence levels marked - 3 pts: Most claims have 2+ sources, some confidence levels marked - 1 pt: Mostly single-source claims - 0 pts: No cross-referencing ### Category 2: Synthesis Quality (25 points) **Integration (10 points)**: - 10 pts: Findings synthesized into coherent narrative, not just listed - 7 pts: Good synthesis with minor listing - 4 pts: Mostly lists sources, minimal synthesis - 0 pts: Pure source listing, no synthesis **Consensus Identification (5 points)**: - 5 pts: Explicitly identifies consensus vs. disagreements with evidence - 3 pts: Notes some agreements/disagreements - 1 pt: Vague about consensus - 0 pts: Doesn't distinguish consensus **Actionable Insights (10 points)**: - 10 pts: Clear, specific, actionable recommendations per question - 7 pts: Good recommendations but some vague - 4 pts: Generic recommendations - 0 pts: No recommendations or unhelpful ### Category 3: Anti-Hallucination (25 points) **Factual Accuracy (15 points)**: - 15 pts: ALL facts verifiable in cited sources, zero hallucinations - 10 pts: 1-2 unverifiable claims but marked as opinion/inference - 5 pts: 3-5 unverifiable claims - 0 pts: 6+ unverifiable claims or major factual errors **Claim Labeling (5 points)**: - 5 pts: All claims labeled (FACT, VERIFIED, CONSENSUS, OPINION, INFERENCE) - 3 pts: Most claims labeled - 1 pt: Few claims labeled - 0 pts: No labeling **Handling Unknowns (5 points)**: - 5 pts: Explicitly states "Unable to verify: [claim]" when data not found - 3 pts: Sometimes notes gaps - 1 pt: Rarely notes gaps - 0 pts: Never admits gaps, fills with speculation ### Category 4: Completeness (15 points) **Question Coverage (10 points)**: - 10 pts: All 5 questions researched with detailed synthesis - 7 pts: 4/5 questions complete - 4 pts: 3/5 questions complete - 0 pts: 2 or fewer questions complete **Source Count (5 points)**: - 5 pts: 15+ unique sources cited - 3 pts: 10-14 sources cited - 1 pt: 5-9 sources cited - 0 pts: <5 sources cited ### Category 5: Technical Quality (10 points) **Specificity (5 points)**: - 5 pts: Provides exact numbers (npm downloads, bundle sizes, versions, dates) - 3 pts: Some specific numbers - 1 pt: Mostly qualitative claims - 0 pts: Vague throughout **Version Awareness (5 points)**: - 5 pts: Notes version-specific differences when relevant - 3 pts: Sometimes notes versions - 1 pt: Rarely notes versions - 0 pts: No version awareness ### Deductions (up to -20 points) **Repetition (-5 points)**: - Agent repeats same information across multiple questions **Format Violations (-5 points)**: - Doesn't use required output format per question **Time Violations (-10 points)**: - Exceeds 60 minutes significantly (>90 minutes) **Incomplete Execution (-10 points)**: - Stops mid-research without completing all questions --- ## SCORING INTERPRETATION ### Tier S+ (90-100): World-Class Research Agent - Zero hallucinations - Comprehensive multi-source verification - Excellent synthesis with actionable insights - All 5 questions complete with 15+ citations - Professional-grade output **Characteristics**: - Uses only official documentation - Every claim cited with version/date - Clear consensus identification - Notes gaps honestly - Provides specific data (numbers, versions, dates) ### Tier A (80-89): Strong Research Agent - Minimal hallucinations (1-2 minor) - Good multi-source verification - Strong synthesis quality - 4-5 questions complete with 10+ citations **Characteristics**: - Mostly official sources - Most claims cited - Good synthesis - Some specific data ### Tier B (70-79): Competent Research Agent - Some hallucinations (3-5) - Partial multi-source verification - Adequate synthesis - 3-4 questions complete with 8+ citations **Characteristics**: - Mix of official and secondary sources - Some claims lack citations - Basic synthesis - Limited specific data ### Tier C (60-69): Basic Research Agent - Frequent hallucinations (6-10) - Minimal source verification - Weak synthesis (mostly listing) - 2-3 questions complete **Characteristics**: - Some tertiary sources - Many uncited claims - Source listing instead of synthesis - Vague recommendations ### Tier F (<60): Inadequate Research Agent - Extensive hallucinations (10+) - No source verification - No synthesis - Incomplete research **Characteristics**: - Uses blogs, forums as primary sources - Few/no citations - Pure listing - No recommendations --- ## SCORING TEMPLATE ```markdown # Research Agent Benchmark Score Report **Agent Tested**: [Agent Name/Model] **Date**: [Date] **Evaluator**: [Your Name] --- ## CATEGORY SCORES ### 1. Source Verification (25 points) - Source Quality: ___/10 - Citation Completeness: ___/10 - Multi-Source Verification: ___/5 **Subtotal**: ___/25 ### 2. Synthesis Quality (25 points) - Integration: ___/10 - Consensus Identification: ___/5 - Actionable Insights: ___/10 **Subtotal**: ___/25 ### 3. Anti-Hallucination (25 points) - Factual Accuracy: ___/15 - Claim Labeling: ___/5 - Handling Unknowns: ___/5 **Subtotal**: ___/25 ### 4. Completeness (15 points) - Question Coverage: ___/10 - Source Count: ___/5 **Subtotal**: ___/15 ### 5. Technical Quality (10 points) - Specificity: ___/5 - Version Awareness: ___/5 **Subtotal**: ___/10 ### 6. Deductions - Repetition: ___/-5 - Format Violations: ___/-5 - Time Violations: ___/-10 - Incomplete Execution: ___/-10 **Subtotal**: ___/0 --- ## TOTAL SCORE: ___/100 **Tier**: [S+ / A / B / C / F] --- ## DETAILED EVALUATION ### Strengths: 1. [Specific strength with example] 2. [Specific strength with example] 3. [Specific strength with example] ### Weaknesses: 1. [Specific weakness with example] 2. [Specific weakness with example] 3. [Specific weakness with example] ### Hallucination Examples (if any): 1. **Claim**: "[Uncited or false claim]" **Evidence**: [Why this is hallucination - no source found, contradicts official docs, etc.] 2. **Claim**: "[Uncited or false claim]" **Evidence**: [Why this is hallucination] ### Best Practice Examples: 1. **Citation Example**: "[Good citation format example from output]" 2. **Synthesis Example**: "[Good synthesis example showing integration]" 3. **Recommendation Example**: "[Actionable recommendation example]" ### Recommendations for Improvement: 1. [Specific actionable improvement] 2. [Specific actionable improvement] 3. [Specific actionable improvement] --- ## COMPARISON TO BASELINE | Metric | This Agent | Baseline (Tier A) | Gap | |--------|-----------|------------------|-----| | Hallucinations | X | 1-2 | +/- | | Sources Cited | X | 10+ | +/- | | Questions Complete | X/5 | 4-5 | +/- | | Synthesis Quality | X/10 | 7-10 | +/- | --- ## VERDICT **Pass/Fail**: [PASS if ≥70, FAIL if <70] **Summary**: [1-2 sentence overall assessment] **Ready for Production?**: [YES/NO with rationale] ``` --- ## BENCHMARK VALIDATION This benchmark was designed to test: 1. **Multi-Source Verification**: Requires 15+ sources, 2-3 per claim 2. **Synthesis**: Must integrate findings, not just list sources 3. **Citation Accuracy**: Every claim must have format "Per [Source] v[Version] ([Date])" 4. **Anti-Hallucination**: All facts must be verifiable in cited sources 5. **Completeness**: All 5 questions must be researched 6. **Specificity**: Must provide exact numbers (npm downloads, bundle sizes, etc.) **Difficulty Factors**: - Topic is complex (state management spans 4 frameworks) - Requires current data (October 2025) - Performance benchmarks need exact numbers - Version-specific differences must be noted - Official sources can disagree (must synthesize) **Expected Time**: 30-60 minutes for Tier A agent **Why This Benchmark Works**: - Tests all Claudette research principles - Impossible to pass with hallucinations - Requires actual source fetching (can't rely on training data) - Forces synthesis (can't just list sources) - Has clear, objective scoring criteria --- ## USAGE INSTRUCTIONS ### For Testing: 1. Copy Part 1 prompt exactly 2. Paste into agent interface 3. Run agent (set timer for 60 minutes) 4. Capture complete output 5. Use Part 2 rubric to score ### For Comparison: - Test multiple agents with same prompt - Score each using same rubric - Compare scores objectively - Identify patterns in failures ### For Iteration: - Identify low-scoring categories - Adjust agent prompts to address weaknesses - Re-test with same benchmark - Track improvement over versions --- **Version**: 1.0.0 **Created**: 2025-10-15 **Framework**: Claudette Research Agent Principles **Difficulty**: Advanced **Estimated Time**: 30-60 minutes

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server