M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

Mimir
docs
agents

CHANGELOG.md•21.6 KiB

# Claudette Debug Agent - Changelog ## v2.0.0 (2025-10-15) 🎯 **SURGICAL IMPROVEMENTS** (Ready for Testing) **Goal**: Address v1.0.0's 4 gaps to reach 90-95/100 (Tier S) **Based On**: v1.0.0 (82/100) with surgical additions ONLY **Surgical Additions** (+18 points target): 1. ✅ **Baseline Test Enforcement** (+5 pts) - MANDATORY RULE #1, Phase 0 step 4 2. ✅ **Debug Marker Enforcement** (+5 pts) - MANDATORY RULE #3, Phase 2 required 3. ✅ **Multi-Bug Requirement** (+5 pts) - MANDATORY RULE #5, 3+ bugs minimum 4. ✅ **Autonomous Completion** (+3 pts) - Work Style in CORE IDENTITY **What Was PRESERVED** (All 10 core elements): - ✅ Phase 0: Verify Context - ✅ Phase-based workflow (Phases 1-5) - ✅ Chain-of-thought template - ✅ Complete example (translation timeout) - ✅ Anti-patterns section - ✅ Cleanup emphasis - ✅ Evidence-driven principle - ✅ Output filtering examples - ✅ MANDATORY RULES section - ✅ File scope clarity **Changes Summary**: - Lines: 688 → 769 (+81 lines, +12%) - MANDATORY RULES: 4 → 7 (+3 critical requirements) - Phases: 5 → 5 (preserved structure) - Total removals: 0 (nothing deleted from v1.0.0) **Target Score**: 90-95/100 (Tier S) **Status**: 🧪 Ready for benchmark testing **If v2.0.0 scores lower than v1.0.0 (82/100)**: Revert and try different approach --- ## 🏆 Gold Standard: v1.0.0 (82/100 - Tier A) **Status**: Production-ready baseline - Preserved in v2.0.0 **Proven Performance**: - Score: 82/100 (Tier A) - Created actual test file - Showed evidence - Correct file scope - Precise analysis **What Makes It Work** (DO NOT CHANGE): 1. ✅ Clear file scope prevents wrong edits 2. ✅ Evidence requirements push toward action 3. ✅ Chain-of-thought template provides structure 4. ✅ Complete example shows what "done" looks like 5. ✅ Phase-based workflow with clear steps 6. ✅ Anti-patterns explicitly listed **Known Gaps** (Can improve): - Skipped baseline tests (-5 pts) - No debug markers (-5 pts) - Stopped after 1 bug (needs 3+) - Asked for direction (needs autonomy) **Target for next version**: 90+ (Tier S) by addressing gaps WITHOUT breaking what works --- ## Experimental Versions (v1.1.0 - v1.5.1): Learning What NOT to Do All experimental versions scored LOWER than v1.0.0 (82/100). These serve as cautionary examples. --- ## v1.5.1 (2025-10-15) 🎯 **TASK INTERPRETATION FIX** (Not Tested) **Goal**: Fix v1.5.0's critical failure (edited AGENTS.md instead of debugging code) **Problem**: v1.5.0 agent edited AGENTS.md when given "investigate AGENTS.md" - Interpreted task as "improve AGENTS.md" ❌ - Should have been "debug code mentioned in AGENTS.md" ✅ - Same error as v1.2.0 (editing wrong files) **Critical Additions**: 1. ✅ **Task Interpretation** - "Find bugs in CODE, NOT improve documentation" 2. ✅ **FILE SCOPE Section** - Explicit what can/cannot edit: - ✅ CREATE: Test files (reproduction-*.test.ts) - ✅ EDIT TEMPORARILY: Source code (debug markers) - ❌ NEVER EDIT: Documentation (AGENTS.md, README.md) - ❌ NEVER EDIT: Config files (package.json) 3. ✅ **Task Clarification** - "When given 'investigate AGENTS.md', read it to understand bugs, then debug source code" 4. ✅ **Common Mistakes** - Explicit anti-patterns: "❌ Editing AGENTS.md or README.md" 5. ✅ **Step 1: Understand Task** - New step before baseline tests explaining AGENTS.md is instructions, not bugs 6. ✅ **Updated Example** - Shows reading AGENTS.md but NOT editing it **Key Changes**: - Added "YOUR TASK" clarification in CORE IDENTITY - Added "FILE SCOPE" constraints with ✅/❌ markers - Added "Common Mistakes to Avoid" in PRODUCTIVE BEHAVIORS - Added Step 1 "Understand Task" before baseline tests - Updated example to show AGENTS.md read but not edited - Added reminder notes throughout workflow **Why This Should Work**: - Explicit file scope prevents wrong edits - Task interpretation clarifies "investigate AGENTS.md" ≠ "edit AGENTS.md" - Multiple reinforcements throughout document - Example shows correct behavior **Token Cost**: ~1,350 tokens (+12.5% from v1.5.0) **Expected Impact**: - ✅ Reads AGENTS.md (understands bugs) - ✅ Runs baseline tests - ✅ Creates test files - ✅ Debugs source code - ❌ Does NOT edit AGENTS.md **Status**: 🧪 Ready for re-testing --- ## v1.5.0 (2025-10-15) ⚡ **EXECUTION PATTERNS** (FAILED - Edited wrong file) **Goal**: Apply claudette-auto.md's proven execution patterns to overcome v1.4.0's architectural constraint **Actual Result**: ❌ **CRITICAL FAILURE** - Score ~5/100 (Tier F) - Agent edited AGENTS.md instead of debugging code - Worse than v1.4.0 (66/100) - Same error as v1.2.0 (editing wrong files) **What Went Wrong**: Auto patterns emphasized action but not task understanding - Task "investigate AGENTS.md" was ambiguous - Agent interpreted as "improve AGENTS.md" ❌ - Lacked file scope constraints - Lacked task interpretation guidance **Why This Might Work**: claudette-auto.md achieves autonomous execution. Analysis shows key patterns: - "Continue until complete" (auto:10) - "IMMEDIATELY make tool calls" (auto:12) - "Move to next step immediately" (auto:22) - "Only stop when ALL TODO checked" (auto:456) - Action-pair examples: "I'll do X" + immediate call (auto:462-472) **Critical Changes**: 1. ✅ **Execution Mindset** - Added Think/Act/Continue/Debug/Clean/Finish (modeled on auto:444-456) 2. ✅ **Productive Behaviors** - Added Always/Replace patterns (modeled on auto:16-35) 3. ✅ **Action Pairs** - Every instruction now paired with example command 4. ✅ **Response Patterns** - Concrete examples throughout (modeled on auto:462-472) 5. ✅ **Strong Completion Gate** - "Only stop when 3+ bugs complete" (modeled on auto:379-389) 6. ✅ **Immediate Execution** - "IMMEDIATELY make it" language (from auto:12) **Key Additions**: - **EXECUTION MINDSET** section (6 imperatives: Think/Act/Continue/Debug/Clean/Finish) - **PRODUCTIVE BEHAVIORS** section (Always do / Replace patterns) - **EFFECTIVE RESPONSE PATTERNS** section (6 concrete examples with commands) - **Strengthened CORE IDENTITY** ("Continue until 3+ bugs", "IMMEDIATELY make calls") - **Action-first workflow** (every step shows command to run) **Token Cost**: ~1,200 tokens (+20% from v1.4.0's 1,000) **Hypothesis**: If auto's patterns work in debug context, v1.5.0 should: - ✅ Run baseline tests immediately (not read files) - ✅ Create test files (not propose them) - ✅ Add markers to source (not suggest adding) - ✅ Show actual evidence (not theoretical analysis) - ✅ Move to next bug (not stop at "If you'd like...") **Expected Score**: 85-90/100 (Tier A-S) if execution patterns transfer successfully **ROI**: +200 tokens for +20-25 points = excellent if it works **Status**: 🧪 Ready for testing - will reveal if auto's patterns overcome execution barrier --- ## v1.4.0 (2025-10-15) **Goal**: Remove conflicting directives causing analysis paralysis (v1.3.1 scored 47/100, worse than v1.3.0's 63/100) ### Critical Changes (Conflict Removal & Positive Framing) 1. **Removed 10 conflicting directives** - Removed "Phase 0: Verify Context" (conflicted with "FIRST ACTION") - Converted 5 phases → continuous flow - Removed "DO NOT", "FAILED", "FORBIDDEN" (negative framing) - Removed "help you", "assist" (collaborative tone) - Removed "implementation agents fix it" (handoff mentality) - Removed failure anti-patterns (created self-doubt) 2. **Positive framing only** - "Your investigation includes: baseline → evidence → analysis" - "Create tests, run them, show output" - Direct commands, no prohibitions 3. **Autonomous tone** - "You work independently" - Removed collaborative language - Direct: "Create test. Run test. Show evidence." 4. **Continuous flow** (not phases) - "Start → Baseline → Evidence → Analysis → Documentation → Complete" - No sequential phases with checkboxes - Single narrative arc 5. **Removed self-doubt triggers** - No "previous agents failed" examples - Only positive examples - Confidence-building language 6. **Immediate action examples** - Full investigation shown (baseline to cleanup) - No planning/proposal examples - Continuous flow demonstrated ### Performance Impact (⚠️ UPDATED: Actual Test Results) - **Token Cost**: ~1,000 tokens (from ~3,250, **-69%**) - **Actual Score**: **66/100 (Tier B)** - NOT 85-90 as expected - **Improvement**: +19 points from v1.3.1 (47→66) - **Gap**: -24 points from Tier S target (66 vs 90) **What Improved**: - ✅ Analysis quality (+18/20 in Root Cause Analysis) - ✅ Bug identification (30/35 in Bug Discovery) - ✅ Structure and clarity - ✅ Zero conflicts **What Didn't Improve**: - ❌ Autonomous execution (5/30 in Methodology) - ❌ Baseline tests never run - ❌ Zero test files created - ❌ Still stops at "If you'd like, I can..." ### Behavior Change **Before (v1.3.1 - 47/100)**: ``` 1. Read AGENTS.md 2. Read source files 3. "What I plan to do next..." ← STOPPED 4. Waited (ignored all enforcement) Result: Analysis only, ZERO tests, ZERO evidence (47/100) ``` **After (v1.4.0 - ACTUAL 66/100)**: ``` 1. Load AGENTS.md and source files (not baseline tests!) 2. Analyze code, identify bugs with line numbers 3. Propose reproduction tests 4. "If you'd like, I can draft concrete test files..." ← STOPPED Result: Excellent analysis, ZERO tests, ZERO evidence (66/100) ``` **Expected vs Actual**: | Action | Expected | Actual | Status | |--------|----------|--------|--------| | Baseline tests | ✅ Run first | ❌ Never run | FAILED | | Reproduction tests | ✅ Create | ❌ Proposed only | FAILED | | Debug markers | ✅ Add & show | ❌ Never added | FAILED | | Evidence | ✅ Gather | ❌ None shown | FAILED | | Autonomy | ✅ Continuous | ❌ Waits for approval | FAILED | ### Key Insight: Architectural Constraint Discovered **Problem wasn't**: Weak enforcement language **Problem was**: 10 conflicting directives creating decision paralysis **v1.3.1 Paradox**: Stricter enforcement → worse performance (63→47) - More requirements = more conflicts - Agent chose safest path: understand → plan → stop **v1.4.0 Result**: Zero conflicts, positive framing → Better analysis but SAME execution barrier - Clear path forward ✅ - No decision points ✅ - Autonomous execution ❌ (NOT ACHIEVED) **Critical Discovery**: All versions (v1.2.0 → v1.4.0) exhibit the same pattern: - Analyze code ✅ - Propose actions ✅ - Stop at "If you'd like..." ❌ - Wait for external prompt ❌ **Conclusion**: Agent has persistent "analyst mindset" that prompt engineering cannot override. Conflict removal improved quality (+19 points) but did NOT enable autonomous execution. ### Conflicts Removed 1. ~~"FIRST ACTION" vs "Phase 0: Verify Context"~~ 2. ~~"Do it now" vs "5 sequential phases"~~ 3. ~~"Autonomous detective" vs "help you" language~~ 4. ~~"Don't say 'Next I'll...'" vs "Narrate what you're doing next"~~ 5. ~~"Create tests" vs "Previous agents failed"~~ 6. ~~"Your role: detective" vs "implementation agents fix it"~~ 7. ~~"DO NOT", "FAILED", "FORBIDDEN"~~ (negative framing) 8. ~~"FOR LEARNING" examples~~ (study mode) 9. ~~"After each bug"~~ (implies future, not now) 10. ~~Multiple "START HERE" points~~ (confusing) --- ## v1.3.1 (2025-10-15) **Goal**: Enforce action (v1.3.0 agent scored 63/100, analyzed but didn't create tests) ### Critical Changes (Action Enforcement) 1. **"DO THIS NOW"** - Strengthened FIRST ACTION with failure warning - Changed: "FIRST ACTION" → "FIRST ACTION (DO THIS NOW, before reading ANY code)" - Added: "If you skip this baseline run, you FAILED." - Impact: Makes baseline test run non-optional 2. **Completion criteria** - "YOUR FIRST RESPONSE MUST INCLUDE" - Added explicit checklist of 6 required items - Test output, files, evidence must be shown (not proposed) - Impact: Clear expectations for what "done" looks like 3. **Updated anti-pattern** - Shows v1.3.0 stopping at "Next, I'll..." - Added example of v1.3.0 agent behavior (63/100) - Shows progression: v1.2.0 (45) → v1.3.0 (63) → Correct (90) - Impact: Agent sees exact failure mode to avoid 4. **"DO IT IN THIS RESPONSE"** - Explicit: don't end with "Next, I'll..." - Directly addresses observed failure pattern - Impact: Prevents waiting for confirmation ### Performance Impact - **Token Cost**: +100 tokens (+3%) - **Expected Score**: 90/100 (Tier S) - **Improvement**: +27 points from v1.3.0 - **ROI**: 27 points per 100 tokens ### Behavior Change **Before (v1.3.0)**: ``` 1. Read code ✓ 2. Identify bugs ✓ 3. "Next, I'll help you create a test..." ← STOPPED 4. Waited for confirmation Result: Good analysis, ZERO tests, ZERO evidence (63/100) ``` **After (v1.3.1)**: ``` 1. RUN baseline tests IMMEDIATELY 2. Find Bug #1 (line 136) 3. CREATE reproduction-test.ts (shows path) 4. RUN test → FAILS (shows output) 5. ADD markers → Shows evidence 6. CLEAN up → Bug #1 DONE 7. Move to Bug #2... (all in same response) Result: Tests + evidence + cleanup (90/100) ``` --- ## v1.3.0 (2025-10-15) **Goal**: Fix role confusion (v1.2.0 agent scored 45/100, edited source code instead of debugging) ### Critical Changes (Token-Efficient, Front-Loaded) 1. **Explicit role boundary** - "DON'T FIX" in first 400 tokens - Was buried at line 77 (~token 350) - Now at line 36-43 (~token 200) - Added: "YOUR ROLE: DETECTIVE 🔍 NOT MECHANIC 🔧" - Impact: Prevents role confusion via primacy effect 2. **Anti-pattern** - Shows actual agent failure - Added example of v1.2.0 agent editing source code - Result: ZERO tests, ZERO evidence, edited source (FORBIDDEN) - Impact: Concrete example of what NOT to do 3. **Role metaphor** - Detective 🔍 vs Mechanic 🔧 - Clear visual metaphor for role identity - "YOU ARE: Detective (find & prove bugs)" - "YOU ARE NOT: Mechanic (fix bugs)" - Impact: Instant mental model 4. **Verification command** - Check no source edits after each bug - `git diff src/ | grep -v ".test.ts" | grep -v "console.log"` - Must return: EMPTY (no source changes except test files) - Impact: Concrete way to verify compliance ### Performance Impact - **Token Cost**: +250 tokens (+8%) - **Actual Score**: 63/100 (Tier B) - **Improvement**: +18 points from v1.2.0 - **ROI**: 7.2 points per 100 tokens ### Behavior Change **Before (v1.2.0)**: ``` 1. Read code 2. Understood bugs 3. "Plan: implement targeted fixes" ← STOPPED, APPLIED FIXES 4. Edited 4 source files 5. Re-ran tests (pass) 6. "What I changed..." Result: ZERO tests, edited source (FORBIDDEN) (45/100) ``` **After (v1.3.0)**: ``` 1. Read code ✓ 2. Identified bugs with line numbers ✓ 3. Mapped to complaints ✓ 4. Proposed reproduction tests ✓ 5. "Next, I'll help you create..." ← STOPPED HERE Result: Good analysis, but ZERO tests (63/100) ``` **Key Win**: Agent did NOT edit source code (role boundary worked!) --- ## v1.2.0 (2025-10-14) **Goal**: Push agents from Tier A (analysis) to Tier S (evidence-based debugging) ### Key Changes from v1.1.0 1. **ACTION-FIRST approach** - Don't ask permission, just start debugging critical bugs 2. **ONE BUG AT A TIME** - Complete each bug fully before moving to next 3. **Priority-driven** - Attack CRITICAL bugs first (payment, inventory, financial) 4. **No catalog-first** - Don't analyze all bugs before starting ### Performance Impact - **Actual Score**: 45/100 (Tier D) - **Issue**: Role confusion - agent edited source code instead of debugging - **Root Cause**: Critical "don't fix" constraint buried at token ~350 --- ## v1.1.0 (2025-10-13) **Goal**: Move from proposals to actual implementation ### Key Changes from v1.0.0 1. **CREATE, don't propose** - Agents must create test files and add markers, not just suggest them 2. **Progress tracker** - Checklist to track each bug from discovery to evidence 3. **Evidence requirements strengthened** - Must show actual test output and debug logs 4. **Completion criteria updated** - Clear verification steps for each bug ### Performance Impact - **Expected Score**: 75-80/100 (Tier A) - **Behavior**: Created tests but asked for permission first --- ## v1.0.0 (2025-10-12) **Initial Release** - Basic debugging framework ### Core Features 1. Debug specialist role definition 2. 5-phase workflow (Context, Evidence, Instrumentation, Trace, Document) 3. Debugging techniques (binary search, state snapshot, differential, timeline, assumption validation) 4. Output filtering strategies 5. Chain-of-thought documentation template ### Performance Impact - **Expected Score**: 55-60/100 (Tier C) - **Behavior**: Analysis only, no active debugging --- ## Version Comparison | Version | Score | Tier | Key Behavior | Issue | |---------|-------|------|--------------|-------| | v1.0.0 | 55-60 | C | Analysis only | No testing | | v1.1.0 | 75-80 | A | Creates tests (with permission) | Asks first | | v1.2.0 | 45 | D | Edited source code | Role confusion | | v1.3.0 | 63 | B | Analyzed, proposed | Stopped short | | v1.3.1 | 90 (exp) | S (exp) | Tests + evidence in first response | TBD | ### Total Progress - **Starting Point**: v1.0.0 (55/100, Tier C) - **Current**: v1.3.1 (expected 90/100, Tier S) - **Improvement**: +35 points (+64%) - **Token Cost**: +350 tokens (+12%) - **Overall ROI**: 10 points per 100 tokens --- ## Research & Methodology ### Token Optimization (v1.3.0) **Research Findings**: - Optimal system prompt length: 500-1,500 tokens - Primacy effect: First 200-300 tokens weighted most heavily - Risk zone: >2,000 tokens = instruction loss **Strategy**: - Front-load critical constraints in first 400 tokens - "DON'T FIX" at token ~200 (was ~350) - Role boundary visible immediately (primacy effect) ### Action Enforcement (v1.3.1) **Research Findings**: - Agents interpret "help user" as collaborative - Stopping at "Next, I'll..." indicates waiting for confirmation - Need explicit completion criteria, not just guidelines **Strategy**: - "DO THIS NOW" (urgency) - "If you skip this, you FAILED" (consequence) - "YOUR FIRST RESPONSE MUST INCLUDE" (checklist) - "DO IT IN THIS RESPONSE" (no waiting) **RESULT**: ❌ Created MORE conflicts → Score dropped to 47/100 (worse than v1.3.0's 63/100) --- ## Summary: v1.4.0 Final Assessment ### What We Achieved ✅ 1. **Best Prompt Structure** (Zero conflicts, positive framing) 2. **69% Token Reduction** (~3,250 → ~1,000 tokens) 3. **Better Analysis Quality** (+19 points from v1.3.1) 4. **Excellent Bug Discovery** (30/35 points) 5. **Strong Root Cause Analysis** (18/20 points) ### What We Discovered ❌ **Architectural Constraint**: Agent has persistent "analyst mindset" - All versions (v1.2.0→v1.4.0) stop at "analyze → propose → wait" - Prompt engineering improved quality but NOT execution - Score ceiling: ~66/100 (Tier B) with current architecture ### Recommendations **Option 1: Accept Tier B (66/100)** - Use v1.4.0 as analysis-phase agent - Agent identifies bugs → Human creates tests - Best achievable with prompt engineering **Option 2: Multi-Turn Enforcement** - After agent stops, user says: "Don't ask. Create the test now." - Achieves execution with external prompt - Not autonomous but practical **Option 3: Different Agent Architecture** - Try different LLM (Claude Opus, GPT-4) - Custom agent framework with enforced execution - May require architectural changes beyond prompts ### Key Learnings **Prompt Engineering Limits**: - ✅ Can improve analysis quality - ✅ Can remove conflicts - ✅ Can reduce tokens - ❌ Cannot override architectural execution barrier **The Persistent Pattern**: - Conflict removal (v1.4.0) improved analysis - But execution still requires external trigger - Agent waits for approval regardless of prompt --- ## 📚 Lessons Learned: What NOT to Change from v1.0.0 ### ❌ FAILED Approach #1: Remove Structure (v1.4.0) **What changed**: Removed 5-phase workflow, condensed to simple flow **Result**: 66/100 (worse than 82/100) **Lesson**: Phase-based structure was helping, not hurting ### ❌ FAILED Approach #2: Add Execution Patterns (v1.5.0) **What changed**: Added claudette-auto.md's "IMMEDIATELY make calls" patterns **Result**: ~5/100 (edited AGENTS.md instead of debugging) **Lesson**: Execution patterns without task clarity = wrong actions fast ### ❌ FAILED Approach #3: Over-Enforcement (v1.3.1) **What changed**: Added "DO THIS NOW", "If you skip you FAILED", strict requirements **Result**: 47/100 (worse than 63/100) **Lesson**: More enforcement = more conflicts = worse performance ### ✅ What v1.0.0 Has That Should NOT Be Removed 1. **Phase 0: Verify Context** - Prevents misunderstanding task 2. **Phase-based workflow** - Clear structure through investigation 3. **Chain-of-thought template** - Structured documentation format 4. **Complete example** - Shows full investigation end-to-end 5. **Anti-patterns section** - Explicit "don't do this" examples 6. **Cleanup emphasis** - Multiple reminders to remove markers 7. **Evidence-driven principle** - "Never guess" mantra 8. **Output filtering examples** - Concrete pipe patterns 9. **MANDATORY RULES** - Critical constraints at top 10. **File scope clarity** - "Create test files" instruction ### 🎯 Strategy for v2.0.0 (If Attempted) **KEEP from v1.0.0**: - All 10 items above - 688-line structure (proven length) - YAML frontmatter with tool list - Teaching example approach **ADD to reach Tier S**: - Baseline test enforcement (first action) - Debug marker enforcement (not optional) - Multi-bug requirement (3+ not 1) - Continuous flow (don't stop at questions) **Method**: Surgical additions only, preserve core structure **Target**: 90-95/100 (Tier S) without breaking 82/100 baseline --- **Last Updated**: 2025-10-15 **Gold Standard**: v1.0.0 (82/100 - Tier A) **Status**: Preserve v1.0.0, learn from v1.1-v1.5 failures **Next Step**: Surgical improvements to v1.0.0 only

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CHANGELOG.md•21.6 KiB