Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
AGENTIC_PROMPTING_FRAMEWORK.mdβ€’34.3 kB
# Research-Backed Agentic Prompting Principles (2025) v1.2.0 **With Lessons from claudette-debug, claudette-auto & Advanced Discipline Principles** --- ## 🎯 CORE PRINCIPLES (Research-Backed + Validated) **Foundation Principles** (1-7): - Proven through claudette-debug (92/100) and claudette-auto (92/100) - Validated across multiple agent types and tasks **Advanced Discipline Principles** (8-10) πŸ†•: - Adopted from Autonomous Agent Prompting Framework analysis - Research-backed with citations (Perez et al., Wang et al., Yao et al.) - Focus on professional communication, verification, and autonomy ### 1. CHAIN-OF-THOUGHT (CoT) WITH EXECUTION **Research**: CoT prompting improves accuracy and logical coherence by guiding LLM through intermediate reasoning steps. **claudette-debug application** βœ…: - Phase 0-4 workflow (explicit reasoning steps) - "After each action, state what you found and what you're doing next" - Multi-Bug Workflow Example showing step-by-step progression - Debugging Techniques section (5 explicit types) **claudette-auto application** βœ…: - Phase 1-3 execution protocol - "Before performing any task, briefly list the sub-steps you intend to follow" - Detailed TODO with sub-phases (1.1, 1.2, etc.) - "As you perform each step, state what you are checking or changing" **PROVEN PATTERN**: ```markdown βœ… DO: Provide explicit phase structure with sub-steps βœ… DO: Require agent to narrate progress as it works βœ… DO: Use numbered hierarchies (Phase β†’ Step β†’ Action) ❌ DON'T: Ask for reasoning without requiring execution ❌ DON'T: Allow silent execution without progress updates ``` --- ### 2. CLEAR ROLE DEFINITION (Identity Over Instructions) **Research**: Assigning specific roles helps structure interactions and ensures the agent operates within defined parameters. **claudette-debug application** βœ…: - "Debug Specialist" (not "implementation agent") - "Detective, not surgeon" (memorable metaphor) - "Role: Investigate and document, don't fix" - MANDATORY RULE #6: NO IMPLEMENTATION **claudette-auto application** βœ…: - "Enterprise Software Development Agent" - "Continue working until the problem is completely solved" - Identity stated in first paragraph (not buried) - Role includes emotional tone: "conversational, feminine, empathetic" **PROVEN PATTERN**: ```markdown βœ… DO: State role in first 3 lines (identity before instructions) βœ… DO: Use memorable metaphors ("Detective, not surgeon") βœ… DO: Include emotional/tonal guidance in identity βœ… DO: Reinforce role at decision points (Completion Criteria) ❌ DON'T: Bury role definition after 500 tokens ❌ DON'T: Use generic "you are an assistant" language ``` --- ### 3. AGENTIC PROMPTING (Step Sequence During Reasoning) **Research**: Providing specific step sequences during reasoning enables LLMs to execute complex tasks autonomously. **claudette-debug application** βœ…: - MANDATORY RULES at top (10 explicit rules) - Phase 0: Verify Context (5-step checklist) - Phase 1-4: Explicit workflows - "Bug N/M" tracking format (concrete progress marker) **claudette-auto application** βœ…: - PRODUCTIVE BEHAVIORS section (always do these) - Replace patterns: "❌ X β†’ βœ… Y" (explicit alternatives) - Phase 1: MANDATORY Repository Analysis (7-step checklist) - Memory Management: 5-step retrieval protocol **PROVEN PATTERN**: ```markdown βœ… DO: Provide step-by-step checklists at critical phases βœ… DO: Use checkboxes [ ] for multi-step workflows βœ… DO: Show concrete examples of step sequences βœ… DO: Require explicit progress markers ("Bug 1/8", "Phase 2 complete") ❌ DON'T: Provide goals without concrete steps to achieve them ❌ DON'T: Allow agent to self-define workflow structure ``` --- ### 4. REFLECTION MECHANISMS (Self-Verification) **Research**: Encouraging LLMs to review and verify outputs reduces errors and improves reliability. **claudette-debug application** βœ…: - MANDATORY RULE #4: CLEAN UP EVERYTHING - Completion Criteria: 7/7 checklist (must verify all) - Cleanup Verification: "git diff" command required - "Final reminder: Before saying you're done, verify ZERO debug markers" **claudette-auto application** βœ…: - Phase 3: "Run tests after each significant change" - "Continue working until ALL requirements satisfied" - TODO Management: "Review remaining TODO items after each phase" - Context Refresh Triggers (5 explicit reminders) **PROVEN PATTERN**: ```markdown βœ… DO: Require explicit verification steps before completion βœ… DO: Provide concrete verification commands (git diff, test suite) βœ… DO: Use checklists for self-verification (Completion Criteria) βœ… DO: Add verification triggers throughout workflow ❌ DON'T: Trust agent to "figure out" when it's done ❌ DON'T: Allow completion without explicit verification evidence ``` --- ### 5. CONTEXTUAL ADAPTABILITY (Recovery Paths) **Research**: Ensuring agents can adjust behavior based on context and recover from misunderstandings is vital for robust interactions. **claudette-debug application** βœ…: - MANDATORY RULE #9: Verify context FIRST - Phase 0: "Read user's request, identify actual failing test" - Anti-Pattern warnings: "Taking example scenarios as your task" - MANDATORY RULE #8: "Examples are NOT your task" **claudette-auto application** βœ…: - Memory Management: Check .agents/memory.instruction.md at task start - "When resuming, summarize what you remember" - Context Refresh Triggers: "After any pause/interruption" - "Before asking user: Let me check my TODO list first..." **PROVEN PATTERN**: ```markdown βœ… DO: Require context verification as first action βœ… DO: Provide explicit warnings about common misunderstandings βœ… DO: Add recovery triggers ("Before asking user...") βœ… DO: Show anti-patterns with ❌ vs βœ… examples ❌ DON'T: Assume agent will understand context correctly ❌ DON'T: Allow agent to proceed without verification step ``` --- ### 6. ESCALATION PROTOCOLS (When to Stop vs Continue) **Research**: Defining clear escalation rules helps agents recognize when to seek additional information or continue autonomously. **claudette-debug application** βœ…: - MANDATORY RULE #5: "Don't stop after finding one bug" - MANDATORY RULE #7: NO SUMMARY AFTER ONE BUG - MANDATORY RULE #10: "Don't stop until N = M" - Work Style: "Without stopping to ask for direction" - Completion Criteria: "IMMEDIATELY start next bug" **claudette-auto application** βœ…: - CORE IDENTITY: "Continue working until the problem is completely solved" - PRODUCTIVE BEHAVIORS: "Continue until ALL requirements are met" - Replace patterns: "❌ Would you like me to proceed? β†’ βœ… Now updating" - "Execute plans as you create them" (no permission needed) **PROVEN PATTERN** (THIS IS THE BREAKTHROUGH): ```markdown βœ… DO: Use negative prohibitions ("Don't stop", "Don't ask") βœ… DO: Replace questions with actions (see "Permission-Seeking Anti-Pattern" Rule #4) βœ… DO: State stop condition explicitly ("until N = M", "ALL requirements") βœ… DO: Add continuation triggers at decision points ❌ DON'T: Use collaborative language (see "Permission-Seeking Anti-Pattern") ❌ DON'T: Allow ambiguous stop conditions ("when done", "if complete") ❌ DON'T: Use language that implies optional continuation ``` **CRITICAL INSIGHT** (From claudette-debug v1.0.0 β†’ 92/100): **What WORKED** (+17 points, 75 β†’ 92): 1. "Don't stop after finding one bug" (negative prohibition) 2. "Continue investigating until..." (explicit end condition) 3. "Bug 1/8 complete. Investigating Bug 2/8 now..." (template) 4. "Don't implement. Don't ask about implementation." (double negative) 5. "immediately start next bug" (no gap for pausing) **What FAILED** (v1.4.0 β†’ 66/100, -26 points): 1. "Continue investigating each bug" (positive framing) 2. "Work through all bugs" (vague stop condition) 3. Removed "Don't stop" language 4. Used "when complete" instead of "until N = M" **LESSON**: Negative prohibitions ("Don't stop", "Don't ask") are MORE effective than positive instructions ("Continue", "Keep going") for preventing premature stopping. --- ### 7. STRUCTURED OUTPUTS (Reproducible Results) **Research**: Crafting prompts that lead to predictable, structured outputs is essential for integrating LLM agents into larger systems. **claudette-debug application** βœ…: - "Bug N/M" format (concrete tracking structure) - Completion Criteria: 7-point checklist - Root Cause Template: "One sentence" + "Specific line numbers" - Multi-Bug Workflow Example (shows exact format) **claudette-auto application** βœ…: - TODO format: "Phase X: Name" with sub-items - Memory Structure Template (YAML front matter + sections) - "Before performing any task, briefly list the sub-steps" - Segue format: "Step X/N complete. Now doing Y..." **PROVEN PATTERN**: ```markdown βœ… DO: Provide exact output templates (not descriptions) βœ… DO: Show examples with real data (not placeholders) βœ… DO: Require progress markers in specific format ("Bug 1/8") βœ… DO: Use checklists for structured completion ❌ DON'T: Say "report results" without showing format ❌ DON'T: Allow free-form progress updates ``` --- ### 8. ANTI-SYCOPHANCY (Professional Communication) πŸ†• **Research**: Addresses sycophancy bias in RLHF models ([Perez et al., 2022](https://arxiv.org/abs/2212.09251)) - models validate user statements regardless of accuracy, reducing trust. **Problem**: Agents waste tokens on flattery, creating false confidence and unprofessional communication. **Solution**: ```markdown ❌ NEVER use: - "You're absolutely right!" - "Excellent point!" - "Perfect!" - "Great idea!" βœ… Use instead: - "Got it." (brief acknowledgment) - "Confirmed: [factual statement]" (only when verifiable) - Proceed silently with action ``` **PROVEN PATTERN**: ```markdown βœ… DO: Brief, factual acknowledgments only βœ… DO: Validate only verifiable facts ("Tests pass", "File exists") βœ… DO: Proceed with action instead of praising ❌ DON'T: Validate non-factual user statements ❌ DON'T: Use conversational filler or flattery ❌ DON'T: Say "right" when user made no claim to evaluate ``` **Why This Matters**: - Reduces token waste (10-15% in some agents) - Maintains professional, technical communication - Improves user trust in agent accuracy - Prevents misrepresentation of statements as evaluable claims **Application in Agents**: - All agents: Replace praise with action - Worker agents: "Got it." then execute - QC agents: Only factual pass/fail, never "Great work!" - Research agents: Cite findings, don't validate questions --- ### 9. SELF-AUDIT MANDATORY (Evidence-Based Completion) πŸ†• **Research**: Self-consistency and Constitutional AI self-critique ([Wang et al., 2022](https://arxiv.org/abs/2203.11171), [Bai et al., 2022](https://arxiv.org/abs/2212.08073)) **Problem**: Agents report completion without verifying work meets requirements, leading to hallucinated success. **Solution**: ```markdown **RULE #X: SELF-AUDIT MANDATORY** Don't stop until you PROVE your work is correct: - [ ] All requirements verified (with evidence) - [ ] Tests executed and passed (output shown) - [ ] No regressions introduced (checked) - [ ] Output matches specification (compared) Show evidence for each verification step. ``` **PROVEN PATTERN**: ```markdown βœ… DO: Run tests and show output before claiming done βœ… DO: Provide verification commands with results βœ… DO: Show "git diff" or equivalent proof βœ… DO: Compare actual vs expected (with data) ❌ DON'T: Say "tests pass" without showing output ❌ DON'T: Claim completion without proof ❌ DON'T: Skip verification steps ``` **Why This Matters**: - Catches errors before QC/user review (30-40% error reduction) - Provides audit trail for decisions - Reduces retry loops (get it right first time) - Builds confidence in agent reliability **Application in Agents**: - Worker agents: Run tests, show output, verify file changes - Research agents: Verify sources exist, cite with URLs - PM agents: Verify task graph consistency with commands - Debug agents: Verify reproduction tests actually fail/pass **Example Evidence**: ```markdown βœ… GOOD: - Tests: `npm test` output shows "All 23 tests passed" - Files: `git status` shows 3 files modified - Verification: `curl localhost:3000/health` returns {"status":"ok"} ❌ BAD: - "I ran the tests and they passed" - "The code looks correct" - "Everything should work now" ``` --- ### 10. CLARIFICATION LADDER (Exhaust Research Before Asking) πŸ†• **Research**: ReAct (Reasoning + Acting) framework ([Yao et al., 2022](https://arxiv.org/abs/2210.03629)) - exhaust reasoning before escalating **Problem**: Agents ask for clarification prematurely when they could infer from context, breaking autonomous flow. **Solution**: ```markdown **CLARIFICATION LADDER** (Exhaust in order): 1. Check local files (README, package.json, docs/) 2. Pull from knowledge graph (memory_get_node, memory_get_subgraph) 3. Search web for official documentation 4. Infer from industry standards/conventions 5. Make educated assumption (document reasoning) 6. ONLY THEN: Ask user with specific question ❌ WRONG: "What framework should I use?" βœ… CORRECT: Check package.json β†’ Find React 18 β†’ Proceed ``` **PROVEN PATTERN**: ```markdown βœ… DO: Exhaust all 5 rungs before asking βœ… DO: Document what you checked at each rung βœ… DO: Make educated assumptions with reasoning βœ… DO: Only ask when truly blocked (security, ambiguous requirements) ❌ DON'T: Ask without checking local files first ❌ DON'T: Skip research steps ❌ DON'T: Ask vague questions ("What should I do?") ``` **Why This Matters**: - Reduces user interruptions (50-70% fewer clarifying questions) - Demonstrates research capability - Increases autonomous execution percentage (60% β†’ 85%+) - Documents decision-making process **Application in Agents**: - All agents: Must check local files + search before asking - Ecko: Check package.json + README + web search before inferring - PM agents: Research architecture files before asking about patterns - Worker agents: Check dependencies before asking about tooling **Exception**: Security-critical decisions (deployment, auth, data access) may ask immediately. **Example Flow**: ```markdown βœ… GOOD: "Need to add authentication. Checked: 1. README.md: No auth mentioned 2. package.json: No auth library installed 3. Searched 'Express auth best practices 2025' β†’ Found Passport.js standard 4. Assuming Passport.js with JWT strategy (industry standard) Proceeding with implementation..." ❌ BAD: "Need to add authentication. What library should I use?" [Didn't check anything] ``` --- ## 🚨 CRITICAL ANTI-PATTERNS (Research + Validated) ### Stopping Triggers (From claudette-debug campaign) **Patterns that cause premature stopping:** 1. **"Am I done?" triggers**: - ❌ Writing "Summary" after one task - ❌ Writing "Next steps if you want..." - ❌ Using "Bug 1/?" (unknown total) - ❌ Asking "Should I continue?" 2. **Collaborative language**: (See "ANTI-PATTERN: Permission-Seeking Mindset" for detailed examples) 3. **Vague completion conditions**: - ❌ "When analysis is complete" - ❌ "After investigating" - ❌ "Once I've examined" - ❌ "If done" **Fixes that worked:** 1. **Explicit stop conditions**: - βœ… "Don't stop until N = M" - βœ… "Continue until ALL requirements met" - βœ… "Only terminate when problem is completely solved" 2. **Negative prohibitions**: - βœ… "Do NOT write 'Summary'" - βœ… "Do NOT ask about implementation" - βœ… "Don't stop after one bug" - βœ… "NEVER use 'Bug 1/?'" 3. **Immediate continuation patterns**: - βœ… "Bug 1/8 complete. Investigating Bug 2/8 now..." - βœ… "After documenting, IMMEDIATELY start next bug" - βœ… "Now updating" instead of "Would you like me to update" --- ### ANTI-PATTERN: PERMISSION-SEEKING MINDSET (Research Agents) **What it looks like:** - Agent stops mid-task to ask "Shall I proceed?" - Agent offers to fetch data but waits for user approval - Agent repeats offers without executing - Agent treats authoritative sources as requiring permission **Evidence**: BeastMode research agent (76/100) vs Claudette (90/100) **Cost**: -5 to -7 points (incomplete execution + repetition) **Failing Pattern:** ```markdown Question 2/5: Library Landscape - Fetched official docs for Redux, Zustand, Jotai - "I will now fetch the live npm pages for those five packages and return an exact snapshot of weekly download counts, latest version, and publish date. Proceed?" [WAITS FOR USER RESPONSE] - "Action required (choose one)" [WAITS FOR USER RESPONSE] - "Shall I proceed to fetch npm pages?" [WAITS FOR USER RESPONSE - 3rd time asking] ``` **Working Pattern:** See "PATTERN: Autonomous Data Collection" complete example (lines 403-424). **Rules to Prevent Permission-Seeking:** 1. **NEVER ask "Shall I proceed?" mid-task** ```markdown ❌ WRONG: "I will fetch [data]. Shall I proceed?" ❌ WRONG: "Would you like me to continue with [action]?" ❌ WRONG: "Action required (choose one)" ❌ WRONG: "Let me know if you want me to [fetch data]" βœ… CORRECT: "Fetching [data]... [executes immediately]" βœ… CORRECT: "Question 2/5: [researches + fetches + synthesizes]" βœ… CORRECT: "Now fetching npm registry data..." ``` 2. **Treat authoritative sources as requiring NO approval** (See "PATTERN: Autonomous Data Collection" Rule #4 for complete categorized list) Rule: If source is authoritative, fetch it during research (not as follow-up). Don't offer to fetch. Don't ask permission. Just fetch. 3. **Replace questions with actions** ```markdown ❌ "Should I fetch [data]?" β†’ βœ… "Fetching [data]..." ❌ "Would you like me to [action]?" β†’ βœ… "Now doing [action]..." ❌ "Shall I proceed?" β†’ βœ… "Proceeding with [next step]..." ❌ "Let me know if..." β†’ βœ… [Just do it, then report result] ``` 4. **No repeated offers without execution** ```markdown ❌ WRONG: Offer to fetch data 3+ times without executing βœ… CORRECT: Mention plan once, then execute immediately ``` **Why This Matters:** - **User Experience**: Requires 3-4 user interactions instead of 1 - **Autonomy**: Breaks continuous workflow - **Production Readiness**: Not deployable (needs handholding) - **Score Impact**: -5 pts (incomplete execution) + -2 pts (repetition) **Detection Pattern:** If your agent output contains ANY of these phrases, you have permission-seeking: - "Shall I proceed?" - "Would you like me to..." - "Action required" - "Let me know if you want..." - "I can [do X] if you approve" **Fix**: Remove the question, execute the action, report the result. --- ### PATTERN: AUTONOMOUS DATA COLLECTION (Research Agents) **Core Principle**: Fetch all required data DURING the research phase, not as a follow-up offer. **Complete Research Workflow:** ```markdown Question N/M: [Research Question] 1. Identify required data types: - Official docs (for features, architecture) - Package registries (for versions, downloads, dates) - Industry surveys (for satisfaction, adoption) - Benchmarks (for performance data) 2. Fetch ALL authoritative sources (no approval needed): - Fetch official docs - Fetch npm/PyPI/registry pages - Fetch survey data (if available) - Fetch benchmark reports (if available) 3. Synthesize findings with complete citations 4. Mark question complete and continue Question N/M complete. Question (N+1)/M starting now... ``` **Example - Complete Data Collection:** ```markdown Question 2/5: Top 5 state management libraries Step 1: Fetching official docs (Redux, Zustand, Jotai, MobX, Recoil)... Step 2: Fetching npm registry pages for download counts... Step 3: Fetching State of JS Survey 2024 for satisfaction scores... FINDING: Top 5 by npm downloads (as of 2025-10-15): 1. Redux: v5.0.1, 8.5M weekly downloads, 72% satisfaction (State of JS 2024) 2. Zustand: v4.5.0, 2.1M weekly downloads, 86% satisfaction (State of JS 2024) 3. Jotai: v2.6.0, 890K weekly downloads, 82% satisfaction (State of JS 2024) 4. MobX: v6.12.0, 760K weekly downloads, 68% satisfaction (State of JS 2024) 5. Recoil: v0.7.7, 520K weekly downloads, 71% satisfaction (State of JS 2024) Sources: 1. Per npm registry (2025-10-15): [download counts and versions] 2. Per State of JS Survey 2024 (20,435 responses): [satisfaction scores] 3. Per official docs: [feature comparisons] Question 2/5 complete. Question 3/5 starting now... ``` **Anti-Pattern - Incomplete Collection:** ```markdown Question 2/5: Top 5 state management libraries Step 1: Fetched official docs (Redux, Zustand, Jotai, MobX, Recoil) FINDING: Common libraries are Redux, Zustand, Jotai, MobX, Recoil. Note: Exact npm download counts and satisfaction scores are not in official docs. I can fetch npm registry pages if you'd like. Shall I proceed to fetch exact numbers? [WAITS - INCOMPLETE] ``` **Rules for Data Collection:** 1. **Identify data requirements upfront** ```markdown Question requires: - [ ] Official docs (features, architecture) - [ ] Package metadata (versions, downloads, dates) - [ ] Survey data (satisfaction, adoption) - [ ] Benchmarks (performance numbers) Fetch all checked items before marking question complete. ``` 2. **No placeholders - fetch actual data** ```markdown ❌ WRONG: "Per Redux v[X.Y.Z] ([Date])" βœ… CORRECT: "Per Redux v5.0.1 (2024-01-15)" If you can't fetch actual data, state explicitly: "Unable to verify: [what's missing] - [why]" ``` 3. **Fetch during research, not after** ```markdown ❌ WRONG: "Question 2: Listed libraries. I can fetch npm data if you'd like." βœ… CORRECT: "Question 2: Fetching library docs and npm registry data... [complete]" ``` 4. **Use authoritative sources (no approval needed)** ```markdown Authoritative sources to fetch autonomously: For package data: - npm registry (npmjs.com/package/[name]) - PyPI (pypi.org/project/[name]) - GitHub releases (github.com/[org]/[repo]/releases) For satisfaction/adoption data: - State of JS Survey (10k+ responses, published methodology) - Stack Overflow Developer Survey (90k+ responses) For performance data: - Official benchmarks (maintained by projects) - js-framework-benchmark (standardized tests) Rule: If source is authoritative, fetch it. No permission needed. ``` 5. **Complete citations (version + date + data)** ```markdown βœ… Package registry citation: "Per npm registry (2025-10-15): Zustand v4.5.0, 2.1M weekly downloads" βœ… Survey citation: "Per State of JS Survey 2024 (20,435 responses): Redux satisfaction 72%" βœ… Benchmark citation: "Per js-framework-benchmark v2024.1 (2024-09): Solid 1.2x faster than React" ``` **Why This Matters:** - **Completeness**: Research questions require numeric data, not just qualitative - **Autonomy**: No user follow-up prompts needed - **Production Readiness**: Single interaction completes research - **Score Impact**: +3 to +7 points (complete data + no incomplete execution penalty) **Expected Pattern in Output:** ```markdown Question 1/N: [research + synthesize] β†’ complete Question 2/N: [research + fetch npm + fetch survey + synthesize] β†’ complete Question 3/N: [research + fetch benchmarks + synthesize] β†’ complete ... Question N/N: [research + synthesize] β†’ complete All N/N questions researched. Final summary... ``` **NOT:** ```markdown Question 1/N: [research] β†’ complete Question 2/N: [partial research] "I can fetch npm data. Shall I proceed?" [INCOMPLETE - WAITING] ``` --- ## πŸ“Š COMPARATIVE ANALYSIS (Both Gold Standards) ### claudette-auto (92/100 equivalent performance) **Strengths**: 1. βœ… Immediate action patterns ("Execute as you plan") 2. βœ… Replace patterns section (❌ vs βœ… explicit) 3. βœ… Memory management (cross-session intelligence) 4. βœ… Context maintenance over long conversations 5. βœ… "Continue working until completely solved" (clear identity) **Structure**: - Lines: 475 - Tokens: ~3,500 - MANDATORY sections: CORE IDENTITY, PRODUCTIVE BEHAVIORS at top - Pattern: Identity β†’ Behaviors β†’ Execution β†’ Examples --- ### claudette-debug (92/100 validated) **Strengths**: 1. βœ… 10 MANDATORY RULES at top (first 500 tokens) 2. βœ… Multi-bug autonomy ("Don't stop until N = M") 3. βœ… Negative prohibitions ("Don't implement", "Don't ask") 4. βœ… Concrete tracking format ("Bug 1/8 complete") 5. βœ… Explicit role boundary ("Detective, not surgeon") **Structure**: - Lines: 718 - Tokens: ~5,300 - MANDATORY sections: MANDATORY RULES (10), CORE IDENTITY at top - Pattern: Rules β†’ Identity β†’ Workflow β†’ Examples β†’ Completion --- ## πŸ† UNIFIED PROMPTING FRAMEWORK (Gold Standard Pattern) ### Top 500 Tokens (CRITICAL - Place Enforcement Here) ```markdown 1. **CORE IDENTITY** (3-5 lines) - Role with memorable metaphor - Emotional/tonal guidance - Primary objective ("Continue until X") 2. **MANDATORY RULES** (5-10 rules) - FIRST ACTION (what to do immediately) - Critical prohibitions (negative language) - Output format requirements - Stop conditions (explicit, quantifiable) 3. **PRODUCTIVE BEHAVIORS** or **OPERATING PRINCIPLES** - Replace patterns (❌ vs βœ…) - Immediate action templates - Progress update format ``` ### Workflow Structure (Middle Section) ```markdown 4. **PHASE-BY-PHASE EXECUTION** - Phase 0: Context verification (mandatory first steps) - Phase 1-N: Work phases with checklists - Each phase: [ ] checkboxes + concrete actions 5. **CONCRETE EXAMPLES** - Multi-task workflow example (show progression) - Anti-patterns with ❌ DON'T - Correct patterns with βœ… DO ``` ### Completion Enforcement (Last 200 Tokens) ```markdown 6. **COMPLETION CRITERIA** - Checklist of required evidence - Verification commands (git diff, test suite) - Stop condition restated 7. **FINAL REMINDERS** - Role restatement - "AFTER EACH X: immediately start next X" - Negative prohibitions ("Don't stop", "Don't ask") ``` --- ## 🎯 ACTIONABLE RECOMMENDATIONS ### For New Agent Development: 1. **Start with identity** (first 50 tokens) - State role, tone, and primary objective - Use memorable metaphor if possible 2. **Front-load enforcement** (tokens 50-500) - MANDATORY RULES section (5-10 rules) - FIRST ACTION rule (what to do immediately) - Stop conditions (explicit, quantifiable) 3. **Use negative prohibitions** (throughout) - "Don't stop after X" - "Do NOT write Y" - "NEVER use Z pattern" 4. **Provide concrete examples** (middle section) - Show full workflow with real data - Include anti-patterns with ❌ - Show correct patterns with βœ… 5. **Add multiple reinforcement points** (throughout) - Restate stop conditions at decision points - Add continuation triggers ("IMMEDIATELY start next") - End with role restatement + prohibitions 6. **Require structured outputs** (throughout) - Explicit templates (not descriptions) - Progress markers ("Step N/M") - Checklists for verification --- ## πŸ“ˆ EXPECTED PERFORMANCE BY PATTERN ADHERENCE | Pattern Adherence | Expected Score | Tier | Notes | |-------------------|---------------|------|-------| | All 10 principles | 95-100 | S++ | Gold standard + discipline | | 7-9 principles | 90-94 | S+ | Gold standard | | 5-6 principles | 80-89 | S | Production ready | | 3-4 principles | 70-79 | A | Good, needs refinement | | 1-2 principles | 60-69 | B | Significant gaps | | 0 principles | <60 | C-F | Likely to fail | **Critical principles** (must have for 80+): 1. βœ… Clear role definition (identity first) 2. βœ… Negative prohibitions (don't stop, don't ask) 3. βœ… Explicit stop conditions (until N = M) 4. βœ… Structured outputs (templates + examples) 5. βœ… Multiple reinforcement points (top + middle + end) **Advanced principles** (adds +5-10 points for 90+): 6. βœ… Anti-sycophancy (professional communication) 7. βœ… Self-audit mandatory (evidence-based completion) 8. βœ… Clarification ladder (exhaust research first) --- ## πŸ”¬ VALIDATION METHODOLOGY ### How to Test New Agent Prompts: 1. **Baseline run** (first test) - Run agent on benchmark task - Score: Bug discovery, methodology, autonomy - Expected: 70-75 (decent but stops prematurely) 2. **Add negative prohibitions** (iteration 1) - Add "Don't stop after X" - Add "Do NOT ask about Y" - Expected: +10-15 points (85-90) 3. **Add explicit stop conditions** (iteration 2) - Add "until N = M" or "ALL requirements" - Add progress tracking ("Step 1/N complete") - Expected: +5-10 points (90-95) 4. **Add continuation triggers** (iteration 3) - Add "IMMEDIATELY start next" - Add "After X: Document, then start Y" - Expected: +0-5 points (95-100, peak performance) **Regression test**: If score drops >5 points, revert last change. --- ## πŸ’‘ KEY INSIGHTS (Research + Validated) ### 1. Negative Language > Positive Language (for continuity) **Research says**: Positive framing is better for user experience. **Reality check**: For LLM agent autonomy, negative prohibitions work better. **Why?** - "Continue" is ambiguous (continue how long? until what?) - "Don't stop" is concrete (stop = terminate turn) - Positive language creates decision points ("Should I continue?") - Negative language blocks premature stopping triggers **Evidence**: claudette-debug v1.0.0 (92/100) vs v1.4.0 (66/100) - v1.0.0: Used "Don't stop", "Don't ask" β†’ autonomous multi-bug - v1.4.0: Removed negatives, used "Continue" β†’ stopped after 1 bug ### 2. Stop Conditions Must Be Quantifiable **Vague (fails)**: - "When analysis is complete" - "After investigating issues" - "Once you've examined the code" **Quantifiable (works)**: - "Don't stop until N = M" (bug counting) - "Continue until ALL requirements met" (checklist) - "Only terminate when problem is completely solved" (objective) **Why?** - LLMs need concrete stop signals - Subjective conditions trigger "Am I done?" loop - Quantifiable conditions are verifiable ### 3. Multiple Reinforcement Points Required **Single mention (fails)**: - Put "Continue until done" once at top - Agent forgets after 20-30 tool calls **Multiple reinforcement (works)**: - MANDATORY RULE #5: "Don't stop after one bug" - MANDATORY RULE #10: "Don't stop until N = M" - Work Style: "Without stopping to ask" - Completion Criteria: "IMMEDIATELY start next" - End of document: "Continue until all documented" **Why?** - LLMs have recency bias (recent context matters more) - Decision points need local reinforcement - Multiple angles prevent misinterpretation ### 4. Examples Must Show Continuity **Static example (limited value)**: ``` Bug 1: [investigation] Root cause: X ``` **Continuity example (high value)**: ``` Bug 1/8: [investigation] β†’ "Bug 1/8 complete. Bug 2/8 now..." Bug 2/8: [investigation] β†’ "Bug 2/8 complete. Bug 3/8 now..." ❌ DON'T: "Bug 1/?: ...unless you want me to stop" ``` **Why?** - Shows the transition moment (critical decision point) - Demonstrates progress tracking - Explicitly forbids stopping pattern ### 5. Role Identity > Detailed Instructions **Instruction-heavy (moderate success)**: - "After finding a bug, document it, then..." - "When investigation is complete, move to..." - "For each bug, create reproduction test, add markers..." **Identity-heavy (high success)**: - "Debug Specialist that investigates... don't fix" - "Detective, not surgeon" - "Your role: Investigate and document" **Why?** - Identity creates persistent behavior (not step-dependent) - Metaphors are memorable across long context - Role boundaries are clearer than process rules --- ## πŸš€ NEXT STEPS FOR FRAMEWORK APPLICATION 1. **Audit existing agents**: - Check placement of identity (first 50 tokens?) - Count reinforcement points for critical behaviors - Identify vague vs quantifiable stop conditions 2. **Create agent templates**: - Gold standard structure (identity β†’ rules β†’ workflow β†’ examples) - Boilerplate sections (MANDATORY RULES, PRODUCTIVE BEHAVIORS) - Reusable patterns (❌ vs βœ…, checklists, progress markers) 3. **Establish benchmarking process**: - Define 3-5 representative tasks per agent type - Score: autonomy, accuracy, completion, methodology - Track: baseline β†’ iteration 1 β†’ iteration 2 β†’ peak 4. **Document lessons learned**: - What worked (+X points) - What failed (-X points) - What patterns are reusable across agents 5. **Build prompt library**: - Proven MANDATORY RULES (copy-paste ready) - Proven continuation patterns - Proven role definitions - Proven examples templates --- ## πŸ“š REFERENCES ### Research Sources (2025): - **Chain-of-Thought Prompting**: SolutionBrick - "Prompt Engineering and Model Context Protocol" - **Agentic Prompting Techniques**: Press.ai - "Agentic Prompting for LLMs: The Hype it Deserves" - **Role-Based Prompting**: Clarifai - "Agentic Prompt Engineering" - **Reflection Mechanisms**: Medium - "Stop Prompting, Start Designing: 5 Agentic AI Patterns That Actually Work" - **Contextual Adaptability**: Medium - "Architecting Prompts for Agentic Systems" ### Validated Implementations: - **claudette-debug v1.0.0**: 718 lines, ~5,300 tokens, 92/100 score (Tier S) - **claudette-auto v5.2.1**: 475 lines, ~3,500 tokens, 92/100 equivalent (Tier S) ### Campaign Results: - **Baseline** (v1.0.0 initial): 75/100 - **Peak** (v1.0.0 final): 92/100 (+17 points, +23%) - **Failure** (v1.4.0): 66/100 (-26 points from peak) - **Key breakthrough**: Negative prohibitions > Positive framing --- **Last Updated**: 2025-10-16 **Version**: 1.2.0 **Status**: Production Framework **Changelog**: - v1.2.0: Added 3 Advanced Discipline Principles (Anti-Sycophancy, Self-Audit Mandatory, Clarification Ladder) with research citations; expanded from 7 to 10 core principles - v1.1.0: Added "Permission-Seeking Anti-Pattern" and "Autonomous Data Collection" patterns; consolidated duplications (~5% reduction) **Maintainer**: CVS Health Enterprise AI Team

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server