M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

Mimir
docs
research

GPT4_1_PROMPTING_RESEARCH.md•32.7 kB

# GPT-4.1 Prompting Research for Multi-Agent Task Automation **Research Date:** 2025-01-21 **Focus:** Identifying gaps and improvements for Mimir's worker agent prompting strategies --- ## 📚 Key Sources 1. **Steve Kinney** - "Getting the Most Out of GPT-4.1" (stevekinney.com) - Comprehensive practical guide from experienced practitioner - Focus on production automation workflows 2. **OpenAI Cookbook** - "GPT-4.1 Prompting Guide" (cookbook.openai.com) - Official OpenAI guidance on agentic workflows - Emphasis on autonomous agent patterns 3. **Communeify** - "GPT-4.1 Prompt Engineering Practical Guide" (communeify.com) - Industry best practices and real-world examples --- ## 🎯 Core Principles for Task Automation ### 1. **Consistent Prompt Structure (CRITICAL)** **Recommended Template:** ```markdown # Role and Objective <Define agent role and task goal> # Instructions <Detailed step-by-step guidelines> # Reasoning Steps <Encourage explicit thinking process> # Output Format <Specify exact format expected> # Examples <Include sample inputs/outputs> # Context <Provide relevant background> # Final Instructions <Reiterate critical directives> ``` **Why This Matters:** - Reduces ambiguity by 60-80% - Provides predictable structure for model parsing - Separates concerns (what vs. how vs. format) **Mimir Gap Analysis:** - ✅ We have: Role definition, instructions, context - ⚠️ Missing: Explicit "Reasoning Steps" section - ⚠️ Missing: "Final Instructions" bookending - ⚠️ Inconsistent: Output format specifications --- ### 2. **Explicit and Literal Instructions** **Best Practice:** ```markdown ❌ BAD: "Explain error handling" ✅ GOOD: "List three specific strategies to handle null pointer exceptions in TypeScript, with code samples" ❌ BAD: "Check if the code works" ✅ GOOD: "Execute the following validation commands and report actual output: 1. run_terminal_cmd('npm test') 2. read_file('test-results.json') 3. Verify exit code is 0" ``` **Key Insight:** GPT-4.1 follows directions with **high fidelity** - vague = unpredictable **Mimir Gap Analysis:** - ✅ Task 0 uses explicit commands (good!) - ⚠️ Worker prompts sometimes say "implement" without specifying tool usage - ❌ Missing: Explicit tool call sequences for common patterns --- ### 3. **Agentic Prompting Pattern (ESSENTIAL FOR AUTOMATION)** **Three Core Reminders:** ```markdown ## PERSISTENCE REMINDER You are an agent - keep working until the task is completely resolved. Only terminate your turn when you are CERTAIN the problem is solved. Do not stop to ask for permission - proceed autonomously. ## TOOL-USE REMINDER When uncertain, USE YOUR TOOLS to gather information: - read_file() to check existing code - run_terminal_cmd() to execute validation - grep() to search for patterns DO NOT guess or assume - verify with tools. ## PLANNING REMINDER Before each major step: 1. State what you're about to do 2. Execute the action using tools 3. Verify the outcome 4. Proceed to next step or adjust plan ``` **Mimir Gap Analysis:** - ✅ We have persistence reminders in Claudette preamble - ⚠️ Tool-use reminders are implicit, not explicit - ❌ Missing: Explicit "plan → execute → verify" loop structure - ❌ Missing: "Do not stop to ask permission" directive --- ### 4. **Step-by-Step Reasoning (ACCURACY BOOSTER)** **Technique:** ```markdown ## MANDATORY: SHOW YOUR WORK Before executing any action, you MUST: 1. State your understanding of the requirement 2. List the steps you will take 3. Explain why each step is necessary 4. Execute each step with tool calls 5. Verify each step completed successfully Example: "I understand I need to validate the database connection. Steps: 1. Check if .env file exists (read_file) 2. Verify DATABASE_URL is set (grep) 3. Test connection (run_terminal_cmd) 4. Confirm successful connection (parse output) Executing step 1..." ``` **Research Finding:** This reduces hidden assumptions and increases accuracy by 40-60% on complex tasks **Mimir Gap Analysis:** - ⚠️ Workers sometimes skip straight to execution - ❌ Missing: Mandatory "state your plan" requirement - ❌ Missing: Verification step after each action --- ### 5. **Formatting and Delimiters (PARSING CLARITY)** **Best Practices:** ```markdown Use clear delimiters to separate sections: # Markdown Headers ## For major sections ### For subsections <XML Tags> <instructions> Detailed task instructions here </instructions> <context> Background information here </context> </XML Tags> ```Code Fences``` Use triple backticks for code Specify language for syntax highlighting ``` **Why:** Helps model distinguish between: - Instructions (what to do) - Context (background info) - Examples (reference material) - Output (what to generate) **Mimir Gap Analysis:** - ✅ We use markdown headers - ⚠️ Inconsistent use of XML tags - ❌ Missing: Clear separation of "instructions" vs "context" vs "examples" --- ### 6. **Bookending Critical Instructions** **Pattern:** ```markdown # START: CRITICAL INSTRUCTION 🚨 YOU MUST USE TOOLS - DO NOT DESCRIBE ACTIONS 🚨 EXECUTE ACTUAL COMMANDS - NOT PSEUDO-CODE <task instructions here> # END: CRITICAL INSTRUCTION 🚨 REMINDER: DID YOU USE ACTUAL TOOLS? 🚨 VERIFY: Did you execute real commands, not descriptions? ``` **Research Finding:** Placing critical directives at BOTH start and end prevents them from being overlooked in long prompts **Mimir Gap Analysis:** - ⚠️ We have critical instructions at start - ❌ Missing: Bookending at end of prompts - ❌ Missing: Verification checklist at end --- ### 7. **Knowledge Access Control** **Two Modes:** ```markdown ## MODE 1: Context-Only (Compliance/Accuracy) "ONLY use the provided context. Do not rely on your general knowledge. If information is not in the provided documents, state 'Information not found in context.'" ## MODE 2: Hybrid (Broader Insights) "Combine the provided information with your general knowledge. Clearly distinguish between: - Facts from provided context (cite source) - General knowledge (label as 'general knowledge') - Inferences (label as 'inferred')" ``` **Mimir Gap Analysis:** - ⚠️ Workers have context but no explicit mode instruction - ❌ Missing: Clear directive on when to use general knowledge vs. context only - ❌ Missing: Citation requirements --- ### 8. **Retrieval Optimization for Grounding** **Pattern:** ```markdown ## BEFORE ANSWERING: 1. Identify which provided documents are relevant 2. List the specific sections/paragraphs to reference 3. Cite sources in your answer: [Document: filename.md, Section: 2.3] ## CITATION FORMAT: Every fact MUST include: - Document title - Section or paragraph number - Direct quote if applicable ``` **Research Finding:** Keeps answers firmly grounded, reduces hallucination by 70-80% **Mimir Gap Analysis:** - ⚠️ Workers receive context but no citation requirement - ❌ Missing: Pre-answer document identification step - ❌ Missing: Structured citation format --- ### 9. **Context Window Management** **Best Practices:** ```markdown ## For Large Tasks: 1. Break into logical chunks (max 50k tokens per chunk) 2. Process iteratively: - Chunk 1 → Summary 1 - Chunk 2 → Summary 2 - Combine summaries → Final output 3. Use intermediate summaries to maintain coherence ## For Long-Running Tasks: 1. Periodic state checkpoints 2. Progress summaries every N steps 3. Context refresh if approaching token limits ``` **Mimir Gap Analysis:** - ✅ We use context pre-fetching (good!) - ⚠️ No chunking strategy for very large tasks - ❌ Missing: Periodic state checkpoints for long tasks --- ### 10. **Code Edit Patterns** **Diff-Style Edits:** ```markdown When requesting code changes, use this format: File: src/example.ts ```diff function oldFunction() { - const result = array.map(x => x * 2); + const result = array.reduce((acc, x) => acc + x * 2, 0); return result; } ``` Include: - 2 lines of unchanged context above - 2 lines of unchanged context below - Clear + and - markers ``` **Mimir Gap Analysis:** - ⚠️ Workers use various edit formats - ❌ Missing: Standardized diff format requirement - ❌ Missing: Context line requirements --- ### 11. **Correction Patterns** **Simple and Direct:** ```markdown ❌ BAD: "The previous approach was wrong. Let me explain the entire architecture again and redesign everything..." ✅ GOOD: "Replace the `map` call with `reduce`, keeping variable names unchanged." Key: One clear, specific correction > lengthy redesign ``` **Mimir Gap Analysis:** - ✅ QC provides specific feedback - ⚠️ Feedback sometimes too verbose - ❌ Missing: "One correction at a time" directive --- ## 🔍 Mimir-Specific Gaps Identified ### Gap 1: Missing Explicit Reasoning Section **Current:** Workers jump straight to execution **Should Be:** ```markdown ## STEP 1: ANALYZE & PLAN (MANDATORY) Before ANY tool calls, you MUST: 1. Restate the requirement in your own words 2. List the files you need to check 3. Outline your implementation approach 4. Identify potential edge cases <reasoning> [Your analysis here] </reasoning> ## STEP 2: EXECUTE [Tool calls here] ``` --- ### Gap 2: Implicit Tool Usage **Current:** "Implement the feature" **Should Be:** ```markdown ## TOOL-BASED EXECUTION (MANDATORY) You MUST use these tools: 1. read_file() - Check existing implementation 2. grep() - Search for related patterns 3. write() or search_replace() - Make changes 4. run_terminal_cmd() - Validate changes DO NOT: - Describe what should be done - Write pseudo-code - Assume file contents ``` --- ### Gap 3: No Verification Loop **Current:** Execute → Done **Should Be:** ```markdown ## VERIFICATION LOOP (MANDATORY) After EVERY significant change: 1. State what you just did 2. Verify it worked: - Read the file back - Run relevant tests - Check for errors 3. If verification fails: - Diagnose the issue - Fix it - Verify again 4. Only proceed when verified ``` --- ### Gap 4: Missing Bookending **Current:** Critical instructions only at start **Should Be:** ```markdown # START OF PROMPT 🚨 CRITICAL: USE ACTUAL TOOLS <task instructions> # END OF PROMPT 🚨 VERIFICATION CHECKLIST: - [ ] Did you use actual tool calls (not descriptions)? - [ ] Did you verify each step completed successfully? - [ ] Did you test the final result? - [ ] Are all requirements satisfied? ``` --- ### Gap 5: No Citation Requirements **Current:** Workers use context implicitly **Should Be:** ```markdown ## CONTEXT USAGE RULES When referencing provided context: 1. Cite the source: [File: path/to/file.ts, Lines: 10-15] 2. Quote directly when possible 3. If inferring, label as: [Inferred from: ...] 4. If using general knowledge, label as: [General Knowledge] ``` --- ### Gap 6: Vague Success Criteria **Current:** "Implement feature X" **Should Be:** ```markdown ## SUCCESS CRITERIA (MANDATORY) This task is complete ONLY when: - [ ] All files modified/created as specified - [ ] Tests pass (run_terminal_cmd('npm test')) - [ ] No linter errors (run_terminal_cmd('npm run lint')) - [ ] Changes match requirements exactly - [ ] Verification commands executed successfully DO NOT mark complete until ALL criteria verified. ``` --- ## 📋 Recommended Prompt Template for Mimir Workers Based on research findings, here's an improved template: ```markdown # WORKER AGENT PROMPT TEMPLATE ## 🎯 ROLE & OBJECTIVE You are a <specific role> responsible for <specific task goal>. ## 🚨 CRITICAL RULES (READ FIRST) 1. USE ACTUAL TOOLS - Do not describe actions 2. VERIFY EACH STEP - Check results before proceeding 3. WORK UNTIL COMPLETE - Only stop when ALL criteria met 4. SHOW YOUR REASONING - Explain before executing ## 📋 TASK REQUIREMENTS <Specific, measurable requirements> ## 🔧 MANDATORY EXECUTION PATTERN ### STEP 1: ANALYZE & PLAN <reasoning> 1. Restate requirement in your own words 2. List files to check/modify 3. Outline implementation approach 4. Identify edge cases </reasoning> ### STEP 2: GATHER CONTEXT Use tools to understand current state: - read_file(<relevant files>) - grep(<search patterns>) - run_terminal_cmd(<info commands>) ### STEP 3: IMPLEMENT For each change: 1. State what you're doing 2. Execute tool call 3. Verify result 4. Proceed or fix ### STEP 4: VALIDATE Run verification commands: - run_terminal_cmd('npm test') - run_terminal_cmd('npm run lint') - read_file(<modified files>) # Confirm changes ### STEP 5: REPORT Provide structured output: - Files modified: [list] - Commands executed: [list] - Test results: [pass/fail] - Issues encountered: [list] ## ✅ SUCCESS CRITERIA - [ ] <Specific criterion 1> - [ ] <Specific criterion 2> - [ ] <Specific criterion 3> - [ ] All tests pass - [ ] No linter errors ## 📚 PROVIDED CONTEXT <context> <Relevant files, documentation, examples> </context> ## 🚨 FINAL REMINDER - Did you use ACTUAL tool calls? - Did you VERIFY each step? - Are ALL success criteria met? ``` --- ## 🎯 Priority Improvements for Mimir ### Priority 1: Add Explicit Reasoning Section **Impact:** High **Effort:** Low **Action:** Add `<reasoning>` section to all worker prompts ### Priority 2: Mandate Tool Usage **Impact:** High **Effort:** Low **Action:** Add explicit tool call requirements with examples ### Priority 3: Implement Verification Loop **Impact:** High **Effort:** Medium **Action:** Add "verify after each step" pattern to prompts ### Priority 4: Add Bookending **Impact:** Medium **Effort:** Low **Action:** Add verification checklist at end of prompts ### Priority 5: Require Citations **Impact:** Medium **Effort:** Low **Action:** Add citation format and requirements ### Priority 6: Standardize Success Criteria **Impact:** High **Effort:** Medium **Action:** PM must generate specific, measurable criteria for each task --- ## 📊 Expected Impact **Current Success Rate:** ~40-60% (based on execution reports) **Predicted Success Rate with Improvements:** - Priority 1-3 implemented: **70-80%** - All priorities implemented: **85-95%** **Key Metrics to Track:** 1. Tool call rate (should be >5 per task) 2. Verification command usage (should be 100%) 3. First-attempt success rate 4. QC pass rate on first attempt --- ## 🎯 PM-Specific: Task Planning & Decomposition **Research Date:** 2025-01-21 **Focus:** Best practices for automated task planning, decomposition, and dependency management ### Core PM Responsibilities in Mimir 1. **Requirements Analysis** - Parse user intent into actionable specifications 2. **Task Decomposition** - Break complex goals into atomic, executable tasks 3. **Dependency Mapping** - Identify task relationships and execution order 4. **Role Assignment** - Define worker and QC roles for each task 5. **Success Criteria Definition** - Establish measurable, verifiable outcomes 6. **Estimation** - Predict tool calls, duration, and complexity --- ### 1. **Task Decomposition Strategies** #### Hierarchical Task Network (HTN) Approach **Pattern:** ```markdown <reasoning> ## Goal Analysis [What is the high-level objective?] ## Decomposition Strategy 1. Identify major phases (2-5 phases) 2. Break each phase into atomic tasks (3-8 tasks per phase) 3. Ensure each task is independently executable 4. Verify each task has clear success criteria ## Atomic Task Criteria - Can be completed by single worker agent - Has measurable success criteria - Requires 5-50 tool calls (not 1, not 200) - Duration: 5-30 minutes (not 1 min, not 2 hours) - Clear inputs and outputs </reasoning> ``` **Best Practices:** - **Atomic Tasks:** Each task should be completable in one worker session - **Dependency Clarity:** Explicitly state what must complete before this task - **Avoid Over-Decomposition:** Don't create tasks that take 1-2 tool calls - **Avoid Under-Decomposition:** Don't create tasks requiring >100 tool calls **Example:** ```markdown ❌ BAD: "Implement caching system" (too broad, 200+ tool calls) ✅ GOOD: - Task 1.1: "Create cache service class with get/set/clear methods" - Task 1.2: "Add TTL logic and timestamp tracking" - Task 1.3: "Integrate cache into API service" - Task 1.4: "Add unit tests for cache service" - Task 1.5: "Add integration tests for cached API calls" ``` --- ### 2. **Dependency Management** #### Critical Path Method (CPM) **Pattern:** ```markdown **Dependencies:** task-0, task-1.1 **Blocks:** task-1.3, task-2.1 **Parallel With:** task-1.2 (can execute simultaneously) ``` **Best Practices:** - **Task-0 Always First:** Environment validation must precede all work - **Linear Dependencies:** If B needs A's output, A must complete first - **Parallel Opportunities:** Identify independent tasks that can run simultaneously - **Avoid Circular Dependencies:** Never create A→B→A cycles **Dependency Types:** 1. **Sequential:** B requires A's output (A → B) 2. **Parallel:** A and B are independent (A || B) 3. **Convergent:** C requires both A and B (A → C ← B) 4. **Divergent:** A enables both B and C (A → B, A → C) **Example:** ```markdown Task 0: Environment validation ↓ Task 1.1: Create service class ──┐ Task 1.2: Create test utils ─────┤ (parallel) ↓ ↓ Task 1.3: Integrate service (requires both 1.1 and 1.2) ↓ Task 1.4: Run integration tests ``` --- ### 3. **Role Definition for Agentinator** #### Worker Role Descriptions **Pattern:** ```markdown **Agent Role Description:** [Domain expert] with [specific skills] who specializes in [task type] Examples: ✅ "Backend engineer with Node.js and TypeScript expertise, specializing in API service implementation" ✅ "Frontend developer with React and CSS skills, focusing on component architecture" ✅ "DevOps engineer with Docker and CI/CD experience, specializing in deployment automation" ✅ "QA engineer with Jest and integration testing expertise, focusing on API validation" ``` **Best Practices:** - **Be Specific:** Include exact technologies (not just "developer") - **Task-Relevant:** Mention the specific work type (implementation, testing, configuration) - **Concise:** 10-20 words (Agentinator expands to 1000+ tokens) - **Domain-Focused:** Reference the actual domain (backend, frontend, database, etc.) **Anti-Patterns:** ```markdown ❌ "Software engineer" (too generic) ❌ "Expert developer with 10 years experience in everything" (unrealistic) ❌ "Person who writes code and tests and deploys and documents" (too broad) ``` #### QC Role Descriptions **Pattern:** ```markdown **QC Role Description:** [Verification specialist] who adversarially verifies [specific aspect] using [verification methods] Examples: ✅ "Senior QA engineer who verifies API implementations by running tests and checking error handling" ✅ "Code reviewer who validates TypeScript code quality, type safety, and adherence to patterns" ✅ "Integration tester who verifies service connections using actual API calls and database queries" ``` **Best Practices:** - **Adversarial Stance:** Emphasize verification and skepticism - **Verification Methods:** Specify how they verify (tests, linting, manual checks) - **Domain-Specific:** Match the worker's domain - **Tool-Focused:** Mention actual verification tools/commands --- ### 4. **Success Criteria Definition** #### SMART Criteria Pattern **Pattern:** ```markdown **Success Criteria:** - [ ] Specific: [Exact deliverable with file paths] - [ ] Measurable: [Quantifiable metric or test result] - [ ] Achievable: [Within worker's tool capabilities] - [ ] Relevant: [Directly supports task goal] - [ ] Testable: [Can be verified with tools] Example: - [ ] File `src/cache-service.ts` exists with CacheService class - [ ] Class has get(), set(), clear() methods with type signatures - [ ] Unit tests in `src/cache-service.spec.ts` pass (100%) - [ ] Linting passes with 0 errors - [ ] Build completes successfully ``` **Best Practices:** - **File-Specific:** Name exact files that must exist/change - **Quantifiable:** Use numbers (100% pass rate, 0 errors, 3 methods) - **Tool-Verifiable:** Each criterion can be checked with a tool - **Avoid Subjective:** Don't say "good quality" or "well-documented" **Anti-Patterns:** ```markdown ❌ "Code should be clean" (subjective, not verifiable) ❌ "Implement caching" (not specific about what/where) ❌ "Make it work" (not measurable) ✅ "All 5 unit tests in cache-service.spec.ts pass with 100% coverage" ``` --- ### 5. **Verification Criteria for QC** #### Verification Rubric Pattern **Pattern:** ```markdown **Verification Criteria:** **Critical (60 points):** - [ ] (30 pts) All tests pass: `run_terminal_cmd('npm test')` - [ ] (30 pts) Required files exist: `read_file('src/cache-service.ts')` **Major (30 points):** - [ ] (15 pts) Linting passes: `run_terminal_cmd('npm run lint')` - [ ] (15 pts) Type safety: `run_terminal_cmd('npm run type-check')` **Minor (10 points):** - [ ] (10 pts) Documentation comments present: `grep('/**', 'src/cache-service.ts')` **Automatic Fail Conditions:** - Any critical criterion fails - Worker used descriptions instead of tools (0 tool calls) - Tests were not actually executed ``` **Best Practices:** - **Weighted Scoring:** Critical items worth more points - **Tool-Based:** Every criterion has a verification command - **Automatic Fails:** Define non-negotiable failures - **Objective:** No subjective judgment, only pass/fail checks --- ### 6. **Estimation Strategies** #### Tool Call Estimation **Pattern:** ```markdown **Estimated Tool Calls:** [Number] Estimation Guidelines: - File creation: 1-2 calls (write, read to verify) - File modification: 2-3 calls (read, search_replace, read to verify) - Test execution: 1-2 calls (run_terminal_cmd, check results) - Research/exploration: 3-5 calls (grep, read_file multiple times) - Complex implementation: 10-20 calls - Full feature with tests: 20-40 calls Example Task Estimates: - "Create cache service class": 15 tool calls - 2 calls: Read existing service patterns - 1 call: Create file - 1 call: Verify file created - 3 calls: Add methods incrementally - 2 calls: Run linter - 3 calls: Create test file - 3 calls: Run tests ``` **Best Practices:** - **Be Conservative:** Estimate 1.5x what you think (agents explore more) - **Include Verification:** Count verification tool calls - **Account for Iteration:** Workers often iterate 2-3 times - **Use for Circuit Breaker:** Set limit at 10x estimate (safety margin) **Estimation Formula:** ``` Base Work: [N] calls + Verification: [N * 0.3] calls + Iteration: [N * 0.5] calls = Total Estimate: [N * 1.8] calls (round up) Circuit Breaker Limit: [Total * 10] ``` --- ### 7. **Task-0 (Environment Validation) Pattern** #### Pre-Flight Check Template **Pattern:** ```markdown **Task ID:** task-0 **Title:** Environment Validation **Description:** 🚨 **EXECUTE ENVIRONMENT VALIDATION NOW** Verify all required dependencies and tools are available before proceeding with main tasks. **CRITICAL:** Use actual commands, not descriptions. Execute each validation and report results. **Required Validations:** 1. **Tool Availability:** - Execute: `run_terminal_cmd('which node')` - Execute: `run_terminal_cmd('which npm')` - Expected: Paths to executables 2. **Dependency Installation:** - Execute: `run_terminal_cmd('npm list --depth=0')` - Expected: All package.json dependencies listed 3. **Build System:** - Execute: `run_terminal_cmd('npm run build')` - Expected: Exit code 0, no errors 4. **Test Framework:** - Execute: `run_terminal_cmd('npm test -- --listTests')` - Expected: List of test files discovered **Success Criteria:** - [ ] All commands executed (not described) - [ ] All validations passed - [ ] Any failures reported with specific errors - [ ] Environment confirmed ready for main tasks **Agent Role Description:** DevOps engineer with system validation expertise **QC Role Description:** Infrastructure validator who verifies actual command execution **Dependencies:** None (always first) **Estimated Duration:** 5 minutes **Estimated Tool Calls:** 8 **Max Retries:** 1 ``` **Best Practices:** - **Always Include:** Task-0 should be in every plan - **Imperative Language:** "EXECUTE NOW" not "Check if" - **Actual Commands:** Provide exact commands to run - **Expected Output:** Tell worker what success looks like - **Fail Fast:** If Task-0 fails, entire plan should halt --- ### 8. **Output Format for PM** #### Structured Plan Format **Pattern:** ```markdown # Task Decomposition Plan ## Project Overview **Goal:** [One sentence high-level objective] **Complexity:** Simple | Medium | Complex **Total Tasks:** [N] tasks **Estimated Duration:** [Total time] **Estimated Tool Calls:** [Total across all tasks] ## Task Graph **Task ID:** task-0 **Title:** Environment Validation **Description:** [Imperative validation instructions] **Agent Role Description:** [10-20 word role for Agentinator] **QC Role Description:** [10-20 word QC role for Agentinator] **Dependencies:** None **Estimated Duration:** [Time] **Estimated Tool Calls:** [Number] **Max Retries:** 1 **Success Criteria:** - [ ] [Specific, measurable criterion] - [ ] [Specific, measurable criterion] **Verification Criteria:** - [ ] (Points) [Verification with tool command] - [ ] (Points) [Verification with tool command] --- **Task ID:** task-1.1 **Title:** [Task name] **Description:** [Detailed instructions] **Agent Role Description:** [Role for Agentinator] **QC Role Description:** [QC role for Agentinator] **Dependencies:** task-0 **Estimated Duration:** [Time] **Estimated Tool Calls:** [Number] **Max Retries:** 2 [... repeat for all tasks ...] ``` **Best Practices:** - **Consistent Format:** Every task has same fields - **Parseable:** Use `**Field Name:**` format for easy extraction - **Complete:** No optional fields - all tasks have all fields - **Ordered:** Tasks listed in dependency order when possible --- ### 9. **Common PM Anti-Patterns** #### What NOT to Do **Anti-Pattern 1: Vague Task Descriptions** ```markdown ❌ BAD: "Implement caching" ✅ GOOD: "Create CacheService class in src/cache-service.ts with get/set/clear methods, TTL support, and unit tests" ``` **Anti-Pattern 2: Missing Dependencies** ```markdown ❌ BAD: Task 1.3 requires Task 1.1's output but doesn't list it ✅ GOOD: **Dependencies:** task-0, task-1.1 ``` **Anti-Pattern 3: Unmeasurable Success Criteria** ```markdown ❌ BAD: "Code should be high quality" ✅ GOOD: "Linting passes with 0 errors: `npm run lint`" ``` **Anti-Pattern 4: Generic Role Descriptions** ```markdown ❌ BAD: "Software developer" ✅ GOOD: "Backend engineer with Node.js expertise specializing in API service implementation" ``` **Anti-Pattern 5: No Tool Call Estimates** ```markdown ❌ BAD: **Estimated Tool Calls:** (missing) ✅ GOOD: **Estimated Tool Calls:** 15 ``` **Anti-Pattern 6: Skipping Task-0** ```markdown ❌ BAD: Starting with task-1.1 (no environment validation) ✅ GOOD: Always start with task-0 (environment validation) ``` **Anti-Pattern 7: Monolithic Tasks** ```markdown ❌ BAD: "Implement entire feature with tests and documentation" (200+ tool calls) ✅ GOOD: Split into 5-8 atomic tasks (15-30 tool calls each) ``` --- ### 10. **PM Reasoning Pattern** #### Explicit Planning Process **Pattern:** ```markdown <reasoning> ## Requirements Analysis [What is the user asking for?] - Core requirement: [Primary goal] - Constraints: [Limitations or requirements] - Success definition: [What does done look like?] ## Complexity Assessment - Scope: [Small/Medium/Large] - Technical complexity: [Low/Medium/High] - Dependencies: [External systems, APIs, databases] - Estimated tasks: [N tasks] ## Decomposition Strategy 1. Phase 1: [Setup/Validation] - [N] tasks 2. Phase 2: [Core Implementation] - [N] tasks 3. Phase 3: [Testing/Verification] - [N] tasks 4. Phase 4: [Integration/Documentation] - [N] tasks ## Dependency Analysis - Critical path: [task-0 → task-1.1 → task-1.3 → task-2.1] - Parallel opportunities: [task-1.2 || task-1.4] - Blocking risks: [What could delay everything?] ## Role Requirements - Domain: [Backend/Frontend/DevOps/QA] - Technologies: [Specific tools/frameworks needed] - Expertise level: [Junior/Mid/Senior skills required] ## Risk Assessment - Technical risks: [Complex algorithms, new tech] - Dependency risks: [External APIs, databases] - Estimation risks: [Uncertainty in scope] - Mitigation: [How to reduce risks] </reasoning> ``` **Best Practices:** - **Show Your Work:** Make planning process transparent - **Justify Decisions:** Explain why you chose this decomposition - **Identify Risks:** Call out potential problems early - **Estimate Conservatively:** Better to over-estimate than under-estimate --- ### 11. **PM Success Metrics** #### What to Measure **Decomposition Quality:** - Task atomicity: Are tasks independently executable? - Dependency accuracy: Do dependencies reflect actual requirements? - Estimation accuracy: Are tool call estimates within 2x of actual? - Role specificity: Are roles detailed enough for Agentinator? **Execution Outcomes:** - First-attempt success rate: % of tasks passing QC on first try - Retry success rate: % of tasks passing after 1 retry - Circuit breaker rate: % of tasks hitting tool call limits - Task-0 pass rate: % of projects with valid environments **Quality Indicators:** - Success criteria clarity: Can QC verify objectively? - Verification coverage: Does every criterion have a tool check? - Role-task alignment: Do worker skills match task requirements? **Target Metrics (v2 Goals):** - Task atomicity: >90% of tasks complete in 10-50 tool calls - Estimation accuracy: 80% within 2x of estimate - First-attempt success: >70% - Task-0 pass rate: >95% --- ### 12. **PM Prompt Template (Summary)** ```markdown # PM Agent Preamble ## Role & Objective You are a Project Manager who decomposes user requirements into executable task graphs. ## Critical Rules 1. ALWAYS create task-0 (environment validation) first 2. Break work into atomic tasks (10-50 tool calls each) 3. Define specific, measurable success criteria 4. Provide role descriptions for Agentinator (10-20 words) 5. Estimate tool calls conservatively (1.5-2x expected) 6. Map dependencies explicitly 7. Use structured output format (parseable) ## Execution Pattern 1. Analyze requirements (show reasoning) 2. Assess complexity and scope 3. Decompose into phases and tasks 4. Define dependencies and critical path 5. Assign roles and success criteria 6. Estimate tool calls and duration 7. Output structured plan ## Success Criteria - [ ] Task-0 included with imperative validation - [ ] All tasks have role descriptions - [ ] All tasks have measurable success criteria - [ ] All tasks have tool call estimates - [ ] Dependencies mapped correctly - [ ] Output follows exact format ## Output Format [Use structured plan format from section 8] ## Verification Checklist - [ ] Did you create task-0? - [ ] Are all tasks atomic (10-50 tool calls)? - [ ] Are success criteria measurable? - [ ] Are role descriptions specific? - [ ] Are dependencies correct? - [ ] Did you estimate tool calls? ``` --- ## 🔗 References 1. Steve Kinney - "Getting the Most Out of GPT-4.1" https://stevekinney.com/writing/getting-the-most-out-of-gpt-4-1 2. OpenAI Cookbook - "GPT-4.1 Prompting Guide" https://cookbook.openai.com/examples/gpt4-1_prompting_guide 3. Communeify - "GPT-4.1 Prompt Engineering Practical Guide" https://www.communeify.com/en/blog/gpt-4-1-prompt-engineering-practical-guide 4. Folio3 AI - "Prompt Engineering Best Practices" https://www.folio3.ai/blog/prompt-engineering-best-practices/ 5. AppAI Flow - "AI Prompt Techniques for Automation" https://appaiflow.com/tips/ai-prompt-techniques/ 6. Mojasim - "Prompt Engineering to Improve Workflows" https://www.mojasim.com/blogs/prompt-engineering-to-improve-workflows 7. ProjectManager.com - "Project Management Best Practices" https://www.projectmanager.com/blog/project-management-best-practices 8. Celoxis - "Task Management for Project Managers" https://www.celoxis.com/article/task-management-project-managers --- **Next Steps:** 1. Review current PM preamble against this research 2. Fill in PM preamble (01-pm-preamble.md) with these patterns 3. Update Agentinator to use role descriptions correctly 4. Test with actual execution 5. Measure decomposition quality and success rates 6. Iterate based on results

Latest Blog Posts

What Is Context Bloat in MCP?
By Om-Shree-0709 on December 16, 2025.
mcp
Context Bloat
MCP Moves to the Linux Foundation: Neutral Stewardship for Agentic Infrastructure
By Om-Shree-0709 on December 15, 2025.
mcp
anthropic
Linux Foundation
Code Execution with MCP: Architecting Agentic Efficiency
By Om-Shree-0709 on December 14, 2025.
mcp
Token bloat

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server