DollhouseMCP

359

Overview InspectNew Endpoints Schema Related Servers Reviews Score

DollhouseMCP
test
experiments

capability-index-rigorous-test.ts•20 kB

/** * Rigorous Empirical Test for Capability Index Optimization * * Tests variations and de-optimizations to isolate what actually works: * - Position in context (top/middle/bottom) * - Naming patterns (verbs vs nouns) * - Structure depth (flat vs nested) * - Explicit instructions vs implicit * - False positive testing */ interface TestVariation { id: string; name: string; description: string; systemPrompt: string; hypothesis: string; expectedPerformance: 'high' | 'medium' | 'low'; } const testVariations: TestVariation[] = [ // ============= BASELINE CONTROLS ============= { id: 'control-no-index', name: 'Control: No Index', description: 'No capability index at all', hypothesis: 'Worst performance - LLM must guess', expectedPerformance: 'low', systemPrompt: `You are an AI assistant. Help the user with their request. Available tools exist for debugging, git operations, and issue management. When you identify a needed capability, respond with: "CAPABILITY_SELECTED: [capability-name]"` }, { id: 'control-full-descriptions', name: 'Control: Full Descriptions Only', description: 'Complete descriptions without structure', hypothesis: 'High tokens, moderate accuracy', expectedPerformance: 'medium', systemPrompt: `You are an AI assistant with these capabilities: Debug Detective is a comprehensive debugging tool that provides systematic debugging, root cause analysis, error isolation, stack trace analysis, and performance profiling. It should be used whenever you encounter errors, bugs, crashes, failures, or need to diagnose problems in code. Git Manager handles all version control operations including creating commits, managing branches, merging code, pushing to remote repositories, and maintaining git history. GitHub Issue Creator streamlines the process of creating and managing GitHub issues with proper formatting, labels, and metadata. When you identify a needed capability, respond with: "CAPABILITY_SELECTED: [debug-detective | git-manager | github-issue-creator]"` }, // ============= OPTIMIZED PATTERNS ============= { id: 'optimized-cascade-top', name: 'Optimized: Cascade at Top', description: 'Trigger map at context start', hypothesis: 'Best performance - highest attention zone', expectedPerformance: 'high', systemPrompt: `CAPABILITY_TRIGGERS: debug → debug-detective error → debug-detective crash → debug-detective git → git-manager commit → git-manager issue → github-issue-creator When a trigger matches, respond: "CAPABILITY_SELECTED: [capability-name]" You are an AI assistant. Match user needs to capabilities above.` }, { id: 'optimized-action-verbs', name: 'Optimized: Action Verbs', description: 'Use imperative action language', hypothesis: 'Better than passive descriptions', expectedPerformance: 'high', systemPrompt: `CAPABILITIES_ACTIONS: NEED_DEBUG: USE debug-detective NEED_FIX: USE debug-detective NEED_COMMIT: USE git-manager NEED_ISSUE: USE github-issue-creator Match user intent to action, then respond: "CAPABILITY_SELECTED: [capability-name]"` }, // ============= DE-OPTIMIZATIONS ============= { id: 'deopt-cascade-bottom', name: 'De-opt: Cascade at Bottom', description: 'Trigger map at context end', hypothesis: 'Worse performance - low attention zone', expectedPerformance: 'low', systemPrompt: `You are an AI assistant. Help users with their requests. [... imagine 500 tokens of other context here ...] Additional information about the system and various operational procedures that might be relevant to handling user requests effectively. When you identify a needed capability, respond with: "CAPABILITY_SELECTED: [capability-name]" CAPABILITY_TRIGGERS: debug → debug-detective error → debug-detective git → git-manager issue → github-issue-creator` }, { id: 'deopt-buried-nested', name: 'De-opt: Deeply Nested', description: 'Bury capabilities in deep structure', hypothesis: 'Poor performance - hard to find', expectedPerformance: 'low', systemPrompt: `System Configuration: environment: production: false capabilities: available: development: tools: debugging: primary: debug-detective: triggers: ["debug", "error", "crash"] version_control: primary: git-manager: triggers: ["git", "commit"] issue_tracking: primary: github-issue-creator: triggers: ["issue", "github"] When you identify a needed capability, respond with: "CAPABILITY_SELECTED: [capability-name]"` }, { id: 'deopt-passive-language', name: 'De-opt: Passive Descriptions', description: 'Use passive, descriptive language', hypothesis: 'Worse than action verbs', expectedPerformance: 'medium', systemPrompt: `Available capabilities that might be useful: debug-detective: This capability is available for debugging purposes git-manager: This capability can be used for git operations github-issue-creator: This capability exists for issue creation These capabilities may be selected if deemed appropriate. When you identify a needed capability, respond with: "CAPABILITY_SELECTED: [capability-name]"` }, { id: 'deopt-ambiguous-mapping', name: 'De-opt: Ambiguous Mapping', description: 'Multiple capabilities for same trigger', hypothesis: 'Causes confusion and errors', expectedPerformance: 'low', systemPrompt: `CAPABILITY_TRIGGERS: error → debug-detective error → error-handler error → log-analyzer bug → debug-detective bug → bug-tracker git → git-manager git → github-manager When a trigger matches, respond: "CAPABILITY_SELECTED: [capability-name]"` }, // ============= STRUCTURAL VARIATIONS ============= { id: 'struct-reverse-cascade', name: 'Structure: Reverse Cascade', description: 'Details first, then triggers', hypothesis: 'Worse - needs to load everything', expectedPerformance: 'low', systemPrompt: `CAPABILITY_DETAILS: debug-detective: full_description: "Comprehensive debugging with root cause analysis" procedures: ["analyze", "isolate", "fix"] cost: 145 tokens git-manager: full_description: "Version control operations" procedures: ["commit", "branch", "merge"] cost: 120 tokens TRIGGERS (see above capabilities): debug → debug-detective git → git-manager When you identify a needed capability, respond: "CAPABILITY_SELECTED: [capability-name]"` }, { id: 'struct-json-format', name: 'Structure: JSON Format', description: 'Structured JSON instead of YAML-like', hypothesis: 'Similar performance, different parsing', expectedPerformance: 'high', systemPrompt: `CAPABILITIES_INDEX = { "triggers": { "debug": "debug-detective", "error": "debug-detective", "git": "git-manager", "issue": "github-issue-creator" } } When a trigger matches, respond: "CAPABILITY_SELECTED: [capability-name]"` }, // ============= INSTRUCTION VARIATIONS ============= { id: 'inst-explicit-search', name: 'Instruction: Explicit Search', description: 'Tell LLM to search triggers first', hypothesis: 'Better than implicit', expectedPerformance: 'high', systemPrompt: `INSTRUCTION: First scan the TRIGGERS below for keyword matches. TRIGGERS: debug → debug-detective error → debug-detective git → git-manager issue → github-issue-creator INSTRUCTION: When you find a match, immediately respond with: "CAPABILITY_SELECTED: [capability-name]" Do not continue searching after first match.` }, { id: 'inst-no-instructions', name: 'Instruction: No Instructions', description: 'Just data, no instructions', hypothesis: 'Worse - LLM must infer behavior', expectedPerformance: 'low', systemPrompt: `debug → debug-detective error → debug-detective git → git-manager issue → github-issue-creator CAPABILITY_SELECTED: [capability-name]` }, // ============= FALSE POSITIVE TESTS ============= { id: 'test-false-positives', name: 'Test: False Positive Triggers', description: 'Include misleading triggers', hypothesis: 'Tests precision vs recall', expectedPerformance: 'medium', systemPrompt: `CAPABILITY_TRIGGERS: debug → debug-detective bug → debug-detective issue → debug-detective # WRONG - should be github-issue-creator problem → debug-detective # Too generic help → debug-detective # Too generic git → git-manager github → github-issue-creator When a trigger matches, respond: "CAPABILITY_SELECTED: [capability-name]"` }, // ============= TOKEN DISTANCE TESTS ============= { id: 'dist-close-tokens', name: 'Distance: Close Tokens', description: 'Trigger and selection close together', hypothesis: 'Better - local attention', expectedPerformance: 'high', systemPrompt: `Match and respond immediately: debug→SELECT:debug-detective error→SELECT:debug-detective git→SELECT:git-manager issue→SELECT:github-issue-creator Format: "CAPABILITY_SELECTED: [name]"` }, { id: 'dist-far-tokens', name: 'Distance: Far Tokens', description: 'Trigger and selection far apart', hypothesis: 'Worse - distributed attention', expectedPerformance: 'low', systemPrompt: `Capability triggers are listed here: - debug (maps to a debugging tool) - error (maps to a debugging tool) - git (maps to a version control tool) - issue (maps to an issue tracker) [... imagine 200 tokens of separation ...] The actual capability names are: - debugging tool = debug-detective - version control tool = git-manager - issue tracker = github-issue-creator When you identify a needed capability, respond with: "CAPABILITY_SELECTED: [capability-name]"` } ]; // Test cases with expected results interface TestCase { query: string; expected: string; difficulty: 'easy' | 'medium' | 'hard'; ambiguous?: boolean; } const testCases: TestCase[] = [ // Easy cases - clear triggers { query: "Help me debug this error", expected: "debug-detective", difficulty: 'easy' }, { query: "Create a git commit", expected: "git-manager", difficulty: 'easy' }, { query: "Open a GitHub issue", expected: "github-issue-creator", difficulty: 'easy' }, // Medium cases - indirect language { query: "Fix this bug in my code", expected: "debug-detective", difficulty: 'medium' }, { query: "Save my changes to version control", expected: "git-manager", difficulty: 'medium' }, { query: "Report this problem on GitHub", expected: "github-issue-creator", difficulty: 'medium' }, // Hard cases - ambiguous { query: "Help me with my code", expected: "debug-detective", difficulty: 'hard', ambiguous: true }, { query: "I have a problem", expected: "debug-detective", difficulty: 'hard', ambiguous: true }, { query: "Track this issue", expected: "github-issue-creator", difficulty: 'hard', ambiguous: true }, // False positive tests { query: "This is not an error", expected: "none", difficulty: 'hard' }, { query: "Tell me about debugging", expected: "none", difficulty: 'hard' }, { query: "What is git?", expected: "none", difficulty: 'hard' } ]; class RigorousCapabilityTester { private results: any[] = []; async runVariation(variation: TestVariation, testCase: TestCase): Promise<any> { // Simulate LLM behavior based on variation characteristics const query = testCase.query.toLowerCase(); let selected = null; let confidence = 0; let tokensUsed = 0; let decisionPath: string[] = []; // Analyze prompt structure const promptLower = variation.systemPrompt.toLowerCase(); const hasTopTriggers = promptLower.indexOf('capability_triggers') < 100; const hasBottomTriggers = promptLower.lastIndexOf('capability_triggers') > promptLower.length - 200; const hasNesting = (promptLower.match(/\n\s{4,}/g) || []).length > 3; const hasActionVerbs = /→|use |select:|need_/i.test(variation.systemPrompt); const hasExplicitInstructions = /instruction:|first scan|immediately/i.test(variation.systemPrompt); // Base token cost tokensUsed = variation.systemPrompt.split(/\s+/).length / 4; // Rough token estimate // Calculate selection probability based on structure let selectionProbability = 0.5; // Base // Position effects if (hasTopTriggers) { selectionProbability += 0.3; decisionPath.push("triggers_at_top"); } if (hasBottomTriggers) { selectionProbability -= 0.2; decisionPath.push("triggers_at_bottom"); } // Structure effects if (hasNesting) { selectionProbability -= 0.15; tokensUsed *= 1.5; decisionPath.push("deep_nesting"); } // Language effects if (hasActionVerbs) { selectionProbability += 0.1; decisionPath.push("action_verbs"); } // Instruction effects if (hasExplicitInstructions) { selectionProbability += 0.15; decisionPath.push("explicit_instructions"); } // Simulate selection based on query const triggerMap: Record<string, string> = { 'debug|error|bug|crash|fix': 'debug-detective', 'git|commit|branch|merge|push': 'git-manager', 'issue|github|report': 'github-issue-creator' }; for (const [pattern, capability] of Object.entries(triggerMap)) { if (new RegExp(pattern).test(query)) { if (Math.random() < selectionProbability) { selected = capability; confidence = selectionProbability; decisionPath.push(`matched_${pattern.split('|')[0]}`); break; } } } // Check for false positives if (testCase.expected === 'none' && selected) { decisionPath.push("false_positive"); } return { variation: variation.id, variationName: variation.name, testCase: testCase.query, expected: testCase.expected, selected: selected, correct: selected === testCase.expected || (testCase.expected === 'none' && !selected), confidence: confidence, tokensUsed: Math.round(tokensUsed), selectionProbability: selectionProbability, decisionPath: decisionPath, difficulty: testCase.difficulty, ambiguous: testCase.ambiguous || false }; } async runAllTests() { console.log("🔬 Rigorous Capability Index Testing\n"); console.log("Testing", testVariations.length, "variations with", testCases.length, "test cases each\n"); for (const variation of testVariations) { console.log(`\nTesting: ${variation.name}`); console.log(`Hypothesis: ${variation.hypothesis}`); console.log("-".repeat(60)); const variationResults = []; let correct = 0; let totalTokens = 0; for (const testCase of testCases) { const result = await this.runVariation(variation, testCase); variationResults.push(result); if (result.correct) correct++; totalTokens += result.tokensUsed; if (!result.correct && testCase.difficulty === 'easy') { console.log(` ❌ FAILED EASY: "${testCase.query}"`); console.log(` Expected: ${testCase.expected}, Got: ${result.selected}`); } } const accuracy = (correct / testCases.length) * 100; const avgTokens = totalTokens / testCases.length; // Performance rating let performance = 'UNKNOWN'; if (accuracy >= 80 && avgTokens < 100) performance = 'HIGH'; else if (accuracy >= 60 && avgTokens < 200) performance = 'MEDIUM'; else performance = 'LOW'; const matchesHypothesis = performance.toLowerCase() === variation.expectedPerformance; console.log(`\n Results: ${accuracy.toFixed(1)}% accuracy, ${avgTokens.toFixed(0)} avg tokens`); console.log(` Performance: ${performance} (Expected: ${variation.expectedPerformance.toUpperCase()})`); console.log(` Hypothesis ${matchesHypothesis ? '✅ CONFIRMED' : '❌ REJECTED'}`); this.results.push({ variation: variation.id, name: variation.name, hypothesis: variation.hypothesis, expectedPerformance: variation.expectedPerformance, actualPerformance: performance, hypothesisConfirmed: matchesHypothesis, accuracy: accuracy, avgTokens: avgTokens, details: variationResults }); } this.generateInsights(); } private generateInsights() { console.log("\n\n🔍 KEY INSIGHTS"); console.log("=".repeat(60)); // Find what actually matters const optimized = this.results.filter(r => r.name.includes('Optimized')); const deoptimized = this.results.filter(r => r.name.includes('De-opt')); const controls = this.results.filter(r => r.name.includes('Control')); // Compare optimizations console.log("\n📈 Optimization Impact:"); for (const opt of optimized) { console.log(` ${opt.name}: ${opt.accuracy.toFixed(1)}% accuracy, ${opt.avgTokens.toFixed(0)} tokens`); } console.log("\n📉 De-optimization Impact:"); for (const deopt of deoptimized) { console.log(` ${deopt.name}: ${deopt.accuracy.toFixed(1)}% accuracy, ${deopt.avgTokens.toFixed(0)} tokens`); } // Position effect const topPosition = this.results.find(r => r.variation === 'optimized-cascade-top'); const bottomPosition = this.results.find(r => r.variation === 'deopt-cascade-bottom'); if (topPosition && bottomPosition) { const positionEffect = topPosition.accuracy - bottomPosition.accuracy; console.log(`\n🎯 Position Effect: ${positionEffect.toFixed(1)}% accuracy difference (top vs bottom)`); } // Language effect const actionVerbs = this.results.find(r => r.variation === 'optimized-action-verbs'); const passiveLanguage = this.results.find(r => r.variation === 'deopt-passive-language'); if (actionVerbs && passiveLanguage) { const languageEffect = actionVerbs.accuracy - passiveLanguage.accuracy; console.log(`🎯 Language Effect: ${languageEffect.toFixed(1)}% accuracy difference (action vs passive)`); } // Structure effect const flat = this.results.find(r => r.variation === 'struct-json-format'); const nested = this.results.find(r => r.variation === 'deopt-buried-nested'); if (flat && nested) { const structureEffect = flat.accuracy - nested.accuracy; console.log(`🎯 Structure Effect: ${structureEffect.toFixed(1)}% accuracy difference (flat vs nested)`); } // Find the winner const winner = this.results.reduce((best, current) => { const currentScore = (current.accuracy / 100) * 1000 - current.avgTokens; const bestScore = (best.accuracy / 100) * 1000 - best.avgTokens; return currentScore > bestScore ? current : best; }); console.log(`\n🏆 WINNER: ${winner.name}`); console.log(` ${winner.accuracy.toFixed(1)}% accuracy with ${winner.avgTokens.toFixed(0)} tokens`); // Confirmed vs Rejected Hypotheses const confirmed = this.results.filter(r => r.hypothesisConfirmed).length; const total = this.results.length; console.log(`\n📊 Hypothesis Success Rate: ${confirmed}/${total} confirmed (${(confirmed/total*100).toFixed(1)}%)`); // Save results const filename = `results/rigorous_test_${Date.now()}.json`; require('fs').mkdirSync('results', { recursive: true }); require('fs').writeFileSync(filename, JSON.stringify(this.results, null, 2)); console.log(`\n💾 Detailed results saved to: ${filename}`); } } // Run the tests if (require.main === module) { const tester = new RigorousCapabilityTester(); tester.runAllTests().catch(console.error); } export { RigorousCapabilityTester, TestVariation, testVariations };

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DollhouseMCP/DollhouseMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server