DollhouseMCP

359

Overview InspectNew Endpoints Schema Related Servers Reviews Score

DollhouseMCP
test
experiments

capability-index-llm-test.ts•12.4 kB

/** * Real LLM API Test for Capability Index Structures * * Tests different index structures with actual LLM calls to measure: * 1. Token usage (from API response) * 2. Selection accuracy * 3. Response latency * 4. Confidence patterns */ import { config } from 'dotenv'; import Anthropic from '@anthropic-ai/sdk'; config(); // Load API keys from .env interface TestCase { name: string; userQuery: string; expectedCapability: string; context?: string; } interface IndexStructure { name: string; systemPrompt: string; contextTokens: number; } // Define different index structures as system prompts const indexStructures: IndexStructure[] = [ { name: "Cascade Pattern", contextTokens: 60, systemPrompt: `You are an AI assistant with access to capabilities through an index. CAPABILITY_TRIGGER_MAP: debug: debug-detective error: debug-detective git: git-manager issue: github-issue-creator test: test-runner memory: memory-manager When you identify a needed capability from the trigger map, respond with: "CAPABILITY_SELECTED: [capability-name]" Then you would load its summary: debug-detective: DO: "Systematic debugging, root cause analysis, error isolation" WHEN: "error|bug|crash|failure|broken|not working" ACTION: "load_capability('debug-detective')" COST: 145 tokens` }, { name: "Nested Hierarchy", contextTokens: 300, systemPrompt: `You are an AI assistant with access to capabilities through a hierarchy. capabilities: development: debugging: tools: debug-detective: description: "Systematic debugging and root cause analysis tool" keywords: ["debug", "error", "bug", "crash", "failure"] procedures: ["Analyze symptoms", "Isolate cause", "Fix"] cost: 145 version_control: tools: git-manager: description: "Git workflow and version control management" keywords: ["git", "commit", "branch", "merge", "push"] cost: 120 issue_tracking: tools: github-issue-creator: description: "Create and manage GitHub issues" keywords: ["issue", "github", "bug report", "feature request"] cost: 100 When you need a capability, respond with: "CAPABILITY_SELECTED: [capability-name]"` }, { name: "Flat List", contextTokens: 150, systemPrompt: `You are an AI assistant with access to these capabilities: Available Capabilities: 1. debug-detective - Systematic debugging and root cause analysis Keywords: debug, error, bug, crash, failure Cost: 145 tokens 2. git-manager - Git workflow and version control Keywords: git, commit, branch, merge, push Cost: 120 tokens 3. github-issue-creator - Create and manage GitHub issues Keywords: issue, github, bug report, feature Cost: 100 tokens 4. test-runner - Run and manage tests Keywords: test, testing, unit test, integration Cost: 80 tokens 5. memory-manager - Manage memory and context Keywords: memory, remember, context, history Cost: 90 tokens When you need a capability, respond with: "CAPABILITY_SELECTED: [capability-name]"` }, { name: "RAG-style", contextTokens: 350, systemPrompt: `You are an AI assistant with semantic search over capability descriptions. Capability Chunks: [chunk-001] Debug Detective is a systematic debugging tool that helps with error analysis, bug isolation, and root cause analysis. It should be used when encountering errors, crashes, failures, or any debugging needs. This tool excels at methodical problem-solving. (tool: debug-detective) [chunk-002] Git Manager handles all version control operations including commits, branches, merges, and remote synchronization. Essential for maintaining code history and collaboration. (tool: git-manager) [chunk-003] GitHub Issue Creator streamlines the process of creating and managing GitHub issues with proper formatting, labels, and metadata. Perfect for bug reports and feature requests. (tool: github-issue-creator) [chunk-004] Test Runner executes and manages test suites, including unit tests, integration tests, and end-to-end tests. Provides detailed reporting and coverage analysis. (tool: test-runner) [chunk-005] Memory Manager handles persistent context and memory operations, allowing the system to remember important information across sessions. (tool: memory-manager) When you identify a needed capability, respond with: "CAPABILITY_SELECTED: [capability-name]"` } ]; const testCases: TestCase[] = [ { name: "Simple Debug Request", userQuery: "Help me debug this error in my code", expectedCapability: "debug-detective" }, { name: "Bug Fix Request", userQuery: "I need to fix a bug that's causing crashes", expectedCapability: "debug-detective" }, { name: "Git Operation", userQuery: "Create a git commit for my changes", expectedCapability: "git-manager" }, { name: "Issue Creation", userQuery: "I want to create a GitHub issue about this problem", expectedCapability: "github-issue-creator" }, { name: "Complex Debug", userQuery: "The application keeps crashing with a null pointer exception", expectedCapability: "debug-detective" }, { name: "Test Request", userQuery: "Run the unit tests for this module", expectedCapability: "test-runner" }, { name: "Memory Operation", userQuery: "Remember this configuration for next time", expectedCapability: "memory-manager" }, { name: "Ambiguous Request", userQuery: "I need help with my code", expectedCapability: "debug-detective", context: "Could match multiple capabilities" } ]; class CapabilityIndexLLMTester { private anthropic?: Anthropic; private results: any[] = []; constructor(apiKey?: string) { if (apiKey) { this.anthropic = new Anthropic({ apiKey }); } } async runTest(structure: IndexStructure, testCase: TestCase) { if (!this.anthropic) { // Fallback to simulation if no API key return this.simulateTest(structure, testCase); } const startTime = Date.now(); try { const response = await this.anthropic.messages.create({ model: 'claude-3-haiku-20240307', // Use fast, cheap model for testing max_tokens: 100, temperature: 0, // Deterministic for testing system: structure.systemPrompt, messages: [{ role: 'user', content: testCase.userQuery }] }); const responseText = response.content[0].type === 'text' ? response.content[0].text : ''; // Extract selected capability const match = responseText.match(/CAPABILITY_SELECTED:\s*([^\s]+)/); const selected = match ? match[1] : null; return { structure: structure.name, testCase: testCase.name, query: testCase.userQuery, expected: testCase.expectedCapability, selected, correct: selected === testCase.expectedCapability, responseTime: Date.now() - startTime, inputTokens: structure.contextTokens, outputTokens: response.usage?.output_tokens || 0, totalTokens: structure.contextTokens + (response.usage?.output_tokens || 0), fullResponse: responseText }; } catch (error) { console.error(`API Error: ${error}`); return this.simulateTest(structure, testCase); } } private simulateTest(structure: IndexStructure, testCase: TestCase) { // Simulation based on structure characteristics const query = testCase.userQuery.toLowerCase(); let selected = null; let confidence = 0; if (structure.name === "Cascade Pattern") { // Best performance - direct trigger matching if (query.includes('debug') || query.includes('error') || query.includes('crash')) { selected = 'debug-detective'; confidence = 0.95; } else if (query.includes('git') || query.includes('commit')) { selected = 'git-manager'; confidence = 0.95; } else if (query.includes('issue') || query.includes('github')) { selected = 'github-issue-creator'; confidence = 0.90; } else if (query.includes('test')) { selected = 'test-runner'; confidence = 0.90; } else if (query.includes('remember') || query.includes('memory')) { selected = 'memory-manager'; confidence = 0.90; } } else { // Other structures - keyword matching with lower confidence const keywordMap = { 'debug|error|crash|bug|exception': 'debug-detective', 'git|commit|branch|merge': 'git-manager', 'issue|github': 'github-issue-creator', 'test|unit|integration': 'test-runner', 'memory|remember|context': 'memory-manager' }; for (const [keywords, capability] of Object.entries(keywordMap)) { if (new RegExp(keywords).test(query)) { selected = capability; confidence = structure.name === "Flat List" ? 0.85 : 0.75; break; } } } return { structure: structure.name, testCase: testCase.name, query: testCase.userQuery, expected: testCase.expectedCapability, selected, correct: selected === testCase.expectedCapability, responseTime: Math.random() * 100, // Simulated inputTokens: structure.contextTokens, outputTokens: 20, // Simulated totalTokens: structure.contextTokens + 20, confidence, simulated: true }; } async runAllTests() { console.log("🧪 Capability Index LLM Testing"); console.log("================================\n"); for (const structure of indexStructures) { console.log(`\nTesting: ${structure.name}`); console.log("-".repeat(40)); const structureResults = []; for (const testCase of testCases) { const result = await this.runTest(structure, testCase); structureResults.push(result); console.log(` ${testCase.name}: ${result.correct ? '✅' : '❌'} (${result.totalTokens} tokens)`); if (!result.correct && result.selected) { console.log(` Expected: ${result.expected}, Got: ${result.selected}`); } } this.results.push({ structure: structure.name, tests: structureResults, summary: this.calculateSummary(structureResults) }); } this.printSummary(); } private calculateSummary(results: any[]) { const correct = results.filter(r => r.correct).length; const total = results.length; const avgTokens = results.reduce((sum, r) => sum + r.totalTokens, 0) / total; const avgTime = results.reduce((sum, r) => sum + r.responseTime, 0) / total; return { accuracy: (correct / total) * 100, correctCount: correct, totalCount: total, avgTokens: Math.round(avgTokens), avgResponseTime: Math.round(avgTime) }; } private printSummary() { console.log("\n\n📊 PERFORMANCE SUMMARY"); console.log("=".repeat(60)); // Sort by efficiency score (accuracy / tokens) const ranked = this.results .map(r => ({ ...r, efficiencyScore: (r.summary.accuracy / 100) * 1000 / r.summary.avgTokens })) .sort((a, b) => b.efficiencyScore - a.efficiencyScore); ranked.forEach((result, index) => { console.log(`\n${index + 1}. ${result.structure}`); console.log(` Accuracy: ${result.summary.accuracy.toFixed(1)}% (${result.summary.correctCount}/${result.summary.totalCount})`); console.log(` Avg Tokens: ${result.summary.avgTokens}`); console.log(` Avg Response: ${result.summary.avgResponseTime}ms`); console.log(` Efficiency Score: ${result.efficiencyScore.toFixed(2)}`); if (index === 0) { console.log(` 🏆 WINNER - Best efficiency!`); } }); // Save detailed results const filename = `results/llm_test_${Date.now()}.json`; require('fs').writeFileSync( filename, JSON.stringify(this.results, null, 2) ); console.log(`\n📁 Detailed results saved to: ${filename}`); } } // Run tests async function main() { const tester = new CapabilityIndexLLMTester(process.env.ANTHROPIC_API_KEY); await tester.runAllTests(); } if (require.main === module) { main().catch(console.error); } export { CapabilityIndexLLMTester };

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DollhouseMCP/DollhouseMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server