Skip to main content
Glama

DollhouseMCP

by DollhouseMCP
CRITICAL_MISSING_TESTS.mdโ€ข4.41 kB
# Critical Missing Tests for Capability Index ## What We've Been Missing ### 1. We're Only Testing MCP Tools, Not Elements **Current tests:** - list_elements() - search_collection() - search_portfolio() **MISSING tests:** - **activate_element('persona-name')** - Activating personas - **create_element(type='memories')** - Creating memories - **deactivate_element()** - Deactivating elements - **render_template()** - Using templates - **execute_agent()** - Running agents ### 2. We're Not Testing Heavy Context Scenarios **Need to test with:** - 25,000+ tokens of context BEFORE the query - Multiple competing priorities in context - Conflicting information that index should resolve ### 3. Capability Index Should Cover ALL Elements ```yaml CAPABILITY_INDEX: # PERSONAS (behavioral activation) debug โ†’ activate_element('debug-detective') creative โ†’ activate_element('creative-writer') security โ†’ activate_element('security-analyst') # MEMORIES (context management) save โ†’ create_element(type='memories', content=current) recall โ†’ search_portfolio('memory') session โ†’ get_element_details('session-*') # SKILLS (capability enhancement) code_review โ†’ activate_element('code-review-skill') testing โ†’ activate_element('test-writer-skill') # TEMPLATES (structured output) report โ†’ render_template('status-report') pr โ†’ render_template('pull-request') # AGENTS (autonomous tasks) research โ†’ execute_agent('research-agent', goal) refactor โ†’ execute_agent('refactor-agent', code) # ENSEMBLES (combined elements) team โ†’ activate_element('dev-team-ensemble') ``` ## Tests We SHOULD Run ### Test A: Persona Activation Under Load ```bash # 25,000 tokens of documentation # Query: "Help me debug this error" # Expected: activate_element('debug-detective') # Measure: Time to activation, tokens used ``` ### Test B: Memory Creation with Context ```bash # 25,000 tokens of conversation history # Query: "Save this for tomorrow's session" # Expected: create_element(type='memories', ...) # Measure: What gets saved, how much context ``` ### Test C: Progressive Disclosure ```bash # Index with multiple levels: CAPABILITY_INDEX: personas โ†’ list_names_only personas_detailed โ†’ list_with_descriptions personas_full โ†’ get_all_metadata # Test if Claude uses minimal option first ``` ### Test D: Index vs No Index with Heavy Load ```bash # Scenario 1: 25k tokens + NO index # Scenario 2: 25k tokens + index at TOP # Scenario 3: 25k tokens + index at BOTTOM # Scenario 4: 25k tokens + index REPEATED # Measure: Response time, correct element activation ``` ### Test E: Conflicting Context ```bash # Context says: "Always use creative-writer" # Query: "Help me debug" # Index says: debug โ†’ debug-detective # What wins? Context or index? ``` ## Why Current Tests Are Insufficient ### 1. Binary Tool Execution All our tests just check if `list_elements` runs. We're not testing: - Which persona gets activated - Whether memories are created/retrieved - Template rendering - Agent execution ### 2. Clean Context Testing with empty/minimal context doesn't reflect reality where: - Context is already full of information - Multiple possible interpretations exist - Performance degrades with size ### 3. Single Element Type Only testing "list personas" doesn't cover: - Cross-element workflows - Combined operations - Progressive refinement ## Proposed Comprehensive Test ### Setup 1. Load 25,000 tokens of mixed content 2. Include conflicting instructions 3. Add noise/distractions ### Test Matrix | Context Size | Index Position | Element Type | Query Type | |-------------|----------------|--------------|------------| | 0 tokens | None | Persona | Direct | | 10k tokens | Top | Memory | Indirect | | 25k tokens | Bottom | Skill | Ambiguous | | 50k tokens | Both | Template | Complex | ### Measurements 1. **Activation accuracy**: Right element activated? 2. **Response time**: How long to decide? 3. **Token usage**: Input + output tokens 4. **Cascade behavior**: Progressive disclosure? 5. **Conflict resolution**: What wins? ## The Real Question **Does a capability index help Claude navigate a heavy context to activate the right elements (personas, memories, skills) more efficiently?** Current answer: **We don't know because we haven't tested it properly** --- *We've been testing the wrong thing in the wrong way*

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DollhouseMCP/DollhouseMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server