Skip to main content
Glama
spec.md26.3 kB
# Feature Specification: Automated Overflow-Safety Test Suite **Feature ID**: 006-overflow-safety-test-suite **Status**: Draft **Created**: 2025-10-16 **Last Updated**: 2025-10-16 --- ## 1. Overview ### 1.1 Problem Statement In production environments, the Hostaway MCP server can return oversized payloads or error bodies that bloat Claude's context window and potentially halt conversations: - **Unbounded Data Returns**: Some endpoints (e.g., `list_properties`, `search_bookings`) can return hundreds of records without pagination - **Large Error Bodies**: 500 errors may include full HTML stack traces (10-50KB+) - **Context Window Exhaustion**: Large payloads consume Claude's token budget, preventing further conversation - **Cost Overruns**: Excessive token usage increases API costs and degrades performance **Current State**: - No automated testing for payload size constraints - Manual verification of pagination behavior - Production incidents discovered only when context windows are exhausted - No regression detection for endpoints that start returning unbounded data **Desired State**: - Automated, repeatable tests that validate payload size constraints - Fast CI feedback before deployment - Comprehensive coverage of all high-volume endpoints - Regression detection for overflow scenarios ### 1.2 Goals & Objectives **Primary Goals**: 1. **Safety by Default**: Guarantee all high-volume tools are paginated, chunked, or summarized 2. **Regression Detection**: Detect when endpoints start returning unbounded data 3. **Fast Feedback**: Provide CI validation before production deployment 4. **Production Confidence**: Ensure context window safety across all endpoints **Success Metrics**: - ✅ 0 tests exceed hard token cap in steady state - ✅ ≥95% of eligible endpoints return paginated/summarized results under forced-small caps - ✅ CI runs complete in <5 minutes - ✅ Test flake rate <1% - ✅ 100% coverage of high-volume endpoints ### 1.3 Scope **In Scope**: **Endpoint Categories**: 1. **Properties** - `list_properties` - Pagination required - `get_property_details` - Projection/field limits 2. **Availability** - `check_availability` - Bounded time windows 3. **Bookings** - `search_bookings` - Paging + filters - `get_booking_details` - Projection/chunking - `get_guest_info` - Projection 4. **Financial** - `get_financial_reports` - Preview/summary + drilldown for large datasets 5. **MCP Resources** (if exposed) - `resources/list` - Cursor/nextCursor contract - `resources/read` - Size-bounded reads **Cross-Cutting Behaviors**: - Token-cap enforcement - Preview mode activation - Dynamic chunking - Failure hygiene (500/429 errors) - Rate-limit guidance **Out of Scope**: - ❌ Rewriting business logic or changing public schemas (beyond additive metadata) - ❌ UI/agent-level changes - ❌ Performance optimization (speed focus is on CI runtime, not endpoint performance) - ❌ Schema validation (covered by existing contract tests) ### 1.4 Target Users **Primary Users**: 1. **Platform/Agent Engineers** - Need: Prevent context overflows and runaway costs - Benefit: Automated validation before deployment 2. **SRE/QA Teams** - Need: Reproducible checks for pagination/summarization - Benefit: Fast regression detection in CI 3. **MCP Server Maintainers** - Need: Confidence in endpoint safety - Benefit: Clear test coverage metrics **Secondary Users**: - Product managers (cost control visibility) - Security team (failure hygiene validation) --- ## 2. Requirements ### 2.1 Functional Requirements #### FR1: Pagination Testing **FR1.1**: `list_properties` Pagination - **Requirement**: Endpoint MUST respect `limit` parameter - **Validation**: - Request with `limit=10` returns ≤10 properties - Returns `nextCursor` when more results available - Subsequent calls with `cursor` return disjoint pages - Total results across pages match expected count - **Priority**: P0 (Critical) **FR1.2**: `search_bookings` Pagination - **Requirement**: Endpoint MUST implement cursor-based pagination - **Validation**: - Default page size respects `MCP_DEFAULT_PAGE_SIZE` - `nextCursor` present when hasMore=true - Cursor navigation returns stable, non-overlapping results - **Priority**: P0 (Critical) **FR1.3**: MCP Resources List Pagination - **Requirement**: If resources exposed, MUST use cursor/nextCursor - **Validation**: - `resources/list` returns cursor metadata - Cursor-based traversal completes without duplicates - **Priority**: P1 (High) #### FR2: Preview & Summarization **FR2.1**: Financial Reports Preview - **Requirement**: Large date ranges MUST trigger preview mode - **Validation**: - Projected tokens >threshold returns summary + next steps - Summary includes: total records, date range, top insights - No raw oversized dump in preview mode - Provides cursor for drilldown pagination - **Priority**: P0 (Critical) **FR2.2**: Token Threshold Enforcement - **Requirement**: Responses exceeding soft threshold activate preview - **Validation**: - Configure `MCP_OUTPUT_TOKEN_THRESHOLD=1000` - Large responses return preview with `meta.summary` - Preview includes actionable next steps - **Priority**: P0 (Critical) #### FR3: Field Projection & Chunking **FR3.1**: Property Details Projection - **Requirement**: Support field selection to limit payload size - **Validation**: - `fields` parameter filters response - Projected response significantly smaller than full payload - Large nested sections (reviews, availability) are chunked - **Priority**: P1 (High) **FR3.2**: Booking Details Chunking - **Requirement**: Large booking records are chunked intelligently - **Validation**: - Core booking data always present - Nested collections (line items, guests) paginated separately - Chunk metadata indicates more data available - **Priority**: P1 (High) **FR3.3**: Guest Info Projection - **Requirement**: Return minimal guest data by default - **Validation**: - Default fields exclude large collections - Optional `includeHistory` parameter for full data - **Priority**: P2 (Medium) #### FR4: Error Hygiene **FR4.1**: 500 Error Compactness - **Requirement**: 500 errors MUST return compact JSON-RPC errors - **Validation**: - No HTML stack traces in response - Error payload <2KB - Includes correlation ID for debugging - Actionable error message for client - **Priority**: P0 (Critical) **FR4.2**: 429 Rate Limit Metadata - **Requirement**: Rate limit errors include retry guidance - **Validation**: - Returns `retryAfter` in seconds - Includes `rateLimitRemaining` if available - Error message explains limit and retry timing - **Priority**: P1 (High) #### FR5: Token Cap Enforcement **FR5.1**: Hard Cap Validation - **Requirement**: No response exceeds `MCP_HARD_OUTPUT_TOKEN_CAP` - **Validation**: - All endpoints tested with cap=5000 tokens - Responses automatically truncate/preview when approaching cap - Truncation metadata indicates original size - **Priority**: P0 (Critical) **FR5.2**: Soft Cap Preview - **Requirement**: Soft threshold triggers preview without truncation - **Validation**: - `MCP_OUTPUT_TOKEN_THRESHOLD` triggers preview mode - Preview includes summary + next steps - No data loss, just deferred detail retrieval - **Priority**: P0 (Critical) ### 2.2 Non-Functional Requirements #### NFR1: Performance - Test suite completes in <5 minutes in CI - Individual tests complete in <10 seconds - Parallel execution supported for test isolation #### NFR2: Reliability - Flake rate <1% over 100 runs - Tests are deterministic and repeatable - Mock failures don't cascade #### NFR3: Maintainability - Clear test names describing scenario - Reusable fixtures and utilities - Documented test setup and teardown #### NFR4: Coverage - 100% of high-volume endpoints tested - All pagination paths exercised - Error scenarios comprehensively covered #### NFR5: CI Integration - Runs on every PR - Blocks merge if tests fail - Clear failure messages in PR comments --- ## 3. Assumptions & Constraints ### 3.1 Assumptions 1. **MCP Communication**: Server communicates over stdio (existing pattern) 2. **Hostaway API Access**: API is reachable OR can be mocked 3. **Configuration**: Environment variables available for cap tuning: - `MCP_OUTPUT_TOKEN_THRESHOLD` (default: 50000) - `MCP_HARD_OUTPUT_TOKEN_CAP` (default: 100000) - `MCP_DEFAULT_PAGE_SIZE` (default: 50) 4. **Backward Compatibility**: Only additive changes allowed (e.g., `nextCursor`, `meta.summary`) 5. **Test Environment**: Dedicated test database/fixtures available ### 3.2 Constraints 1. **Schema Stability**: Cannot break existing client contracts 2. **Performance**: CI time budget is 5 minutes total 3. **External Dependencies**: Hostaway API rate limits apply to live tests 4. **Token Estimation**: Approximation acceptable (±10% accuracy) 5. **Test Data**: Must not use real customer data ### 3.3 Dependencies **Internal**: - Existing MCP server (`src/api/main.py`) - Token estimator utility (`src/utils/token_estimator.py`) - Pagination service (`src/services/pagination_service.py`) - Summarization service (`src/services/summarization_service.py`) **External**: - Hostaway API (for live integration tests) - Vitest/Jest (test framework) - TypeScript (test language) - Mock server (http-mock-server or msw) --- ## 4. Acceptance Criteria ### 4.1 Core Acceptance Criteria **AC1**: Pagination Validation - ✅ `list_properties` with `limit=10` returns ≤10 results - ✅ `nextCursor` present when hasMore=true - ✅ Cursor navigation returns disjoint pages - ✅ All pages combined match expected total **AC2**: Preview Mode Activation - ✅ Financial reports with large date ranges return preview - ✅ Preview includes summary + cursor for drilldown - ✅ No oversized raw dumps **AC3**: Token Cap Enforcement - ✅ Hard cap never exceeded across all endpoints - ✅ Soft threshold triggers preview mode - ✅ Truncation metadata present when needed **AC4**: Error Hygiene - ✅ 500 errors return compact JSON (<2KB) - ✅ 429 errors include `retryAfter` metadata - ✅ No HTML stack traces in any error response **AC5**: CI Integration - ✅ Test suite runs in <5 minutes - ✅ Flake rate <1% - ✅ Clear failure reporting in PRs ### 4.2 Test Coverage Requirements **Coverage Targets**: - ✅ 100% of high-volume endpoints tested - ✅ All pagination paths exercised (first page, middle page, last page) - ✅ Preview mode tested with various thresholds - ✅ Error scenarios (500, 429, 400) covered - ✅ Token cap boundaries tested (at threshold, over threshold, at hard cap) ### 4.3 Edge Cases **EC1**: Empty Results - Pagination metadata correct for 0 results - No `nextCursor` when no results **EC2**: Exactly One Page - Single page returns all results without cursor - Subsequent cursor call returns empty **EC3**: Token Estimation Edge Cases - Very small responses (<100 tokens) - Responses near soft threshold (±5%) - Responses at hard cap boundary **EC4**: Concurrent Requests - Multiple paginated requests don't interfere - Cursors remain stable across parallel calls --- ## 5. User Experience ### 5.1 Developer Experience **Test Execution**: ```bash # Run all overflow-safety tests npm test -- overflow-safety # Run specific endpoint tests npm test -- overflow-safety/properties # Run with custom caps (for debugging) MCP_OUTPUT_TOKEN_THRESHOLD=1000 npm test -- overflow-safety # Run live integration tests (slower) TEST_MODE=live npm test -- overflow-safety/integration ``` **Test Output**: ``` ✓ list_properties respects limit parameter (45ms) ✓ list_properties returns nextCursor when hasMore (52ms) ✓ list_properties cursor pagination returns disjoint pages (89ms) ✓ financial_reports activates preview for large date ranges (124ms) ✓ get_property_details respects field projection (38ms) ✗ search_bookings exceeds token threshold (234ms) Expected: <50000 tokens Actual: 67543 tokens Endpoint: search_bookings Parameters: {status: 'all', limit: 100} ``` ### 5.2 CI/CD Experience **PR Check Output**: ```markdown ## 🔒 Overflow-Safety Tests **Status**: ❌ Failed ### Failed Tests (1) - `search_bookings` exceeded soft token threshold - Expected: ≤50,000 tokens - Actual: 67,543 tokens - Fix: Reduce default page size or enable preview mode ### Coverage - Endpoints tested: 12/12 (100%) - Pagination paths: 36/36 (100%) - Error scenarios: 8/8 (100%) ### Performance - Duration: 3m 42s - Flakes: 0 ``` ### 5.3 Monitoring & Alerts **Production Metrics** (future enhancement): - Token usage per endpoint (P95, P99) - Preview mode activation rate - Pagination cursor utilization - Error response sizes --- ## 6. Technical Approach ### 6.1 Architecture Overview ``` ┌─────────────────────────────────────────────────────────┐ │ Test Harness (Vitest) │ ├─────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────┐ ┌──────────────────┐ │ │ │ Test Scenarios│ │ Token Estimator │ │ │ │ - Pagination │ │ - Count tokens │ │ │ │ - Preview │ │ - Validate caps │ │ │ │ - Errors │ │ │ │ │ └────────┬───────┘ └────────┬─────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────────────────────────────┐ │ │ │ MCP Client (stdio) │ │ │ └──────────────┬──────────────────────┘ │ │ │ │ └─────────────────┼────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ MCP Server (Python) │ ├─────────────────────────────────────────────────────────┤ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Pagination │ │ Preview │ │ Token │ │ │ │ Service │ │ Service │ │ Estimator │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Endpoint Handlers │ │ │ └───────────────────┬─────────────────────────┘ │ └────────────────────────┼───────────────────────────────┘ │ ▼ ┌──────────────────────┐ │ Mock Hostaway API │ │ (for unit tests) │ └──────────────────────┘ ``` ### 6.2 Test Categories **Unit Tests** (fast, mocked): - Pagination logic - Token estimation accuracy - Preview mode activation - Error formatting **Integration Tests** (slower, live API): - End-to-end pagination flows - Real token counts - Error scenarios with actual API - Rate limit handling **Contract Tests**: - Schema validation - Backward compatibility - Metadata presence (nextCursor, meta.summary) ### 6.3 Key Components **Test Harness** (`tests/overflow-safety/`): ```typescript // tests/overflow-safety/harness.ts export class OverflowTestHarness { constructor( private mcpClient: MCPClient, private tokenEstimator: TokenEstimator, private config: TestConfig ) {} async testPagination( endpoint: string, params: Record<string, any> ): Promise<PaginationTestResult> { // Test pagination behavior } async testPreview( endpoint: string, largeParams: Record<string, any> ): Promise<PreviewTestResult> { // Test preview mode activation } async testTokenCap( endpoint: string, params: Record<string, any> ): Promise<TokenCapTestResult> { // Validate token caps } } ``` **Token Estimator** (existing, enhanced): ```python # src/utils/token_estimator.py class TokenEstimator: def estimate(self, text: str) -> int: """Estimate token count (±10% accuracy)""" def will_exceed_threshold( self, response: dict, threshold: int ) -> bool: """Check if response will exceed threshold""" def get_truncation_point( self, response: dict, max_tokens: int ) -> int: """Find safe truncation point""" ``` **Mock Server** (`tests/mocks/hostaway-api.ts`): ```typescript // Mock Hostaway API for deterministic testing export const mockHostawayAPI = setupServer( rest.get('/v1/listings', (req, res, ctx) => { const limit = parseInt(req.url.searchParams.get('limit') || '50') const offset = parseInt(req.url.searchParams.get('offset') || '0') return res( ctx.json({ result: generateProperties(limit), count: 500, // Total available limit, offset }) ) }) ) ``` --- ## 7. Testing Strategy ### 7.1 Test Matrix | Endpoint | Pagination | Preview | Projection | Error | Token Cap | |----------|-----------|---------|------------|-------|-----------| | `list_properties` | ✅ | N/A | ✅ | ✅ | ✅ | | `get_property_details` | N/A | ✅ | ✅ | ✅ | ✅ | | `check_availability` | ✅ | N/A | N/A | ✅ | ✅ | | `search_bookings` | ✅ | ✅ | ✅ | ✅ | ✅ | | `get_booking_details` | N/A | ✅ | ✅ | ✅ | ✅ | | `get_guest_info` | N/A | N/A | ✅ | ✅ | ✅ | | `get_financial_reports` | ✅ | ✅ | ✅ | ✅ | ✅ | ### 7.2 Test Scenarios **Pagination Scenarios**: 1. First page with limit 2. Middle page navigation 3. Last page (no nextCursor) 4. Empty results 5. Single page results 6. Cursor stability across retries **Preview Scenarios**: 1. Below threshold (no preview) 2. At threshold boundary 3. Above threshold (preview activated) 4. Preview with drilldown cursor 5. Preview summary accuracy **Token Cap Scenarios**: 1. Well below soft threshold 2. At soft threshold (±5%) 3. Between soft and hard cap 4. At hard cap boundary 5. Forced truncation **Error Scenarios**: 1. 500 with correlation ID 2. 429 with retryAfter 3. 400 with validation details 4. Network timeout 5. Invalid cursor ### 7.3 Test Data Strategy **Fixtures** (`tests/fixtures/`): ``` fixtures/ ├── properties/ │ ├── small-set.json (10 properties) │ ├── medium-set.json (100 properties) │ └── large-set.json (1000 properties) ├── bookings/ │ ├── minimal.json (1 booking, core fields) │ ├── detailed.json (1 booking, all fields) │ └── collection.json (50 bookings) └── errors/ ├── 500-html.html (large error body) ├── 500-compact.json (desired format) └── 429-rate-limit.json ``` **Generators**: ```typescript // tests/fixtures/generators.ts export function generateProperties(count: number): Property[] { // Generate realistic property data } export function generateLargeFinancialReport(): FinancialReport { // Generate report that exceeds threshold } ``` --- ## 8. Risks & Mitigations ### 8.1 Technical Risks **Risk 1**: Flaky External API - **Impact**: High (unreliable tests) - **Probability**: Medium - **Mitigation**: - Use HTTP mocks for unit tests - Run live integration tests nightly only - Implement retry logic with exponential backoff **Risk 2**: Token Estimation Inaccuracy - **Impact**: Medium (false positives/negatives) - **Probability**: Low - **Mitigation**: - Validate estimator against real Claude API - Use ±10% tolerance bands - Log discrepancies for tuning **Risk 3**: Test Performance Degradation - **Impact**: Medium (slow CI) - **Probability**: Medium - **Mitigation**: - Parallel test execution - Mock by default, live tests opt-in - Optimize fixture generation ### 8.2 Product Risks **Risk 4**: Breaking Client Compatibility - **Impact**: High (production incidents) - **Probability**: Low - **Mitigation**: - Additive-only changes - Contract tests validate schemas - Feature flags for gradual rollout **Risk 5**: Hidden Client Reliance on Unbounded Payloads - **Impact**: High (broken integrations) - **Probability**: Low - **Mitigation**: - Audit existing client usage - Provide migration guide - Support legacy mode temporarily ### 8.3 Operational Risks **Risk 6**: CI Pipeline Instability - **Impact**: Medium (blocked PRs) - **Probability**: Low - **Mitigation**: - Quarantine flaky tests - Manual override for known issues - Test health dashboard --- ## 9. Implementation Plan ### 9.1 Phases **Phase 1: Foundation** (Week 1) - Set up test harness infrastructure - Create mock Hostaway API - Implement token estimator utility - Basic pagination tests for 2 endpoints **Phase 2: Core Coverage** (Week 2) - Complete pagination tests for all endpoints - Implement preview mode tests - Add error hygiene tests - CI integration **Phase 3: Edge Cases** (Week 3) - Token cap boundary tests - Concurrent request tests - Field projection validation - Performance optimization **Phase 4: Integration & Documentation** (Week 4) - Live integration tests - Test documentation - Runbook for CI failures - Metrics dashboard ### 9.2 Deliverables **Code**: - [ ] Test harness (`tests/overflow-safety/harness.ts`) - [ ] Test scenarios (`tests/overflow-safety/scenarios/`) - [ ] Mock server (`tests/mocks/hostaway-api.ts`) - [ ] Token estimator enhancements (`src/utils/token_estimator.py`) - [ ] CI workflow (`.github/workflows/overflow-safety.yml`) **Documentation**: - [ ] Test guide (`docs/OVERFLOW_SAFETY_TESTING.md`) - [ ] Runbook (`docs/OVERFLOW_SAFETY_RUNBOOK.md`) - [ ] Configuration reference (`docs/TOKEN_CAP_CONFIG.md`) **Infrastructure**: - [ ] Test fixtures (`tests/fixtures/`) - [ ] CI pipeline configuration - [ ] Metrics collection (optional) --- ## 10. Success Criteria & Validation ### 10.1 Launch Criteria **Before Merge to Main**: - ✅ All P0 tests passing - ✅ CI runtime <5 minutes - ✅ Flake rate <1% over 100 runs - ✅ Code review approved - ✅ Documentation complete **Before Production**: - ✅ Live integration tests passing - ✅ No regressions in existing functionality - ✅ Metrics dashboard deployed (if included) ### 10.2 Post-Launch Validation **Week 1**: - Monitor CI success rate (target: >99%) - Track test execution time (target: <5 min) - Collect feedback from engineers **Month 1**: - Review production token usage metrics - Validate preview mode activation rates - Assess regression detection effectiveness ### 10.3 Rollback Plan **Triggers**: - CI success rate <95% - Test execution time >10 minutes - False positive rate >5% **Actions**: 1. Disable failing tests in CI 2. Investigate root cause 3. Fix or remove problematic tests 4. Re-enable after validation --- ## 11. Open Questions 1. **Token Estimator Accuracy**: What's acceptable tolerance for token estimation? (Proposal: ±10%) 2. **Live Test Frequency**: Should live integration tests run on every PR or nightly? (Proposal: Nightly for cost/speed) 3. **Failure Threshold**: Should CI block on soft threshold violations or only hard cap? (Proposal: Block on hard cap, warn on soft) 4. **Mock Data Volume**: How many test fixtures needed per endpoint? (Proposal: 3 sizes - small, medium, large) 5. **Metric Collection**: Should we collect production metrics for token usage? (Proposal: Yes, via existing telemetry) --- ## 12. Appendix ### 12.1 Related Documents - [Token Budget Management](../003-we-need-to/spec.md) - [Pagination Service](../../src/services/pagination_service.py) - [Summarization Service](../../src/services/summarization_service.py) - [API Contract Tests](../../tests/integration/) ### 12.2 Glossary - **Hard Cap**: Maximum token limit that cannot be exceeded (triggers truncation) - **Soft Threshold**: Token limit that triggers preview mode (no truncation) - **Preview Mode**: Summarized response with cursor for full data retrieval - **Cursor**: Opaque token for pagination navigation - **Field Projection**: Filtering response to include only specified fields - **Chunking**: Breaking large responses into smaller logical pieces ### 12.3 References - [MCP Specification](https://spec.modelcontextprotocol.io/) - [Claude Context Window Limits](https://docs.anthropic.com/claude/docs/models-overview) - [Hostaway API Documentation](https://docs.hostaway.com/) - [Vitest Documentation](https://vitest.dev/) --- **Document History**: - 2025-10-16: Initial draft (v1.0)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/darrentmorgan/hostaway-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server