Skip to main content
Glama
spec.md21 kB
# Feature Specification: MCP Server Context Window Protection **Feature Branch**: `005-project-brownfield-hardening` **Created**: 2025-10-15 **Status**: Draft **Input**: User description: "Project: Brownfield hardening of an existing MCP server used by Claude. Context & Problem: We have an existing MCP server integrated with Claude; large tool/resource outputs are overrunning the model's context window, degrading quality and causing truncation or hard errors..." ## User Scenarios & Testing *(mandatory)* ### User Story 1 - Paginated List Results (Priority: P1) Claude requests a list of resources (e.g., listings, bookings, transactions) from the MCP server and receives a manageable first page of results with clear instructions on how to fetch additional pages if needed. **Why this priority**: This addresses the most common cause of context overflow - unbounded list operations. Without pagination, a single tool call can return hundreds or thousands of items, immediately consuming the entire context window. This is the foundation for all other improvements. **Independent Test**: Can be fully tested by calling any list endpoint with a large dataset and verifying it returns only a limited number of items plus pagination metadata, delivering immediate value by preventing context overflow on the most common operations. **Acceptance Scenarios**: 1. **Given** Claude requests a list of 500 bookings, **When** the server processes the request, **Then** the response contains the first 50 bookings, total count of 500, and a cursor for fetching the next page 2. **Given** Claude uses a pagination cursor from a previous response, **When** requesting the next page, **Then** the server returns the next batch of items with an updated cursor 3. **Given** Claude requests a list with only 10 total items, **When** the server processes the request, **Then** all items are returned without pagination metadata (no unnecessary overhead for small datasets) 4. **Given** Claude reaches the final page of results, **When** processing the response, **Then** no next cursor is provided, indicating completion --- ### User Story 2 - Automatic Response Summarization (Priority: P1) When Claude requests data that would exceed safe token limits, the server automatically provides a condensed summary with key information and instructions for accessing complete details on demand. **Why this priority**: Even with pagination, individual items can be verbose (e.g., a booking with extensive guest details, property amenities, pricing breakdowns). Summarization prevents single-item bloat while keeping the most important information accessible. This complements pagination by handling per-item verbosity. **Independent Test**: Can be fully tested by requesting a resource with extensive details and verifying the response contains only essential fields with instructions for accessing full details, delivering value by preventing single-item context overflow. **Acceptance Scenarios**: 1. **Given** Claude requests booking details that exceed 2000 tokens when fully expanded, **When** the server processes the request, **Then** a summary with essential fields (ID, status, guest name, dates, property) is returned with instructions to request specific detail categories 2. **Given** Claude requests full financial details after receiving a summary, **When** the server processes the detailed request, **Then** only the financial section is returned in full detail 3. **Given** Claude requests a small resource under 500 tokens, **When** the server processes the request, **Then** full details are returned without summarization (no unnecessary complexity for small responses) 4. **Given** a summary response includes token budget information, **When** Claude processes the response, **Then** the metadata shows estimated tokens used and available budget remaining --- ### User Story 3 - Configurable Token Budgets (Priority: P2) Operations teams can configure token limits and response behaviors through environment variables or configuration files without code changes, allowing tailored limits for different deployment environments or use cases. **Why this priority**: Different environments and workflows have different needs. Development may need larger limits for debugging, while production prioritizes strict controls. This enables the same codebase to serve multiple use cases safely. **Independent Test**: Can be fully tested by modifying configuration values and verifying the server respects new limits without restart, delivering value by allowing runtime adaptation to different operational requirements. **Acceptance Scenarios**: 1. **Given** an operations engineer updates the token limit configuration from 4000 to 8000 tokens, **When** the configuration is reloaded, **Then** subsequent responses respect the new 8000 token limit 2. **Given** a feature flag enables pagination for a specific endpoint, **When** that endpoint is called, **Then** pagination is applied while other endpoints remain unchanged 3. **Given** default configuration values are documented, **When** a new deployment uses defaults, **Then** safe production limits (4000 tokens per response, 50 items per page) are automatically applied 4. **Given** invalid configuration is provided, **When** the server attempts to reload, **Then** current configuration remains active and an error is logged --- ### User Story 4 - Response Chunking for Large Content (Priority: P2) When Claude requests large text content (logs, descriptions, documents), the server intelligently chunks the content into digestible pieces with semantic boundaries, allowing progressive reading without overwhelming the context. **Why this priority**: Some content is inherently large (multi-page descriptions, extensive logs) but cannot be meaningfully summarized. Chunking provides a way to access this content progressively while respecting context limits. **Independent Test**: Can be fully tested by requesting a 50KB log file and verifying it's returned in appropriately sized chunks with continuation cursors, delivering value by making large content accessible without context overflow. **Acceptance Scenarios**: 1. **Given** Claude requests a log file with 10,000 lines, **When** the server processes the request, **Then** the first 200 lines are returned with a cursor for the next chunk and total line count 2. **Given** the server chunks content at semantic boundaries (log entry completion, paragraph breaks), **When** returning chunks, **Then** no chunks split mid-sentence or mid-entry 3. **Given** Claude requests a specific line range in a large file, **When** the server processes the request, **Then** only the requested range is returned efficiently 4. **Given** content is under 1000 tokens, **When** the server processes the request, **Then** the full content is returned without chunking --- ### User Story 5 - Telemetry and Monitoring (Priority: P3) Engineers can observe token usage patterns, pagination adoption, and context overflow attempts through dashboards and logs, enabling data-driven optimization and early detection of issues. **Why this priority**: While not directly user-facing, observability enables continuous improvement and proactive issue detection. This builds on the foundation of pagination and summarization by providing insights into their effectiveness. **Independent Test**: Can be fully tested by making various API calls and verifying metrics are accurately captured and queryable, delivering value by enabling operational visibility into context management. **Acceptance Scenarios**: 1. **Given** the server processes 100 requests, **When** reviewing metrics, **Then** each request records estimated tokens, actual bytes, item count, and processing latency 2. **Given** a request exceeds the token budget and triggers summarization, **When** the event is logged, **Then** the log includes original size, summarized size, and reduction percentage 3. **Given** pagination is used on a request, **When** viewing metrics, **Then** pagination usage is tracked separately from non-paginated requests 4. **Given** operations team sets up alerts, **When** oversized response attempts exceed 5% of traffic, **Then** an alert is triggered for investigation --- ### Edge Cases - What happens when a cursor becomes invalid (data changed, expired)? → Server returns clear error indicating cursor invalidity and suggests re-querying from the start - How does the system handle malformed pagination parameters? → Server validates parameters and returns descriptive errors (e.g., "invalid cursor format", "limit exceeds maximum of 200") - What if token estimation is significantly wrong (estimates 1000, actual 5000)? → Server monitors estimation accuracy and logs large discrepancies for model tuning; falls back to hard byte limits as safety net - How does summarization handle data types with no clear summary format (binary, complex nested objects)? → Server uses field projection (select subset of fields) rather than summarization; provides metadata about available fields - What happens when multiple pages are requested in rapid succession? → Server caches pagination state temporarily (10 minutes) to ensure consistency even as underlying data changes - How does chunking handle real-time streaming data? → Server snapshots data at request time for consistent pagination; real-time updates require new query ## Requirements *(mandatory)* ### Functional Requirements #### Pagination - **FR-001**: System MUST support cursor-based pagination on all list/search endpoints that can return more than 50 items - **FR-002**: System MUST accept an optional `limit` parameter (default: 50, maximum: 200) specifying items per page - **FR-003**: System MUST accept an optional `cursor` parameter (opaque string) for fetching subsequent pages - **FR-004**: Paginated responses MUST include `nextCursor` field when additional pages exist, omitted on final page - **FR-005**: Paginated responses MUST include `totalCount` field showing total items available across all pages - **FR-006**: System MUST return items in consistent order within pagination sequence (sorted by creation time descending by default) - **FR-007**: Pagination cursors MUST remain valid for at least 10 minutes after issuance - **FR-008**: System MUST return descriptive error when cursor is invalid or expired #### Token Budget Management - **FR-009**: System MUST estimate token count for response payloads before sending - **FR-010**: System MUST switch to summarization mode when estimated response exceeds configured token threshold (default: 4000 tokens) - **FR-011**: Summarized responses MUST include instructions for fetching detailed sections on demand - **FR-012**: System MUST include token budget metadata in responses: `estimatedTokens`, `budgetUsed`, `budgetRemaining` - **FR-013**: Token estimation MUST use character-based approximation (1 token ≈ 4 characters) with 20% safety margin - **FR-014**: System MUST log actual vs estimated token counts for accuracy monitoring #### Summarization and Compression - **FR-015**: System MUST provide condensed field sets for verbose data structures (bookings, listings, financial records) - **FR-016**: Summaries MUST retain all identifying fields (IDs, names, primary status) while compressing verbose details - **FR-017**: System MUST support field projection via `fields` parameter allowing selective retrieval (e.g., `fields=id,status,total`) - **FR-018**: Long text fields (descriptions, notes) MUST be truncated in summaries with character count and continuation instructions - **FR-019**: System MUST strip redundant metadata from tool responses (verbose schemas, debug info) in production mode #### Content Chunking - **FR-020**: System MUST chunk large text content (logs, documents) when exceeding 2000 tokens - **FR-021**: Chunks MUST respect semantic boundaries (complete log entries, paragraph breaks, sentence completion) - **FR-022**: Chunked responses MUST include chunk position metadata: `chunkIndex`, `totalChunks`, `nextCursor` - **FR-023**: System MUST support range-based requests for large content (e.g., `startLine=100&endLine=200`) #### Configuration - **FR-024**: System MUST read configuration from environment variables with documented defaults - **FR-025**: System MUST support configuration file (YAML/JSON) for complex settings - **FR-026**: System MUST allow runtime configuration reload without server restart - **FR-027**: Configuration MUST include: `defaultPageSize`, `maxPageSize`, `tokenBudgetThreshold`, `chunkSize`, feature flags per endpoint - **FR-028**: Invalid configuration MUST be rejected with clear error messages, falling back to previous valid config #### Telemetry - **FR-029**: System MUST record per-request metrics: `estimatedTokens`, `responseBytes`, `itemCount`, `latency`, `paginationUsed`, `summarizationUsed` - **FR-030**: System MUST log oversized response attempts (exceeded budget before summarization) with original and final sizes - **FR-031**: Metrics MUST be queryable by endpoint, time range, and outcome (paginated/summarized/chunked) - **FR-032**: System MUST expose health endpoint reporting pagination adoption rate and average response size #### Compatibility - **FR-033**: All new fields (pagination metadata, token budgets) MUST be additive to existing response structures - **FR-034**: Existing clients not using pagination MUST continue receiving first page by default (backwards compatible) - **FR-035**: Feature flags MUST allow gradual rollout per endpoint without affecting other endpoints - **FR-036**: System MUST support A/B testing mode where responses can be compared with/without hardening features ### Key Entities - **PaginationCursor**: Opaque token encoding position in result set, valid for 10 minutes, includes snapshot timestamp and ordering context - **TokenBudget**: Configuration and runtime tracking of token limits per response, includes threshold, estimation method, and actual usage - **ResponseSummary**: Condensed representation of verbose data with essential fields retained and instructions for accessing detailed sections - **ContentChunk**: Segment of large text content with semantic boundaries, position metadata, and continuation cursor - **TelemetryRecord**: Per-request metrics capturing token usage, response characteristics, and optimization strategies applied ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001**: Claude workflows complete multi-step tasks without context overflow errors in 99.9% of sessions (currently ~60% due to overflows) - **SC-002**: Average token usage per tool response reduces by 60% or more (from ~6000 tokens to ~2400 tokens) - **SC-003**: 95% of list endpoints use pagination within 2 weeks of deployment - **SC-004**: Support tickets related to "truncated responses" or "context errors" decrease by 80% within 1 month - **SC-005**: Response time increases by no more than 10% (p95 latency) due to token estimation and summarization overhead - **SC-006**: Engineers can diagnose context issues using telemetry dashboards within 5 minutes (vs 30+ minutes currently) - **SC-007**: Users successfully navigate paginated results in 95% of attempts without confusion or errors - **SC-008**: Token estimation accuracy within 20% of actual usage for 90% of responses ## Assumptions *(mandatory)* - Current MCP server uses JSON for request/response serialization - Claude Desktop is the primary client and supports opaque pagination cursors - Server has access to response payload before sending (can intercept and analyze) - Token estimation can use simple character-based heuristics (4 chars/token) with acceptable accuracy - Most data access patterns are read-heavy (lists, searches, reads) vs write-heavy - Pagination cursors stored in memory (Redis/similar) for 10-minute TTL acceptable - Existing endpoints return predictable JSON structures amenable to field projection - Operations team can deploy configuration changes independently of code deployments - Telemetry can be stored in time-series database or log aggregation system - Backwards compatibility requirement applies to external clients only (internal tools can adapt immediately) ## Dependencies *(include if applicable)* - Token estimation library or service capable of approximating Claude token counts from text - Configuration management system supporting hot-reload (e.g., file watchers, environment variable updates) - Metrics collection infrastructure (e.g., Prometheus, StatsD, CloudWatch) - Pagination cursor storage with TTL support (e.g., Redis, in-memory cache) - Existing MCP server framework extensible via middleware or interceptors ## Open Questions *(include if blocking issues exist)* None - all critical decisions have reasonable defaults documented in Requirements and Assumptions sections. ## Non-Functional Requirements *(include if applicable)* ### Performance - Pagination overhead MUST NOT exceed 50ms per request (cursor lookup, state management) - Token estimation MUST complete within 20ms for responses up to 100KB - Configuration reload MUST complete within 100ms without dropping in-flight requests - Metric recording MUST NOT add more than 10ms latency to request processing ### Reliability - Pagination cursor expiration MUST handle gracefully with clear user guidance (re-query instructions) - Token estimation failures MUST fall back to byte-based limits (safe default) rather than failing requests - Configuration errors MUST NOT crash server or drop traffic (fail-safe to previous config) - Telemetry failures MUST NOT affect request success (fire-and-forget metric recording) ### Security - Pagination cursors MUST NOT expose internal data structure or query details (opaque, signed) - Token budget metadata MUST NOT leak sensitive information about server internals - Configuration changes MUST be validated before application (schema validation, range checks) ### Observability - All pagination, summarization, and chunking decisions MUST be logged at INFO level - Token estimation accuracy MUST be sampled (1% of requests) and reported to metrics - Configuration changes MUST be logged with old/new values and timestamp - Oversized response attempts MUST trigger alerts when exceeding 5% of traffic ## Risks & Mitigations ### Risk 1: Hidden Client Dependencies on Unbounded Payloads **Likelihood**: Medium | **Impact**: High **Description**: External clients may rely on receiving complete datasets in single responses (e.g., automated scripts expecting all 1000 listings at once). **Mitigation**: - Feature flags per endpoint for gradual rollout - Default page size starts at 200 items (higher than typical) for soft landing - Backwards compatibility: clients not using pagination still get first page - Contract tests validate no breaking changes to response schemas - Canary deployment: 5% → 25% → 50% → 100% over 2 weeks with rollback capability ### Risk 2: Pagination Increases Round-trips **Likelihood**: High | **Impact**: Medium **Description**: Multi-page workflows require multiple API calls, potentially slowing down Claude's analysis. **Mitigation**: - Intelligent default page size (50 items) balances single-page and multi-page scenarios - Server-side prefetch for common patterns (e.g., "next page" cursor pre-generated) - Batch operations where possible (e.g., "get pages 1-3" in single call) - Telemetry identifies endpoints with excessive pagination and allows tuning ### Risk 3: Token Estimation Inaccuracy **Likelihood**: Medium | **Impact**: Medium **Description**: Character-based token estimation may be significantly wrong for certain content types (code, special characters, non-English text). **Mitigation**: - 20% safety margin built into estimation - Continuous monitoring of estimation vs actual with model retraining - Hard byte limits as secondary safety (10KB regardless of token estimate) - Sampling mode for accuracy testing (1% of responses validated against actual tokenizer) ### Risk 4: Configuration Errors **Likelihood**: Low | **Impact**: High **Description**: Invalid configuration could break pagination, disable summarization, or cause cascading failures. **Mitigation**: - Schema validation for all configuration values before application - Fail-safe defaults: invalid config reverts to previous known-good state - Configuration changes logged and audited - Dry-run mode for testing configuration before applying to production ## Out of Scope - Changing MCP protocol or transport mechanism - Client-side UI changes (beyond what Claude Desktop already supports) - Complete server rewrite or framework migration - Real-time streaming response handling (deferred to future iteration) - Multi-tenant cursor sharing or cross-organization pagination - GraphQL or alternative query language support - Binary/media content chunking (focus on text/JSON initially)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/darrentmorgan/hostaway-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server