Skip to main content
Glama

Flutter MCP

by adamsmaka
token-counting-analysis.mdβ€’10.2 kB
# Token Counting Analysis for Flutter MCP ## Executive Summary For the Flutter MCP documentation server, implementing accurate and performant token counting is crucial to ensure responses fit within LLM context windows. This analysis recommends using model-specific tokenizers with intelligent caching for optimal accuracy and performance. ## Recommended Approach ### Primary Strategy: Model-Specific Tokenizers with Caching Use official tokenizers for each supported model family: - **OpenAI Models**: `tiktoken` library - **Claude Models**: `anthropic` library's `count_tokens()` API - **Gemini Models**: `google-genai` library's `count_tokens()` method ### Implementation Architecture ```python # utils/token_counter.py import tiktoken import anthropic from google import genai from google.genai.types import HttpOptions from typing import Union, Dict, Any import structlog logger = structlog.get_logger() class TokenCounter: """Unified token counter with model-specific tokenizer support.""" def __init__(self): self._tokenizer_cache: Dict[str, Any] = {} self._anthropic_client = None self._genai_client = None def _get_openai_tokenizer(self, model: str): """Get or create OpenAI tokenizer.""" if model not in self._tokenizer_cache: try: # Try model-specific encoding self._tokenizer_cache[model] = tiktoken.encoding_for_model(model) except KeyError: # Fallback to cl100k_base for newer models self._tokenizer_cache[model] = tiktoken.get_encoding("cl100k_base") return self._tokenizer_cache[model] def _get_anthropic_client(self): """Get or create Anthropic client.""" if self._anthropic_client is None: self._anthropic_client = anthropic.Anthropic() return self._anthropic_client def _get_genai_client(self): """Get or create Google GenAI client.""" if self._genai_client is None: self._genai_client = genai.Client( http_options=HttpOptions(api_version="v1") ) return self._genai_client def count_tokens(self, text: str, model: str = "gpt-4") -> int: """ Count tokens for the given text and model. Args: text: The text to count tokens for model: The model name (e.g., "gpt-4", "claude-3-opus", "gemini-1.5-pro") Returns: Number of tokens """ try: # OpenAI models if model.startswith(("gpt-", "text-embedding-")): tokenizer = self._get_openai_tokenizer(model) return len(tokenizer.encode(text)) # Claude models elif model.startswith("claude-"): client = self._get_anthropic_client() response = client.beta.messages.count_tokens( model=model, messages=[{"role": "user", "content": text}] ) return response.input_tokens # Gemini models elif model.startswith("gemini-"): client = self._get_genai_client() response = client.models.count_tokens( model=model, contents=text ) return response.total_tokens # Unknown model - use cl100k_base as fallback else: logger.warning(f"Unknown model {model}, using cl100k_base tokenizer") tokenizer = self._get_openai_tokenizer("cl100k_base") return len(tokenizer.encode(text)) except Exception as e: logger.error(f"Error counting tokens for {model}: {e}") # Fallback to character-based approximation with safety margin return int(len(text) / 3.5) # Conservative estimate # Global instance for reuse token_counter = TokenCounter() ``` ## Performance Optimization Strategies ### 1. Tokenizer Caching - **Critical**: Cache tokenizer instances to avoid initialization overhead - OpenAI's `tiktoken` has minimal overhead, but still benefits from caching - Anthropic and Google clients should be singleton instances ### 2. Batch Processing When processing multiple documents: ```python # For tiktoken (OpenAI) encoding = tiktoken.get_encoding("cl100k_base") token_counts = [len(tokens) for tokens in encoding.encode_batch(texts)] # For other providers, implement parallel processing from concurrent.futures import ThreadPoolExecutor def batch_count_tokens(texts: List[str], model: str) -> List[int]: with ThreadPoolExecutor(max_workers=5) as executor: futures = [executor.submit(token_counter.count_tokens, text, model) for text in texts] return [future.result() for future in futures] ``` ### 3. Redis Integration for Token Count Caching ```python async def get_documentation_with_token_count( self, query: str, model: str = "gpt-4" ) -> Tuple[str, int]: """Get documentation with pre-calculated token count.""" # Check Redis for cached result with token count cache_key = f"flutter_doc:{query}:{model}" cached = await self.redis.get(cache_key) if cached: data = json.loads(cached) return data["content"], data["token_count"] # Fetch and process documentation content = await self.fetch_documentation(query) # Count tokens on final formatted content token_count = token_counter.count_tokens(content, model) # Cache with token count await self.redis.set( cache_key, json.dumps({ "content": content, "token_count": token_count }), ex=86400 # 24 hour TTL ) return content, token_count ``` ## Markdown Formatting Considerations ### Count Tokens on Final Output Always count tokens on the exact string sent to the LLM: ```python def prepare_response(raw_content: str, max_tokens: int, model: str) -> str: """Prepare and truncate response to fit token limit.""" # Apply all formatting transformations formatted_content = format_markdown(raw_content) # Count tokens on formatted content token_count = token_counter.count_tokens(formatted_content, model) # Truncate if necessary if token_count > max_tokens: # Intelligent truncation - keep complete sections formatted_content = truncate_intelligently( formatted_content, max_tokens, model ) return formatted_content ``` ### Token Impact of Markdown Elements - **Code blocks**: Very token-intensive (backticks + language + indentation) - **Links**: Full markdown syntax counts `[text](url)` - **Headers**: All `#` characters count as tokens - **Lists**: Bullets and indentation consume tokens ## Approximation Methods (Fallback Only) When model-specific tokenizers are unavailable: ```python def approximate_tokens(text: str, model_family: str = "general") -> int: """ Approximate token count with model-specific adjustments. Use only as fallback when proper tokenizers unavailable. """ # Base approximations char_ratio = { "gpt": 4.0, # GPT models: ~4 chars/token "claude": 3.8, # Claude: slightly more tokens "gemini": 4.2, # Gemini: slightly fewer tokens "general": 3.5 # Conservative default } ratio = char_ratio.get(model_family, 3.5) base_count = len(text) / ratio # Adjust for code content (more tokens) code_blocks = text.count("```") if code_blocks > 0: base_count *= 1.15 # Safety margin return int(base_count * 1.2) ``` ## Implementation Timeline ### Phase 1: Core Implementation (2 hours) 1. Implement `TokenCounter` class with OpenAI support 2. Add fallback approximation method 3. Integrate with existing response pipeline ### Phase 2: Multi-Model Support (2 hours) 1. Add Anthropic client support 2. Add Google GenAI client support 3. Implement model detection logic ### Phase 3: Optimization (1 hour) 1. Add Redis caching for token counts 2. Implement batch processing 3. Add performance monitoring ## Testing Strategy ```python # tests/test_token_counter.py import pytest from utils.token_counter import token_counter class TestTokenCounter: @pytest.mark.parametrize("model,text,expected_range", [ ("gpt-4", "Hello, world!", (3, 5)), ("claude-3-opus-20240229", "Hello, world!", (3, 5)), ("gemini-1.5-pro", "Hello, world!", (3, 5)), ]) def test_basic_counting(self, model, text, expected_range): count = token_counter.count_tokens(text, model) assert expected_range[0] <= count <= expected_range[1] def test_markdown_formatting(self): markdown = "# Header\n```python\nprint('hello')\n```" count = token_counter.count_tokens(markdown, "gpt-4") # Markdown should produce more tokens than plain text plain_count = token_counter.count_tokens("Header print hello", "gpt-4") assert count > plain_count def test_fallback_approximation(self): # Test with unknown model count = token_counter.count_tokens("Test text", "unknown-model") assert count > 0 ``` ## Recommendations 1. **Use Model-Specific Tokenizers**: Accuracy is worth the minimal performance cost 2. **Cache Everything**: Both tokenizer instances and token counts 3. **Count Final Output**: Always count tokens on the exact formatted string 4. **Plan for Growth**: Design the system to easily add new model support 5. **Monitor Performance**: Track token counting time in your metrics ## Conclusion For the Flutter MCP project, implementing proper token counting with model-specific tokenizers will ensure accurate context window management while maintaining the fast response times required by the Context7-style architecture. The recommended approach balances accuracy, performance, and maintainability while providing graceful fallbacks for edge cases.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/adamsmaka/flutter-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server