Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
heimdall-context.md9.82 kB
# Heimdall Context & Token Budget Guide This guide explains how Heimdall manages context and token allocation with the default qwen2.5-0.5b-instruct model. ## Overview Heimdall uses a **single-shot command architecture** - each request is independent with no conversation history accumulation. This maximizes the available context for rich system prompts while keeping responses fast. ## Token Budget Allocation ``` ┌─────────────────────────────────────────────────────────────────┐ │ 32K CONTEXT WINDOW │ │ (qwen2.5-0.5b-instruct max) │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ SYSTEM PROMPT (12K budget) │ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ │ │ Identity & Role (~50 tokens) │ │ │ │ │ │ "You are Heimdall, the AI assistant..." │ │ │ │ │ ├────────────────────────────────────────────────────┤ │ │ │ │ │ Available Actions (~200-500 tokens) │ │ │ │ │ │ - heimdall.watcher.status │ │ │ │ │ │ - heimdall.watcher.query │ │ │ │ │ │ - [plugin-registered actions] │ │ │ │ │ ├────────────────────────────────────────────────────┤ │ │ │ │ │ Cypher Query Primer (~400 tokens) │ │ │ │ │ │ - Basic patterns, filtering, aggregations │ │ │ │ │ │ - Path queries, modifications, subqueries │ │ │ │ │ ├────────────────────────────────────────────────────┤ │ │ │ │ │ Response Modes (~100 tokens) │ │ │ │ │ │ - ACTION MODE: JSON for operations │ │ │ │ │ │ - HELP MODE: Conversational for questions │ │ │ │ │ ├────────────────────────────────────────────────────┤ │ │ │ │ │ Plugin Instructions (~variable) │ │ │ │ │ │ - AdditionalInstructions from plugins │ │ │ │ │ ├────────────────────────────────────────────────────┤ │ │ │ │ │ Examples (~500 tokens) │ │ │ │ │ │ - 20 built-in examples for common commands │ │ │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ Total base system: ~1,200 tokens │ │ │ │ Available for plugins: ~10,800 tokens │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ USER MESSAGE (4K budget) │ │ │ │ Single-shot command from Bifrost UI │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ RESPONSE (1K max tokens) │ │ │ │ JSON action OR conversational help │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Default Configuration | Setting | Value | Purpose | |---------|-------|---------| | `NORNICDB_HEIMDALL_CONTEXT_SIZE` | 32768 | Full 32K context window | | `NORNICDB_HEIMDALL_BATCH_SIZE` | 8192 | 8K batch for prefill | | `NORNICDB_HEIMDALL_MAX_TOKENS` | 1024 | 1K response limit | ## How Multi-Batch Prefill Works When the system prompt exceeds the batch size, Heimdall automatically splits it into multiple batches: ``` System Prompt (2K tokens) + User Message (500 tokens) = 2.5K total Batch 1: [System prompt tokens 0-8191] → KV cache stores Batch 2: [Remaining tokens + user message] → KV cache accumulates → Generation starts ``` The KV cache accumulates across batches, so the model "sees" the entire context when generating. ## Token Budget Constants These constants define the allocation in `pkg/heimdall/types.go`: ```go const ( MaxContextTokens = 16384 // 16K total context budget MaxSystemPromptTokens = 12000 // 12K for system + plugins MaxUserMessageTokens = 4000 // 4K for user commands TokensPerChar = 0.25 // ~4 chars per token estimate ) ``` ## What Fits in the System Prompt | Component | Estimated Tokens | Notes | |-----------|-----------------|-------| | Base identity | ~50 | Fixed header | | Available actions | 200-500 | Depends on plugin count | | Cypher primer | ~400 | Reference guide | | Response modes | ~100 | Action + Help modes | | Built-in examples | ~500 | 20 comprehensive examples | | **Base total** | **~1,200** | Before plugins | | Plugin instructions | ~10,800 available | Plugins can add context | ## Fallback Behavior If plugins add too many instructions and the system prompt exceeds the 12K budget, Heimdall automatically falls back to a minimal prompt: ```go // Minimal fallback prompt (~200 tokens) "You are Heimdall, AI assistant for NornicDB graph database. ACTIONS: [plugin actions only] For queries: {"action": "heimdall.watcher.query", "params": {"cypher": "..."}} Respond with JSON only." ``` ## Performance Characteristics ### What Affects Speed | Factor | Impact | Notes | |--------|--------|-------| | **MaxTokens** | High | Each output token takes ~same time | | **GPU vs CPU** | Very High | GPU is 10-50x faster | | **Prompt size** | Low | Only affects prefill, not generation | | **Context/Batch size** | Minimal | Memory allocation only | ### Why Context Size Doesn't Slow Down Inference 1. **KV Cache is lazy** - Only allocates for actual tokens used 2. **Prefill is fast** - Parallel processing of input tokens 3. **Generation dominates** - 90% of time is in token generation 4. **Your prompts are small** - ~2K tokens vs 32K capacity ## Model Specifications ### qwen2.5-0.5b-instruct | Spec | Value | |------|-------| | Parameters | 500M | | Context Length | 32,768 tokens | | Quantization | Q4_K_M recommended | | VRAM (GPU) | ~500MB | | RAM (CPU) | ~1GB | | License | Apache 2.0 | ## Configuration Examples ### Default (Balanced) ```bash NORNICDB_HEIMDALL_ENABLED=true # Uses all defaults - 32K context, 8K batch, 1K output ``` ### Memory Constrained ```bash NORNICDB_HEIMDALL_ENABLED=true NORNICDB_HEIMDALL_CONTEXT_SIZE=8192 # Reduce if low RAM NORNICDB_HEIMDALL_BATCH_SIZE=2048 NORNICDB_HEIMDALL_MAX_TOKENS=512 # Shorter responses ``` ### Verbose Responses ```bash NORNICDB_HEIMDALL_ENABLED=true NORNICDB_HEIMDALL_MAX_TOKENS=2048 # Allow longer explanations ``` ## Monitoring Token Usage The handler logs token budget information: ``` [Bifrost] Token budget: system=1247, user=156, total=1403/16384 ``` If you see truncation errors, check: 1. Is MaxTokens high enough for the response? 2. Are plugins adding too many instructions? 3. Is the user message within budget? ## See Also - [Heimdall AI Assistant](./heimdall-ai-assistant.md) - Overview and configuration - [Heimdall Plugins](./heimdall-plugins.md) - Writing custom plugins - [Operations - Monitoring](../operations/monitoring.md) - Prometheus metrics

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server