M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

Mimir
nornicdb
docs
user-guides

heimdall-context.md•9.59 KiB

# Heimdall Context & Token Budget Guide

This guide explains how Heimdall manages context and token allocation with the default qwen2.5-0.5b-instruct model.

## Overview

Heimdall uses a **single-shot command architecture** - each request is independent with no conversation history accumulation. This maximizes the available context for rich system prompts while keeping responses fast.

## Token Budget Allocation

```
┌─────────────────────────────────────────────────────────────────┐
│                    32K CONTEXT WINDOW                            │
│                  (qwen2.5-0.5b-instruct max)                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │            SYSTEM PROMPT (12K budget)                     │   │
│  │  ┌────────────────────────────────────────────────────┐  │   │
│  │  │ Identity & Role (~50 tokens)                       │  │   │
│  │  │ "You are Heimdall, the AI assistant..."            │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Available Actions (~200-500 tokens)                │  │   │
│  │  │ - heimdall.watcher.status                          │  │   │
│  │  │ - heimdall.watcher.query                           │  │   │
│  │  │ - [plugin-registered actions]                      │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Cypher Query Primer (~400 tokens)                  │  │   │
│  │  │ - Basic patterns, filtering, aggregations          │  │   │
│  │  │ - Path queries, modifications, subqueries          │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Response Modes (~100 tokens)                       │  │   │
│  │  │ - ACTION MODE: JSON for operations                 │  │   │
│  │  │ - HELP MODE: Conversational for questions          │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Plugin Instructions (~variable)                    │  │   │
│  │  │ - AdditionalInstructions from plugins              │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Examples (~500 tokens)                             │  │   │
│  │  │ - 20 built-in examples for common commands         │  │   │
│  │  └────────────────────────────────────────────────────┘  │   │
│  │                                                           │   │
│  │  Total base system: ~1,200 tokens                         │   │
│  │  Available for plugins: ~10,800 tokens                    │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │            USER MESSAGE (4K budget)                       │   │
│  │  Single-shot command from Bifrost UI                      │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │            RESPONSE (1K max tokens)                       │   │
│  │  JSON action OR conversational help                       │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Default Configuration

| Setting | Value | Purpose |
|---------|-------|---------|
| `NORNICDB_HEIMDALL_CONTEXT_SIZE` | 32768 | Full 32K context window |
| `NORNICDB_HEIMDALL_BATCH_SIZE` | 8192 | 8K batch for prefill |
| `NORNICDB_HEIMDALL_MAX_TOKENS` | 1024 | 1K response limit |

## How Multi-Batch Prefill Works

When the system prompt exceeds the batch size, Heimdall automatically splits it into multiple batches:

```
System Prompt (2K tokens) + User Message (500 tokens) = 2.5K total

Batch 1: [System prompt tokens 0-8191]      → KV cache stores
Batch 2: [Remaining tokens + user message]  → KV cache accumulates
                                            → Generation starts
```

The KV cache accumulates across batches, so the model "sees" the entire context when generating.

## Token Budget Constants

These constants define the allocation in `pkg/heimdall/types.go`:

```go
const (
    MaxContextTokens      = 16384  // 16K total context budget
    MaxSystemPromptTokens = 12000  // 12K for system + plugins
    MaxUserMessageTokens  = 4000   // 4K for user commands
    TokensPerChar         = 0.25   // ~4 chars per token estimate
)
```

## What Fits in the System Prompt

| Component | Estimated Tokens | Notes |
|-----------|-----------------|-------|
| Base identity | ~50 | Fixed header |
| Available actions | 200-500 | Depends on plugin count |
| Cypher primer | ~400 | Reference guide |
| Response modes | ~100 | Action + Help modes |
| Built-in examples | ~500 | 20 comprehensive examples |
| **Base total** | **~1,200** | Before plugins |
| Plugin instructions | ~10,800 available | Plugins can add context |

## Fallback Behavior

If plugins add too many instructions and the system prompt exceeds the 12K budget, Heimdall automatically falls back to a minimal prompt:

```go
// Minimal fallback prompt (~200 tokens)
"You are Heimdall, AI assistant for NornicDB graph database.

ACTIONS:
[plugin actions only]

For queries: {"action": "heimdall.watcher.query", "params": {"cypher": "..."}}
Respond with JSON only."
```

## Performance Characteristics

### What Affects Speed

| Factor | Impact | Notes |
|--------|--------|-------|
| **MaxTokens** | High | Each output token takes ~same time |
| **GPU vs CPU** | Very High | GPU is 10-50x faster |
| **Prompt size** | Low | Only affects prefill, not generation |
| **Context/Batch size** | Minimal | Memory allocation only |

### Why Context Size Doesn't Slow Down Inference

1. **KV Cache is lazy** - Only allocates for actual tokens used
2. **Prefill is fast** - Parallel processing of input tokens
3. **Generation dominates** - 90% of time is in token generation
4. **Your prompts are small** - ~2K tokens vs 32K capacity

## Model Specifications

### qwen2.5-0.5b-instruct

| Spec | Value |
|------|-------|
| Parameters | 500M |
| Context Length | 32,768 tokens |
| Quantization | Q4_K_M recommended |
| VRAM (GPU) | ~500MB |
| RAM (CPU) | ~1GB |
| License | Apache 2.0 |

## Configuration Examples

### Default (Balanced)
```bash
NORNICDB_HEIMDALL_ENABLED=true
# Uses all defaults - 32K context, 8K batch, 1K output
```

### Memory Constrained
```bash
NORNICDB_HEIMDALL_ENABLED=true
NORNICDB_HEIMDALL_CONTEXT_SIZE=8192   # Reduce if low RAM
NORNICDB_HEIMDALL_BATCH_SIZE=2048
NORNICDB_HEIMDALL_MAX_TOKENS=512      # Shorter responses
```

### Verbose Responses
```bash
NORNICDB_HEIMDALL_ENABLED=true
NORNICDB_HEIMDALL_MAX_TOKENS=2048     # Allow longer explanations
```

## Monitoring Token Usage

The handler logs token budget information:

```
[Bifrost] Token budget: system=1247, user=156, total=1403/16384
```

If you see truncation errors, check:
1. Is MaxTokens high enough for the response?
2. Are plugins adding too many instructions?
3. Is the user message within budget?

## See Also

- [Heimdall AI Assistant](./heimdall-ai-assistant.md) - Overview and configuration
- [Heimdall Plugins](./heimdall-plugins.md) - Writing custom plugins
- [Operations - Monitoring](../operations/monitoring.md) - Prometheus metrics

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

heimdall-context.md•9.59 KiB

# Heimdall Context & Token Budget Guide

This guide explains how Heimdall manages context and token allocation with the default qwen2.5-0.5b-instruct model.

## Overview

Heimdall uses a **single-shot command architecture** - each request is independent with no conversation history accumulation. This maximizes the available context for rich system prompts while keeping responses fast.

## Token Budget Allocation

```
┌─────────────────────────────────────────────────────────────────┐
│                    32K CONTEXT WINDOW                            │
│                  (qwen2.5-0.5b-instruct max)                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │            SYSTEM PROMPT (12K budget)                     │   │
│  │  ┌────────────────────────────────────────────────────┐  │   │
│  │  │ Identity & Role (~50 tokens)                       │  │   │
│  │  │ "You are Heimdall, the AI assistant..."            │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Available Actions (~200-500 tokens)                │  │   │
│  │  │ - heimdall.watcher.status                          │  │   │
│  │  │ - heimdall.watcher.query                           │  │   │
│  │  │ - [plugin-registered actions]                      │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Cypher Query Primer (~400 tokens)                  │  │   │
│  │  │ - Basic patterns, filtering, aggregations          │  │   │
│  │  │ - Path queries, modifications, subqueries          │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Response Modes (~100 tokens)                       │  │   │
│  │  │ - ACTION MODE: JSON for operations                 │  │   │
│  │  │ - HELP MODE: Conversational for questions          │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Plugin Instructions (~variable)                    │  │   │
│  │  │ - AdditionalInstructions from plugins              │  │   │
│  │  ├────────────────────────────────────────────────────┤  │   │
│  │  │ Examples (~500 tokens)                             │  │   │
│  │  │ - 20 built-in examples for common commands         │  │   │
│  │  └────────────────────────────────────────────────────┘  │   │
│  │                                                           │   │
│  │  Total base system: ~1,200 tokens                         │   │
│  │  Available for plugins: ~10,800 tokens                    │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │            USER MESSAGE (4K budget)                       │   │
│  │  Single-shot command from Bifrost UI                      │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │            RESPONSE (1K max tokens)                       │   │
│  │  JSON action OR conversational help                       │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Default Configuration

| Setting | Value | Purpose |
|---------|-------|---------|
| `NORNICDB_HEIMDALL_CONTEXT_SIZE` | 32768 | Full 32K context window |
| `NORNICDB_HEIMDALL_BATCH_SIZE` | 8192 | 8K batch for prefill |
| `NORNICDB_HEIMDALL_MAX_TOKENS` | 1024 | 1K response limit |

## How Multi-Batch Prefill Works

When the system prompt exceeds the batch size, Heimdall automatically splits it into multiple batches:

```
System Prompt (2K tokens) + User Message (500 tokens) = 2.5K total

Batch 1: [System prompt tokens 0-8191]      → KV cache stores
Batch 2: [Remaining tokens + user message]  → KV cache accumulates
                                            → Generation starts
```

The KV cache accumulates across batches, so the model "sees" the entire context when generating.

## Token Budget Constants

These constants define the allocation in `pkg/heimdall/types.go`:

```go
const (
    MaxContextTokens      = 16384  // 16K total context budget
    MaxSystemPromptTokens = 12000  // 12K for system + plugins
    MaxUserMessageTokens  = 4000   // 4K for user commands
    TokensPerChar         = 0.25   // ~4 chars per token estimate
)
```

## What Fits in the System Prompt

| Component | Estimated Tokens | Notes |
|-----------|-----------------|-------|
| Base identity | ~50 | Fixed header |
| Available actions | 200-500 | Depends on plugin count |
| Cypher primer | ~400 | Reference guide |
| Response modes | ~100 | Action + Help modes |
| Built-in examples | ~500 | 20 comprehensive examples |
| **Base total** | **~1,200** | Before plugins |
| Plugin instructions | ~10,800 available | Plugins can add context |

## Fallback Behavior

If plugins add too many instructions and the system prompt exceeds the 12K budget, Heimdall automatically falls back to a minimal prompt:

```go
// Minimal fallback prompt (~200 tokens)
"You are Heimdall, AI assistant for NornicDB graph database.

ACTIONS:
[plugin actions only]

For queries: {"action": "heimdall.watcher.query", "params": {"cypher": "..."}}
Respond with JSON only."
```

## Performance Characteristics

### What Affects Speed

| Factor | Impact | Notes |
|--------|--------|-------|
| **MaxTokens** | High | Each output token takes ~same time |
| **GPU vs CPU** | Very High | GPU is 10-50x faster |
| **Prompt size** | Low | Only affects prefill, not generation |
| **Context/Batch size** | Minimal | Memory allocation only |

### Why Context Size Doesn't Slow Down Inference

1. **KV Cache is lazy** - Only allocates for actual tokens used
2. **Prefill is fast** - Parallel processing of input tokens
3. **Generation dominates** - 90% of time is in token generation
4. **Your prompts are small** - ~2K tokens vs 32K capacity

## Model Specifications

### qwen2.5-0.5b-instruct

| Spec | Value |
|------|-------|
| Parameters | 500M |
| Context Length | 32,768 tokens |
| Quantization | Q4_K_M recommended |
| VRAM (GPU) | ~500MB |
| RAM (CPU) | ~1GB |
| License | Apache 2.0 |

## Configuration Examples

### Default (Balanced)
```bash
NORNICDB_HEIMDALL_ENABLED=true
# Uses all defaults - 32K context, 8K batch, 1K output
```

### Memory Constrained
```bash
NORNICDB_HEIMDALL_ENABLED=true
NORNICDB_HEIMDALL_CONTEXT_SIZE=8192   # Reduce if low RAM
NORNICDB_HEIMDALL_BATCH_SIZE=2048
NORNICDB_HEIMDALL_MAX_TOKENS=512      # Shorter responses
```

### Verbose Responses
```bash
NORNICDB_HEIMDALL_ENABLED=true
NORNICDB_HEIMDALL_MAX_TOKENS=2048     # Allow longer explanations
```

## Monitoring Token Usage

The handler logs token budget information:

```
[Bifrost] Token budget: system=1247, user=156, total=1403/16384
```

If you see truncation errors, check:
1. Is MaxTokens high enough for the response?
2. Are plugins adding too many instructions?
3. Is the user message within budget?

## See Also

- [Heimdall AI Assistant](./heimdall-ai-assistant.md) - Overview and configuration
- [Heimdall Plugins](./heimdall-plugins.md) - Writing custom plugins
- [Operations - Monitoring](../operations/monitoring.md) - Prometheus metrics