M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

Mimir
docs
guides

METADATA_ENRICHED_EMBEDDINGS.md•19.3 KiB

# Metadata-Enriched Embeddings & Image Indexing

## Overview

This guide covers two complementary features that enhance Mimir's semantic search capabilities:

1. **Metadata-Enriched Embeddings** - Embeds file metadata (filename, path, language, directory) alongside content for better file discovery
2. **Image Embeddings** - Indexes images using Vision-Language models to generate searchable text descriptions

Both features work together to provide comprehensive semantic search across text files, documents, and images.

## Problem Solved

**Before**: Searching "authentication config" would only match files containing those words in their content.

**After**: The same search also matches:
- Files named `auth-config.ts`
- Files in `auth/` or `config/` directories
- Files with `AuthConfig` class names
- Any semantically related file paths

## Implementation

### Phase 1: Metadata Formatting (✅ COMPLETE)

Added to `src/indexing/EmbeddingsService.ts`:

```typescript
export interface FileMetadata {
  name: string;
  relativePath: string;
  language: string;
  extension: string;
  directory?: string;
  sizeBytes?: number;
}

export function formatMetadataForEmbedding(metadata: FileMetadata): string {
  // Returns natural language description of file
  // Example: "This is a typescript file named auth-api.ts located at src/api/auth-api.ts in the src/api directory."
}
```

### Phase 2: FileIndexer Integration (⏳ TODO)

Modify `src/indexing/FileIndexer.ts` to prepend metadata before generating embeddings:

```typescript
// Around line 197-199
const metadata: FileMetadata = {
  name: path.basename(filePath),
  relativePath: relativePath,
  language: language,
  extension: extension,
  directory: path.dirname(relativePath),
  sizeBytes: stats.size
};

// Prepend metadata to content
const metadataPrefix = formatMetadataForEmbedding(metadata);
const enrichedContent = metadataPrefix + content;

// Generate embeddings with enriched content
const chunkEmbeddings = await this.embeddingsService.generateChunkEmbeddings(enrichedContent);
```

## Benefits

### Improved Search Examples

| Query | Additional Matches |
|-------|-------------------|
| "authentication config" | `auth-config.ts`, `src/auth/config.ts`, `authentication.config.json` |
| "user database models" | `src/users/db/models.ts`, `user-model.py`, `database/users/*` |
| "typescript API routes" | All `.ts` files in `api/` or `routes/` directories |
| "markdown documentation" | All `.md` files, files in `docs/` directory |
| "test authentication" | `auth.test.ts`, `test/auth/*`, test files with "auth" in path |

### Natural Language Format

The metadata is formatted as natural language for optimal embedding quality:

```
This is a typescript file named auth-api.ts located at src/api/auth-api.ts in the src/api directory.

export class AuthService {
  async authenticate(user: User) {
    // ... actual content ...
  }
}
```

## Standard Behavior

Metadata enrichment is **always enabled** - it's the standard way Mimir embeds files. This ensures optimal semantic search across your entire codebase.

## Backward Compatibility

✅ **Fully backward compatible**
- Existing embeddings continue to work
- New embeddings generated on next file modification
- No schema changes required
- No data migration needed
- Gradual rollout as files are re-indexed

## Storage Impact

- **File nodes**: No change (metadata already stored)
- **Chunk nodes**: ~50-100 characters larger (metadata prefix)
- **Embeddings**: Same dimensions, enriched content
- **Estimated increase**: ~5% more text in chunks

## Testing

After implementation:

1. Index a sample file with metadata
2. Search by filename only
3. Search by directory path
4. Search by file type/language
5. Verify metadata appears in search results

## Image Embeddings Configuration

Mimir supports separate configuration for image embeddings using Vision-Language (VL) models.

### Configuration Hierarchy

**VL-specific config → General embedding config → Defaults**

If VL-specific environment variables are set, they override general settings for images. Otherwise, images use the same config as text.

### Environment Variables

```bash
# Image Indexing Control
MIMIR_EMBEDDINGS_IMAGES=true              # Enable/disable image indexing (default: true)

# Optional VL-Specific Config (if not set, falls back to general embedding config)
MIMIR_EMBEDDINGS_VL_PROVIDER=openai       # VL provider (defaults to MIMIR_EMBEDDINGS_PROVIDER)
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8081  # VL API endpoint (defaults to MIMIR_EMBEDDINGS_API)
MIMIR_EMBEDDINGS_VL_API_PATH=/v1/embeddings          # VL API path (defaults to MIMIR_EMBEDDINGS_API_PATH)
MIMIR_EMBEDDINGS_VL_API_KEY=dummy-key                # VL API key (defaults to MIMIR_EMBEDDINGS_API_KEY)
MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal     # VL model name (defaults to MIMIR_EMBEDDINGS_MODEL)
MIMIR_EMBEDDINGS_VL_DIMENSIONS=768                   # VL dimensions (defaults to MIMIR_EMBEDDINGS_DIMENSIONS)
```

### Configuration Examples

#### Option 1: Single Unified Model (Simplest)
Use same model for both text and images:

```bash
MIMIR_EMBEDDINGS_IMAGES=true
MIMIR_EMBEDDINGS_MODEL=nomic-embed-multimodal
MIMIR_EMBEDDINGS_DIMENSIONS=768
# No VL-specific vars needed
```

#### Option 2: Separate Models
Use different models for text vs images:

```bash
MIMIR_EMBEDDINGS_IMAGES=true
# Text embeddings
MIMIR_EMBEDDINGS_MODEL=mxbai-embed-large
MIMIR_EMBEDDINGS_DIMENSIONS=1024
# Image embeddings
MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal
MIMIR_EMBEDDINGS_VL_DIMENSIONS=768
```

#### Option 3: Separate Servers
Run text and image models on different servers:

```bash
# Text on port 8080
MIMIR_EMBEDDINGS_API=http://llama-server:8080
MIMIR_EMBEDDINGS_MODEL=mxbai-embed-large
# Images on port 8081
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8081
MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal
```

#### Option 4: No Images
Disable image indexing entirely:

```bash
MIMIR_EMBEDDINGS_IMAGES=false
# Images are skipped just like currently
```

### Image Description Mode (Default)

**How it works:**  
Since multimodal GGUF embedding models are hard to find, Mimir uses a **description-based approach**:

1. **Image preprocessing**: Automatically resize images >3.2 MP to fit Qwen2.5-VL limits
2. **qwen2.5-vl** (Vision-Language model) analyzes the image
3. Generates a detailed text description of the image content
4. **Text embedding model** (e.g., mxbai-embed-large) embeds the description
5. Both description and embedding are stored in Neo4j

**Benefits:**
- ✅ Works with existing text embedding infrastructure
- ✅ No need for rare multimodal GGUF models
- ✅ Semantic image search via descriptions
- ✅ Human-readable descriptions stored alongside embeddings
- ✅ Automatic handling of large images (no manual chunking)

**Configuration:**
```bash
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true  # Default: true
MIMIR_EMBEDDINGS_VL_MODEL=qwen2.5-vl        # VL model for descriptions
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8080
MIMIR_EMBEDDINGS_VL_PROVIDER=llama.cpp

# Image processing
MIMIR_IMAGE_MAX_PIXELS=3211264              # Qwen2.5-VL limit (~1792×1792)
MIMIR_IMAGE_TARGET_SIZE=1536                # Conservative resize target
```

### Image Processing Strategy

**Automatic Downscaling (No Chunking Required):**

Qwen2.5-VL has built-in dynamic resolution handling with these limits:
- **Maximum**: ~1792×1792 pixels (3.2 megapixels)
- **Minimum**: ~79×79 pixels (6,272 pixels)

**For images within limits** (most photos, screenshots):
- Sent directly to VL model without modification
- Examples: 1920×1080 (Full HD), 1280×720, phone photos

**For images exceeding limits** (4K, 8K, high-res scans):
- Automatically resized to fit within 3.2 MP
- Aspect ratio preserved
- Resize time negligible (~50-300ms vs ~12-35s VL processing)

**Why no chunking:**
- ✅ Qwen2.5-VL uses Dynamic Resolution ViT (auto-segments into 14×14 patches)
- ✅ MRoPE preserves spatial relationships across entire image
- ✅ Single API call = faster, simpler, more reliable
- ✅ Semantic search needs "gist", not pixel-perfect detail

**Processing pipeline:**
```
Image File → Check size → Resize if >3.2MP → Base64 encode → 
Qwen2.5-VL → Text description → Add metadata → Embed → Store
```

### Image Metadata Format

When images are indexed, their metadata + AI description is formatted:

```typescript
const imageMetadata: ImageMetadata = {
  ...fileMetadata,
  format: 'jpeg',
  width: 1920,
  height: 1080,
  description: 'A screenshot showing a terminal window with code execution...' // AI-generated
};

// Example output for embedding:
// "This is a JPEG image named screenshot.jpg located at docs/images/screenshot.jpg, 
// 1920x1080 pixels. Description: A screenshot showing a terminal window with code execution 
// and colorful syntax highlighting. The terminal displays Python code with import statements 
// and function definitions."
```

---

## 🖼️ Image Embeddings (VL Description Method)

### Overview

Mimir can index and search images using a Vision-Language Model (VLM) to generate text descriptions, which are then embedded alongside text content. This enables semantic search across both text and images.

**Status**: ✅ **Production Ready** (Disabled by default)

### How It Works

Mimir supports **two modes** for image embeddings:

#### Mode 1: VL Description Method (Default) ⭐

Uses a Vision-Language Model to generate text descriptions:

1. **Image Detection** - Automatically identifies image files (JPG, PNG, WEBP, GIF, BMP, TIFF)
2. **Preprocessing** - Resizes images >3.2 MP to fit model limits (aspect ratio preserved)
3. **VL Analysis** - Qwen2.5-VL generates detailed text description
4. **Metadata Enrichment** - Adds file metadata to description
5. **Text Embedding** - Standard text embedding model embeds the enriched description
6. **Storage** - Both description and embedding stored in Neo4j

**Benefits:**
- ✅ Works with existing text embedding infrastructure
- ✅ No need for rare multimodal GGUF models
- ✅ Semantic image search via descriptions
- ✅ Human-readable descriptions stored alongside embeddings
- ✅ Automatic handling of large images (no manual chunking)

**Enable with:**
```bash
MIMIR_EMBEDDINGS_IMAGES=true
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true  # Default
```

#### Mode 2: Direct Multimodal Embedding

Sends images directly to a multimodal embeddings endpoint:

1. **Image Detection** - Automatically identifies image files
2. **Preprocessing** - Resizes images if needed
3. **Direct Embedding** - Sends image as data URL to multimodal embeddings API
4. **Storage** - Image embedding stored in Neo4j

**Benefits:**
- ✅ True multimodal embeddings (if model supports it)
- ✅ No intermediate text description
- ✅ Can use any OpenAI-compatible multimodal embeddings API
- ✅ Faster processing (no VL model inference)

**Enable with:**
```bash
MIMIR_EMBEDDINGS_IMAGES=true
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=false  # Direct mode

# Point to your multimodal embeddings endpoint
MIMIR_EMBEDDINGS_API=http://your-multimodal-api:8080
MIMIR_EMBEDDINGS_API_PATH=/v1/embeddings
MIMIR_EMBEDDINGS_MODEL=your-multimodal-model
```

**Note**: Mode 2 requires a multimodal embeddings endpoint that accepts images in OpenAI format:
```json
{
  "model": "multimodal-model",
  "input": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}]
}
```

### Quick Start

Choose your mode based on your needs:
- **Mode 1 (VL Description)**: Best for most users, generates human-readable descriptions
- **Mode 2 (Direct Multimodal)**: For advanced users with multimodal embeddings endpoints

#### Mode 1: VL Description Method (Recommended)

**1. Enable Image Indexing**

```bash
# In your .env file or docker-compose
MIMIR_EMBEDDINGS_IMAGES=true                      # Enable image indexing
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true        # Use VL description method (default)
```

**2. Uncomment VL Server in docker-compose.arm64.yml**

```yaml
llama-vl-server:
  image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest  # or :2b for lighter
  container_name: llama_vl_server
  ports:
    - "8081:8080"
  # ... rest of config
```

**3. Start Services**

```bash
docker compose -f docker-compose.arm64.yml up -d
```

#### Mode 2: Direct Multimodal Embedding (Advanced)

**1. Enable Image Indexing with Direct Mode**

```bash
# In your .env file or docker-compose
MIMIR_EMBEDDINGS_IMAGES=true                      # Enable image indexing
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=false       # Use direct multimodal mode

# Point to your multimodal embeddings endpoint
MIMIR_EMBEDDINGS_API=http://your-multimodal-api:8080
MIMIR_EMBEDDINGS_API_PATH=/v1/embeddings
MIMIR_EMBEDDINGS_MODEL=your-multimodal-model
MIMIR_EMBEDDINGS_DIMENSIONS=1024  # Your model's dimensions
```

**2. No VL Server Needed**

Direct mode sends images to your embeddings endpoint, so you don't need the llama-vl-server.

**3. Start Services**

```bash
docker compose -f docker-compose.arm64.yml up -d
```

---

#### Common Steps (Both Modes)
  
  **Index a Folder with Images**

```bash
curl -X POST http://localhost:3000/api/index/folder \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/workspace/images",
    "generateEmbeddings": true
  }'
```

**Search for Images**

```bash
curl -X POST http://localhost:3000/api/search/semantic \
  -H "Content-Type: application/json" \
  -d '{
    "query": "photo of food",
    "limit": 10
  }'
```

### Model Selection

Mimir provides two pre-built Docker images for different resource requirements:

#### 7B Model (Recommended) ⭐

**Image**: `timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest`

**Specs:**
- **Size**: 6.88 GB
- **RAM Required**: ~8 GB
- **Context**: 128K tokens
- **Speed**: ~30-60 seconds per image (ARM64)
- **Quality**: Excellent descriptions

**Best for:**
- Production environments
- High-quality image descriptions
- Detailed scene understanding
- Complex images with multiple objects

**Configuration:**
```yaml
llama-vl-server:
  image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest
  environment:
    - LLAMA_ARG_CTX_SIZE=131072  # 128K tokens
```

#### 2B Model (Resource-Constrained)

**Image**: `timothyswt/llama-cpp-server-arm64-qwen2.5-vl-2b:latest`

**Specs:**
- **Size**: 2.86 GB
- **RAM Required**: ~4 GB
- **Context**: 32K tokens
- **Speed**: ~15-30 seconds per image (ARM64)
- **Quality**: Good descriptions

**Best for:**
- Development environments
- Resource-constrained systems
- Faster processing
- Simple images

**Configuration:**
```yaml
llama-vl-server:
  image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-2b:latest
  environment:
    - LLAMA_ARG_CTX_SIZE=32768  # 32K tokens
```

**To switch models:**
1. Update the `image:` line in `docker-compose.arm64.yml`
2. Update `LLAMA_ARG_CTX_SIZE` to match model capacity
3. Restart: `docker compose -f docker-compose.arm64.yml up -d llama-vl-server`

### Configuration Reference

#### Image Processing

```bash
# Enable/disable image indexing
MIMIR_EMBEDDINGS_IMAGES=false  # Default: disabled for safety

# VL description mode (vs direct multimodal embedding - not yet available)
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true  # Default: true

# Image preprocessing
MIMIR_IMAGE_MAX_PIXELS=3211264      # Qwen2.5-VL limit (~1792×1792)
MIMIR_IMAGE_TARGET_SIZE=1536        # Conservative resize target
MIMIR_IMAGE_RESIZE_QUALITY=90       # JPEG quality after resize
```

#### VL Provider Settings

```bash
# VL server configuration
MIMIR_EMBEDDINGS_VL_PROVIDER=llama.cpp
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8080
MIMIR_EMBEDDINGS_VL_API_PATH=/v1/chat/completions
MIMIR_EMBEDDINGS_VL_API_KEY=dummy-key  # Not required for local llama.cpp

# Model settings
MIMIR_EMBEDDINGS_VL_MODEL=qwen2.5-vl
MIMIR_EMBEDDINGS_VL_CONTEXT_SIZE=131072    # 128K for 7b, 32K for 2b
MIMIR_EMBEDDINGS_VL_MAX_TOKENS=2048        # Max description length
MIMIR_EMBEDDINGS_VL_TEMPERATURE=0.7
MIMIR_EMBEDDINGS_VL_TIMEOUT=180000         # 3 minutes (VL is slow)
```

**Fallback Hierarchy**: VL-specific settings override general embedding settings if not provided.

### Image Processing Strategy

#### Automatic Downscaling (No Chunking Required)

Qwen2.5-VL has built-in dynamic resolution handling with these limits:
- **Maximum**: ~1792×1792 pixels (3.2 megapixels)
- **Minimum**: ~79×79 pixels (6,272 pixels)

**For images within limits** (most photos, screenshots):
- Sent directly to VL model without modification
- Examples: 1920×1080 (Full HD), 1280×720, phone photos

**For images exceeding limits** (4K, 8K, high-res scans):
- Automatically resized to fit within 3.2 MP
- Aspect ratio preserved
- Resize time negligible (~50-300ms vs ~12-35s VL processing)

**Why no chunking:**
- ✅ Qwen2.5-VL uses Dynamic Resolution ViT (auto-segments into 14×14 patches)
- ✅ MRoPE preserves spatial relationships across entire image
- ✅ Single API call = faster, simpler, more reliable
- ✅ Semantic search needs "gist", not pixel-perfect detail
- ✅ Negligible resize time: ~50-300ms vs ~12-35s VL processing

**Processing pipeline:**
```
Image File → Check size → Resize if >3.2MP → Base64 encode → 
Qwen2.5-VL → Text description → Add metadata → Embed → Store
```

### Performance Expectations

#### qwen2.5-vl-7b on ARM64:
- Small images (<1MP): ~15-30 seconds
- Medium images (1-3MP): ~30-60 seconds
- Large images (>3MP, auto-resized): ~30-60 seconds

#### qwen2.5-vl-2b on ARM64:
- Small images: ~8-15 seconds
- Medium images: ~15-30 seconds
- Large images: ~15-30 seconds

**Note**: First image may take longer due to model loading. Subsequent images are faster.

### Example Use Cases

**1. Find screenshots:**
```bash
vector_search_nodes(query='screenshot of terminal with code', types=['file'])
```

**2. Locate diagrams:**
```bash
vector_search_nodes(query='architecture diagram showing microservices', types=['file'])
```

**3. Search photos:**
```bash
vector_search_nodes(query='photo of food on a plate', types=['file'])
```

**4. Find UI mockups:**
```bash
vector_search_nodes(query='user interface design with buttons and forms', types=['file'])
```

### Troubleshooting

#### Images timing out

**Symptom**: `TimeoutError: The operation was aborted due to timeout`

**Solution**: Increase timeout (default 3 minutes)
```bash
MIMIR_EMBEDDINGS_VL_TIMEOUT=300000  # 5 minutes
```

#### VL server not responding

**Check health:**
```bash
docker logs llama_vl_server
curl http://localhost:8081/health
```

**Restart:**
```bash
docker compose -f docker-compose.arm64.yml restart llama-vl-server
```

#### Out of memory

**Symptom**: Container crashes or system becomes unresponsive

**Solution**: Switch to 2B model or increase Docker memory limit
```bash
# In Docker Desktop: Settings → Resources → Memory → 8GB+
```

#### Wrong server URL

**Symptom**: `ECONNREFUSED` or connection errors

**Fix**: Ensure correct URLs in docker-compose:
- Text embeddings: `http://llama-server:8080`
- Image embeddings: `http://llama-vl-server:8080`

### Building Custom Images

If you need to build the images yourself:

```bash
# Build 7B image
./scripts/build-llama-cpp-qwen-vl.sh 7b

# Build 2B image
./scripts/build-llama-cpp-qwen-vl.sh 2b

# Push to your registry
docker tag timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest your-registry/qwen2.5-vl:7b
docker push your-registry/qwen2.5-vl:7b
```

See `scripts/build-llama-cpp-qwen-vl.sh` for details.

---

## Related

- [Knowledge Graph Guide](./KNOWLEDGE_GRAPH.md)
- [File Indexing System](../architecture/FILE_INDEXING_SYSTEM.md)
- [Qwen2.5-VL Setup Guide](../configuration/QWEN_VL_SETUP.md)
- [Vector Search](../../README.md#vector-search)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

METADATA_ENRICHED_EMBEDDINGS.md•19.3 KiB

# Metadata-Enriched Embeddings & Image Indexing

## Overview

This guide covers two complementary features that enhance Mimir's semantic search capabilities:

1. **Metadata-Enriched Embeddings** - Embeds file metadata (filename, path, language, directory) alongside content for better file discovery
2. **Image Embeddings** - Indexes images using Vision-Language models to generate searchable text descriptions

Both features work together to provide comprehensive semantic search across text files, documents, and images.

## Problem Solved

**Before**: Searching "authentication config" would only match files containing those words in their content.

**After**: The same search also matches:
- Files named `auth-config.ts`
- Files in `auth/` or `config/` directories
- Files with `AuthConfig` class names
- Any semantically related file paths

## Implementation

### Phase 1: Metadata Formatting (✅ COMPLETE)

Added to `src/indexing/EmbeddingsService.ts`:

```typescript
export interface FileMetadata {
  name: string;
  relativePath: string;
  language: string;
  extension: string;
  directory?: string;
  sizeBytes?: number;
}

export function formatMetadataForEmbedding(metadata: FileMetadata): string {
  // Returns natural language description of file
  // Example: "This is a typescript file named auth-api.ts located at src/api/auth-api.ts in the src/api directory."
}
```

### Phase 2: FileIndexer Integration (⏳ TODO)

Modify `src/indexing/FileIndexer.ts` to prepend metadata before generating embeddings:

```typescript
// Around line 197-199
const metadata: FileMetadata = {
  name: path.basename(filePath),
  relativePath: relativePath,
  language: language,
  extension: extension,
  directory: path.dirname(relativePath),
  sizeBytes: stats.size
};

// Prepend metadata to content
const metadataPrefix = formatMetadataForEmbedding(metadata);
const enrichedContent = metadataPrefix + content;

// Generate embeddings with enriched content
const chunkEmbeddings = await this.embeddingsService.generateChunkEmbeddings(enrichedContent);
```

## Benefits

### Improved Search Examples

| Query | Additional Matches |
|-------|-------------------|
| "authentication config" | `auth-config.ts`, `src/auth/config.ts`, `authentication.config.json` |
| "user database models" | `src/users/db/models.ts`, `user-model.py`, `database/users/*` |
| "typescript API routes" | All `.ts` files in `api/` or `routes/` directories |
| "markdown documentation" | All `.md` files, files in `docs/` directory |
| "test authentication" | `auth.test.ts`, `test/auth/*`, test files with "auth" in path |

### Natural Language Format

The metadata is formatted as natural language for optimal embedding quality:

```
This is a typescript file named auth-api.ts located at src/api/auth-api.ts in the src/api directory.

export class AuthService {
  async authenticate(user: User) {
    // ... actual content ...
  }
}
```

## Standard Behavior

Metadata enrichment is **always enabled** - it's the standard way Mimir embeds files. This ensures optimal semantic search across your entire codebase.

## Backward Compatibility

✅ **Fully backward compatible**
- Existing embeddings continue to work
- New embeddings generated on next file modification
- No schema changes required
- No data migration needed
- Gradual rollout as files are re-indexed

## Storage Impact

- **File nodes**: No change (metadata already stored)
- **Chunk nodes**: ~50-100 characters larger (metadata prefix)
- **Embeddings**: Same dimensions, enriched content
- **Estimated increase**: ~5% more text in chunks

## Testing

After implementation:

1. Index a sample file with metadata
2. Search by filename only
3. Search by directory path
4. Search by file type/language
5. Verify metadata appears in search results

## Image Embeddings Configuration

Mimir supports separate configuration for image embeddings using Vision-Language (VL) models.

### Configuration Hierarchy

**VL-specific config → General embedding config → Defaults**

If VL-specific environment variables are set, they override general settings for images. Otherwise, images use the same config as text.

### Environment Variables

```bash
# Image Indexing Control
MIMIR_EMBEDDINGS_IMAGES=true              # Enable/disable image indexing (default: true)

# Optional VL-Specific Config (if not set, falls back to general embedding config)
MIMIR_EMBEDDINGS_VL_PROVIDER=openai       # VL provider (defaults to MIMIR_EMBEDDINGS_PROVIDER)
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8081  # VL API endpoint (defaults to MIMIR_EMBEDDINGS_API)
MIMIR_EMBEDDINGS_VL_API_PATH=/v1/embeddings          # VL API path (defaults to MIMIR_EMBEDDINGS_API_PATH)
MIMIR_EMBEDDINGS_VL_API_KEY=dummy-key                # VL API key (defaults to MIMIR_EMBEDDINGS_API_KEY)
MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal     # VL model name (defaults to MIMIR_EMBEDDINGS_MODEL)
MIMIR_EMBEDDINGS_VL_DIMENSIONS=768                   # VL dimensions (defaults to MIMIR_EMBEDDINGS_DIMENSIONS)
```

### Configuration Examples

#### Option 1: Single Unified Model (Simplest)
Use same model for both text and images:

```bash
MIMIR_EMBEDDINGS_IMAGES=true
MIMIR_EMBEDDINGS_MODEL=nomic-embed-multimodal
MIMIR_EMBEDDINGS_DIMENSIONS=768
# No VL-specific vars needed
```

#### Option 2: Separate Models
Use different models for text vs images:

```bash
MIMIR_EMBEDDINGS_IMAGES=true
# Text embeddings
MIMIR_EMBEDDINGS_MODEL=mxbai-embed-large
MIMIR_EMBEDDINGS_DIMENSIONS=1024
# Image embeddings
MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal
MIMIR_EMBEDDINGS_VL_DIMENSIONS=768
```

#### Option 3: Separate Servers
Run text and image models on different servers:

```bash
# Text on port 8080
MIMIR_EMBEDDINGS_API=http://llama-server:8080
MIMIR_EMBEDDINGS_MODEL=mxbai-embed-large
# Images on port 8081
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8081
MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal
```

#### Option 4: No Images
Disable image indexing entirely:

```bash
MIMIR_EMBEDDINGS_IMAGES=false
# Images are skipped just like currently
```

### Image Description Mode (Default)

**How it works:**  
Since multimodal GGUF embedding models are hard to find, Mimir uses a **description-based approach**:

1. **Image preprocessing**: Automatically resize images >3.2 MP to fit Qwen2.5-VL limits
2. **qwen2.5-vl** (Vision-Language model) analyzes the image
3. Generates a detailed text description of the image content
4. **Text embedding model** (e.g., mxbai-embed-large) embeds the description
5. Both description and embedding are stored in Neo4j

**Benefits:**
- ✅ Works with existing text embedding infrastructure
- ✅ No need for rare multimodal GGUF models
- ✅ Semantic image search via descriptions
- ✅ Human-readable descriptions stored alongside embeddings
- ✅ Automatic handling of large images (no manual chunking)

**Configuration:**
```bash
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true  # Default: true
MIMIR_EMBEDDINGS_VL_MODEL=qwen2.5-vl        # VL model for descriptions
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8080
MIMIR_EMBEDDINGS_VL_PROVIDER=llama.cpp

# Image processing
MIMIR_IMAGE_MAX_PIXELS=3211264              # Qwen2.5-VL limit (~1792×1792)
MIMIR_IMAGE_TARGET_SIZE=1536                # Conservative resize target
```

### Image Processing Strategy

**Automatic Downscaling (No Chunking Required):**

Qwen2.5-VL has built-in dynamic resolution handling with these limits:
- **Maximum**: ~1792×1792 pixels (3.2 megapixels)
- **Minimum**: ~79×79 pixels (6,272 pixels)

**For images within limits** (most photos, screenshots):
- Sent directly to VL model without modification
- Examples: 1920×1080 (Full HD), 1280×720, phone photos

**For images exceeding limits** (4K, 8K, high-res scans):
- Automatically resized to fit within 3.2 MP
- Aspect ratio preserved
- Resize time negligible (~50-300ms vs ~12-35s VL processing)

**Why no chunking:**
- ✅ Qwen2.5-VL uses Dynamic Resolution ViT (auto-segments into 14×14 patches)
- ✅ MRoPE preserves spatial relationships across entire image
- ✅ Single API call = faster, simpler, more reliable
- ✅ Semantic search needs "gist", not pixel-perfect detail

**Processing pipeline:**
```
Image File → Check size → Resize if >3.2MP → Base64 encode → 
Qwen2.5-VL → Text description → Add metadata → Embed → Store
```

### Image Metadata Format

When images are indexed, their metadata + AI description is formatted:

```typescript
const imageMetadata: ImageMetadata = {
  ...fileMetadata,
  format: 'jpeg',
  width: 1920,
  height: 1080,
  description: 'A screenshot showing a terminal window with code execution...' // AI-generated
};

// Example output for embedding:
// "This is a JPEG image named screenshot.jpg located at docs/images/screenshot.jpg, 
// 1920x1080 pixels. Description: A screenshot showing a terminal window with code execution 
// and colorful syntax highlighting. The terminal displays Python code with import statements 
// and function definitions."
```

---

## 🖼️ Image Embeddings (VL Description Method)

### Overview

Mimir can index and search images using a Vision-Language Model (VLM) to generate text descriptions, which are then embedded alongside text content. This enables semantic search across both text and images.

**Status**: ✅ **Production Ready** (Disabled by default)

### How It Works

Mimir supports **two modes** for image embeddings:

#### Mode 1: VL Description Method (Default) ⭐

Uses a Vision-Language Model to generate text descriptions:

1. **Image Detection** - Automatically identifies image files (JPG, PNG, WEBP, GIF, BMP, TIFF)
2. **Preprocessing** - Resizes images >3.2 MP to fit model limits (aspect ratio preserved)
3. **VL Analysis** - Qwen2.5-VL generates detailed text description
4. **Metadata Enrichment** - Adds file metadata to description
5. **Text Embedding** - Standard text embedding model embeds the enriched description
6. **Storage** - Both description and embedding stored in Neo4j

**Benefits:**
- ✅ Works with existing text embedding infrastructure
- ✅ No need for rare multimodal GGUF models
- ✅ Semantic image search via descriptions
- ✅ Human-readable descriptions stored alongside embeddings
- ✅ Automatic handling of large images (no manual chunking)

**Enable with:**
```bash
MIMIR_EMBEDDINGS_IMAGES=true
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true  # Default
```

#### Mode 2: Direct Multimodal Embedding

Sends images directly to a multimodal embeddings endpoint:

1. **Image Detection** - Automatically identifies image files
2. **Preprocessing** - Resizes images if needed
3. **Direct Embedding** - Sends image as data URL to multimodal embeddings API
4. **Storage** - Image embedding stored in Neo4j

**Benefits:**
- ✅ True multimodal embeddings (if model supports it)
- ✅ No intermediate text description
- ✅ Can use any OpenAI-compatible multimodal embeddings API
- ✅ Faster processing (no VL model inference)

**Enable with:**
```bash
MIMIR_EMBEDDINGS_IMAGES=true
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=false  # Direct mode

# Point to your multimodal embeddings endpoint
MIMIR_EMBEDDINGS_API=http://your-multimodal-api:8080
MIMIR_EMBEDDINGS_API_PATH=/v1/embeddings
MIMIR_EMBEDDINGS_MODEL=your-multimodal-model
```

**Note**: Mode 2 requires a multimodal embeddings endpoint that accepts images in OpenAI format:
```json
{
  "model": "multimodal-model",
  "input": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}]
}
```

### Quick Start

Choose your mode based on your needs:
- **Mode 1 (VL Description)**: Best for most users, generates human-readable descriptions
- **Mode 2 (Direct Multimodal)**: For advanced users with multimodal embeddings endpoints

#### Mode 1: VL Description Method (Recommended)

**1. Enable Image Indexing**

```bash
# In your .env file or docker-compose
MIMIR_EMBEDDINGS_IMAGES=true                      # Enable image indexing
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true        # Use VL description method (default)
```

**2. Uncomment VL Server in docker-compose.arm64.yml**

```yaml
llama-vl-server:
  image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest  # or :2b for lighter
  container_name: llama_vl_server
  ports:
    - "8081:8080"
  # ... rest of config
```

**3. Start Services**

```bash
docker compose -f docker-compose.arm64.yml up -d
```

#### Mode 2: Direct Multimodal Embedding (Advanced)

**1. Enable Image Indexing with Direct Mode**

```bash
# In your .env file or docker-compose
MIMIR_EMBEDDINGS_IMAGES=true                      # Enable image indexing
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=false       # Use direct multimodal mode

# Point to your multimodal embeddings endpoint
MIMIR_EMBEDDINGS_API=http://your-multimodal-api:8080
MIMIR_EMBEDDINGS_API_PATH=/v1/embeddings
MIMIR_EMBEDDINGS_MODEL=your-multimodal-model
MIMIR_EMBEDDINGS_DIMENSIONS=1024  # Your model's dimensions
```

**2. No VL Server Needed**

Direct mode sends images to your embeddings endpoint, so you don't need the llama-vl-server.

**3. Start Services**

```bash
docker compose -f docker-compose.arm64.yml up -d
```

---

#### Common Steps (Both Modes)
  
  **Index a Folder with Images**

```bash
curl -X POST http://localhost:3000/api/index/folder \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/workspace/images",
    "generateEmbeddings": true
  }'
```

**Search for Images**

```bash
curl -X POST http://localhost:3000/api/search/semantic \
  -H "Content-Type: application/json" \
  -d '{
    "query": "photo of food",
    "limit": 10
  }'
```

### Model Selection

Mimir provides two pre-built Docker images for different resource requirements:

#### 7B Model (Recommended) ⭐

**Image**: `timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest`

**Specs:**
- **Size**: 6.88 GB
- **RAM Required**: ~8 GB
- **Context**: 128K tokens
- **Speed**: ~30-60 seconds per image (ARM64)
- **Quality**: Excellent descriptions

**Best for:**
- Production environments
- High-quality image descriptions
- Detailed scene understanding
- Complex images with multiple objects

**Configuration:**
```yaml
llama-vl-server:
  image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest
  environment:
    - LLAMA_ARG_CTX_SIZE=131072  # 128K tokens
```

#### 2B Model (Resource-Constrained)

**Image**: `timothyswt/llama-cpp-server-arm64-qwen2.5-vl-2b:latest`

**Specs:**
- **Size**: 2.86 GB
- **RAM Required**: ~4 GB
- **Context**: 32K tokens
- **Speed**: ~15-30 seconds per image (ARM64)
- **Quality**: Good descriptions

**Best for:**
- Development environments
- Resource-constrained systems
- Faster processing
- Simple images

**Configuration:**
```yaml
llama-vl-server:
  image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-2b:latest
  environment:
    - LLAMA_ARG_CTX_SIZE=32768  # 32K tokens
```

**To switch models:**
1. Update the `image:` line in `docker-compose.arm64.yml`
2. Update `LLAMA_ARG_CTX_SIZE` to match model capacity
3. Restart: `docker compose -f docker-compose.arm64.yml up -d llama-vl-server`

### Configuration Reference

#### Image Processing

```bash
# Enable/disable image indexing
MIMIR_EMBEDDINGS_IMAGES=false  # Default: disabled for safety

# VL description mode (vs direct multimodal embedding - not yet available)
MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true  # Default: true

# Image preprocessing
MIMIR_IMAGE_MAX_PIXELS=3211264      # Qwen2.5-VL limit (~1792×1792)
MIMIR_IMAGE_TARGET_SIZE=1536        # Conservative resize target
MIMIR_IMAGE_RESIZE_QUALITY=90       # JPEG quality after resize
```

#### VL Provider Settings

```bash
# VL server configuration
MIMIR_EMBEDDINGS_VL_PROVIDER=llama.cpp
MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8080
MIMIR_EMBEDDINGS_VL_API_PATH=/v1/chat/completions
MIMIR_EMBEDDINGS_VL_API_KEY=dummy-key  # Not required for local llama.cpp

# Model settings
MIMIR_EMBEDDINGS_VL_MODEL=qwen2.5-vl
MIMIR_EMBEDDINGS_VL_CONTEXT_SIZE=131072    # 128K for 7b, 32K for 2b
MIMIR_EMBEDDINGS_VL_MAX_TOKENS=2048        # Max description length
MIMIR_EMBEDDINGS_VL_TEMPERATURE=0.7
MIMIR_EMBEDDINGS_VL_TIMEOUT=180000         # 3 minutes (VL is slow)
```

**Fallback Hierarchy**: VL-specific settings override general embedding settings if not provided.

### Image Processing Strategy

#### Automatic Downscaling (No Chunking Required)

Qwen2.5-VL has built-in dynamic resolution handling with these limits:
- **Maximum**: ~1792×1792 pixels (3.2 megapixels)
- **Minimum**: ~79×79 pixels (6,272 pixels)

**For images within limits** (most photos, screenshots):
- Sent directly to VL model without modification
- Examples: 1920×1080 (Full HD), 1280×720, phone photos

**For images exceeding limits** (4K, 8K, high-res scans):
- Automatically resized to fit within 3.2 MP
- Aspect ratio preserved
- Resize time negligible (~50-300ms vs ~12-35s VL processing)

**Why no chunking:**
- ✅ Qwen2.5-VL uses Dynamic Resolution ViT (auto-segments into 14×14 patches)
- ✅ MRoPE preserves spatial relationships across entire image
- ✅ Single API call = faster, simpler, more reliable
- ✅ Semantic search needs "gist", not pixel-perfect detail
- ✅ Negligible resize time: ~50-300ms vs ~12-35s VL processing

**Processing pipeline:**
```
Image File → Check size → Resize if >3.2MP → Base64 encode → 
Qwen2.5-VL → Text description → Add metadata → Embed → Store
```

### Performance Expectations

#### qwen2.5-vl-7b on ARM64:
- Small images (<1MP): ~15-30 seconds
- Medium images (1-3MP): ~30-60 seconds
- Large images (>3MP, auto-resized): ~30-60 seconds

#### qwen2.5-vl-2b on ARM64:
- Small images: ~8-15 seconds
- Medium images: ~15-30 seconds
- Large images: ~15-30 seconds

**Note**: First image may take longer due to model loading. Subsequent images are faster.

### Example Use Cases

**1. Find screenshots:**
```bash
vector_search_nodes(query='screenshot of terminal with code', types=['file'])
```

**2. Locate diagrams:**
```bash
vector_search_nodes(query='architecture diagram showing microservices', types=['file'])
```

**3. Search photos:**
```bash
vector_search_nodes(query='photo of food on a plate', types=['file'])
```

**4. Find UI mockups:**
```bash
vector_search_nodes(query='user interface design with buttons and forms', types=['file'])
```

### Troubleshooting

#### Images timing out

**Symptom**: `TimeoutError: The operation was aborted due to timeout`

**Solution**: Increase timeout (default 3 minutes)
```bash
MIMIR_EMBEDDINGS_VL_TIMEOUT=300000  # 5 minutes
```

#### VL server not responding

**Check health:**
```bash
docker logs llama_vl_server
curl http://localhost:8081/health
```

**Restart:**
```bash
docker compose -f docker-compose.arm64.yml restart llama-vl-server
```

#### Out of memory

**Symptom**: Container crashes or system becomes unresponsive

**Solution**: Switch to 2B model or increase Docker memory limit
```bash
# In Docker Desktop: Settings → Resources → Memory → 8GB+
```

#### Wrong server URL

**Symptom**: `ECONNREFUSED` or connection errors

**Fix**: Ensure correct URLs in docker-compose:
- Text embeddings: `http://llama-server:8080`
- Image embeddings: `http://llama-vl-server:8080`

### Building Custom Images

If you need to build the images yourself:

```bash
# Build 7B image
./scripts/build-llama-cpp-qwen-vl.sh 7b

# Build 2B image
./scripts/build-llama-cpp-qwen-vl.sh 2b

# Push to your registry
docker tag timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest your-registry/qwen2.5-vl:7b
docker push your-registry/qwen2.5-vl:7b
```

See `scripts/build-llama-cpp-qwen-vl.sh` for details.

---

## Related

- [Knowledge Graph Guide](./KNOWLEDGE_GRAPH.md)
- [File Indexing System](../architecture/FILE_INDEXING_SYSTEM.md)
- [Qwen2.5-VL Setup Guide](../configuration/QWEN_VL_SETUP.md)
- [Vector Search](../../README.md#vector-search)