Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
METADATA_ENRICHED_EMBEDDINGS.md19.8 kB
# Metadata-Enriched Embeddings & Image Indexing ## Overview This guide covers two complementary features that enhance Mimir's semantic search capabilities: 1. **Metadata-Enriched Embeddings** - Embeds file metadata (filename, path, language, directory) alongside content for better file discovery 2. **Image Embeddings** - Indexes images using Vision-Language models to generate searchable text descriptions Both features work together to provide comprehensive semantic search across text files, documents, and images. ## Problem Solved **Before**: Searching "authentication config" would only match files containing those words in their content. **After**: The same search also matches: - Files named `auth-config.ts` - Files in `auth/` or `config/` directories - Files with `AuthConfig` class names - Any semantically related file paths ## Implementation ### Phase 1: Metadata Formatting (✅ COMPLETE) Added to `src/indexing/EmbeddingsService.ts`: ```typescript export interface FileMetadata { name: string; relativePath: string; language: string; extension: string; directory?: string; sizeBytes?: number; } export function formatMetadataForEmbedding(metadata: FileMetadata): string { // Returns natural language description of file // Example: "This is a typescript file named auth-api.ts located at src/api/auth-api.ts in the src/api directory." } ``` ### Phase 2: FileIndexer Integration (⏳ TODO) Modify `src/indexing/FileIndexer.ts` to prepend metadata before generating embeddings: ```typescript // Around line 197-199 const metadata: FileMetadata = { name: path.basename(filePath), relativePath: relativePath, language: language, extension: extension, directory: path.dirname(relativePath), sizeBytes: stats.size }; // Prepend metadata to content const metadataPrefix = formatMetadataForEmbedding(metadata); const enrichedContent = metadataPrefix + content; // Generate embeddings with enriched content const chunkEmbeddings = await this.embeddingsService.generateChunkEmbeddings(enrichedContent); ``` ## Benefits ### Improved Search Examples | Query | Additional Matches | |-------|-------------------| | "authentication config" | `auth-config.ts`, `src/auth/config.ts`, `authentication.config.json` | | "user database models" | `src/users/db/models.ts`, `user-model.py`, `database/users/*` | | "typescript API routes" | All `.ts` files in `api/` or `routes/` directories | | "markdown documentation" | All `.md` files, files in `docs/` directory | | "test authentication" | `auth.test.ts`, `test/auth/*`, test files with "auth" in path | ### Natural Language Format The metadata is formatted as natural language for optimal embedding quality: ``` This is a typescript file named auth-api.ts located at src/api/auth-api.ts in the src/api directory. export class AuthService { async authenticate(user: User) { // ... actual content ... } } ``` ## Standard Behavior Metadata enrichment is **always enabled** - it's the standard way Mimir embeds files. This ensures optimal semantic search across your entire codebase. ## Backward Compatibility ✅ **Fully backward compatible** - Existing embeddings continue to work - New embeddings generated on next file modification - No schema changes required - No data migration needed - Gradual rollout as files are re-indexed ## Storage Impact - **File nodes**: No change (metadata already stored) - **Chunk nodes**: ~50-100 characters larger (metadata prefix) - **Embeddings**: Same dimensions, enriched content - **Estimated increase**: ~5% more text in chunks ## Testing After implementation: 1. Index a sample file with metadata 2. Search by filename only 3. Search by directory path 4. Search by file type/language 5. Verify metadata appears in search results ## Image Embeddings Configuration Mimir supports separate configuration for image embeddings using Vision-Language (VL) models. ### Configuration Hierarchy **VL-specific config → General embedding config → Defaults** If VL-specific environment variables are set, they override general settings for images. Otherwise, images use the same config as text. ### Environment Variables ```bash # Image Indexing Control MIMIR_EMBEDDINGS_IMAGES=true # Enable/disable image indexing (default: true) # Optional VL-Specific Config (if not set, falls back to general embedding config) MIMIR_EMBEDDINGS_VL_PROVIDER=openai # VL provider (defaults to MIMIR_EMBEDDINGS_PROVIDER) MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8081 # VL API endpoint (defaults to MIMIR_EMBEDDINGS_API) MIMIR_EMBEDDINGS_VL_API_PATH=/v1/embeddings # VL API path (defaults to MIMIR_EMBEDDINGS_API_PATH) MIMIR_EMBEDDINGS_VL_API_KEY=dummy-key # VL API key (defaults to MIMIR_EMBEDDINGS_API_KEY) MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal # VL model name (defaults to MIMIR_EMBEDDINGS_MODEL) MIMIR_EMBEDDINGS_VL_DIMENSIONS=768 # VL dimensions (defaults to MIMIR_EMBEDDINGS_DIMENSIONS) ``` ### Configuration Examples #### Option 1: Single Unified Model (Simplest) Use same model for both text and images: ```bash MIMIR_EMBEDDINGS_IMAGES=true MIMIR_EMBEDDINGS_MODEL=nomic-embed-multimodal MIMIR_EMBEDDINGS_DIMENSIONS=768 # No VL-specific vars needed ``` #### Option 2: Separate Models Use different models for text vs images: ```bash MIMIR_EMBEDDINGS_IMAGES=true # Text embeddings MIMIR_EMBEDDINGS_MODEL=mxbai-embed-large MIMIR_EMBEDDINGS_DIMENSIONS=1024 # Image embeddings MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal MIMIR_EMBEDDINGS_VL_DIMENSIONS=768 ``` #### Option 3: Separate Servers Run text and image models on different servers: ```bash # Text on port 8080 MIMIR_EMBEDDINGS_API=http://llama-server:8080 MIMIR_EMBEDDINGS_MODEL=mxbai-embed-large # Images on port 8081 MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8081 MIMIR_EMBEDDINGS_VL_MODEL=nomic-embed-multimodal ``` #### Option 4: No Images Disable image indexing entirely: ```bash MIMIR_EMBEDDINGS_IMAGES=false # Images are skipped just like currently ``` ### Image Description Mode (Default) **How it works:** Since multimodal GGUF embedding models are hard to find, Mimir uses a **description-based approach**: 1. **Image preprocessing**: Automatically resize images >3.2 MP to fit Qwen2.5-VL limits 2. **qwen2.5-vl** (Vision-Language model) analyzes the image 3. Generates a detailed text description of the image content 4. **Text embedding model** (e.g., mxbai-embed-large) embeds the description 5. Both description and embedding are stored in Neo4j **Benefits:** - ✅ Works with existing text embedding infrastructure - ✅ No need for rare multimodal GGUF models - ✅ Semantic image search via descriptions - ✅ Human-readable descriptions stored alongside embeddings - ✅ Automatic handling of large images (no manual chunking) **Configuration:** ```bash MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true # Default: true MIMIR_EMBEDDINGS_VL_MODEL=qwen2.5-vl # VL model for descriptions MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8080 MIMIR_EMBEDDINGS_VL_PROVIDER=llama.cpp # Image processing MIMIR_IMAGE_MAX_PIXELS=3211264 # Qwen2.5-VL limit (~1792×1792) MIMIR_IMAGE_TARGET_SIZE=1536 # Conservative resize target ``` ### Image Processing Strategy **Automatic Downscaling (No Chunking Required):** Qwen2.5-VL has built-in dynamic resolution handling with these limits: - **Maximum**: ~1792×1792 pixels (3.2 megapixels) - **Minimum**: ~79×79 pixels (6,272 pixels) **For images within limits** (most photos, screenshots): - Sent directly to VL model without modification - Examples: 1920×1080 (Full HD), 1280×720, phone photos **For images exceeding limits** (4K, 8K, high-res scans): - Automatically resized to fit within 3.2 MP - Aspect ratio preserved - Resize time negligible (~50-300ms vs ~12-35s VL processing) **Why no chunking:** - ✅ Qwen2.5-VL uses Dynamic Resolution ViT (auto-segments into 14×14 patches) - ✅ MRoPE preserves spatial relationships across entire image - ✅ Single API call = faster, simpler, more reliable - ✅ Semantic search needs "gist", not pixel-perfect detail **Processing pipeline:** ``` Image File → Check size → Resize if >3.2MP → Base64 encode → Qwen2.5-VL → Text description → Add metadata → Embed → Store ``` ### Image Metadata Format When images are indexed, their metadata + AI description is formatted: ```typescript const imageMetadata: ImageMetadata = { ...fileMetadata, format: 'jpeg', width: 1920, height: 1080, description: 'A screenshot showing a terminal window with code execution...' // AI-generated }; // Example output for embedding: // "This is a JPEG image named screenshot.jpg located at docs/images/screenshot.jpg, // 1920x1080 pixels. Description: A screenshot showing a terminal window with code execution // and colorful syntax highlighting. The terminal displays Python code with import statements // and function definitions." ``` --- ## 🖼️ Image Embeddings (VL Description Method) ### Overview Mimir can index and search images using a Vision-Language Model (VLM) to generate text descriptions, which are then embedded alongside text content. This enables semantic search across both text and images. **Status**: ✅ **Production Ready** (Disabled by default) ### How It Works Mimir supports **two modes** for image embeddings: #### Mode 1: VL Description Method (Default) ⭐ Uses a Vision-Language Model to generate text descriptions: 1. **Image Detection** - Automatically identifies image files (JPG, PNG, WEBP, GIF, BMP, TIFF) 2. **Preprocessing** - Resizes images >3.2 MP to fit model limits (aspect ratio preserved) 3. **VL Analysis** - Qwen2.5-VL generates detailed text description 4. **Metadata Enrichment** - Adds file metadata to description 5. **Text Embedding** - Standard text embedding model embeds the enriched description 6. **Storage** - Both description and embedding stored in Neo4j **Benefits:** - ✅ Works with existing text embedding infrastructure - ✅ No need for rare multimodal GGUF models - ✅ Semantic image search via descriptions - ✅ Human-readable descriptions stored alongside embeddings - ✅ Automatic handling of large images (no manual chunking) **Enable with:** ```bash MIMIR_EMBEDDINGS_IMAGES=true MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true # Default ``` #### Mode 2: Direct Multimodal Embedding Sends images directly to a multimodal embeddings endpoint: 1. **Image Detection** - Automatically identifies image files 2. **Preprocessing** - Resizes images if needed 3. **Direct Embedding** - Sends image as data URL to multimodal embeddings API 4. **Storage** - Image embedding stored in Neo4j **Benefits:** - ✅ True multimodal embeddings (if model supports it) - ✅ No intermediate text description - ✅ Can use any OpenAI-compatible multimodal embeddings API - ✅ Faster processing (no VL model inference) **Enable with:** ```bash MIMIR_EMBEDDINGS_IMAGES=true MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=false # Direct mode # Point to your multimodal embeddings endpoint MIMIR_EMBEDDINGS_API=http://your-multimodal-api:8080 MIMIR_EMBEDDINGS_API_PATH=/v1/embeddings MIMIR_EMBEDDINGS_MODEL=your-multimodal-model ``` **Note**: Mode 2 requires a multimodal embeddings endpoint that accepts images in OpenAI format: ```json { "model": "multimodal-model", "input": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}] } ``` ### Quick Start Choose your mode based on your needs: - **Mode 1 (VL Description)**: Best for most users, generates human-readable descriptions - **Mode 2 (Direct Multimodal)**: For advanced users with multimodal embeddings endpoints #### Mode 1: VL Description Method (Recommended) **1. Enable Image Indexing** ```bash # In your .env file or docker-compose MIMIR_EMBEDDINGS_IMAGES=true # Enable image indexing MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true # Use VL description method (default) ``` **2. Uncomment VL Server in docker-compose.arm64.yml** ```yaml llama-vl-server: image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest # or :2b for lighter container_name: llama_vl_server ports: - "8081:8080" # ... rest of config ``` **3. Start Services** ```bash docker compose -f docker-compose.arm64.yml up -d ``` #### Mode 2: Direct Multimodal Embedding (Advanced) **1. Enable Image Indexing with Direct Mode** ```bash # In your .env file or docker-compose MIMIR_EMBEDDINGS_IMAGES=true # Enable image indexing MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=false # Use direct multimodal mode # Point to your multimodal embeddings endpoint MIMIR_EMBEDDINGS_API=http://your-multimodal-api:8080 MIMIR_EMBEDDINGS_API_PATH=/v1/embeddings MIMIR_EMBEDDINGS_MODEL=your-multimodal-model MIMIR_EMBEDDINGS_DIMENSIONS=1024 # Your model's dimensions ``` **2. No VL Server Needed** Direct mode sends images to your embeddings endpoint, so you don't need the llama-vl-server. **3. Start Services** ```bash docker compose -f docker-compose.arm64.yml up -d ``` --- #### Common Steps (Both Modes) **Index a Folder with Images** ```bash curl -X POST http://localhost:3000/api/index/folder \ -H "Content-Type: application/json" \ -d '{ "path": "/workspace/images", "generateEmbeddings": true }' ``` **Search for Images** ```bash curl -X POST http://localhost:3000/api/search/semantic \ -H "Content-Type: application/json" \ -d '{ "query": "photo of food", "limit": 10 }' ``` ### Model Selection Mimir provides two pre-built Docker images for different resource requirements: #### 7B Model (Recommended) ⭐ **Image**: `timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest` **Specs:** - **Size**: 6.88 GB - **RAM Required**: ~8 GB - **Context**: 128K tokens - **Speed**: ~30-60 seconds per image (ARM64) - **Quality**: Excellent descriptions **Best for:** - Production environments - High-quality image descriptions - Detailed scene understanding - Complex images with multiple objects **Configuration:** ```yaml llama-vl-server: image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest environment: - LLAMA_ARG_CTX_SIZE=131072 # 128K tokens ``` #### 2B Model (Resource-Constrained) **Image**: `timothyswt/llama-cpp-server-arm64-qwen2.5-vl-2b:latest` **Specs:** - **Size**: 2.86 GB - **RAM Required**: ~4 GB - **Context**: 32K tokens - **Speed**: ~15-30 seconds per image (ARM64) - **Quality**: Good descriptions **Best for:** - Development environments - Resource-constrained systems - Faster processing - Simple images **Configuration:** ```yaml llama-vl-server: image: timothyswt/llama-cpp-server-arm64-qwen2.5-vl-2b:latest environment: - LLAMA_ARG_CTX_SIZE=32768 # 32K tokens ``` **To switch models:** 1. Update the `image:` line in `docker-compose.arm64.yml` 2. Update `LLAMA_ARG_CTX_SIZE` to match model capacity 3. Restart: `docker compose -f docker-compose.arm64.yml up -d llama-vl-server` ### Configuration Reference #### Image Processing ```bash # Enable/disable image indexing MIMIR_EMBEDDINGS_IMAGES=false # Default: disabled for safety # VL description mode (vs direct multimodal embedding - not yet available) MIMIR_EMBEDDINGS_IMAGES_DESCRIBE_MODE=true # Default: true # Image preprocessing MIMIR_IMAGE_MAX_PIXELS=3211264 # Qwen2.5-VL limit (~1792×1792) MIMIR_IMAGE_TARGET_SIZE=1536 # Conservative resize target MIMIR_IMAGE_RESIZE_QUALITY=90 # JPEG quality after resize ``` #### VL Provider Settings ```bash # VL server configuration MIMIR_EMBEDDINGS_VL_PROVIDER=llama.cpp MIMIR_EMBEDDINGS_VL_API=http://llama-vl-server:8080 MIMIR_EMBEDDINGS_VL_API_PATH=/v1/chat/completions MIMIR_EMBEDDINGS_VL_API_KEY=dummy-key # Not required for local llama.cpp # Model settings MIMIR_EMBEDDINGS_VL_MODEL=qwen2.5-vl MIMIR_EMBEDDINGS_VL_CONTEXT_SIZE=131072 # 128K for 7b, 32K for 2b MIMIR_EMBEDDINGS_VL_MAX_TOKENS=2048 # Max description length MIMIR_EMBEDDINGS_VL_TEMPERATURE=0.7 MIMIR_EMBEDDINGS_VL_TIMEOUT=180000 # 3 minutes (VL is slow) ``` **Fallback Hierarchy**: VL-specific settings override general embedding settings if not provided. ### Image Processing Strategy #### Automatic Downscaling (No Chunking Required) Qwen2.5-VL has built-in dynamic resolution handling with these limits: - **Maximum**: ~1792×1792 pixels (3.2 megapixels) - **Minimum**: ~79×79 pixels (6,272 pixels) **For images within limits** (most photos, screenshots): - Sent directly to VL model without modification - Examples: 1920×1080 (Full HD), 1280×720, phone photos **For images exceeding limits** (4K, 8K, high-res scans): - Automatically resized to fit within 3.2 MP - Aspect ratio preserved - Resize time negligible (~50-300ms vs ~12-35s VL processing) **Why no chunking:** - ✅ Qwen2.5-VL uses Dynamic Resolution ViT (auto-segments into 14×14 patches) - ✅ MRoPE preserves spatial relationships across entire image - ✅ Single API call = faster, simpler, more reliable - ✅ Semantic search needs "gist", not pixel-perfect detail - ✅ Negligible resize time: ~50-300ms vs ~12-35s VL processing **Processing pipeline:** ``` Image File → Check size → Resize if >3.2MP → Base64 encode → Qwen2.5-VL → Text description → Add metadata → Embed → Store ``` ### Performance Expectations #### qwen2.5-vl-7b on ARM64: - Small images (<1MP): ~15-30 seconds - Medium images (1-3MP): ~30-60 seconds - Large images (>3MP, auto-resized): ~30-60 seconds #### qwen2.5-vl-2b on ARM64: - Small images: ~8-15 seconds - Medium images: ~15-30 seconds - Large images: ~15-30 seconds **Note**: First image may take longer due to model loading. Subsequent images are faster. ### Example Use Cases **1. Find screenshots:** ```bash vector_search_nodes(query='screenshot of terminal with code', types=['file']) ``` **2. Locate diagrams:** ```bash vector_search_nodes(query='architecture diagram showing microservices', types=['file']) ``` **3. Search photos:** ```bash vector_search_nodes(query='photo of food on a plate', types=['file']) ``` **4. Find UI mockups:** ```bash vector_search_nodes(query='user interface design with buttons and forms', types=['file']) ``` ### Troubleshooting #### Images timing out **Symptom**: `TimeoutError: The operation was aborted due to timeout` **Solution**: Increase timeout (default 3 minutes) ```bash MIMIR_EMBEDDINGS_VL_TIMEOUT=300000 # 5 minutes ``` #### VL server not responding **Check health:** ```bash docker logs llama_vl_server curl http://localhost:8081/health ``` **Restart:** ```bash docker compose -f docker-compose.arm64.yml restart llama-vl-server ``` #### Out of memory **Symptom**: Container crashes or system becomes unresponsive **Solution**: Switch to 2B model or increase Docker memory limit ```bash # In Docker Desktop: Settings → Resources → Memory → 8GB+ ``` #### Wrong server URL **Symptom**: `ECONNREFUSED` or connection errors **Fix**: Ensure correct URLs in docker-compose: - Text embeddings: `http://llama-server:8080` - Image embeddings: `http://llama-vl-server:8080` ### Building Custom Images If you need to build the images yourself: ```bash # Build 7B image ./scripts/build-llama-cpp-qwen-vl.sh 7b # Build 2B image ./scripts/build-llama-cpp-qwen-vl.sh 2b # Push to your registry docker tag timothyswt/llama-cpp-server-arm64-qwen2.5-vl-7b:latest your-registry/qwen2.5-vl:7b docker push your-registry/qwen2.5-vl:7b ``` See `scripts/build-llama-cpp-qwen-vl.sh` for details. --- ## Related - [Knowledge Graph Guide](./KNOWLEDGE_GRAPH.md) - [File Indexing System](../architecture/FILE_INDEXING_SYSTEM.md) - [Qwen2.5-VL Setup Guide](../configuration/QWEN_VL_SETUP.md) - [Vector Search](../../README.md#vector-search)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server