M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

Overview Schema Related Servers Score Discussions

Mimir
docker
llama-cpp

BUILD_CUDA.md•9.89 KiB

# Building CUDA-Enabled llama.cpp Images for Windows/AMD64 This guide covers building GPU-accelerated Docker images for llama.cpp on Windows with NVIDIA GPUs. ## Available Images ### 1. Embeddings Server (CUDA) - **Image:** `timothyswt/llama-cpp-server-amd64-mxbai:latest` - **Model:** mxbai-embed-large (1024 dimensions) - **Size:** ~640MB model + 2GB Docker image - **Memory:** ~1.5GB VRAM when running - **Build command:** `npm run llama:build-mxbai-cuda` ### 2. Vision Server 2B (CUDA) - **Image:** `timothyswt/llama-cpp-server-amd64-qwen-vl-2b:latest` - **Model:** Qwen2.5-VL-2B-Instruct (multimodal) - **Size:** ~1.5GB model + 2GB Docker image - **Memory:** ~4-6GB VRAM when running - **Context:** 32K tokens - **Build command:** `npm run llama:build-qwen-2b-cuda` ### 3. Vision Server 7B (CUDA) - **Image:** `timothyswt/llama-cpp-server-amd64-qwen-vl-7b:latest` - **Model:** Qwen2.5-VL-7B-Instruct (multimodal) - **Size:** ~4.5GB model + 2GB Docker image - **Memory:** ~8-10GB VRAM when running - **Context:** 128K tokens - **Build command:** `npm run llama:build-qwen-7b-cuda` ## Prerequisites ### 1. NVIDIA GPU with CUDA Support ```powershell # Check if NVIDIA GPU is detected nvidia-smi ``` ### 2. Docker Desktop with WSL2 Backend - Enable "Use WSL 2 based engine" in Docker Desktop settings - Install NVIDIA Container Toolkit in WSL2: ```bash # Inside WSL2 terminal distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` ### 3. Ollama (for embeddings model only) ```powershell # Pull the mxbai-embed-large model ollama pull mxbai-embed-large ``` ## Build Instructions ### Embeddings Server (mxbai-embed-large) ```powershell # Build the image (extracts model from Ollama) npm run llama:build-mxbai-cuda # Or manually: .\scripts\build-llama-cpp-mxbai-cuda.ps1 # Push to Docker Hub (optional) .\scripts\build-llama-cpp-mxbai-cuda.ps1 -Push ``` **What it does:** 1. Extracts mxbai-embed-large model from `~/.ollama/models` 2. Builds Docker image with CUDA support 3. Embeds the model in the image (~3.5GB total) 4. Optionally pushes to Docker Hub ### Vision Server 2B (Qwen2.5-VL) ```powershell # Build the image (auto-downloads models) npm run llama:build-qwen-2b-cuda # Or manually: .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 2b # Skip download if models already exist: .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 2b -SkipDownload # Push to Docker Hub: .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 2b -Push ``` **What it does:** 1. Downloads Qwen2.5-VL-2B-Instruct (Q4_K_M) from HuggingFace (~1.5GB) 2. Downloads vision projector (~300MB) 3. Builds Docker image with CUDA support 4. Embeds models in the image (~4GB total) 5. Optionally pushes to Docker Hub ### Vision Server 7B (Qwen2.5-VL) ```powershell # Build the image (auto-downloads models) npm run llama:build-qwen-7b-cuda # Or manually: .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 7b # Skip download if models already exist: .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 7b -SkipDownload # Push to Docker Hub: .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 7b -Push ``` **What it does:** 1. Downloads Qwen2.5-VL-7B-Instruct (Q4_K_M) from HuggingFace (~4.5GB) 2. Downloads vision projector (~600MB) 3. Builds Docker image with CUDA support 4. Embeds models in the image (~7.5GB total) 5. Optionally pushes to Docker Hub ## Build Time Estimates | Image | Download Time | Build Time | Total | |-------|--------------|------------|-------| | mxbai-embed-large | N/A (from Ollama) | 10-15 min | 10-15 min | | Qwen 2B | 5-10 min | 15-20 min | 20-30 min | | Qwen 7B | 15-20 min | 15-20 min | 30-40 min | *Times vary based on internet speed and CPU performance* ## Using the Images ### Update docker-compose.amd64.yml The images are already configured in `docker-compose.amd64.yml`. Uncomment the service you want to use: ```yaml # For embeddings (default - already enabled): llama-server: image: timothyswt/llama-cpp-server-amd64-mxbai:latest # GPU support already configured # For vision 2B (uncomment to enable): # llama-vl-server-2b: # image: timothyswt/llama-cpp-server-amd64-qwen-vl-2b:latest # # ... rest of config ... # For vision 7B (uncomment to enable): # llama-vl-server-7b: # image: timothyswt/llama-cpp-server-amd64-qwen-vl-7b:latest # # ... rest of config ... ``` ### Restart Services ```powershell # Rebuild and restart npm run stop npm run build:docker npm run start # Or just restart specific service: docker-compose -f docker-compose.amd64.yml up -d llama-server ``` ## Verifying GPU Usage ### Check GPU Access ```powershell # Verify GPU is accessible in container docker exec llama_server nvidia-smi # Expected output: GPU info with ~1-2GB memory used ``` ### Check Model Loading ```powershell # Check startup logs for CUDA initialization docker logs llama_server | Select-String "CUDA","GPU","offload" # Expected output: # - "CUDA support enabled" # - "offloading X layers to GPU" # - "GPU backend initialized" ``` ### Test Embeddings Performance ```powershell # Before GPU (CPU only): ~200-500ms per request # After GPU: ~20-50ms per request (10x faster) Measure-Command { $body = @{input='Test sentence';model='mxbai-embed-large'} | ConvertTo-Json Invoke-RestMethod -Uri 'http://localhost:11434/v1/embeddings' -Method Post -Body $body -ContentType 'application/json' } ``` ### Test Vision API ```powershell # Test 2B model (port 8081) curl http://localhost:8081/v1/chat/completions ` -H "Content-Type: application/json" ` -d '{ "model": "vision-2b", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "What is in this image?"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}} ] }] }' # Test 7B model (port 8082) curl http://localhost:8082/v1/chat/completions ` -H "Content-Type: application/json" ` -d '{ "model": "vision-7b", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "What is in this image?"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}} ] }] }' ``` ## Troubleshooting ### Issue: "CUDA not found" during build **Solution:** - Ensure NVIDIA GPU drivers are installed - Verify Docker Desktop is using WSL2 backend - Install NVIDIA Container Toolkit in WSL2 (see Prerequisites) ### Issue: Model loads on CPU instead of GPU **Symptoms:** ``` load_tensors: CPU_Mapped model buffer ``` **Solution:** - This means the image wasn't built with CUDA support - Rebuild using the `-cuda` scripts: `npm run llama:build-mxbai-cuda` - Verify logs show "CUDA support enabled" ### Issue: "Out of memory" when loading model **Solution:** - Check available VRAM: `nvidia-smi` - For 2B models: Need 4-6GB VRAM - For 7B models: Need 8-10GB VRAM - Reduce `--parallel` parameter in Dockerfile (default: 4) - Use smaller model variant (2B instead of 7B) ### Issue: Download fails from HuggingFace **Solution:** ```powershell # Download manually using curl or browser curl -L -o docker/llama-cpp/models/qwen2.5-vl-2b.gguf ` "https://huggingface.co/Qwen/Qwen2.5-VL-2B-Instruct-GGUF/resolve/main/qwen2.5-vl-2b-instruct-q4_k_m.gguf" # Then run build with -SkipDownload .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 2b -SkipDownload ``` ### Issue: GPU utilization is low **Check:** ```powershell # Watch GPU usage in real-time nvidia-smi -l 1 # Verify layers are offloaded docker logs llama_server | Select-String "ngl" # Expected: "-ngl 99" means all layers on GPU ``` ## Performance Comparisons ### Embeddings (mxbai-embed-large, 1024 dims) | Backend | Time per Request | GPU Memory | |---------|-----------------|------------| | CPU only | 200-500ms | 0 MB | | **CUDA** | **20-50ms** | **~1.5 GB** | **Speedup: ~10x faster** ### Vision Models (Qwen2.5-VL) | Model | Backend | Time per Image | GPU Memory | |-------|---------|---------------|------------| | 2B | CPU only | 5-15 seconds | 0 MB | | **2B** | **CUDA** | **1-3 seconds** | **~4-6 GB** | | 7B | CPU only | 15-45 seconds | 0 MB | | **7B** | **CUDA** | **2-5 seconds** | **~8-10 GB** | **Speedup: ~5-10x faster** ## Publishing Images After building, you can publish to Docker Hub: ```powershell # Login to Docker Hub docker login # Build and push embeddings .\scripts\build-llama-cpp-mxbai-cuda.ps1 -Push # Build and push vision 2B .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 2b -Push # Build and push vision 7B .\scripts\build-llama-cpp-qwen-vl-cuda.ps1 -ModelSize 7b -Push ``` ## Advanced Configuration ### Custom Layer Offloading Edit Dockerfile CMD to control GPU layers: ```dockerfile # Full GPU (default): CMD ["--ngl", "99"] # Partial GPU (save VRAM): CMD ["--ngl", "32"] # Offload 32 layers only # CPU only: CMD ["--ngl", "0"] ``` ### Custom Context Size ```dockerfile # For 2B model - increase from 32K to 64K: CMD ["--ctx-size", "65536"] # For 7B model - reduce from 128K to 64K (saves VRAM): CMD ["--ctx-size", "65536"] ``` ### Custom Parallelism ```dockerfile # More parallel requests (needs more VRAM): CMD ["--parallel", "8"] # Fewer parallel requests (saves VRAM): CMD ["--parallel", "2"] ``` ## References - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) - [Qwen2.5-VL Models](https://huggingface.co/Qwen) - [Docker GPU Support](https://docs.docker.com/config/containers/resource_constraints/#gpu)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

BUILD_CUDA.md•9.89 KiB