Medical GraphRAG Assistant

MIT License

medical-graphrag-assistant
specs
003-aws-nim-deployment

research.md•8.38 kB

# Research: AWS GPU-based NVIDIA NIM RAG Deployment **Feature**: 003-aws-nim-deployment **Date**: 2025-11-09 **Status**: Complete ## Research Questions Resolved ### 1. AWS GPU Instance Selection for NVIDIA NIM **Decision**: Use g5.xlarge instance type **Rationale**: - NVIDIA A10G GPU (24GB VRAM) sufficient for meta/llama-3.1-8b-instruct (requires ~16GB) - Cost-effective for development/testing ($1.006/hour on-demand vs g5.2xlarge at $1.212/hour) - 4 vCPUs and 16GB system RAM adequate for vectorization pipelines and IRIS database - NVMe SSD provides fast local storage for Docker layers and model caching **Alternatives Considered**: - p3.2xlarge (V100 GPU): More expensive ($3.06/hour), overkill for 8B parameter model - g4dn.xlarge (T4 GPU): Cheaper but only 16GB VRAM, insufficient headroom for concurrent LLM + vectorization - g5.2xlarge: Double the cost for minimal performance gain in single-user scenario **References**: - AWS EC2 G5 Instances: https://aws.amazon.com/ec2/instance-types/g5/ - NVIDIA NIM Requirements: https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html --- ### 2. NVIDIA Driver and CUDA Version Compatibility **Decision**: nvidia-driver-535 with CUDA 12.2 **Rationale**: - Driver 535 is LTS (Long Term Support) release with proven stability on Ubuntu 24.04 - CUDA 12.2 required for NVIDIA Container Toolkit and NIM containers - Matches versions used in successful local development testing - Compatible with NVIDIA A10G GPU (Ampere architecture) **Alternatives Considered**: - nvidia-driver-550 (latest): Newer but less tested, potential compatibility issues - CUDA 11.8: Older, not supported by latest NIM containers **References**: - NVIDIA Driver Downloads: https://www.nvidia.com/Download/index.aspx - CUDA Toolkit Archive: https://developer.nvidia.com/cuda-toolkit-archive --- ### 3. Vector Database Scaling Strategy **Decision**: IRIS native VECTOR type with B-tree indexes, no specialized vector index initially **Rationale**: - IRIS VECTOR(DOUBLE, 1024) provides optimized storage for high-dimensional vectors - For 100K vectors, brute-force VECTOR_COSINE search performs adequately (<1 second) - Simpler implementation without external vector index dependencies - Can add HNSW or IVF indexes later if performance degrades at scale **Alternatives Considered**: - pgvector extension: Would require PostgreSQL instead of IRIS, loses native FHIR integration - Specialized vector DB (Pinecone, Weaviate): Additional cost and operational complexity **References**: - IRIS Vector Search Documentation: https://docs.intersystems.com/irislatest/csp/docbook/Doc.View.cls?KEY=GSQL_vecsearch - Vector Search Performance: Internal IRIS benchmarks show sub-second search for 100K-1M vectors --- ### 4. Batch Processing and Resumability Pattern **Decision**: Checkpoint-based resumable pipeline with SQLite state tracking **Rationale**: - Lightweight SQLite DB tracks processing state (document ID, status, timestamp) - After interruption, pipeline queries state DB to find last successful batch - Idempotent: Re-processing same document generates identical vector, safe to reinsert - No external dependencies beyond Python standard library **Implementation Pattern**: ```python # Simplified resumable batch processor def process_batch(documents, batch_size=50): state_db = sqlite3.connect('vectorization_state.db') for i in range(0, len(documents), batch_size): batch = documents[i:i+batch_size] # Check which docs already processed unprocessed = filter_unprocessed(batch, state_db) if unprocessed: vectors = generate_embeddings(unprocessed) insert_vectors(vectors) mark_processed(unprocessed, state_db) ``` **Alternatives Considered**: - File-based checkpointing: Harder to query for status, prone to corruption - No resumability: Unacceptable for 50K+ document processing (could lose hours of work) **References**: - Python sqlite3 module: https://docs.python.org/3/library/sqlite3.html - Idempotent data pipeline patterns: Martin Kleppmann, "Designing Data-Intensive Applications" Chapter 11 --- ### 5. NVIDIA NIM Embeddings: Local vs Cloud API **Decision**: Use NVIDIA NIM Cloud API (nvidia/nv-embedqa-e5-v5) for embeddings **Rationale**: - Cloud API provides 1024-dimensional embeddings optimized for retrieval tasks - Faster deployment (no local embedding model management) - GPU resources freed for LLM inference - Acceptable latency for batch processing (~100ms per request for 50 documents) - Cost: Included in NVIDIA NGC developer tier for reasonable usage **Trade-offs**: - Network dependency: Requires retry logic and offline fallback strategy - API rate limits: Batch size tuning required (found 50 docs/request optimal) **Alternatives Considered**: - Local sentence-transformers: Requires GPU memory, slower than cloud-optimized inference - OpenAI ada-002 embeddings: More expensive, 1536-dim (larger storage footprint) **References**: - NVIDIA NIM Embeddings API: https://build.nvidia.com/nvidia/nv-embedqa-e5-v5 - Embedding quality benchmarks: MTEB leaderboard shows nv-embedqa-e5-v5 competitive with OpenAI --- ### 6. Docker GPU Runtime Configuration **Decision**: NVIDIA Container Toolkit with Docker runtime configuration **Rationale**: - Officially supported by NVIDIA for GPU passthrough to containers - Enables `--gpus all` flag for automatic GPU discovery - Works with both Docker Compose and standalone docker run commands - Handles GPU driver library mounting automatically **Installation Steps**: ```bash # Add NVIDIA Container Toolkit repository distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg # Install toolkit sudo apt-get update sudo apt-get install -y nvidia-container-toolkit # Configure Docker daemon sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` **Alternatives Considered**: - nvidia-docker2 (deprecated): Legacy tool, replaced by Container Toolkit - Podman with crun: Less mature GPU support, smaller ecosystem **References**: - NVIDIA Container Toolkit Installation: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html --- ### 7. Deployment Idempotency Strategy **Decision**: Resource existence checks before creation, fail-safe defaults **Rationale**: - Check if EC2 instance exists before launching (via instance tags) - Check if Docker containers running before starting (docker ps) - Check if IRIS tables exist before CREATE TABLE (if exists in DDL) - Safe to re-run deployment scripts multiple times without errors **Implementation Example**: ```bash # Idempotent container deployment if ! docker ps | grep -q nim-llm; then docker run -d --name nim-llm --gpus all ... else echo "nim-llm container already running" fi ``` **Alternatives Considered**: - Infrastructure as Code (Terraform): Overkill for single-instance deployment, adds complexity - Ansible playbooks: More appropriate for multi-server deployments **References**: - Idempotent script patterns: https://en.wikipedia.org/wiki/Idempotence --- ## Summary of Technology Choices | Component | Technology | Version | Rationale | |-----------|------------|---------|-----------| | Cloud Platform | AWS EC2 | g5.xlarge | Cost-effective GPU instance for NIM workloads | | Operating System | Ubuntu LTS | 24.04 | Long-term support, NVIDIA driver compatibility | | GPU Driver | nvidia-driver | 535 (LTS) | Stability and CUDA 12.2 support | | Container Runtime | Docker + NVIDIA Toolkit | 24+ | Official GPU container support | | Vector Database | InterSystems IRIS | 2025.1 Community | Native VECTOR type, FHIR integration | | LLM Service | NVIDIA NIM | meta/llama-3.1-8b-instruct | GPU-accelerated, containerized deployment | | Embeddings | NVIDIA NIM Cloud API | nvidia/nv-embedqa-e5-v5 | 1024-dim, optimized for retrieval | | Scripting | Bash | 5.x | Standard for infrastructure automation | | Data Processing | Python | 3.10+ | IRIS driver, NumPy, requests libraries | | State Tracking | SQLite | 3.x | Resumable pipeline checkpointing | ## Open Questions (None Remaining) All technical questions from the feature spec have been resolved through this research phase.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/isc-tdyar/medical-graphrag-assistant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server