Skip to main content
Glama
spec.md•19 kB
# Feature Specification: AWS GPU-based NVIDIA NIM RAG Deployment Automation **Feature Branch**: `003-aws-nim-deployment` **Created**: 2025-11-09 **Status**: Draft **Input**: User description: "AWS GPU-based NVIDIA NIM RAG System - Create comprehensive deployment automation including EC2 g5.xlarge setup, NVIDIA driver installation (driver-535, CUDA 12.2, nvidia-container-toolkit), Docker GPU configuration, IRIS vector database deployment (1024-dim vectors), NVIDIA NIM LLM container (meta/llama-3.1-8b-instruct on port 8001), NVIDIA NIM embeddings (nvidia/nv-embedqa-e5-v5 Cloud API), clinical note vectorization pipeline for 50K+ Synthea documents, MIMIC-CXR image vectorization with NIM Vision, and vector search testing. All components must run locally on GPU except embeddings Cloud API, with full documentation for repeatable deployment." ## User Scenarios & Testing ### User Story 1 - Complete RAG System Deployment from Scratch (Priority: P1) Infrastructure engineers and ML teams need to deploy a production-ready GPU-accelerated RAG system on cloud infrastructure with minimal manual intervention. They require a single repeatable process that provisions compute resources, installs necessary GPU drivers and runtimes, deploys all required services, and validates the complete stack. **Why this priority**: This is the foundation of the entire system. Without reliable deployment automation, teams cannot build or test the RAG capabilities. This delivers immediate value by reducing deployment time from hours to minutes and eliminating configuration errors. **Independent Test**: Can be fully tested by running the deployment automation on a fresh cloud account and verifying that all services start successfully, GPU is accessible, and health checks pass for all components. **Acceptance Scenarios**: 1. **Given** a fresh cloud account with appropriate credentials, **When** deployment automation is executed, **Then** all infrastructure is provisioned, GPU drivers are installed, and all services are running within 30 minutes 2. **Given** the deployment has completed, **When** health checks are run on each service, **Then** GPU-accelerated LLM service responds successfully, vector database accepts connections, and GPU utilization can be monitored 3. **Given** deployment automation has run previously, **When** the same automation is re-executed, **Then** existing resources are detected, no duplicate services are created, and the system remains in a consistent state --- ### User Story 2 - Clinical Note Vectorization Pipeline (Priority: P2) Data scientists need to vectorize large volumes of clinical text documents using GPU-accelerated embeddings to enable semantic search capabilities. They need to process tens of thousands of documents efficiently with batch processing, progress tracking, and error handling for failed documents. **Why this priority**: Vectorization is the core data transformation that enables the RAG system's search capabilities. This must work reliably at scale to handle production medical datasets. **Independent Test**: Can be tested independently by providing a sample dataset of clinical notes and verifying that vectorized representations are generated, stored correctly in the vector database, and retrievable for similarity search. **Acceptance Scenarios**: 1. **Given** a collection of 50,000+ clinical notes in supported format, **When** the vectorization pipeline is executed, **Then** all documents are processed, embeddings are generated using GPU acceleration, and vectors are stored in the database with appropriate metadata 2. **Given** vectorization is in progress, **When** a system failure occurs mid-processing, **Then** the pipeline can resume from the last successful batch without reprocessing completed documents 3. **Given** a clinical note contains unusual formatting or characters, **When** vectorization is attempted, **Then** the document is either successfully processed with normalization or logged as failed with detailed error information 4. **Given** vectorization has completed, **When** similarity search is performed on the vector database, **Then** relevant clinical notes are retrieved based on semantic similarity with sub-second response times --- ### User Story 3 - Medical Image Vectorization with Vision Models (Priority: P3) Radiologists and imaging teams need to vectorize medical images (chest X-rays, CT scans) to enable multi-modal search combining text and image data. The system must handle large image files, extract meaningful visual features using GPU-accelerated vision models, and link image vectors to associated clinical reports. **Why this priority**: Multi-modal capabilities significantly enhance the RAG system's value but can be implemented after text-based search is working. This represents an advanced feature that builds on the foundational text vectorization. **Independent Test**: Can be tested by providing a sample set of medical images with associated metadata and verifying that image embeddings are generated, stored with correct linkage to patient records, and retrievable through visual similarity search. **Acceptance Scenarios**: 1. **Given** a collection of medical images in standard formats (DICOM, PNG, JPG), **When** the image vectorization pipeline runs, **Then** visual embeddings are generated using GPU-accelerated vision models and stored with appropriate patient and study metadata 2. **Given** a chest X-ray image, **When** visual similarity search is performed, **Then** similar chest X-ray images are retrieved ranked by visual similarity, along with their associated clinical reports 3. **Given** an image fails to process due to corruption or format issues, **When** the pipeline encounters the error, **Then** the failure is logged with diagnostic information, and processing continues with remaining images --- ### User Story 4 - Multi-Modal RAG Query Processing (Priority: P2) Clinical researchers need to ask natural language questions that retrieve relevant information from both clinical text notes and medical images. The system must understand the query intent, perform semantic search across text vectors, optionally search image vectors for visual queries, generate contextual responses using the local LLM, and cite sources for retrieved information. **Why this priority**: This is the end-user facing functionality that delivers direct value. It depends on successful vectorization (P2) and deployment (P1) but is more critical than image vectorization (P3) because text-only RAG provides immediate utility. **Independent Test**: Can be tested by submitting various clinical questions and verifying that relevant documents are retrieved, responses are generated with accurate information, and source citations are provided. **Acceptance Scenarios**: 1. **Given** a vectorized database of clinical notes, **When** a user submits a natural language query about a medical condition, **Then** relevant clinical notes are retrieved, a coherent response is generated using the local LLM, and source documents are cited 2. **Given** a query that has no relevant matches in the database, **When** the RAG system processes the query, **Then** the system indicates no relevant information was found rather than generating unsupported responses 3. **Given** a complex multi-part question, **When** the query is processed, **Then** the system retrieves information addressing all parts of the question and generates a comprehensive response that integrates the retrieved context --- ### User Story 5 - Deployment Validation and Health Monitoring (Priority: P1) DevOps engineers need to verify that all deployed services are functioning correctly and monitor system health over time. They require automated validation tests that check GPU accessibility, service connectivity, database operations, and end-to-end query processing. **Why this priority**: Without reliable validation, teams cannot confidently deploy to production or troubleshoot issues. This is foundational infrastructure that supports all other functionality. **Independent Test**: Can be tested immediately after deployment by running the validation suite and verifying that all health checks report successful status and any failures provide actionable diagnostic information. **Acceptance Scenarios**: 1. **Given** a freshly deployed system, **When** validation tests are executed, **Then** GPU drivers respond with device information, all containerized services report healthy status, vector database accepts test queries, and LLM generates test responses 2. **Given** a service has failed or is unresponsive, **When** validation tests run, **Then** the specific failing service is identified, relevant logs are collected, and remediation steps are suggested 3. **Given** the system is running under load, **When** continuous health monitoring executes, **Then** GPU utilization metrics are collected, service response times are tracked, and alerts are generated when thresholds are exceeded --- ### Edge Cases - What happens when GPU memory is exhausted during large batch vectorization? System should detect memory pressure, reduce batch size automatically, or queue remaining batches for sequential processing - How does the system handle AWS instance stop/start cycles where public IPs change? Deployment automation should detect IP changes and update configurations, or use elastic IPs for stable addressing - What happens when NVIDIA NIM service containers fail to start due to license validation issues? System should provide clear error messages about API key configuration and validation status - How does vectorization handle extremely large clinical documents that exceed token limits? Documents should be chunked into overlapping segments with metadata preserved to maintain context - What happens when the vector database runs out of disk space during bulk ingestion? System should monitor disk usage, provide early warnings, and gracefully handle out-of-space conditions with clear recovery procedures - How does the system behave when network connectivity to NVIDIA Cloud API is lost during embedding generation? Pipeline should implement retry logic with exponential backoff, cache successful embeddings, and allow resumption after connectivity is restored - What happens when multiple vectorization jobs are submitted concurrently? System should implement job queuing or fail-fast validation to prevent resource contention and data corruption ## Requirements ### Functional Requirements - **FR-001**: Deployment automation MUST provision GPU-enabled cloud compute instances with appropriate instance types for running large language models - **FR-002**: System MUST install and configure GPU drivers compatible with NVIDIA GPU hardware (A10G, A100, or equivalent) - **FR-003**: System MUST configure container runtime to enable GPU passthrough to containerized applications - **FR-004**: System MUST deploy a vector database capable of storing high-dimensional vectors (1024+ dimensions) with efficient similarity search capabilities - **FR-005**: System MUST deploy NVIDIA NIM language model service configured to use available GPU resources for inference acceleration - **FR-006**: System MUST provide vectorization pipeline for processing clinical text documents with batch processing and progress tracking - **FR-007**: System MUST store generated text embeddings in the vector database with associated document metadata (patient ID, document type, source bundle) - **FR-008**: System MUST provide image vectorization pipeline for processing medical images using GPU-accelerated vision models - **FR-009**: System MUST store generated image embeddings linked to appropriate patient records and image metadata - **FR-010**: System MUST handle vectorization failures gracefully with error logging and ability to retry failed documents - **FR-011**: System MUST provide validation tests that verify GPU accessibility, service health, database connectivity, and end-to-end query functionality - **FR-012**: System MUST support resumable vectorization pipelines that can continue from last successful batch after interruption - **FR-013**: Deployment automation MUST be idempotent, detecting existing resources and avoiding duplicate deployments - **FR-014**: System MUST provide clear documentation for initial deployment, configuration, and troubleshooting - **FR-015**: System MUST monitor and log GPU utilization metrics for performance optimization - **FR-016**: RAG query pipeline MUST retrieve relevant documents from vector database using semantic similarity search - **FR-017**: RAG query pipeline MUST generate responses using local GPU-accelerated LLM with retrieved context - **FR-018**: RAG responses MUST include citations to source documents used in generating the answer ### Key Entities - **Deployment Environment**: Represents the cloud infrastructure (EC2 instance), including instance type, region, security groups, storage volumes, and network configuration - **GPU Configuration**: Represents NVIDIA GPU setup including driver version, CUDA version, container toolkit configuration, and device accessibility - **Vector Database**: Stores document embeddings with metadata, supports similarity search queries, includes separate tables for text vectors and image vectors with 1024-dimensional vector fields - **Clinical Document**: Represents a medical text document with resource ID, patient ID, document type, text content, source bundle reference, associated vector embedding, and embedding model identifier - **Medical Image**: Represents a medical image file with image ID, patient ID, study metadata, image format, file location, associated vector embedding, and linkage to related clinical reports - **Vectorization Job**: Represents a batch processing job with job ID, source data location, processing status, progress metrics, error logs, and timestamps for start/completion - **NIM Service**: Represents deployed NVIDIA NIM container (LLM or Vision) with service type, model identifier, port configuration, GPU allocation, API endpoint, and health status - **RAG Query**: Represents a user question with query text, search mode (text-only or multi-modal), retrieved document references, generated response, source citations, and processing timestamps ## Success Criteria ### Measurable Outcomes - **SC-001**: Infrastructure engineers can deploy the complete RAG system on a fresh cloud environment in under 30 minutes from start to finish - **SC-002**: Deployment automation succeeds on first attempt without manual intervention for 95% of standard configurations - **SC-003**: Clinical note vectorization processes 50,000+ documents with 99.9% success rate, with failed documents clearly logged and retrievable - **SC-004**: Vectorization throughput achieves at least 100 documents per minute when using GPU acceleration - **SC-005**: Medical image vectorization processes standard chest X-ray images within 2 seconds per image on average - **SC-006**: Vector similarity search returns top-K results in under 1 second for 95% of queries against databases with 100,000+ vectors - **SC-007**: RAG query processing generates complete responses within 5 seconds from question submission to answer delivery - **SC-008**: GPU utilization reaches at least 70% during active vectorization or inference workloads, indicating effective hardware usage - **SC-009**: System maintains 99.9% uptime for deployed services over 30-day operational period - **SC-010**: Validation test suite completes full system health check in under 3 minutes and correctly identifies component failures with 100% accuracy - **SC-011**: Documentation enables new team members to deploy the system successfully within 1 hour without external assistance - **SC-012**: Deployed system handles at least 100 concurrent similarity search queries without response time degradation beyond 2x baseline - **SC-013**: Vectorization pipeline successfully resumes processing within 1 minute of interruption recovery with zero data loss - **SC-014**: Storage costs remain under $50/month for typical deployment with 100,000 vectorized documents and associated metadata ## Assumptions - Cloud provider account has appropriate permissions and quota for GPU instance types - NVIDIA API keys for NIM services are available and have sufficient API quota for expected usage - Clinical documents are provided in structured JSON format with consistent schema - Medical images are in standard formats (DICOM, PNG, JPG) with associated metadata available - Network connectivity to NVIDIA Cloud services is available for embedding API calls - Data privacy and compliance requirements (HIPAA, GDPR) are handled at the infrastructure/network level outside this automation - Clinical notes text content is in English language (non-English content may have reduced embedding quality) - Users executing deployment have basic familiarity with command-line tools and cloud infrastructure concepts - System will process de-identified patient data or operate in appropriate compliance environment - GPU driver versions are compatible with Ubuntu 24.04 LTS operating system - Docker container engine is either pre-installed or can be installed as part of automation ## Dependencies - AWS account with EC2 access and ability to launch g5.xlarge or larger GPU instances - NVIDIA NGC account and API keys for accessing NIM container registry and Cloud API services - InterSystems IRIS Community Edition or licensed version for vector database functionality - Ubuntu 24.04 LTS or compatible Linux distribution for EC2 instances - Sufficient AWS storage quota for vector database (recommend 500GB+ for large datasets) - Network security groups configured to allow SSH access and required service ports - Python 3.10+ runtime environment for vectorization scripts - Source data: Synthea-generated clinical notes (50K+) and optionally MIMIC-CXR image dataset - SSH key pair for secure access to deployed cloud instances ## Out of Scope - Multi-region deployment or high-availability configurations across availability zones - Production-grade security hardening, penetration testing, or compliance certification - User authentication and authorization systems for controlling access to the RAG service - Web-based user interface for submitting queries (command-line and API only) - Automated scaling based on load (single instance deployment only) - Integration with existing hospital EHR systems or clinical workflows - Real-time vectorization of streaming clinical data - Fine-tuning or customization of NVIDIA NIM models - Backup and disaster recovery automation - Cost optimization or spot instance handling - Support for non-NVIDIA GPU hardware - Multi-language support for non-English clinical documents - De-identification or anonymization of patient data (assumes pre-processed data) - DICOM image processing beyond basic format conversion - Integration with medical imaging PACS systems

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/isc-tdyar/medical-graphrag-assistant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server