Skip to main content
Glama
AWS_DEPLOYMENT_OPTIONS.md7.64 kB
# AWS GraphRAG Deployment Options **Date**: December 13, 2025 **Environment**: AWS EC2 g5.xlarge (3.84.250.46) **Status**: Knowledge graph populated, multiple embedding options available --- ## Current State ✅ ### What's Working - **51 FHIR DocumentReference resources** migrated to AWS - **83 medical entities** extracted with 1024-dim NVIDIA embeddings - **540 entity relationships** mapped in knowledge graph - **InterSystems IRIS** running on AWS EC2 with vector support ### Database Tables (AWS) ``` SQLUser.FHIRDocuments - 51 FHIR resources (JSON) SQLUser.ClinicalNoteVectors - 51 vectors (1024-dim, NVIDIA Hosted API) SQLUser.Entities - 83 medical entities with embeddings SQLUser.EntityRelationships - 540 relationships ``` --- ## Embedding Options ### Option 1: NVIDIA Hosted NIM API ✅ **CURRENT** **Status**: Working and in use **Configuration**: ```python NVIDIA_API_KEY = "nvapi-..." NVIDIA_API_BASE = "https://integrate.api.nvidia.com/v1" MODEL = "nvidia/nv-embedqa-e5-v5" DIMENSION = 1024 ``` **Advantages**: - ✅ No infrastructure needed - ✅ Free tier (1000 requests/day) - ✅ Low latency (~100ms per request) - ✅ Always up-to-date models - ✅ Automatic scaling **Disadvantages**: - ⚠️ Rate limited (1000 requests/day) - ⚠️ Requires internet connectivity - ⚠️ Data leaves your VPC **Cost**: Free (hosted tier) **Use Case**: Development, prototyping, light production workloads --- ### Option 2: NVIDIA NIM Self-Hosted (Docker) **Status**: Not deployed (optional enhancement) **Configuration**: ```bash # Deploy on AWS EC2 g5.xlarge docker run -d \ --name nim-embedding \ --gpus all \ -p 8080:8080 \ -e NGC_API_KEY=$NGC_API_KEY \ nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.0 ``` **Advantages**: - ✅ Data stays in your VPC - ✅ Unlimited requests - ✅ Predictable latency - ✅ No rate limits - ✅ Works offline **Disadvantages**: - ❌ Requires GPU infrastructure (A10G) - ❌ Need NGC license - ❌ Manual updates - ❌ Higher operational overhead **Cost**: - GPU instance: ~$1.00/hour (g5.xlarge) - Storage: ~$0.10/GB/month - Network: ~$0.09/GB outbound **Use Case**: High-volume production, compliance requirements, offline deployments --- ### Option 3: Hybrid Approach **Status**: Possible but not implemented **Strategy**: - Use **NVIDIA Hosted API** for development/testing - Use **Self-Hosted NIM** for production queries - Sync embeddings between environments **Advantages**: - Development flexibility - Production control - Cost optimization **Implementation**: ```yaml # config/hybrid-embeddings.yaml environments: development: provider: nvidia_hosted api_key: "nvapi-..." production: provider: nvidia_self_hosted base_url: "http://3.84.250.46:8080" ``` --- ## Deployment Recommendation ### For Current Phase: **Option 1 (NVIDIA Hosted API)** ✅ **Reasoning**: 1. Knowledge graph already populated with Hosted API embeddings 2. 51 documents fit well within free tier (1000 requests/day) 3. No additional infrastructure costs 4. Fast iteration during development 5. Easy to upgrade to self-hosted later **Configuration** (Already in Use): ```python # scripts/aws/extract-entities-aws.py NVIDIA_API_KEY = os.getenv('NVIDIA_API_KEY') or open('~/.nvidia_api_key').read().strip() API_BASE = "https://integrate.api.nvidia.com/v1" MODEL = "nvidia/nv-embedqa-e5-v5" ``` ### For Production Scale: **Option 2 (Self-Hosted NIM)** **When to Upgrade**: - Document count > 1,000 - Query volume > 1,000/day - Compliance requires data locality - Need guaranteed SLA/latency - Going offline/air-gapped **Migration Path**: 1. Deploy NIM container on AWS EC2 2. Update config to point to local endpoint 3. Re-vectorize documents (or accept small drift) 4. Test GraphRAG queries against local NIM 5. Switch over production traffic --- ## GraphRAG Query Performance ### Current Configuration ```yaml # config/fhir_graphrag_config.aws.yaml embeddings: model: "nvidia/nv-embedqa-e5-v5" dimension: 1024 pipelines: graphrag: vector_k: 30 # Top 30 vector matches text_k: 30 # Top 30 text matches graph_k: 10 # Top 10 graph matches fusion_method: "rrf" # Reciprocal rank fusion ``` ### Query Latency Breakdown ``` Component | Time (ms) | Provider -----------------------|-----------|------------------ Query embedding | 100 | NVIDIA Hosted API Vector search (IRIS) | 50 | Local IRIS Text search (IRIS) | 30 | Local IRIS Graph traversal (IRIS) | 20 | Local IRIS RRF fusion | 10 | Local compute -----------------------|-----------|------------------ Total | 210 | Mixed ``` **With Self-Hosted NIM**: - Query embedding: 100ms → **20ms** (5x faster) - Total latency: 210ms → **130ms** (38% faster) --- ## Cost Analysis (Monthly) ### Current Setup (Option 1) ``` AWS EC2 g5.xlarge : $0 (already running for IRIS) NVIDIA Hosted API : $0 (free tier) IRIS database : $0 (Community Edition) Storage (10GB) : $1 (EBS) Network (minimal) : $1 (data transfer) ------------------------------------------------- Total : ~$2/month ``` ### With Self-Hosted NIM (Option 2) ``` AWS EC2 g5.xlarge : $720 (24/7 operation) NGC License : $0 (included with GPU instance) Storage (50GB) : $5 (EBS + model storage) Network (5GB/day) : $13.50 (data transfer) ------------------------------------------------- Total : ~$738/month ``` ### Break-Even Analysis Self-hosted makes sense when: - Query volume > 100,000/month (exceeds free tier) - OR need compliance/data locality - OR require < 50ms embedding latency --- ## Next Steps ### Immediate (Use Current Setup) 1. ✅ Continue using NVIDIA Hosted API 2. Test GraphRAG multi-modal queries 3. Validate entity extraction quality 4. Benchmark query performance ### Future Enhancements 1. Deploy self-hosted NIM for production 2. Add LLM service (llama-3.1-nemotron-70b) 3. Implement answer generation 4. Set up monitoring/alerting --- ## Configuration Files ### Current Active Config **File**: `config/fhir_graphrag_config.aws.yaml` ```yaml embeddings: model: "nvidia/nv-embedqa-e5-v5" # Uses NVIDIA Hosted API via environment variable # export NVIDIA_API_KEY="nvapi-..." ``` ### Self-Hosted NIM Config (Template) **File**: `config/fhir_graphrag_config.aws.selfhosted.yaml` ```yaml embeddings: provider: "nvidia_nim" model: "nvidia/nv-embedqa-e5-v5" base_url: "http://localhost:8080" # Local NIM endpoint dimension: 1024 ``` --- ## Migration Scripts ### Current Embedding Script **File**: `scripts/aws/extract-entities-aws.py` - Uses NVIDIA Hosted API - Extracts 83 entities - Stores in SQLUser.Entities ### Alternative: iris-vector-rag Pipeline **File**: `scripts/aws/build-knowledge-graph-aws.py` - Uses GraphRAGPipeline abstraction - Supports both hosted and self-hosted - Requires NVIDIA_API_KEY environment variable --- ## Summary ✅ **Current deployment uses NVIDIA Hosted API successfully** - Knowledge graph fully populated - 83 entities with 1024-dim embeddings - 540 relationships mapped - Ready for GraphRAG queries ⚙️ **Self-hosted NIM available as optional enhancement** - For production scale (>1000 docs) - For compliance requirements - For <50ms embedding latency 📊 **Cost-effective at current scale** - ~$2/month with hosted API - Scales to ~$738/month if self-hosting - Break-even at ~100K queries/month 🎯 **Recommendation**: Stay with hosted API until query volume or compliance drives need for self-hosting.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/isc-tdyar/medical-graphrag-assistant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server