AWS_GRAPHRAG_KNOWLEDGE_GRAPH_COMPLETE.md•10.4 kB
# AWS GraphRAG Knowledge Graph - Complete ✅
**Date**: December 13, 2025
**Status**: **KNOWLEDGE GRAPH POPULATED**
**Environment**: AWS EC2 g5.xlarge (3.84.250.46)
---
## Summary
Successfully built and populated a medical knowledge graph on AWS IRIS with:
- **83 medical entities** extracted from 51 FHIR clinical notes
- **540 entity relationships** (co-occurrence based)
- **1024-dimensional NVIDIA NIM embeddings** for all entities
- **Full GraphRAG infrastructure** ready for multi-modal search
---
## Knowledge Graph Statistics
### Entities by Type
| Entity Type | Count | Examples |
|-------------|-------|----------|
| TEMPORAL | 43 | "during hospitalization", "on admission" |
| SYMPTOM | 21 | "discomfort", "tenderness", "vomiting", "diarrhea" |
| BODY_PART | 10 | "respiratory", "chest", "abdomen" |
| CONDITION | 6 | "infection", "fever", "pain" |
| MEDICATION | 2 | Drug names |
| PROCEDURE | 1 | Medical procedures |
**Total**: 83 unique entities
### Relationships
| Relationship Type | Count | Description |
|-------------------|-------|-------------|
| CO_OCCURS_WITH | 540 | Entities appearing in same document |
**Total**: 540 relationships
---
## Database Schema
### Tables Created
#### SQLUser.Entities
```sql
CREATE TABLE SQLUser.Entities (
EntityID BIGINT PRIMARY KEY AUTO_INCREMENT,
EntityText VARCHAR(500),
EntityType VARCHAR(100),
ResourceID BIGINT,
Confidence DOUBLE,
EmbeddingVector VECTOR(DOUBLE, 1024), -- NVIDIA NIM embeddings
ExtractedAt TIMESTAMP
)
```
#### SQLUser.EntityRelationships
```sql
CREATE TABLE SQLUser.EntityRelationships (
RelationshipID BIGINT PRIMARY KEY AUTO_INCREMENT,
SourceEntityID BIGINT, -- Fixed: uses entity IDs, not text
TargetEntityID BIGINT, -- Fixed: uses entity IDs, not text
RelationshipType VARCHAR(100),
ResourceID BIGINT, -- Source document
Confidence DOUBLE,
ExtractedAt TIMESTAMP,
Context VARCHAR(1000)
)
```
---
## Issues Resolved
### Issue 1: iris-vector-rag GraphRAG Bug ✅ FIXED
- **Problem**: `UnboundLocalError: cannot access local variable 'time'`
- **Root Cause**: Duplicate `import time` statement (line 144) causing Python scoping issue
- **Solution**: Removed duplicate import from `/Users/tdyar/ws/iris-vector-rag-private/iris_vector_rag/pipelines/graphrag.py`
- **Status**: Fixed in local development version, awaiting upstream contribution
### Issue 2: Schema Mismatch - Entity IDs ✅ FIXED
- **Problem**: `EntityRelationships` table expected entity IDs (bigint) but script inserted entity text (varchar)
- **Solution**:
1. Built entity text → ID mapping after entity insertion
2. Used entity IDs when inserting relationships
- **Code Change**: `scripts/aws/extract-entities-aws.py` lines 334-380
- **Status**: Fixed and verified
### Issue 3: Missing ResourceID ✅ FIXED
- **Problem**: `EntityRelationships` table requires ResourceID field
- **Solution**: Tracked source document ID with each relationship
- **Code Change**: Modified relationship tuple to include `resource_id` field
- **Status**: Fixed and verified
---
## Scripts Used
### 1. Direct SQL Approach (Used for Population)
**File**: `scripts/aws/extract-entities-aws.py`
**What it does**:
1. Loads 51 FHIR documents from `SQLUser.FHIRDocuments`
2. Extracts entities using regex patterns (SYMPTOM, CONDITION, etc.)
3. Generates 1024-dim embeddings via NVIDIA Hosted NIM API
4. Stores entities in `SQLUser.Entities` with embeddings
5. Builds entity ID mapping
6. Stores relationships in `SQLUser.EntityRelationships` using entity IDs
**Status**: ✅ Successfully populated knowledge graph
### 2. iris-vector-rag Pipeline (Ready for Use)
**File**: `scripts/aws/build-knowledge-graph-aws.py`
**What it does**:
1. Uses ConfigurationManager for settings
2. Uses IRISVectorStore abstraction
3. Uses GraphRAGPipeline for entity extraction
4. Handles schema management automatically
**Status**: ⚠️ Needs NVIDIA NIM API key configuration
---
## Next Steps
### 1. Configure NVIDIA NIM API Key
The iris-vector-rag GraphRAG pipeline needs NVIDIA NIM credentials:
```bash
# Option A: Environment variable
export NVIDIA_API_KEY="nvapi-..."
# Option B: Add to config/fhir_graphrag_config.aws.yaml
embeddings:
provider: "nvidia_nim"
api_key: "nvapi-..."
base_url: "https://integrate.api.nvidia.com/v1"
```
### 2. Test GraphRAG Multi-Modal Search
Once configured, test knowledge graph queries:
```python
from iris_vector_rag.pipelines import GraphRAGPipeline
pipeline = GraphRAGPipeline(config_manager=config, vector_store=vector_store)
# Query combines:
# - Vector search (embeddings)
# - Text search (keywords)
# - Graph traversal (entity relationships)
results = pipeline.query("chest pain", top_k=5)
```
### 3. Deploy NIM LLM Service (Optional)
For answer generation, deploy NVIDIA NIM LLM on AWS:
```yaml
llm:
provider: "nvidia_nim"
model: "nvidia/llama-3.1-nemotron-70b-instruct"
base_url: "http://3.84.250.46:8000"
```
---
## Data Sources
### FHIR Documents
- **Source**: `SQLUser.FHIRDocuments`
- **Count**: 51 DocumentReference resources
- **Format**: Hex-encoded clinical notes
- **Origin**: Synthea synthetic patient data
### Embeddings
- **Model**: NVIDIA NV-EmbedQA-E5-v5
- **Dimension**: 1024
- **API**: NVIDIA Hosted NIM (integrate.api.nvidia.com)
- **Cost**: Free tier (1000 requests/day)
### Vector Storage
- **Table**: `SQLUser.ClinicalNoteVectors`
- **Documents**: 51 clinical notes
- **Embeddings**: 51 x 1024-dimensional vectors
- **Database**: InterSystems IRIS on AWS EC2
---
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ AWS EC2 g5.xlarge │
│ InterSystems IRIS │
├─────────────────────────────────────────────────────────────┤
│ │
│ SQLUser.FHIRDocuments │
│ ├── 51 FHIR DocumentReference resources │
│ └── Hex-encoded clinical notes │
│ │
│ SQLUser.ClinicalNoteVectors │
│ ├── 51 document embeddings (1024-dim) │
│ └── NVIDIA NIM embeddings │
│ │
│ SQLUser.Entities ✅ NEW │
│ ├── 83 medical entities │
│ ├── Entity types: SYMPTOM, CONDITION, etc. │
│ └── 1024-dim embeddings per entity │
│ │
│ SQLUser.EntityRelationships ✅ NEW │
│ ├── 540 entity relationships │
│ └── CO_OCCURS_WITH relationships │
│ │
└─────────────────────────────────────────────────────────────┘
│ │
│ Vector Search │ Graph Traversal
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ NVIDIA NIM API │ │ GraphRAG Query │
│ Hosted Service │ │ Engine │
└──────────────────┘ └──────────────────┘
```
---
## Verification
### Entity Count
```sql
SELECT COUNT(*) FROM SQLUser.Entities;
-- Result: 83
```
### Relationship Count
```sql
SELECT COUNT(*) FROM SQLUser.EntityRelationships;
-- Result: 540
```
### Entity Type Distribution
```sql
SELECT EntityType, COUNT(*)
FROM SQLUser.Entities
GROUP BY EntityType
ORDER BY COUNT(*) DESC;
```
Result:
```
TEMPORAL : 43
SYMPTOM : 21
BODY_PART : 10
CONDITION : 6
MEDICATION : 2
PROCEDURE : 1
```
### Sample Query
```sql
SELECT e1.EntityText as Source,
r.RelationshipType,
e2.EntityText as Target
FROM SQLUser.EntityRelationships r
JOIN SQLUser.Entities e1 ON r.SourceEntityID = e1.EntityID
JOIN SQLUser.Entities e2 ON r.TargetEntityID = e2.EntityID
LIMIT 5;
```
---
## Performance Notes
### Entity Extraction
- **Time**: ~2 minutes for 51 documents
- **Method**: Regex-based pattern matching
- **Throughput**: ~25 documents/minute
### Embedding Generation
- **API**: NVIDIA Hosted NIM
- **Latency**: ~100ms per entity
- **Total Time**: ~8 seconds for 83 entities
- **Cost**: Free (hosted tier)
### Database Operations
- **Entity Inserts**: Batched (commit every 50)
- **Relationship Inserts**: Batched (commit every 50)
- **Total Storage**: <1MB (entities + relationships)
---
## Related Documents
- **Bug Report**: `IRIS_VECTOR_RAG_GRAPHRAG_BUG_REPORT.md`
- **Bug Resolution**: `IRIS_VECTOR_RAG_GRAPHRAG_BUG_RESOLUTION.md`
- **AWS Config**: `config/fhir_graphrag_config.aws.yaml`
- **Extraction Script**: `scripts/aws/extract-entities-aws.py`
- **Pipeline Script**: `scripts/aws/build-knowledge-graph-aws.py`
---
## Success Criteria ✅
- [x] GraphRAG bug fixed (import time)
- [x] Entity extraction working
- [x] 83 entities stored with embeddings
- [x] 540 relationships stored with entity IDs
- [x] Schema compatible with iris-vector-rag
- [x] Knowledge graph queryable
- [ ] NVIDIA NIM API key configured (next step)
- [ ] Multi-modal search tested (next step)
---
**Status**: ✅ **KNOWLEDGE GRAPH READY FOR QUERYING**
The knowledge graph is fully populated and ready for GraphRAG queries. Next step is to configure NVIDIA NIM API credentials for the iris-vector-rag pipeline.