RAGStack-Lambda

Overview Schema Related Servers Score Discussions

RAGStack-Lambda
docs

ARCHITECTURE.md•8.43 KiB

# Architecture ## System Design ``` Upload → OCR → Bedrock KB (embeddings + indexing) ↓ UI/Chat ←→ Query Bedrock KB ``` **Principles:** - Serverless (auto-scaling, no servers) - Cost-optimized (S3 vectors ~$1/mo vs OpenSearch $50+/mo) - Error handling (DLQ, 3x retry), CloudWatch metrics ## Components | Component | Purpose | |-----------|---------| | DetectFileType Lambda | Detect file type, count pages, and route to appropriate processor | | ProcessDocument Lambda | OCR extraction (Textract/Bedrock) for PDF/images | | ProcessMedia Lambda | Video/audio transcription via AWS Transcribe, 30s segmentation | | ProcessText Lambda | Text extraction for HTML, CSV, JSON, XML, EML, EPUB, DOCX, XLSX | | EnqueueBatches Lambda | Queue batch jobs to SQS | | BatchProcessor Lambda | Process 10-page batches (max 10 concurrent) | | CombinePages Lambda | Merge partial outputs into final document | | IngestToKB Lambda | Trigger Bedrock KB ingestion (Nova Multimodal embeddings) | | IngestMedia Lambda | Ingest transcribed media segments to KB | | QueryKB Lambda | Query documents, chat with sources | | SearchKB Lambda | Direct KB search (no chat context) | | ProcessImage Lambda | Image ingestion with captions | | Scrape Lambdas | Web scraping pipeline (start/discover/process/status) | | ReindexKB Lambda | Orchestrate KB reindexing with new metadata settings | | MetadataAnalyzer Lambda | Sample KB vectors and generate filter examples | | SyncCoordinator Lambda | Coordinate KB sync operations | | SyncStatusChecker Lambda | Check KB sync completion status | | ConfigurationResolver Lambda | Resolve DynamoDB configuration | | AppSyncResolvers Lambda | GraphQL resolver implementations | | ApiKeyResolver Lambda | API key validation and management | | Step Functions | Orchestrate document/scrape/reindex workflows | | Bedrock KB | Vector storage & retrieval (S3 backend) | | S3 | File storage (input/, output/, images/) | | DynamoDB | Document tracking, config, conversations, scrape jobs | | AppSync | GraphQL API with subscriptions | | React UI | Web dashboard (Cloudscape) | | ragstack-chat | AI chat web component | ## Data Flow ### Document Processing Documents are automatically routed to the appropriate processor based on file type detection: ``` Upload → DetectFileType → Route by Type: │ ├── Text files (HTML, TXT, CSV, JSON, XML, EML, EPUB, DOCX, XLSX) │ └── ProcessText → IngestToKB → Bedrock KB │ ├── OCR files (PDF, images) │ └── ProcessDocument → IngestToKB → Bedrock KB │ ├── Media files (MP4, WebM, MP3, WAV, M4A, OGG, FLAC) │ └── ProcessMedia → AWS Transcribe → 30s segments → IngestToKB → Bedrock KB │ └── Passthrough (Markdown) └── ProcessDocument → IngestToKB → Bedrock KB ``` **Supported File Types:** | Category | Types | Processing | |----------|-------|------------| | **Text** | HTML, TXT, CSV, JSON, XML, EML, EPUB, DOCX, XLSX | Direct text extraction with smart analysis | | **OCR** | PDF, JPG, PNG, TIFF, GIF, BMP, WebP, AVIF | Textract or Bedrock vision OCR (WebP/AVIF require Bedrock) | | **Media** | MP4, WebM, MP3, WAV, M4A, OGG, FLAC | AWS Transcribe speech-to-text, 30s segments with timestamps | | **Passthrough** | Markdown (.md) | Copy directly to output | **Text Processing:** Content sniffing detects actual file type regardless of extension. Structured formats (CSV, JSON, XML) get smart extraction with schema analysis. **Large PDFs (>20 pages):** 1. **Upload:** User → S3 input/ → EventBridge → Step Functions 2. **Page Info:** DetectFileType counts pages, creates 10-page batches 3. **Queue:** EnqueueBatches → SQS batch queue 4. **Process:** BatchProcessor Lambda (max 10 concurrent) → partial files 5. **Combine:** Last batch triggers CombinePages → merged output 6. **Indexing:** IngestToKB → Bedrock KB **95% threshold:** Ingestion proceeds if ≥95% of pages processed successfully. Failed batches retry 3x before DLQ. ### Web Scraping 1. **Start:** User → AppSync → ScrapeStart Lambda → SQS discovery queue 2. **Discover:** ScrapeDiscover finds links → SQS processing queue 3. **Process:** ScrapeProcess fetches content → S3 input/ (.scraped.md) 4. **Index:** Step Functions → ProcessDocument → IngestToKB ### Image Processing 1. **Upload:** User → S3 images/ → EventBridge → ProcessImage 2. **Indexing:** ProcessImage ingests image + caption to Bedrock KB 3. **Cross-modal:** Both visual and text vectors share image_id ### Media Processing (Video/Audio) 1. **Upload:** User → S3 input/ → EventBridge → DetectFileType 2. **Transcribe:** ProcessMedia → AWS Transcribe batch job → transcript with timestamps 3. **Segment:** Transcript split into 30-second chunks 4. **Metadata:** Each segment tagged with `timestamp_start`, `timestamp_end`, `speaker` (if diarization enabled) 5. **Indexing:** Segments ingested to Bedrock KB with timestamp metadata 6. **Query:** Sources include timestamp ranges, URLs with `#t=start,end` fragment for HTML5 playback **Speaker diarization:** When enabled, Transcribe identifies up to 10 speakers. Each segment tracks the primary speaker for filtering. **Source format:** Chat responses show timestamps like "1:30-2:00" with clickable links that open the media at that position. ### Knowledge Base Reindex 1. **Trigger:** User → AppSync → startReindex mutation → Step Functions 2. **Init:** Create new S3 Vectors bucket + Knowledge Base 3. **Process:** Map state iterates documents, regenerates metadata, ingests to new KB 4. **Finalize:** Update SSM parameter to new KB ID 5. **Cleanup:** Delete old KB and S3 Vectors bucket **Note:** Reindex regenerates metadata only - does NOT re-run OCR/text extraction. ### Chat Query 1. **Query:** User → AppSync → QueryKB Lambda 2. **Quota Check:** Atomic DynamoDB transaction (global + per-user limits) 3. **History:** Load last 5 conversation turns for context 4. **Retrieve:** bedrock_agent.retrieve() → top 5 KB results 5. **Generate:** bedrock_runtime.converse() → answer with citations 6. **Sources:** KB URIs resolved to original files via tracking table 7. **Store:** Save turn to conversation history (14-day TTL) **Media sources:** Results from video/audio include `timestampStart`, `timestampEnd` (seconds), `timestampDisplay` ("1:30-2:00"), and `segmentUrl` with `#t=start,end` fragment for direct playback positioning. ### Real-time Updates All state changes publish via GraphQL subscriptions: - `onDocumentUpdate` - Document processing progress - `onImageUpdate` - Image processing progress - `onScrapeUpdate` - Web scraping progress - `onReindexUpdate` - Knowledge Base reindex progress UI subscribes on load, updates automatically without polling. ## Architecture Decisions **Why SAM?** Local testing, simpler Lambda packaging **Why S3 vectors?** ~$1/month vs $50+/month for OpenSearch **Why DynamoDB config?** Changes apply immediately, no redeployment **Why shared library?** `lib/ragstack_common/` eliminates duplication **Error handling:** Lambda retry → Bedrock retry → DLQ ## Security - HTTPS/TLS everywhere - S3 SSE, DynamoDB encryption - Cognito auth + optional MFA - API key for programmatic access (all operations) - API key regeneration (manual, via Settings UI) - Least-privilege IAM - Public S3 blocked ## API Access All operations support both API key and Cognito authentication: | Operation | Endpoint | Auth | |-----------|----------|------| | Search KB | `searchKnowledgeBase` | API key / Cognito | | Chat | `queryKnowledgeBase` | API key / Cognito | | Upload docs | `createUploadUrl` | API key / Cognito | | Upload images | `createImageUploadUrl`, `submitImage` | API key / Cognito | | Scrape | `startScrape`, `getScrapeJob` | API key / Cognito | **In-app documentation:** Each UI tab includes an expandable section with GraphQL queries and code examples. ## Performance **10-page PDF:** - Upload: <5 sec - OCR: 2-15 min - KB Sync: 1-10 min **Optimization:** - Text-native PDFs: 50% faster (skip OCR) - Smaller docs: scales linearly ## Cost 1000 docs/month (5 pages each): - Textract + Haiku: **$7-10/month** - Bedrock OCR + Haiku: **$25-75/month** See [Configuration](CONFIGURATION.md) ## Stack - **Infrastructure:** SAM, Lambda, Step Functions - **Storage:** S3, DynamoDB, Bedrock KB - **APIs:** AppSync, Bedrock, Textract, Transcribe - **Frontend:** React 19, Vite, Cloudscape - **Chat:** ragstack-chat web component

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HatmanStack/RAGStack-Lambda'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ARCHITECTURE.md•8.43 KiB