RAGStack-Lambda

Overview Schema Related Servers Score Discussions

RAGStack-Lambda
docs

MIGRATION.md•8.81 KiB

# Migration Guide: v1.x to v2.0 This guide covers migrating RAGStack deployments from v1.x (deployed after December 24, 2025) to v2.0. ## Breaking Changes in v2.0 ### Architecture Changes | Component | v1.x | v2.0 | |-----------|------|------| | **S3 Prefixes** | `output/` (docs), `images/` (images) | `content/` (unified) | | **Data Sources** | 2 (TextDataSource + ImageDataSource) | 1 (DataSource) | | **Env Vars** | `TEXT_DATA_SOURCE_ID`, `IMAGE_DATA_SOURCE_ID` | `DATA_SOURCE_ID` | | **Image Metadata** | `IN_LINE_ATTRIBUTE` | `S3_LOCATION` (.metadata.json files) | | **CloudFormation Outputs** | `DataSourceId`, `TextDataSourceId`, `ImageDataSourceId` | `DataSourceId` only | ### Why Migrate? - **Simplified architecture**: Single data source reduces complexity - **Unified content handling**: All content types use consistent metadata format - **Improved metadata extraction**: New LLM-based metadata extraction for all content - **Better filtering**: `content_type` field enables filtering by document/image/web_page ## Prerequisites - Python 3.13+ with `boto3` installed - AWS CLI configured with appropriate permissions - Access to the deployed stack ## Migration Process ### Step 1: Run the Migration Script (Dry Run) First, preview what changes will be made: ```bash python scripts/migrate_v1_to_v2.py --stack-name <your-stack-name> --dry-run ``` This will show: - Files that will be copied from `output/` and `images/` to `content/` - DynamoDB tracking records that will be updated ### Step 2: Run the Actual Migration Once satisfied with the dry run output: ```bash python scripts/migrate_v1_to_v2.py --stack-name <your-stack-name> ``` The script will: 1. Copy all files from `output/` to `content/` 2. Copy all files from `images/` to `content/` 3. Update tracking table records with new S3 URIs **Note:** The script is idempotent - it skips files that already exist in `content/` and records that are already updated. ### Step 3: Deploy v2.0 Stack Pull the latest code and deploy: ```bash git pull origin main sam build sam deploy --stack-name <your-stack-name> ``` This updates: - Lambda functions with new single data source logic - Knowledge Base custom resource to create unified data source - EventBridge rules to watch `content/` prefix ### Step 4: Trigger Reindex Open the RAGStack dashboard and navigate to **Settings**: 1. Scroll to the **Knowledge Base Reindex** section 2. Click **Start Reindex** 3. Wait for the reindex to complete The reindex will: - Create a new Knowledge Base with the unified `content/` data source - Re-extract metadata for all content using the new extraction system - Ingest all documents, images, and scraped pages with fresh embeddings - Delete the old Knowledge Base ### Step 5: Verify Migration After reindex completes: 1. **Test chat**: Query the Knowledge Base and verify responses 2. **Check sources**: Ensure source attribution shows correct paths 3. **Test image search**: Verify image captions are searchable 4. **Test filters**: Use content_type filter to search specific content types ## Migration Script Details ### What the Script Does ```text migrate_v1_to_v2.py ├── Get stack outputs (bucket name, table name) ├── Step 1: Copy output/* → content/* ├── Step 2: Copy images/* → content/* └── Step 3: Update DynamoDB tracking records ├── output_s3_uri: output/ → content/ ├── input_s3_uri: images/ → content/ (for images) └── caption_s3_uri: images/ → content/ (for images) ``` ### What the Script Does NOT Do - Does NOT delete old files (output/, images/ remain intact) - Does NOT modify the Knowledge Base (reindex handles this) - Does NOT generate metadata files (reindex handles this) ### Options ```bash python scripts/migrate_v1_to_v2.py --help Options: --stack-name CloudFormation stack name (required) --region AWS region (default: us-east-1) --dry-run Preview changes without making them --verbose, -v Enable debug logging ``` ## Reindex Details The reindex process handles all content types with type-specific logic: | Type | Text Source | Metadata Extraction | Ingestion | |------|-------------|---------------------|-----------| | Documents | `output_s3_uri` | LLM extracts from text | 1 document | | Images | `caption_s3_uri` | LLM extracts from caption | 2 documents (image + caption) | | Scraped | `output_s3_uri` | Job-aware (see below) | 1 document | ### Job-Aware Scraped Content Reindex Scraped content uses a special two-level metadata extraction: 1. **Job-level metadata**: Extracted from the **seed document** (first page scraped) - Applied to ALL pages in the scrape job - Provides semantic context (e.g., "AWS Lambda documentation") 2. **Page-level metadata**: Deterministic fields for each page - `source_url`, `source_domain`, `scraped_date`, `job_id` **How it works:** ```text Scraped Page → S3 metadata → job_id ↓ ScrapeJobs table → base_url ↓ Find seed document (source_url == base_url) ↓ Re-extract job metadata from seed (using NEW settings) ↓ Merge job metadata + page metadata ``` This ensures: - All pages in a job share semantic metadata from the seed - Metadata uses the NEW extraction settings (not preserved from original scrape) - Job metadata is cached per-batch to avoid redundant LLM calls ### Common Metadata All content gets: - Fresh metadata extraction using configured LLM model - `content_type` field for filtering ("document", "image", "web_page") - Base metadata (document_id, filename, file_type) ## Rollback If issues occur after migration: 1. **Before stack deploy**: Old files still exist in `output/` and `images/` - no rollback needed 2. **After stack deploy but before reindex**: Re-deploy old code version 3. **After reindex**: The old KB is deleted; you'd need to re-upload content ## Troubleshooting ### Migration Script Errors **"Stack not found"** - Verify stack name is correct - Check you're using the right AWS region **"Access Denied" errors** - Ensure AWS credentials have S3 read/write and DynamoDB permissions - Check the IAM user/role has access to the stack's resources ### Reindex Errors **"Failed to read text"** - Check the `output_s3_uri` or `caption_s3_uri` paths are correct - Verify files were copied to `content/` prefix **"Metadata extraction failed"** - Check Bedrock model access (ensure your region supports the configured model) - Review CloudWatch logs for the ReindexKB Lambda ### Post-Migration Issues **Chat returns no results** - Wait for reindex to fully complete - Check Knowledge Base status in AWS console - Verify data source has correct `content/` prefix **Images not searchable** - Ensure caption files exist at `content/{imageId}/caption.txt` - Check the image's tracking record has `caption_s3_uri` field ### SAM Layer Caching Issue **Symptoms:** - Lambda functions fail with `No module named 'ragstack_common'` or `No module named 'crhelper'` - CloudFormation update gets stuck on `CodeBuildRun` or `WCCodeBuildRun` custom resources - Stack enters `UPDATE_ROLLBACK_FAILED` state **Cause:** SAM uses content hashing to skip S3 uploads. After a reindex creates a new Knowledge Base, if you redeploy, SAM may reuse a stale/corrupted layer artifact from S3 instead of uploading the freshly built layer. The local build shows ~121MB but S3 only has ~120KB. **Diagnosis:** ```bash # Check local build size (should be ~121MB) du -sh .aws-sam/build/RagstackCommonLayer/ # Check deployed layer size (should match, not 120KB) aws lambda get-function-configuration --function-name <stack>-sync-status-checker \ --query "Layers[0].CodeSize" --output text ``` **Fix:** 1. If stack is stuck in `UPDATE_IN_PROGRESS`, cancel and wait for rollback: ```bash aws cloudformation cancel-update-stack --stack-name <stack> --region us-east-1 ``` 2. If stack is in `UPDATE_ROLLBACK_FAILED`, continue rollback skipping failed resources: ```bash aws cloudformation continue-update-rollback --stack-name <stack> --region us-east-1 \ --resources-to-skip CodeBuildRun WCCodeBuildRun BatchProcessorFunction \ AppSyncResolverFunction ConfigurationResolverFunction ``` 3. Once stack is in `UPDATE_ROLLBACK_COMPLETE`, clear caches and redeploy: ```bash # Delete SAM build cache rm -rf .aws-sam/ # Delete stale S3 artifacts (keep UI source zips) aws s3 rm s3://<stack>-artifacts-<account-id>/ --recursive --exclude "*.zip" # Fresh build and deploy sam build --parallel python publish.py --stack-name <stack> --admin-email <email> ``` **Prevention:** After running reindex, always clear the SAM cache before redeploying: ```bash rm -rf .aws-sam/ ``` ## Support For issues: - Check [TROUBLESHOOTING.md](./TROUBLESHOOTING.md)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HatmanStack/RAGStack-Lambda'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

MIGRATION.md•8.81 KiB