# Vector Store Workflow
This document explains the complete workflow for using the AWS MCP Server with vector store capabilities.
## Quick Start
### 1. Setup Data Source
```bash
# Create directory and add PDF documents
mkdir -p datasource/aws-docs
# Copy your PDF files to datasource/ or subdirectories
```
### 2. Start Server with Vector Store
```bash
# Start with automatic document ingestion
python -m aws_mcp_server.server --enable-vector-store
# Or with custom data source path
python -m aws_mcp_server.server --enable-vector-store --data-source /path/to/your/documents
```
### 3. Server Initialization Flow
When you start the server with `--enable-vector-store`, it automatically:
#### Step 1: Vector Store Check
- ā
Checks if vector store database exists
- š Reports current document count if found
- š Creates new vector store if needed
#### Step 2: File Index Loading
- š Loads existing file tracking index (`vector_store_index.json`)
- š Contains hashes and metadata for previously processed files
- š Creates new index if none exists
#### Step 3: Data Source Scan
- š Scans data source directory for supported files (.pdf)
- š Lists all found files with their paths
- š·ļø Prepares automatic categorization based on path/filename
#### Step 4: Incremental Sync
- š Compares file hashes to detect new/modified files
- āļø Skips unchanged files to avoid duplicate processing
- š„ Ingests only new or modified documents
#### Step 5: Auto-Categorization
Files are automatically categorized based on:
- **Directory structure**: `aws-docs/` ā technical/documentation
- **Filename keywords**: `wellarchitected` ā technical/documentation + aws tags
- **Path patterns**: `policies/` ā business/policy
#### Step 6: Index Update
- š¾ Updates file index with new hashes and metadata
- š Tracks ingestion timestamps and chunk counts
- š·ļø Stores category and tag information
#### Step 7: Ready to Serve
- ā
Vector store is ready for semantic queries
- šÆ All MCP tools are available
- š Reports final document count and status
## Example Server Output
```
š Initializing Vector Store
š Data Source: datasource/
š Checking vector store status...
ā
Vector store found with 0 documents
š No existing index found, creating new one
š Scanning data source directory: datasource/
š Found 2 files
⢠datasource/aws-docs/wellarchitected-framework.pdf
⢠datasource/policies/data-privacy-policy.pdf
š Syncing files with vector store...
š„ Ingesting wellarchitected-framework.pdf (new)
ā
Success! Added 1247 chunks
š„ Ingesting data-privacy-policy.pdf (new)
ā
Success! Added 89 chunks
š Sync Summary:
š New files: 2
š Updated files: 0
āļø Skipped files: 0
ā
Vector Store Ready!
š Total documents: 1336
š Tracked files: 2
šÆ Ready to serve document queries!
```
## File Management
### Adding New Documents
1. **Copy PDF files** to `datasource/` (or subdirectories)
2. **Restart the server** - new files will be automatically detected and ingested
3. **Check the console output** to see ingestion progress
### Updating Existing Documents
1. **Replace the PDF file** with the updated version
2. **Restart the server** - modified files will be re-ingested
3. **Old chunks are replaced** with new content
### Directory Organization
```
datasource/
āāā aws-docs/ # AWS technical documentation
ā āāā wellarchitected-framework.pdf
ā āāā security-best-practices.pdf
āāā policies/ # Business policies
ā āāā data-privacy-policy.pdf
ā āāā security-policy.pdf
āāā manuals/ # Technical manuals
ā āāā api-reference.pdf
āāā research/ # Research papers
āāā cloud-security-study.pdf
```
## Querying Documents
Once ingested, use MCP tools to search documents:
### Search by Content
```python
# Search across all documents
await document_search(
query="What are the Well-Architected pillars?",
n_results=5
)
```
### Search by Category
```python
# Search only business policies
await document_search(
query="data privacy requirements",
category="business",
doc_type="policy"
)
```
### Search by Tags
```python
# Search documents with specific tags
await document_search(
query="security best practices",
tags=["aws", "security"]
)
```
## File Index Structure
The `vector_store_index.json` file tracks:
```json
{
"datasource/aws-docs/wellarchitected-framework.pdf": {
"hash": "sha256_hash_of_file",
"size": 2847392,
"modified_time": 1704567890.123,
"ingested_at": "2024-01-06T15:30:00",
"document_title": "wellarchitected-framework",
"chunks_created": 1247,
"category": "technical",
"doc_type": "documentation",
"tags": ["aws", "framework", "best-practices", "auto_synced"]
}
}
```
## Performance Notes
- **Hash-based change detection** means only modified files are re-processed
- **Large documents** are automatically chunked for optimal embedding
- **Memory usage** scales with the number of documents in the vector store
- **Startup time** depends on the number of new files to process
## Troubleshooting
### No Files Found
```
ā¹ļø No supported files found in datasource/
š Supported formats: .pdf
```
**Solution**: Add PDF files to the data source directory
### File Access Errors
```
ā ļø Error reading file.pdf: Permission denied
```
**Solution**: Check file permissions and ensure the server can read the files
### Memory Issues
**Symptoms**: Slow performance or out of memory errors
**Solution**: Process large documents in batches or increase system memory
### Index Corruption
**Symptoms**: Files are re-ingested every startup
**Solution**: Delete `vector_store_index.json` to rebuild the index
## Advanced Configuration
### Custom Data Source
```bash
# Use a different directory
python -m aws_mcp_server.server --enable-vector-store --data-source /company/documents
```
### Environment Variables
- `ENABLE_VECTOR_STORE` - Enable vector store features (default: false)
- `CHROMA_DB_PATH` - Vector store database location (default: ./chroma_db)
- `DATA_SOURCE_PATH` - Default data source directory (overridden by CLI)
This workflow ensures your documents are automatically available for semantic search with minimal manual intervention!