# CLAUDE.md - Project Context for AI Assistant
> **Purpose**: This file provides context for Claude AI sessions to understand the project without re-exploration.
> **Audience**: Claude (AI Assistant)
> **Last Updated**: 2026-01-06
---
## 1. PROJECT IDENTITY
**Project Name**: GCP Sales Analytics POC
**Location**: `/Users/matthewiames/Desktop/gcp_mcp`
**Status**: ✅ Fully implemented and documented
**GCP Project**: `iames-dq`
**GCP Region**: `us-central1`
**Purpose**: Demonstrate an intelligent AI agent (powered by Claude) that queries both Google Cloud SQL and BigQuery, automatically selecting the right data source based on user questions about sales data.
**Key Achievement**: Agent intelligently routes queries to Cloud SQL (transactional data) or BigQuery (analytics data) without user needing to specify which database to query.
---
## 2. ARCHITECTURE OVERVIEW
```
┌─────────────────────────────────────────────────────────┐
│ USER ASKS SALES QUESTION │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ CLAUDE AGENT (Anthropic Claude 3.5 Sonnet) │
│ - Analyzes question keywords and intent │
│ - Selects appropriate data source │
│ - Formulates SQL query │
│ - Executes via MCP tools │
└──────────┬────────────────────────┬─────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────┐
│ CLOUD SQL │ │ BIGQUERY │
│ (PostgreSQL) │ │ (thelook_ecommerce) │
│ │ │ │
│ Tables: │ │ Tables: │
│ • customers (50) │ │ • products │
│ • orders (50) │ │ • users │
│ • vendors (50) │ │ • events │
│ │ │ • inventory_items │
│ Use for: │ │ • order_items │
│ - Customer queries │ │ • distribution_centers │
│ - Recent orders │ │ │
│ - Vendor info │ │ Use for: │
│ - Transactions │ │ - Product analytics │
│ │ │ - User behavior │
│ │ │ - Inventory analysis │
│ │ │ - Historical trends │
└──────────────────────┘ └──────────────────────────┘
```
### Data Source Selection Rules
The agent decides which data source to use based on:
| User Query Contains | → Use This Data Source | Reason |
|---------------------|------------------------|---------|
| "customer", "customers" | Cloud SQL | Customer table exists here |
| "vendor", "vendors" | Cloud SQL | Vendor table exists here |
| "order", "recent orders" | Cloud SQL | Transactional order data |
| "product", "products" | BigQuery | Product catalog in BigQuery |
| "inventory" | BigQuery | Inventory data in BigQuery |
| "user", "users" | BigQuery | User behavior data |
| "events", "analytics" | BigQuery | Event tracking data |
| "trends", "historical" | BigQuery | Large-scale analytics |
---
## 3. CRITICAL FILE MAP
### 📁 Project Root: `/Users/matthewiames/Desktop/gcp_mcp/`
**Documentation Files** (Read-only, informational):
- `README.md` - Complete user guide (10KB) - setup, architecture, usage
- `QUICKSTART.md` - 15-minute quick start guide (4KB)
- `PROJECT_SUMMARY.md` - Technical deep-dive (10KB)
- `CLAUDE.md` - **THIS FILE** - Context for Claude AI sessions
**Configuration Files**:
- `.env.example` - Environment variable template (COPY to `.env` for deployment)
- `.gitignore` - Git exclusions
- `requirements.txt` - Python dependencies (anthropic, google-cloud-bigquery, psycopg2, mcp, faker)
**Core Application** (Modify these for features):
- `adk_agent/integrated_agent.py` - **PRIMARY AGENT** (functional, 15KB)
- **MODIFY HERE**: To change system prompt or routing logic
- Uses Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
- Direct database connections (not via MCP server)
- Line 42-66: System prompt (guides data source selection)
- `adk_agent/agent.py` - Conceptual framework (10KB)
- Reference implementation with placeholder tool execution
- Shows agentic architecture pattern
- **DO NOT MODIFY**: This is for reference only
- `mcp_server/server.py` - MCP Server implementation (12KB)
- **MODIFY HERE**: To add new MCP tools
- Provides 6 tools: query_cloudsql, query_bigquery, list tables, get schemas
- Standards-compliant Model Context Protocol
**Infrastructure** (Modify for deployment changes):
- `terraform/main.tf` - Cloud SQL resource definitions (2KB)
- PostgreSQL 15, db-f1-micro instance
- Public IP with authorized networks
- `terraform/variables.tf` - Configurable parameters (1KB)
- Region, instance name, database name, credentials
- `terraform/outputs.tf` - Connection details (1KB)
- Instance name, connection string, public IP
- `terraform/terraform.tfvars.example` - Variable values template
**Data & Schema**:
- `data/schema.sql` - Database schema (2KB)
- 3 tables: customers, orders, vendors
- Includes indexes for performance
- **MODIFY HERE**: To change database structure
- `scripts/generate_seed_data.py` - Synthetic data generator (6KB)
- Uses Faker library
- Generates 50 records per table (configurable)
**Automation Scripts**:
- `scripts/deploy.sh` - One-command deployment (5KB, Bash)
- Runs Terraform, creates schema, seeds data
- **RUN THIS**: To deploy infrastructure
- `scripts/cleanup.sh` - Resource teardown (1KB, Bash)
- Destroys Cloud SQL instance
- **RUN THIS**: When done to avoid charges
- `scripts/test_agent.py` - Automated tests (2KB, Python)
---
## 4. KEY IMPLEMENTATION DETAILS
### System Prompt (Primary Agent Behavior Controller)
**Location**: `adk_agent/integrated_agent.py` lines 42-66
```python
self.system_prompt = """You are a sales analytics assistant with access to two data sources:
1. **Cloud SQL Database** (query_cloudsql, get_cloudsql_schema, list_cloudsql_tables):
- Contains recent transactional data
- Tables: customers, orders, vendors
- Use for: specific customer queries, recent order details, vendor information, current sales data
2. **BigQuery thelook_ecommerce Dataset** (query_bigquery, get_bigquery_schema, list_bigquery_tables):
- Public e-commerce analytics dataset
- Tables: products, users, events, inventory_items, order_items, orders, distribution_centers
- Use for: product analytics, user behavior, inventory analysis, historical trends, large-scale analytics
**Decision Guidelines:**
- For questions about "customers", "vendors", or recent "orders" → use Cloud SQL
- For questions about "products", "inventory", "users", "events", or broad analytics → use BigQuery
- When uncertain, first list tables from both sources to understand the data
- Always explain which data source you're using and why
- Combine data from both sources if needed to answer comprehensively
**Important:**
- Use descriptive SQL queries with clear column selections
- Always limit results to reasonable amounts (use LIMIT clause)
- Format numeric results clearly (e.g., currency with $ symbol)
- Provide context and insights, not just raw data
- When showing data, format it in a readable way (tables, lists, etc.)"""
```
**To modify agent behavior**: Edit this system prompt in `adk_agent/integrated_agent.py`
### MCP Tools (6 Tools Available)
**Location**: `mcp_server/server.py` lines 38-128
1. **query_cloudsql**
- Execute SQL against Cloud SQL PostgreSQL
- Input: `{"query": "SELECT ..."}`
- Security: SELECT-only queries enforced
- Returns: JSON with row_count and data array
2. **query_bigquery**
- Execute SQL against BigQuery thelook_ecommerce
- Input: `{"query": "SELECT ..."}`
- Auto-prefixes dataset path if needed
- Returns: JSON with row_count, bytes_processed, data
3. **list_cloudsql_tables**
- List all Cloud SQL tables with column info
- Input: none
- Returns: Table names with column types
4. **list_bigquery_tables**
- List all BigQuery dataset tables
- Input: none
- Returns: Table names with row counts
5. **get_cloudsql_schema**
- Get detailed schema for specific Cloud SQL table
- Input: `{"table_name": "customers"}`
- Returns: Columns with data types, nullability, defaults
6. **get_bigquery_schema**
- Get detailed schema for specific BigQuery table
- Input: `{"table_name": "products"}`
- Returns: Columns with types, mode, description
### Database Schema
**Cloud SQL Tables** (defined in `data/schema.sql`):
```sql
-- CUSTOMERS table (50 records)
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255) UNIQUE,
phone VARCHAR(20),
address VARCHAR(255),
city VARCHAR(100),
state VARCHAR(50),
zip_code VARCHAR(20),
country VARCHAR(100),
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- VENDORS table (50 records)
CREATE TABLE vendors (
vendor_id SERIAL PRIMARY KEY,
vendor_name VARCHAR(255),
contact_name VARCHAR(100),
email VARCHAR(255),
phone VARCHAR(20),
address VARCHAR(255),
city VARCHAR(100),
state VARCHAR(50),
zip_code VARCHAR(20),
country VARCHAR(100),
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- ORDERS table (50 records)
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
vendor_id INTEGER REFERENCES vendors(vendor_id),
order_date TIMESTAMP,
ship_date TIMESTAMP,
total_amount DECIMAL(10, 2),
status VARCHAR(50), -- 'pending', 'processing', 'shipped', 'delivered', 'cancelled'
product_name VARCHAR(255),
quantity INTEGER,
unit_price DECIMAL(10, 2),
created_at TIMESTAMP,
updated_at TIMESTAMP
);
```
**Indexes**: Created on customer_id, vendor_id, order_date, status for performance
### Why Two Agent Versions?
1. **`adk_agent/agent.py`** (Conceptual):
- Demonstrates agentic architecture pattern
- Uses placeholder tool execution
- Shows how MCP tool schemas are defined
- Useful for understanding the pattern
2. **`adk_agent/integrated_agent.py`** (Functional):
- **THIS IS THE ONE TO USE**
- Direct database connections (psycopg2, BigQuery client)
- Actually executes queries
- Full implementation with error handling
- Interactive CLI for testing
**When to use which**:
- Run `integrated_agent.py` to actually query databases
- Read `agent.py` to understand the architecture pattern
---
## 5. ENVIRONMENT & CONFIGURATION
### Required Environment Variables
**File**: `.env` (created from `.env.example`)
```bash
# GCP Configuration
GCP_PROJECT_ID=iames-dq
GCP_REGION=us-central1
CLOUDSQL_INSTANCE_NAME=sales-poc-db
CLOUDSQL_DATABASE=salesdb
CLOUDSQL_USER=salesuser
CLOUDSQL_PASSWORD=your-password-here
# Database connection (set after deployment)
DB_HOST=<Cloud SQL Public IP>
# BigQuery
BIGQUERY_DATASET=bigquery-public-data.thelook_ecommerce
# Anthropic
ANTHROPIC_API_KEY=your-api-key-here
```
### Python Dependencies
**File**: `requirements.txt`
```
google-cloud-bigquery>=3.14.0
google-cloud-sql-connector>=1.5.0
pg8000>=1.30.0
sqlalchemy>=2.0.0
psycopg2-binary>=2.9.9
mcp>=1.0.0
anthropic-sdk>=0.39.0
faker>=22.0.0
python-dotenv>=1.0.0
```
### GCP Resources
- **Cloud SQL Instance**: PostgreSQL 15, db-f1-micro tier
- **Cost**: ~$7-10/month (if not cleaned up!)
- **Security**: Public IP with authorized networks (POC-level, not production)
- **BigQuery**: Uses public dataset (minimal cost)
---
## 6. COMMON MODIFICATION PATTERNS
### How to Add a New MCP Tool
**Edit**: `mcp_server/server.py`
1. Add tool definition in `list_tools()` (line ~38):
```python
Tool(
name="your_tool_name",
description="What this tool does and when to use it",
inputSchema={
"type": "object",
"properties": {
"param_name": {
"type": "string",
"description": "Parameter description"
}
},
"required": ["param_name"]
}
)
```
2. Add handler in `call_tool()` (line ~90):
```python
elif name == "your_tool_name":
result = await self._your_tool_handler(arguments["param_name"])
return [TextContent(type="text", text=json.dumps(result))]
```
3. Implement handler method:
```python
async def _your_tool_handler(self, param_name: str) -> Dict[str, Any]:
# Tool implementation
return {"result": "data"}
```
### How to Modify System Prompt (Change Routing Logic)
**Edit**: `adk_agent/integrated_agent.py` lines 42-66
Change the `self.system_prompt` string to modify:
- Which keywords trigger which data source
- Query formatting instructions
- Response formatting preferences
- Additional constraints or guidelines
**Example**: To add a new keyword routing rule:
```python
**Decision Guidelines:**
- For questions about "customers", "vendors", or recent "orders" → use Cloud SQL
- For questions about "products", "inventory", "users", "events", or broad analytics → use BigQuery
- For questions about "shipments" → use Cloud SQL # NEW RULE
```
### How to Add a New Data Source
1. **Add database connection** in `integrated_agent.py` `__init__()`:
```python
self.new_db_config = {
'host': os.getenv('NEW_DB_HOST'),
'database': os.getenv('NEW_DB_NAME')
}
```
2. **Add tool definition** to `self.tools` list:
```python
{
"name": "query_new_db",
"description": "Execute SQL against new database...",
"input_schema": {...}
}
```
3. **Implement query method**:
```python
async def _query_new_db(self, query: str) -> str:
# Connection and query logic
return json.dumps(result)
```
4. **Update system prompt** to include new data source in decision guidelines
### How to Change Database Schema
**Edit**: `data/schema.sql`
1. Add new table or modify existing:
```sql
CREATE TABLE new_table (
id SERIAL PRIMARY KEY,
...
);
```
2. Update indexes if needed:
```sql
CREATE INDEX idx_new_table_field ON new_table(field);
```
3. Update `scripts/generate_seed_data.py` to generate data for new table
4. Re-run deployment:
```bash
./scripts/deploy.sh
```
---
## 7. OPERATIONAL CONTEXT
### Deployment Workflow
```bash
# 1. Configure environment
cp .env.example .env
# Edit .env with ANTHROPIC_API_KEY and verify GCP_PROJECT_ID
# 2. Authenticate with GCP
gcloud auth login
gcloud auth application-default login
# 3. Deploy everything (one command)
./scripts/deploy.sh
# This runs: Terraform apply → Create schema → Seed data → Verify BigQuery
# Takes ~5-10 minutes
```
### Running the Agent
```bash
# Option 1: Integrated agent (functional)
python3 adk_agent/integrated_agent.py
# Option 2: MCP server (separate terminal)
python3 mcp_server/server.py
```
### Example Queries to Test
```
What are the total sales from Cloud SQL?
How many customers do we have?
Show me orders with status 'delivered'
What's the average order value?
List the top 5 customers by total spending
How many products are in BigQuery?
Show me user events from BigQuery
```
### Cleanup (Important!)
```bash
# Destroy all GCP resources
./scripts/cleanup.sh
# Confirms before destroying
# Prevents ongoing charges (~$7-10/month)
```
### Testing
```bash
# Run automated tests
python3 scripts/test_agent.py
# Manual database connection test
psql -h $DB_HOST -U salesuser -d salesdb
```
---
## 8. IMPORTANT CONSTRAINTS & CONSIDERATIONS
### Security Model (POC-Level)
⚠️ **This is a proof of concept with simplified security**:
- **Public IP**: Cloud SQL has public IP (uses authorized networks)
- **Credentials**: Stored in `.env` file (not Secret Manager)
- **SQL Injection**: Basic prevention (SELECT-only enforcement)
- **No VPC**: No private networking
- **No IAM Auth**: Uses password authentication
**For Production**, implement:
- Cloud SQL Proxy or private IP
- Secret Manager for credentials
- VPC networking
- IAM database authentication
- Query validation and sanitization
- Rate limiting
- Audit logging
### Data Constraints
- **Cloud SQL**: Only 50 records per table (small dataset for POC)
- **Synthetic Data**: Generated with Faker (not real customer data)
- **Two Separate Systems**: Cloud SQL and BigQuery are not synchronized
- **No Transactions**: Each query is independent
- **SELECT-Only**: No INSERT/UPDATE/DELETE allowed (safety)
### Cost Considerations
- **Cloud SQL**: ~$7-10/month (db-f1-micro instance)
- **BigQuery**: Pay per query (~$5/TB scanned, public dataset may be free)
- **Anthropic API**: Pay per token used
- **Total**: ~$10-15/month if left running
**Always run `./scripts/cleanup.sh` when done!**
### Known Limitations
1. **Agent doesn't combine data across sources** (could be added)
2. **No caching** (queries re-execute every time)
3. **No query optimization** (agent may generate inefficient SQL)
4. **No streaming responses** (all results returned at once)
5. **No pagination** (relies on LIMIT clauses)
---
## 9. QUICK REFERENCE
### MCP Tools Cheat Sheet
| Tool Name | Purpose | When to Use | Input | Output |
|-----------|---------|-------------|-------|--------|
| `query_cloudsql` | Execute SQL | Customer/vendor/order queries | SQL string | JSON with data |
| `query_bigquery` | Execute SQL | Product/analytics queries | SQL string | JSON with data |
| `list_cloudsql_tables` | Show schema | Discover available tables | None | Table list |
| `list_bigquery_tables` | Show schema | Discover BQ tables | None | Table list |
| `get_cloudsql_schema` | Table details | Understand table structure | Table name | Column details |
| `get_bigquery_schema` | Table details | Understand BQ table | Table name | Column details |
### File Modification Guide
| To Do This | Edit This File | Look For |
|------------|----------------|----------|
| Change agent routing logic | `adk_agent/integrated_agent.py` | Line 42: `self.system_prompt` |
| Add new MCP tool | `mcp_server/server.py` | Line 38: `list_tools()` |
| Change database schema | `data/schema.sql` | Table definitions |
| Modify deployment config | `terraform/variables.tf` | Variable defaults |
| Add new data source | `adk_agent/integrated_agent.py` | `__init__()` and tools list |
| Change instance size | `terraform/main.tf` | Line 27: `tier` setting |
| Update seed data count | `scripts/generate_seed_data.py` | Lines 19-21: NUM_* constants |
### Command Cheat Sheet
```bash
# Deployment
./scripts/deploy.sh # Deploy everything
./scripts/cleanup.sh # Destroy resources
# Running
python3 adk_agent/integrated_agent.py # Run agent (interactive)
python3 mcp_server/server.py # Run MCP server
python3 scripts/test_agent.py # Run tests
# Database
psql -h $DB_HOST -U salesuser -d salesdb # Connect to Cloud SQL
psql -h $DB_HOST -U salesuser -d salesdb -f data/schema.sql # Apply schema
# Terraform
cd terraform && terraform plan # Preview changes
cd terraform && terraform apply # Apply changes
cd terraform && terraform destroy # Destroy resources
# GCP
gcloud sql instances list # List Cloud SQL instances
gcloud sql instances describe INSTANCE_NAME # Get instance details
bq ls bigquery-public-data:thelook_ecommerce # List BigQuery tables
```
### Data Source Selection Quick Reference
| User Query Keyword | → Data Source | Reason |
|--------------------|---------------|---------|
| customer, buyer | Cloud SQL | `customers` table |
| vendor, supplier | Cloud SQL | `vendors` table |
| order (recent/specific) | Cloud SQL | `orders` table (transactional) |
| sales, revenue | Cloud SQL | Sum `orders.total_amount` |
| product, item | BigQuery | `products` table |
| inventory, stock | BigQuery | `inventory_items` table |
| user, shopper | BigQuery | `users` table |
| event, behavior | BigQuery | `events` table |
| analytics, trends | BigQuery | Large-scale data |
---
## 10. AGENTIC LOOP IMPLEMENTATION
### How the Agent Works (Step-by-Step)
**Location**: `adk_agent/integrated_agent.py` `process_message()` method (line ~193)
1. **User Input**: User asks a question
2. **API Call**: Send to Claude with system prompt + tools
3. **Claude Response**: Returns text and/or tool calls
4. **Tool Execution**: If tool calls present, execute them
5. **Result Formatting**: Format tool results as JSON
6. **Follow-up**: Send tool results back to Claude
7. **Repeat**: Continue loop until no more tool calls
8. **Final Response**: Return accumulated text to user
```python
# Simplified agentic loop
while iteration < max_iterations:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
system=system_prompt,
tools=self.tools,
messages=conversation_history
)
# Extract text and tool calls
if has_tool_calls:
# Execute tools
results = execute_tools(tool_calls)
# Add results to conversation
conversation_history.append(results)
# Continue loop
else:
# No more tools, return response
break
```
### Conversation History Management
The agent maintains conversation history to:
- Enable multi-turn conversations
- Allow follow-up questions
- Provide context across queries
**Reset with**: `agent.reset_conversation()` (clears history)
---
## 11. TROUBLESHOOTING CONTEXT
### Common Issues
**"Connection refused" to Cloud SQL**:
- Check `DB_HOST` in `.env` (should be public IP from Terraform output)
- Verify instance is running: `gcloud sql instances list`
- Check authorized networks include your IP
**"Anthropic API key not found"**:
- Ensure `ANTHROPIC_API_KEY` is set in `.env`
- Load environment: `source .env` or restart terminal
**"Table doesn't exist"**:
- Run `psql -h $DB_HOST -U salesuser -d salesdb -f data/schema.sql`
- Or re-run `./scripts/deploy.sh`
**"BigQuery permission denied"**:
- Ensure BigQuery API is enabled: `gcloud services enable bigquery.googleapis.com`
- Check authentication: `gcloud auth application-default login`
**Agent always chooses wrong data source**:
- Check system prompt in `integrated_agent.py` line 42-66
- Verify tool descriptions are clear
- Add more specific keywords to decision guidelines
---
## 12. EXTENSION IDEAS (For Future Development)
### Short-term Enhancements
1. **Caching**: Add Redis for query result caching
2. **Visualization**: Generate charts/graphs from query results
3. **Data Sync**: Sync Cloud SQL data to BigQuery for unified queries
4. **Query History**: Store and retrieve past queries
5. **Explain Plans**: Show SQL execution plans
### Medium-term Features
1. **Web Interface**: Build React frontend for agent
2. **Scheduled Reports**: Generate automated reports
3. **Alerts**: Send notifications for anomalies
4. **Multi-tenant**: Support multiple users/databases
5. **Advanced Analytics**: Add ML predictions
### Long-term Vision
1. **Natural Language to SQL**: Improve query generation
2. **Data Discovery**: Auto-suggest interesting queries
3. **Performance Optimization**: Query optimization suggestions
4. **Data Governance**: Add access controls and audit logs
5. **Production Hardening**: Full security implementation
---
## SUMMARY FOR CLAUDE
**When you see this project**:
1. ✅ You're working with a fully implemented sales analytics POC
2. ✅ Primary file to run: `adk_agent/integrated_agent.py`
3. ✅ To change behavior: Edit system prompt at line 42 of integrated_agent.py
4. ✅ To add features: Add MCP tools in `mcp_server/server.py`
5. ✅ Data sources: Cloud SQL (transactional) + BigQuery (analytics)
6. ✅ Deploy: `./scripts/deploy.sh`, Cleanup: `./scripts/cleanup.sh`
**Most important files**:
- `adk_agent/integrated_agent.py` - The agent (modify system prompt here)
- `mcp_server/server.py` - MCP tools (add new tools here)
- `data/schema.sql` - Database schema
- `.env` - Configuration (API keys, DB host)
**This POC demonstrates**: Intelligent multi-source data querying with Claude-powered decision making using the Model Context Protocol standard.
---
**End of CLAUDE.md**