# ADR-0002: NL-to-SQL Generation Strategy
**Date:** 2026-01-22
**Status:** Accepted
**Deciders:** Architecture team
## Context
The Jana MCP Server needs to translate natural language questions into data queries. There are two fundamental approaches:
1. **API-Mediated:** Claude translates questions into eko-client-python method calls, which hit the REST API
2. **NL-to-SQL:** Claude generates SQL directly, executed against the database
SQL generation offers maximum flexibility—any question the data can answer becomes answerable. However, it introduces significant security risks (SQL injection, data exposure, schema leakage) and complexity (query validation, performance management, schema coupling).
The existing Jana backend already provides comprehensive APIs:
- Unified ESG API with cross-source queries
- Geographic queries (bbox, point+radius, country codes)
- Temporal filtering and aggregations
- Correlation and trend analysis
- Source-specific endpoints for detailed access
These APIs cover the majority of expected user queries, but some advanced analytical queries (e.g., window functions, complex aggregations, ad-hoc groupings) may not be expressible through the current API surface.
## Decision Drivers
- **Security:** SQL injection and data exposure are critical risks
- **Time to MVP:** SQL generation requires significant guardrails
- **API coverage:** Existing APIs handle 80%+ of expected queries
- **User needs:** Power users may want SQL flexibility; most users don't need it
- **Maintenance:** SQL generation couples tightly to database schema
- **Backend capabilities:** Jana backend already has infrastructure for secure query execution
## Considered Options
### Option 1: API-Only (No SQL Generation)
**Description:** All queries go through eko-client-python → REST API. No direct SQL.
**Pros:**
- Simplest and safest approach
- Uses battle-tested, validated API layer
- No security risks from generated SQL
- No schema coupling in MCP server
- Fastest path to MVP
**Cons:**
- Limited to what APIs expose
- Can't answer complex analytical queries
- Must wait for API enhancements for new query types
### Option 2: SQL Generation in MCP Server
**Description:** MCP server generates SQL and executes directly against database.
**Pros:**
- Maximum query flexibility
- No backend changes needed
- Direct control over query execution
**Cons:**
- Requires DB credentials in MCP server
- Must implement all security guardrails in MCP
- Schema changes break SQL generation
- Highest security risk
- Most complex option
### Option 3: SQL Generation via Backend Endpoint
**Description:** Add a `/api/v1/esg/query/` endpoint to Jana backend that accepts natural language or validated SQL, with backend handling security.
**Pros:**
- Security controls in one place (backend)
- Backend already has DB connection, auth, rate limiting
- MCP server stays simple
- Can leverage backend's existing security infrastructure
- Easier to audit and monitor
**Cons:**
- Requires backend changes
- Still need query validation logic
- Adds new attack surface to backend
### Option 4: Phased Approach
**Description:** Start with API-only, add SQL generation later via backend endpoint.
**Pros:**
- Fast MVP with API-only
- Learn what queries users actually need
- Add SQL only if justified by real demand
- Backend team can build guardrails properly
- Risk is deferred, not accepted upfront
**Cons:**
- May need to tell users "can't do that yet"
- Deferred complexity is still complexity
## Decision
**We will follow a phased approach:**
| Phase | Approach | Scope |
|-------|----------|-------|
| **MVP** | API-only | No SQL generation; all queries via eko-client-python |
| **V2** | Backend SQL endpoint | Add `/api/v1/esg/query/` with guardrails if needed |
| **Future** | Transparency features | Consider "show SQL" to let users see generated queries |
### Phase 1: MVP (API-Only)
The MCP server will translate natural language to eko-client-python method calls exclusively. No SQL generation.
**Rationale:**
- Existing APIs cover most use cases
- Fastest path to useful product
- Zero SQL injection risk
- Learn what users actually need before building SQL generation
### Phase 2: Backend SQL Endpoint (If Needed)
If user demand justifies it, add SQL generation via a new backend endpoint:
```
POST /api/v1/esg/query/
{
"sql": "SELECT ... validated query ...",
"timeout_seconds": 30,
"limit": 10000
}
```
**Required guardrails:**
1. Read-only database connection (separate credentials)
2. Table allowlist: only `openaq_*`, `climatetrace_*`, `edgar_*` tables
3. Query parsing and validation before execution
4. Timeout enforcement (default 30s)
5. Row limit injection (max 10,000 rows)
6. Block system catalog access (`pg_catalog`, `information_schema`)
7. Query audit logging
8. Rate limiting per user
### Phase 3: Transparency Features (Future)
Consider exposing the "conceptual SQL" for queries, even when using APIs:
```
User: "What's PM2.5 in NYC?"
Assistant: "Here's what I found...
[Equivalent SQL: SELECT value, measured_at FROM ...]"
```
This helps power users understand and verify queries without the security risks of actual SQL execution.
## Consequences
### Positive
- MVP ships faster without SQL complexity
- No SQL injection risk in initial release
- Security controls centralized in backend (Phase 2)
- Learn from real usage before investing in SQL generation
- Clear upgrade path if SQL is needed
### Negative
- Some complex queries not possible in MVP
- May need to tell users "the API doesn't support that"
- Phase 2 requires backend development effort
### Neutral
- Need to track which queries fail due to API limitations (informs Phase 2 priority)
- Backend team should be aware of potential Phase 2 work
- May want to enhance APIs as alternative to SQL generation
## Implementation Notes
### MVP Implementation
MCP tools will map to eko-client-python methods:
| User Intent | MCP Tool | eko-client Method |
|-------------|----------|-------------------|
| Get air quality data | `get_air_quality` | `client.get_data(sources=['openaq'], ...)` |
| Get emissions | `get_emissions` | `client.get_data(sources=['climatetrace'], ...)` |
| Find nearby sources | `find_nearby` | `client.get_data(location_point=..., radius_km=...)` |
| Analyze trends | `get_trends` | `client.get_trends(...)` |
| Get correlations | `get_correlations` | `client.get_correlations(...)` |
### Tracking API Limitations
Log queries that fail or require workarounds:
- What did the user ask?
- Why couldn't the API answer it?
- What SQL would have answered it?
This informs whether Phase 2 is needed and what guardrails to prioritize.
### Phase 2 Trigger Criteria
Consider adding SQL generation when:
- Multiple users request queries APIs can't handle
- Common analytical patterns emerge that need SQL
- Adding API endpoints for each case becomes unsustainable
## References
- [MCP Server Brainstorm](../MCP_SERVER_BRAINSTORM.md) - Wild Ideas section mentions SQL generation
- [API Endpoints Documentation](../../../Jana/docs/architecture_docs/API_ENDPOINTS.md) - Current API capabilities
- [OWASP SQL Injection Prevention](https://cheatsheetseries.owasp.org/cheatsheets/SQL_Injection_Prevention_Cheat_Sheet.html)
- Discussion on NL-to-SQL patterns in design session 2026-01-22