# Knowledge & Context: Design Discussion
**Date:** 2026-01-22
**Status:** Discussion / Pre-ADR
**Purpose:** Capture brainstorming on implementing contextual knowledge capabilities
---
## Overview
The "Knowledge & Context" capability aims to help users understand what environmental data means, not just retrieve it. This document captures the design discussion, including analysis of regulatory data sources and implementation recommendations.
**The Dream:** Answer questions like:
- "What is PM2.5 and why does it matter?"
- "Is 35 µg/m³ PM2.5 safe?"
- "What's the regulatory limit for CO2 emissions?"
- "How does this plant's emissions compare to industry average?"
- "What caused the spike in emissions in Q3?"
---
## Question Types & Implementation Strategy
### Classification of Knowledge Questions
| Question Type | Example | Data Needed | Recommended Source |
|---------------|---------|-------------|-------------------|
| **Educational** | "What is PM2.5?" | General knowledge | LLM (Claude's training) |
| **Threshold Interpretation** | "Is 35 µg/m³ safe?" | Regulatory limits | Static KB + LLM synthesis |
| **Regulatory Lookup** | "CO2 limit in California?" | Jurisdiction-specific rules | LLM + authoritative links |
| **Contextual Benchmarking** | "How does this plant compare?" | Live data + statistics | API + LLM synthesis |
| **Investigative** | "What caused the Q3 spike?" | Historical data + reasoning | API + LLM reasoning |
### Implementation by Question Type
#### Type 1: Educational Explanations
> "What is PM2.5 and why does it matter?"
**Implementation:** Pure LLM — Claude already knows this well.
- Environmental science fundamentals don't change
- LLM explanations are rich and contextual
- Zero implementation cost
- Risk: Minimal — this is factual, stable knowledge
#### Type 2: Threshold Interpretation
> "Is 35 µg/m³ PM2.5 safe?"
**Implementation:** Static KB (thresholds) + LLM (synthesis)
**Why Static KB for thresholds:**
- Accuracy is critical (health/safety implications)
- Deterministic — same question should give same thresholds
- Thresholds change rarely (WHO updates every ~5-15 years)
- Small dataset (~50-100 parameter/threshold combinations)
**Proposed KB structure:**
```yaml
# knowledge/thresholds.yaml
parameters:
pm25:
name: "PM2.5"
full_name: "Fine Particulate Matter (≤2.5 micrometers)"
unit: "µg/m³"
guidelines:
who_2021:
annual_mean: 5
daily_mean: 15
source: "WHO Global Air Quality Guidelines 2021"
epa_naaqs:
annual_mean: 12
daily_mean: 35
source: "US EPA National Ambient Air Quality Standards"
```
#### Type 3: Regulatory Lookup
> "What's the regulatory limit for CO2 emissions in California?"
**Implementation:** LLM knowledge + authoritative source links
This is complex because regulations are:
- Jurisdiction-specific (federal, state, local, international)
- Sector-specific (power plants vs vehicles vs buildings)
- Complex (different metrics, timelines, exemptions)
**Recommended approach:** Provide general context from LLM, link to authoritative sources, don't claim comprehensive coverage.
#### Type 4: Contextual Benchmarking
> "How does this plant's emissions compare to industry average?"
**Implementation:** API data + LLM synthesis
Requires live data from Jana backend:
1. Get the facility's emissions from API
2. Get sector/peer statistics from API
3. LLM synthesizes the comparison
#### Type 5: Investigative Explanation
> "What caused the spike in emissions in Q3?"
**Implementation:** API data + LLM reasoning
Requires:
1. Detecting the anomaly (trend analysis)
2. Correlating with potential causes
3. Reasoning about plausible explanations
**Note:** MCP server provides data; Claude hypothesizes. Must be clear these are hypotheses, not proven causes.
---
## Knowledge Source Options Analysis
### Option A: Embed Knowledge in MCP Server (Static KB)
Bundle YAML/JSON files in the MCP server containing parameter definitions, thresholds, etc.
| Pros | Cons |
|------|------|
| Fast - no API calls needed | Maintenance burden - must update manually |
| Works offline/disconnected | Duplicates what LLM already knows |
| Deterministic - same answer every time | Limited scope - only what's in KB |
| Version controlled with code | Gets stale (regulations change) |
| No hallucination risk for critical thresholds | Adds complexity to MCP server |
### Option B: Add Knowledge Endpoint to Backend
Add `/api/v1/esg/knowledge/` endpoints to Jana backend.
| Pros | Cons |
|------|------|
| Single source of truth | More backend work |
| Reusable across clients (not just MCP) | Another API to maintain |
| Can be updated without MCP release | Adds latency (API call) |
| Queryable/filterable | Requires data curation effort |
| Supports i18n, versioning, jurisdiction | Over-engineering if only MCP needs it |
### Option C: Use LLM Knowledge (Claude's Training Data)
Don't build a knowledge base. Let Claude answer from training data.
| Pros | Cons |
|------|------|
| Zero maintenance | Potential hallucination on specifics |
| Broad coverage (can explain anything) | Non-deterministic answers |
| Handles nuance and edge cases | Can't guarantee regulatory accuracy |
| Natural explanations | Training data may be outdated |
| Already available - no build cost | Can't update without model change |
### Option D: Hybrid Approach (Recommended)
| Knowledge Type | Best Source | Rationale |
|----------------|-------------|-----------|
| Parameter definitions | Backend API | Already exists via `get_parameter_definitions()` |
| Regulatory thresholds | Static KB in MCP | Critical accuracy, deterministic |
| Health effects/explanations | LLM | Broad, nuanced, handles questions well |
| Data interpretation | LLM + thresholds | Combine threshold lookup with LLM synthesis |
---
## Regulatory Data API Landscape
### The Core Problem
**The measurements have APIs. The thresholds don't.**
| Source | API Available? | What It Provides | Auto-Updates? |
|--------|---------------|------------------|---------------|
| **EPA AirNow** | ✅ Yes | Real-time AQI, forecasts, current conditions | ✅ Real-time |
| **EPA AQS** | ✅ Yes | Historical air quality data, monitoring stations | ✅ Scheduled |
| **OpenAQ** | ✅ Yes (Jana uses it) | Global air quality measurements | ✅ Real-time |
| **EPA NAAQS** | ❌ No API | Published as regulations/PDFs | ❌ Manual |
| **WHO Guidelines** | ❌ No API | Published as reports | ❌ Manual (rare) |
| **EU Air Quality Directive** | ❌ No API | Published as legal documents | ❌ Manual |
### Why Thresholds Don't Have APIs
1. **They rarely change** - WHO updated guidelines in 2021 (prior: 2005). EPA NAAQS updates infrequently.
2. **They're legal documents** - Regulations are published as law, not data feeds
3. **No business model** - No one has built a regulatory threshold API as a service
### Web Search Results
Searched for existing regulatory data APIs and MCP servers. Key findings:
- **No existing MCP servers** for environmental regulations or compliance data
- **No public APIs** from EPA, WHO, or other bodies specifically for regulatory thresholds
- **Pattern:** Data APIs exist (measurements); policy APIs don't (thresholds)
### Maintenance Reality for Air Quality Thresholds
| Factor | Reality |
|--------|---------|
| WHO guideline changes | Every 10-15 years |
| EPA NAAQS changes | Every 5-10 years |
| AQI breakpoint changes | Essentially never |
| Parameter definitions | Stable |
**A 100-line YAML file reviewed annually is not a significant maintenance burden.**
---
## Air Quality Thresholds vs. Emissions Regulations
### Key Differences
| Aspect | Air Quality (PM2.5, O3, etc.) | Emissions (CO2, CH4, etc.) |
|--------|-------------------------------|---------------------------|
| **Scope** | Health-based, universal | Economic/environmental policy |
| **Complexity** | Simple thresholds | Multi-layered (caps, permits, trading) |
| **Jurisdictions** | WHO, EPA, EU (few) | Federal, 50 states, EU, 190+ countries |
| **Sectors** | N/A - applies to air | Power, transport, industry, buildings... |
| **Change frequency** | Rare (5-15 years) | Frequent (annual adjustments) |
| **Data structure** | Simple numbers | Complex (formulas, exemptions, phase-ins) |
### Types of Emissions Regulations
**1. Absolute Limits (Simpler)**
```
"Power plants must emit < 1000 lbs CO2/MWh"
```
**2. Cap-and-Trade Systems (Complex)**
```
"California Cap-and-Trade: Annual cap reduced by X% per year"
"EU ETS: Allowance prices fluctuate daily"
```
**3. Reporting Requirements (Process)**
```
"Facilities emitting >25,000 tonnes CO2e must report annually"
```
**4. Sector-Specific Standards (Many)**
```
"Vehicle emissions: EPA Tier 3 standards"
"Methane from oil/gas: EPA Quad Oa/Ob/Oc"
```
### The Scale Problem
If we tried to capture emissions regulations comprehensively:
| Dimension | Scale |
|-----------|-------|
| Countries | 190+ with varying regulations |
| US States | 50+ with different programs |
| Sectors | 10+ major sectors |
| Gases | 6+ greenhouse gases |
| Facility types | 100s of categories |
| Time periods | Regulations phase in over years |
**Conservative estimate:** 10,000+ distinct regulatory rules globally
**This is not a YAML file problem. This is a database problem.**
---
## Recommendations
### For MVP: Hybrid Approach (Air Quality Focus)
| Category | MVP Approach |
|----------|--------------|
| **Air quality thresholds** | Small static KB (~100 lines): WHO 2021, EPA NAAQS, AQI breakpoints |
| **Parameter explanations** | LLM knowledge |
| **Emissions regulations** | LLM + authoritative source links |
| **Compliance determination** | **Out of scope** (liability risk) |
**What we explicitly DON'T do in MVP:**
- Claim to know current emissions regulatory limits with certainty
- Provide compliance determinations
- Maintain a comprehensive regulatory database
**What we DO well:**
- Provide air quality interpretation with accurate thresholds
- Explain what parameters mean and health effects
- Show emissions data (Climate TRACE, EDGAR)
- Help users understand their data
- Point to authoritative regulatory sources
### Proposed MVP Static KB Scope
```
knowledge/
├── parameters.yaml # ~20 parameters (PM2.5, CO2, NO2, O3, etc.)
├── thresholds.yaml # WHO, EPA, EU guidelines for air quality
├── aqi_breakpoints.yaml # US AQI calculation tables
└── glossary.yaml # Common terms and acronyms
```
**Total: ~500-1000 lines of YAML** — maintainable, versionable, auditable.
### For Phase 2: Emissions Regulatory Context (If Demanded)
If users demand regulatory context for emissions:
1. Survey users on which frameworks matter most
2. Curate top 5-10 frameworks only:
- EPA GHG Reporting Program (GHGRP)
- EU ETS
- California Cap-and-Trade
- Paris Agreement NDCs
- SEC Climate Disclosure Rules
3. Build as backend endpoint (not MCP-embedded)
4. Establish quarterly review cadence
### Example MVP Interaction Patterns
**Air Quality (Full Support):**
```
User: "Is 35 µg/m³ PM2.5 safe?"
MCP looks up thresholds from static KB:
- WHO 24hr guideline: 15 µg/m³
- EPA 24hr standard: 35 µg/m³
Claude synthesizes: "35 µg/m³ PM2.5 is at the US EPA's 24-hour standard
but 2.3x the WHO guideline. This level is classified as 'Moderate' on
the AQI scale. Sensitive groups should consider reducing prolonged
outdoor exertion."
```
**Emissions Regulations (Limited Support):**
```
User: "What are the CO2 limits for power plants in California?"
Claude (from LLM knowledge): "California power plants are subject to
multiple regulations including the CARB Cap-and-Trade program, EPA
Clean Air Act requirements, and SB 100 mandates.
For current compliance requirements, consult:
- CARB: https://ww2.arb.ca.gov/...
- EPA: https://www.epa.gov/...
I can help you analyze emissions data for specific facilities using
Jana's Climate TRACE data. Would you like to look up a specific plant?"
```
---
## Open Questions for Future Discussion
1. **Is annual review of static KB acceptable?** Or must it be truly zero-touch?
2. **Should we wrap AirNow API** for real-time US air quality assessment with built-in categories?
3. **Which emissions regulatory frameworks would be highest value** if we curate them in Phase 2?
4. **Is there liability concern** with providing regulatory interpretation? (May argue for LLM-only approach)
5. **Should the static KB live in the MCP server or Jana backend?** Backend makes it reusable across clients.
6. **How do we handle international users?** EU, Asia, etc. have different standards.
---
## Related Documents
- [MCP_SERVER_BRAINSTORM.md](MCP_SERVER_BRAINSTORM.md) - Original capability brainstorm
- [ADR-0001: Start Without MCP-Side Memory](adr/0001-start-without-mcp-side-memory.md)
- [ADR-0002: NL-to-SQL Generation Strategy](adr/0002-nl-to-sql-generation-strategy.md)
---
## Changelog
| Date | Change |
|------|--------|
| 2026-01-22 | Initial discussion document created |