# Product Requirements Document (PRD)
## Product: Salesforce Metadata-Aware RAG MCP
### Author: Fascinating Concepts
### Date: August 2025
---
## 1. Overview
The Salesforce Metadata-Aware RAG MCP is a system that consumes Salesforce metadata and code using the Metadata API, Tooling API, and REST/Describe endpoints. It normalizes and chunks this content into RAG-ready artifacts stored in hybrid indexes (vector + BM25/FTS). An MCP interface exposes ingestion, retrieval, and where-used tools to developers and AI copilots, enabling Salesforce org-aware assistance with citations.
---
## 2. Goals & Objectives
- **Enable metadata ingestion**: Apex, Triggers, Layouts, Flows, Validation Rules, Profiles, PermissionSets, Custom Objects & Fields.
- **Support Salesforce-aware intelligence**: Ensure answers reflect org-specific configuration and data models.
- **Improve developer productivity**: Reduce time spent searching metadata/code and resolving errors.
- **Provide RAG foundation for copilots**: Supply grounded knowledge to VS Code, Slack, or in-org copilots via MCP.
---
## 3. Use Cases
1. **Where-Used Lookups**: “Where is `Account.Industry` used?” → Apex, layouts, flows, validation rules, profiles/permsets.
2. **Schema Exploration**: “What fields exist on Opportunity?” → REST `describeSObject` with citations.
3. **Error Tracing**: Developer pastes error → map stack trace to Apex classes and relevant metadata.
4. **SOQL Guidance**: Retrieve metadata to generate safe SOQL with field validation and picklist awareness.
5. **FLS/CRUD Checks**: Validate code snippets against profile and permission set access.
---
## 4. Functional Requirements
### 4.1 Metadata Ingestion
- **Metadata API**: `listMetadata`, `readMetadata`, `retrieve(package.xml)` for Layouts, Flows, CustomObjects, Profiles, PermissionSets.
- **Tooling API**: ApexClass, ApexTrigger, ValidationRule.
- **REST/Describe APIs**: Object schema, FLS/CRUD, picklists.
- **SOQL (optional)**: Sample data retrieval for context.
### 4.2 Normalization & Chunking
- Apex → per method (include signature + docblock).
- Triggers → per event section.
- LWC → pair `.js + .html`; keep `*-meta.xml` separate.
- Layouts → per section.
- Validation Rules → per rule.
- Flows → per element (with connectors).
- Profiles/PermSets → per object CRUD/FLS block + ApexClassAccess.
- CustomObject/Field → per field (include picklists).
### 4.3 Feature Enrichment
- Extract symbols: `Object`, `Object.Field`, `Schema.SObjectType.X`.
- Identify DML ops, SOQL tables, FLS/CRUD usage.
- Build edges: `(Object.Field) → (Apex/Flow/Layout/Profile)`.
### 4.4 Indexing
- **Vector store** (pgvector/Qdrant) for semantic retrieval.
- **Keyword index** (Postgres FTS/OpenSearch) for symbol precision.
- Hybrid retrieval (union → dedup → rerank).
### 4.5 MCP Tools
- `sf.metadata.list/read/retrieve`
- `sf.tooling.soql`, `sf.describe.object`, `sf.soql`
- `rag.ingest_org`, `rag.search`, `rag.where_used`, `rag.open`, `rag.status`
---
## 5. Non-Functional Requirements
- **Performance**: Retrieval in <1.0s (p95); ingestion batch in <15m for mid-sized org.
- **Scalability**: Handle orgs with 100k+ metadata items.
- **Reliability**: Retry failed API calls, exponential backoff, API quota respect.
- **Security**: OAuth/JWT auth with read-only integration user; tenant isolation by org_id; encrypted storage; strip PII.
---
## 6. Roadmap
### Phase 1 (MVP)
- Basic ingestion (Apex, Layouts, Validation Rules, Objects).
- Chunk + embed + index.
- Expose MCP: `sf.metadata.*`, `rag.ingest_org`, `rag.search`, `rag.status`.
### Phase 2
- Add Profiles, PermissionSets, Flows.
- Implement hybrid retrieval with rerank.
- Add `rag.where_used`.
### Phase 3
- VS Code & Slack integration.
- Error-to-metadata mapping.
- PR impact summarization.
### Phase 4
- Multi-org tenancy.
- Governance layer (citation enforcement, hallucination guard).
- Graph DB for advanced where-used analysis.
---
## 7. Success Metrics
- **Coverage**: ≥ 90% of key metadata types ingested.
- **Accuracy**: ≥ 90% of answers cite valid metadata/code.
- **Adoption**: ≥ 50% dev team weekly use within 3 months.
- **Efficiency**: ≥ 30% reduction in time spent searching code/metadata.
---
## 8. Risks & Mitigations
- **API Limits**: Use incremental deltas + batch `listMetadata`.
- **Large Orgs**: Parallelize ingestion; shard indexes.
- **Security**: Guard against PII embedding; enforce org_id isolation.
- **Complexity Drift**: Enforce modular chunkers; validate with evaluation set.
---
## 9. Recommended Libraries & Projects
### 9.1 Salesforce Connectivity
- **[JSforce](https://github.com/jsforce/jsforce)** (Node.js)
Mature and well-maintained; covers **Metadata API, Tooling API, REST, Bulk** in one SDK.
→ Recommended as the single ingestion layer.
### 9.2 Document Parsing & Normalization
- **[xml2js](https://github.com/Leonidas-from-XIV/node-xml2js)** (Node)
Reliable XML parsing for Salesforce metadata (Layouts, Flows, Profiles, etc.).
→ Use alongside custom **Salesforce-aware chunkers** (per method, layout section, flow element, etc.).
### 9.3 Embeddings
- **[sentence-transformers](https://github.com/UKPLab/sentence-transformers)** (Python)
HuggingFace embeddings (e.g., `all-MiniLM-L6-v2`) — open source, fast, and good balance of accuracy vs. cost.
→ Recommended default for local/self-hosted; can swap to OpenAI/Cohere if SaaS is acceptable.
### 9.4 Vector & Keyword Indexing
- **[pgvector](https://github.com/pgvector/pgvector)** (Postgres extension)
Enables **vector search + FTS (BM25)** in one database.
→ Simplifies ops (one datastore instead of Qdrant + Elastic).
### 9.5 Retrieval & RAG Orchestration
- **[LangChain](https://github.com/langchain-ai/langchain)**
Widely adopted; has retrievers, rerankers, chunkers, and vector store integrations.
→ Use only as plumbing inside MCP tools (`rag.search`, `rag.where_used`), not as the outer protocol.
### 9.6 MCP (Model Context Protocol)
- **[MCP Node SDK](https://github.com/modelcontextprotocol/ts-sdk)**
Official TypeScript SDK; integrates cleanly with JSforce ingestion and Postgres/pgvector retrieval.
→ Recommended for implementing the `sf.*` and `rag.*` tools.
### 9.7 Observability & Evaluation
- **[DeepEval](https://github.com/confident-ai/deepeval)**
Purpose-built for RAG evaluation (Recall@k, MRR, groundedness).
→ Lightweight and open source; ideal for building a Salesforce-specific golden set.