# RooCode-RAG-Lookup
RooCode MCP Server for performing RAG (Retrieval-Augmented Generation) lookups in documents and code repositories using vector embeddings and semantic search.
## Example Usage
Ask a question: e.g. "What is the maximum number of entries* in a word document?" and prompt the LLM stating "use rag". The LLM is usally a decent judge of when it should use a tool or not and may decide to use the tool on its own.
<img width="1458" height="686" alt="image" src="https://github.com/user-attachments/assets/45bdd266-1f23-42e5-9c2f-34d1dd23a179" />
*This is related to the maximum number of XML properties and elements addressable in Word
## Features
- **Full RAG Implementation**: Complete vector-based semantic search using ChromaDB and Haystack
- **Document Indexing**: Automatic text extraction and chunking from PDF documents
- **Vector Embeddings**: Sentence transformer embeddings for semantic similarity
- **RAG Lookup Tool**: Search through documents and code repositories with relevance scoring
- **Test Tool**: Simple hello world tool to verify MCP server connectivity
- **Async MCP Protocol**: Full JSON-RPC 2.0 support via stdio
## Installation
1. Install Python dependencies:
```bash
pip install -r requirements.txt
```
2. Configure RooCode to use this MCP server by adding the configuration from `mcp_config.json` to your RooCode settings.
## Configuration
1. Add the `mcp_config.json` to your RooCode MCP server settings in the edit global settings part of MCP tools. If the tool is ready to use it will show a green status.
2. Set the following environment variables:
- `RAG_LOOKUP_PATH`: Path to this project directory
- `PYTHON_PATH`: Path to your Python executable
3. Configure parameters in [`parameters.py`](parameters.py:1):
- `EMBEDDING_MODEL`: Sentence transformer model (default: all-mpnet-base-v2)
- `COLLECTION_NAME`: ChromaDB collection name
- `CHUNK_SIZE`: Text chunk size in words (default: 500)
- `CHUNK_OVERLAP`: Overlap between chunks (default: 50)
- `DEFAULT_TOP_K`: Number of results to return (default: 5)
## Available Tools
### 1. `rag_lookup`
Perform semantic search using RAG in documents and code repositories. Returns relevant chunks with similarity scores and metadata.
**Parameters:**
- `query` (required): The search query
- `source` (optional): Where to search - "documents", "repos", or "both" (default: "both")
**Returns:**
- Relevant text chunks with similarity scores
- Source file information and metadata
- Statistics on documents searched
**Example:**
```json
{
"query": "authentication implementation",
"source": "both"
}
```
**Response Format:**
```json
{
"status": "success",
"query": "authentication implementation",
"results": [
{
"content": "...",
"score": 0.85,
"metadata": {
"file_name": "document.txt",
"source_file": "/path/to/document.txt"
}
}
],
"metadata": {
"documents_searched": 5,
"repos_searched": 3,
"total_matches": 5
}
}
```
### 2. `say_hello`
Simple test tool that returns a greeting message with timestamp.
**Parameters:**
- `name` (optional): Name to include in greeting (default: "World")
**Example:**
```json
{
"name": "RooCode"
}
```
## Usage
### 1. Extract and Index Documents
Place PDF documents in the `Documents/` or `Repos/` folders, then run:
```bash
# Extract text from PDFs
python extraction/parse_pdf.py
# Populate the vector database
python extraction/populate_database.py
```
### 2. Query the RAG System
```bash
# Test RAG lookup directly
python query_rag.py
Or ask
```
### 3. Use via MCP Server
Once configured in RooCode, use the `rag_lookup` tool through the MCP interface. There is an MCP menu in RooCode settings editing the global settings will give you json settings to edit `{"mcpServers":{}}`, copy and paste the mcp_config.json into the global MCP settings.
## Testing
Test the MCP server locally:
```bash
# Using MCP inspector
npx @modelcontextprotocol/inspector python mcp_tool.py
# Direct stdio test
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' | python mcp_tool.py
```
## Project Structure
```
RooCode-RAG-Lookup/
├── mcp_tool.py # Main MCP server implementation
├── query_rag.py # RAG query functions
├── parameters.py # Configuration parameters
├── run_rag_lookup.bat # Windows batch launcher
├── mcp_config.json # Example RooCode configuration
├── requirements.txt # Python dependencies
├── extraction/
│ ├── parse_pdf.py # PDF text extraction
│ └── populate_database.py # Database population and indexing
├── ExtractedText/ # Extracted text files (.txt + .meta.json)
├── chroma_db/ # ChromaDB vector database
└── README.md # This file
```
## Technology Stack
- **MCP Python SDK**: Protocol implementation for RooCode integration
- **Haystack**: Document processing and RAG pipeline framework
- **ChromaDB**: Vector database for embeddings storage
- **Sentence Transformers**: Semantic embeddings (all-mpnet-base-v2)
- **PDFPlumber**: PDF text extraction with layout preservation
- **Async/Await**: Concurrent request handling
- **JSON-RPC 2.0**: Communication protocol
- **Stdio Transport**: RooCode integration
## How It Works
1. **Document Extraction**: PDFs are parsed using [`parse_pdf.py`](extraction/parse_pdf.py:1) which extracts text and metadata
2. **Text Chunking**: Documents are split into overlapping chunks using [`DocumentSplitter`](extraction/populate_database.py:70)
3. **Embedding Generation**: Text chunks are converted to 768-dimensional vectors using sentence transformers
4. **Vector Storage**: Embeddings are stored in ChromaDB with metadata for retrieval
5. **Semantic Search**: Queries are embedded and matched against stored vectors using cosine similarity
6. **Result Ranking**: Top-K most relevant chunks are returned with scores and metadata
## Requirements
See [`requirements.txt`](requirements.txt:1) for full dependencies. Key packages:
- `mcp>=1.0.0` - MCP protocol support
- `haystack-ai` - RAG framework
- `chroma-haystack` - ChromaDB integration
- `sentence-transformers` - Embedding models
- `pdfplumber` - PDF extraction
## License
MIT