README.md•12.3 kB
# MCP VectorStore Server
A Model Context Protocol (MCP) server that provides advanced vector store operations for document search, PDF processing, and information retrieval. This server wraps the functionality from `vectorstore.py` into a standardized MCP interface.
## Features
- **Vector Store Operations**: Create, search, and manage document vector stores
- **PDF Processing**: Extract and index content from PDF documents using LLMSherpa
- **Semantic Search**: Advanced document search using HuggingFace embeddings
- **Web Search Integration**: Google, Wikipedia, and DuckDuckGo search capabilities
- **File Operations**: Read and process local files
- **Mathematical Calculations**: Built-in calculator functionality
## Prerequisites
### System Requirements
- **Python**: 3.8 or higher
- **Operating System**: Linux, macOS, or Windows
- **Memory**: Minimum 4GB RAM (8GB+ recommended for large document collections)
- **Storage**: At least 2GB free space for models and vector stores
- **Network**: Internet connection for downloading models and web searches
### Optional GPU Support
For improved performance with large document collections:
- **CUDA**: 11.8 or higher
- **GPU**: NVIDIA GPU with 4GB+ VRAM
- **cuDNN**: Compatible version for your CUDA installation
## Installation
### Step 1: Clone or Download the Repository
```bash
# If you have the files locally, navigate to the directory
cd /path/to/McpDocServer
# Or clone from a repository (if available)
# git clone <repository-url>
# cd McpDocServer
```
### Step 2: Create a Virtual Environment
```bash
# Create a virtual environment
python3 -m venv venv
# Activate the virtual environment
# On Linux/macOS:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
```
### Step 3: Install Dependencies
```bash
# Upgrade pip
pip install --upgrade pip
# Install all required packages
pip install -r requirements.txt
```
### Step 4: Install LLMSherpa (Optional but Recommended)
For optimal PDF processing, install LLMSherpa locally:
```bash
# Install LLMSherpa
pip install llmsherpa
# Start the LLMSherpa server (in a separate terminal)
llmsherpa --port 5001
```
### Step 5: Download Embedding Models
The server will automatically download the required embedding model on first use, but you can pre-download it:
```bash
# Download the embedding model
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-mpnet-base-v2')"
```
## Configuration
### Environment Variables
Create a `.env` file in the project directory:
```bash
# LLMSherpa API URL (use local if available, otherwise cloud)
LLMSHERPA_API_URL=http://localhost:5001/api/parseDocument?renderFormat=all
# Vector store directory
VECTORSTORE_DIR=/path/to/your/documents
# User agent for web scraping
USER_AGENT=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
# Optional: CUDA device for GPU acceleration
CUDA_VISIBLE_DEVICES=0
```
### Directory Structure
Prepare your document directory:
```
your_documents/
├── pdfs/
│ ├── document1.pdf
│ ├── document2.pdf
│ └── ...
├── text_files/
│ ├── notes.txt
│ └── ...
└── other_documents/
└── ...
```
## Usage
### Starting the MCP Server
```bash
# Make the server executable
chmod +x mcp_vectorstore_server.py
# Start the server on linux
python /home/em/McpDocServer/mcp_vectorstore_server.py
or windows with wsl
wsl -d Ubuntu-24.04 bash -c "/mnt/c/Users/emanu/Desktop/McpDocServer/start_mcp.sh"
```
### Using with MCP Clients
#### 0. Claude Desktop
Add to your MCP configuration:
```json
{
"mcpServers": {
"vectorstore": {
"command": "python",
"args": ["/home/em/McpDocServer/mcp_vectorstore_server.py"],
"env": {
"PYTHONPATH": "/home/em/McpDocServer/McpDocServer"
}
}
}
}
```
#### 1. GitHub Copilot
1) Click on Configure Tools in the GitHub Copilot Chat window:<br>
2) Click on Add More Tools in the top search bar.<br>
3) Click on Add MCP Server in the top search bar.<br>
4) Click on command (stdio) in the top search bar.<br>
5) Enter command to run:<br>
6) python /home/em/McpDocServer/mcp_vectorstore_server.py<br>
or on windows:
wsl -d Ubuntu-24.04 /mnt/c/Users/emanu/Desktop/McpDocServer/start_mcp.sh
7) Enter mcp server id / name e.g. McpDocServer-19be5552<br>
8) Configure settings.json<br>
```json
{
"security.workspace.trust.untrustedFiles": "open",
"python.defaultInterpreterPath": "/mnt/c/Users/emanu/Desktop/LLM/venv/venv/bin/python",
"terminal.integrated.inheritEnv": false,
"git.openRepositoryInParentFolders": "never",
"terminal.integrated.scrollback": 100000,
"mcp": {
"servers": {
"McpDocServer-19be5552": {
"type": "stdio",
"command": "python",
"args": [
"/mnt/c/Users/emanu/Desktop/McpDocServer/mcp_vectorstore_server.py"
]
}
}
}
}
```
9) Check if the following tools are available in the mcp server tool list when you click on Configure Tools in the GitHub Copilot Chat window and scroll to bottom:<br>
vectorstore_search<br>
vectorstore_create<br>
vectorstore_info<br>
vectorstore_clear<br>
read_file<br>
google_search<br>
wikipedia_search<br>
duckduckgo_search<br>
calculate<br>
10) Select Agent mode in GitHub Copilot Chat window and use vectorstore_search to get information:<br>
use vectorstore_search to get information on unit testing<br>
11)Confirm tool call usage.
#### 2. Continue MCP CLient
```
name: McpDocServer
version: 1.0.1
schema: v1
mcpServers:
- name: McpDocServer
command: wsl -d Ubuntu-24.04
args:
- "/mnt/c/Users/emanu/Desktop/McpDocServer/start_mcp.sh"
env: {}
mcp_timeout: 180 # set timeout to 180 sec
timeout: 9999
connectionTimeout: 120000 # 120 seconds = 2 minutes
```
#### 3. Other MCP Clients
Configure your MCP client to use the server:
```bash
# Example with a generic MCP client
mcp-client --server python --args /path/to/McpDocServer/mcp_vectorstore_server.py
```
## Available Tools
### Vector Store Operations
#### `vectorstore_search`
Search the vector store for relevant documents.
**Parameters:**
- `query` (string, required): Search query
- `k` (integer, optional): Number of results (default: 2)
**Example:**
```json
{
"name": "vectorstore_search",
"arguments": {
"query": "machine learning algorithms",
"k": 5
}
}
```
#### `vectorstore_create`
Create a new vector store from documents in a directory.
**Parameters:**
- `directory_path` (string, required): Path to directory containing documents
**Example:**
```json
{
"name": "vectorstore_create",
"arguments": {
"directory_path": "/home/user/documents/research_papers"
}
}
```
#### `vectorstore_info`
Get information about the current vector store.
**Example:**
```json
{
"name": "vectorstore_info",
"arguments": {}
}
```
#### `vectorstore_clear`
Clear all documents from the vector store.
**Example:**
```json
{
"name": "vectorstore_clear",
"arguments": {}
}
```
### File Operations
#### `read_file`
Read the contents of a file on the system.
**Parameters:**
- `filename` (string, required): Path to the file to read
**Example:**
```json
{
"name": "read_file",
"arguments": {
"filename": "/home/user/documents/notes.txt"
}
}
```
### Web Search Operations
#### `google_search`
Search Google for information.
**Parameters:**
- `query` (string, required): Search query
- `max_results` (integer, optional): Maximum number of results (default: 3)
**Example:**
```json
{
"name": "google_search",
"arguments": {
"query": "latest AI developments 2024",
"max_results": 5
}
}
```
#### `wikipedia_search`
Search Wikipedia for information.
**Parameters:**
- `query` (string, required): Search query
**Example:**
```json
{
"name": "wikipedia_search",
"arguments": {
"query": "artificial intelligence"
}
}
```
#### `duckduckgo_search`
Search DuckDuckGo for information.
**Parameters:**
- `query` (string, required): Search query
**Example:**
```json
{
"name": "duckduckgo_search",
"arguments": {
"query": "privacy-focused search engines"
}
}
```
### Utility Operations
#### `calculate`
Perform mathematical calculations.
**Parameters:**
- `operation` (string, required): Mathematical operation to perform
**Example:**
```json
{
"name": "calculate",
"arguments": {
"operation": "2 + 2 * 3"
}
}
```
## Resources
The server provides the following resources:
### `vectorstore://info`
Returns information about the current vector store in JSON format.
**Example Response:**
```json
{
"num_documents": 150,
"directory": "/home/user/documents",
"embeddings_model": "sentence-transformers/all-mpnet-base-v2"
}
```
## Troubleshooting
### Common Issues
#### 1. Import Errors
**Problem:** `ModuleNotFoundError` for various packages
**Solution:** Ensure all dependencies are installed:
```bash
pip install -r requirements.txt
```
#### 2. CUDA/GPU Issues
**Problem:** CUDA-related errors
**Solution:** Install CPU-only versions:
```bash
pip uninstall faiss-gpu torch
pip install faiss-cpu
```
#### 3. LLMSherpa Connection Issues
**Problem:** Cannot connect to LLMSherpa API
**Solution:**
- Start LLMSherpa server: `llmsherpa --port 5001`
- Or use cloud API by updating the URL in the code
#### 4. Memory Issues
**Problem:** Out of memory errors with large documents
**Solution:**
- Reduce chunk size in the text splitter
- Use smaller embedding models
- Process documents in batches
#### 5. Permission Issues
**Problem:** Cannot read files or directories
**Solution:** Check file permissions:
```bash
chmod 644 /path/to/documents/*
chmod 755 /path/to/documents/
```
### Performance Optimization
#### For Large Document Collections
1. **Use GPU acceleration:**
```python
# In vectorstore.py, ensure CUDA is enabled
model_kwargs={'device': 'cuda'}
```
2. **Optimize chunk size:**
```python
# Adjust in PDFVectorStoreTool.__init__
chunk_size=1000, # Smaller chunks for better performance
chunk_overlap=100,
```
3. **Batch processing:**
```python
# Process documents in smaller batches
batch_size = 10
```
#### For Better Search Results
1. **Adjust similarity threshold:**
```python
# In vectorstore_search method
similarity_threshold = 0.7
```
2. **Use different embedding models:**
```python
# Try different models for better results
model_name="sentence-transformers/all-MiniLM-L6-v2" # Faster
model_name="sentence-transformers/all-mpnet-base-v2" # Better quality
```
## Development
### Project Structure
```
McpDocServer/
├── mcp_vectorstore_server.py # Main MCP server
├── vectorstore.py # Original vectorstore implementation
├── requirements.txt # Python dependencies
├── README.md # This documentation
└── .env # Environment variables (create this)
```
### Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
### Testing
```bash
# Run basic functionality tests
python -c "
from mcp_vectorstore_server import *
print('Server imports successfully')
"
# Test vector store operations
python -c "
from vectorstore import PDFVectorStoreTool
tool = PDFVectorStoreTool()
print(f'Vector store initialized with {tool.vectorstore_get_num_items()} documents')
"
```
## License
This project is provided as-is for educational and research purposes. Please ensure you comply with the licenses of all included dependencies.
## Support
For issues and questions:
1. Check the troubleshooting section above
2. Review the error logs
3. Ensure all dependencies are correctly installed
4. Verify your system meets the requirements
## Changelog
### Version 1.0.0
- Initial release
- MCP server implementation
- Vector store operations
- Web search integration
- File operations
- Mathematical calculations