caseware-ai-procurement-knowledge-platform
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@caseware-ai-procurement-knowledge-platformcompare pricing terms in recent contracts"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Caseware AI Procurement Knowledge Platform
AI-ready procurement knowledge platform built with a local data pipeline, hybrid retrieval, and the Model Context Protocol (MCP).
Related MCP server: caseware-kb
Overview
This project implements an end-to-end AI-ready data platform for procurement and inventory documents.
The solution demonstrates how structured and unstructured business documents can be ingested, transformed into searchable knowledge, and exposed through an MCP (Model Context Protocol) server, allowing AI assistants to retrieve evidence, reason across related documents, and generate grounded responses with source references.
The implementation intentionally remains lightweight and fully local while showcasing modern AI Data Engineering concepts, including:
PDF and image ingestion
OCR fallback for scanned documents
Structured metadata extraction
Semantic embeddings
Hybrid retrieval
Cross-document relationship matching
MCP tool integration
Grounded AI responses
The architecture prioritizes simplicity, explainability, and reproducibility, following the challenge recommendation to avoid over-engineering.
Key Capabilities
PDF document ingestion
OCR using Tesseract
Native PDF parsing with PyMuPDF
Structured metadata extraction
SQLite metadata store
ChromaDB vector database
SentenceTransformers embeddings
Hybrid retrieval (metadata + semantic search)
Cross-document relationship matching
Procurement document comparison
MCP server integration
Claude Desktop integration
Grounded source citations
Design Goals
The solution was intentionally designed to demonstrate the core architectural components of an AI-ready data platform while keeping the implementation easy to understand and reproduce.
Primary goals include:
Reproducible local execution
AI-ready document preparation
Explainable retrieval
Hybrid search combining deterministic metadata and semantic similarity
Modular architecture with clear separation of concerns
Agent integration through MCP
Rather than focusing on production-scale infrastructure, the implementation emphasizes engineering decisions, maintainability, and retrieval quality.
High-Level Architecture
flowchart TD
A[Raw Procurement Documents]
A --> B[PDF Parser]
A --> C[OCR - Tesseract]
B --> D[Extracted Text]
C --> D
D --> E[Chunking]
E --> F[Metadata Extraction]
E --> G[SentenceTransformers Embeddings]
F --> H[(SQLite)]
G --> I[(ChromaDB)]
H --> J[Hybrid Retrieval Layer]
I --> J
J --> K[FastMCP Server]
K --> L[Claude Desktop]Data Flow
The ingestion pipeline performs the following steps:
Load procurement documents from the local filesystem.
Parse native PDFs using PyMuPDF.
Apply OCR to scanned documents using Tesseract.
Normalize extracted text.
Extract structured procurement metadata.
Split documents into retrieval-ready chunks.
Generate semantic embeddings.
Store structured metadata in SQLite.
Store semantic vectors in ChromaDB.
Expose retrieval capabilities through an MCP server.
Technology Stack
Layer | Technology |
Language | Python |
PDF Parsing | PyMuPDF |
OCR | Tesseract |
Embeddings | SentenceTransformers |
Metadata Store | SQLite |
Vector Database | ChromaDB |
MCP Framework | FastMCP |
AI Client | Claude Desktop |
Project Structure
The project is organized into independent modules following a clear separation of concerns. Each component has a single responsibility, making the solution easier to understand, maintain, and extend.
caseware-ai-data-mcp/
│
├── app/
│ ├── pipeline/
│ │ ├── extract.py # PDF parsing and OCR
│ │ ├── chunk.py # Document chunking
│ │ ├── model.py # Metadata extraction
│ │ ├── ingest.py # End-to-end ingestion pipeline
│ │ └── index.py # ChromaDB indexing
│ │
│ ├── retrieval/
│ │ ├── search.py # Semantic retrieval
│ │ ├── matching.py # Cross-document matching
│ │ ├── hybrid.py # Hybrid retrieval
│ │ └── citations.py # Source references
│ │
│ ├── db.py # SQLite initialization
│ └── server.py # FastMCP server
│
├── data/
│ └── raw/ # Procurement documents
│
├── storage/
│ ├── knowledge.db # SQLite metadata store
│ └── chroma/ # ChromaDB vector index
│
├── run_pipeline.py
├── requirements.txt
└── README.mdModule Responsibilities
Pipeline
The pipeline transforms raw procurement documents into AI-ready knowledge.
Responsibilities include:
Reading procurement documents
Parsing PDF files
Running OCR when required
Extracting structured metadata
Chunking document content
Generating semantic embeddings
Populating SQLite
Building the ChromaDB vector index
Retrieval
The retrieval layer is responsible for answering user questions.
It combines two complementary strategies:
Deterministic metadata lookup
Semantic vector search
This hybrid approach improves retrieval precision while maintaining flexibility for natural language queries.
Storage
Structured and semantic information are intentionally stored separately.
# SQLite
Stores:
Document metadata
Extracted procurement fields
Chunk metadata
Document relationships
# ChromaDB
Stores:
Sentence embeddings
Semantic vector index
Separating these responsibilities keeps the architecture simple while allowing each technology to focus on its strengths.
MCP Server
The FastMCP server exposes business-oriented retrieval capabilities rather than direct database access.
Available operations include:
Search procurement documents
Retrieve supporting documents for an order
Compare procurement documents
Detect missing purchase orders
Execute hybrid retrieval
This abstraction allows AI assistants to interact with procurement knowledge through natural language instead of SQL queries.
Installation
Prerequisites
Before running the project, install the following software:
Dependency | Version |
Python | 3.11+ |
Git | Latest |
Tesseract OCR | Latest |
Claude Desktop (optional) | Latest |
Clone the Repository
git clone <repository-url>
cd caseware-ai-data-mcpCreate a Virtual Environment
macOS / Linux
python -m venv env
source env/bin/activateWindows
python -m venv env
env\Scripts\activateInstall Python Dependencies
pip install -r requirements.txtInstall OCR
# macOS
brew install tesseract# Ubuntu
sudo apt install tesseract-ocr# Windows
Download and install Tesseract from:
https://github.com/UB-Mannheim/tesseract/wiki
Verify the installation:
tesseract --versionPreparing the Dataset
Place the procurement documents inside the data/raw/ directory.
data/
└── raw/
├── contracts/
├── invoices/
├── purchase_orders/
├── shipping_orders/
└── inventory_reports/Supported document formats:
PDF
PNG
JPG
JPEG
TIFF
BMP
Note
The original procurement documents are not included in this repository because they are part of the challenge dataset. Place the provided files under
data/raw/before running the ingestion pipeline.
Running the Pipeline
Build the local knowledge base by executing:
python run_pipeline.pyExample output:
{
"documents_processed": 45,
"chunks_indexed": 179
}The ingestion pipeline performs the following tasks:
Reads procurement documents
Parses PDF files
Applies OCR when required
Extracts structured metadata
Generates retrieval-ready chunks
Creates semantic embeddings
Stores metadata in SQLite
Builds the ChromaDB vector index
Creates document relationships
The pipeline is idempotent and may be executed multiple times.
Running the MCP Server
Start the MCP server:
python -m app.serverThe server exposes procurement retrieval capabilities through the Model Context Protocol (MCP).
Rather than exposing raw database queries, the MCP server provides business-oriented tools that allow AI assistants to retrieve grounded procurement evidence using natural language.
Verifying the Installation
After executing the ingestion pipeline, verify that the following artifacts have been created:
storage/
├── knowledge.db
└── chroma/The SQLite database contains:
Documents
Extracted metadata
Chunk metadata
Document relationships
The Chroma directory contains the semantic vector index.
If both artifacts exist, the knowledge base has been successfully created.
Claude Desktop Integration
The MCP server can be consumed directly from Claude Desktop, enabling natural language interaction with the procurement knowledge base.
Configure Claude Desktop
Open the Claude Desktop configuration file.
macOS
~/Library/Application Support/Claude/claude_desktop_config.jsonAdd the following configuration:
{
"mcpServers": {
"caseware-ai-data-mcp": {
"command": "/absolute/path/to/env/bin/python",
"args": [
"-m",
"app.server"
],
"cwd": "/absolute/path/to/caseware-ai-data-mcp",
"env": {
"PYTHONPATH": "/absolute/path/to/caseware-ai-data-mcp"
}
}
}
}Replace the placeholder paths with your local project paths.
Restart Claude Desktop after saving the configuration.
Available MCP Tools
Tool | Description |
| Semantic retrieval across indexed procurement documents |
| Hybrid metadata + semantic retrieval |
| Retrieves supporting procurement documents for an Order ID |
| Performs a lightweight procurement audit |
| Detects invoices without matching purchase orders |
Quick Validation
After connecting Claude Desktop, execute the following question:
Which documents support order 10248?Expected response:
Invoice
Purchase Order
Shipping Order
This confirms that:
the ingestion pipeline executed successfully
SQLite contains the extracted metadata
ChromaDB contains the semantic index
the MCP server is running correctly
Claude Desktop can retrieve grounded procurement evidence
Retrieval Strategy
The platform implements a lightweight Hybrid Retrieval architecture that combines deterministic metadata lookup with semantic vector search.
Metadata Retrieval
During ingestion, structured procurement entities are extracted and stored in SQLite.
Examples include:
Order IDs
Invoice Numbers
Purchase Order Numbers
Vendor Names
Dates
Amounts
Queries containing explicit identifiers are resolved through deterministic lookups, providing fast and highly accurate results.
Semantic Retrieval
Natural language questions are answered using semantic similarity search.
Document chunks are embedded using SentenceTransformers and indexed in ChromaDB.
Typical semantic queries include:
Summarize payment terms.
Find supplier obligations.
What inventory reports mention warehouse damage?
Which contracts discuss delivery conditions?
Hybrid Retrieval
The retrieval layer automatically selects the most appropriate strategy based on the query.
For example:
Which documents support order 10248?The system:
Detects the Order ID.
Retrieves matching procurement documents from SQLite.
Complements the response with semantic evidence when applicable.
Returns grounded citations.
This approach provides better precision than relying exclusively on vector search.
Citation Strategy
Every retrieval result includes references to the original source document whenever possible.
Example:
{
"file": "invoice_10248.pdf",
"page": 1,
"chunk": 0
}This enables AI assistants to generate grounded and explainable responses rather than unsupported summaries.
Example Questions
Once connected through Claude Desktop (or another MCP-compatible client), the following questions can be executed:
Which documents support order 10248?
Compare procurement documents for order 10248.
Which invoices are missing purchase orders?
Summarize the payment terms in the supplier contract.
Find evidence related to vendor Paul Henriot.
What inventory reports are available?Design Decisions
The implementation intentionally favors simplicity over unnecessary complexity while demonstrating the architectural patterns expected from an AI-ready data platform.
Key design decisions include:
SQLite provides a lightweight metadata store requiring no external infrastructure.
ChromaDB enables local semantic retrieval without requiring managed vector databases.
SentenceTransformers generates embeddings locally without external AI services.
FastMCP exposes business-oriented capabilities through the Model Context Protocol.
Hybrid Retrieval combines deterministic matching with semantic similarity to improve retrieval accuracy.
These choices keep the project reproducible, easy to understand, and aligned with the challenge scope.
Engineering Trade-offs
This implementation intentionally prioritizes:
Simplicity over production-scale infrastructure.
Explainability over complex AI pipelines.
Local execution over cloud deployment.
Modular design over tightly coupled components.
Deterministic metadata extraction combined with semantic retrieval.
The objective is to demonstrate sound AI Data Engineering principles rather than build a production-ready enterprise platform.
Future Improvements
Potential production enhancements include:
Schema-constrained LLM-based metadata extraction.
Confidence scoring for extracted fields.
BM25 + Vector hybrid ranking.
Line-item reconciliation across procurement documents.
Human review workflows for low-confidence matches.
OpenSearch or AWS Bedrock Knowledge Bases for cloud deployment.
Observability with LangFuse, LangSmith, or OpenTelemetry.
AI-Assisted Development
This project was developed with AI-assisted development support for architectural brainstorming, implementation scaffolding, documentation, and code refinement.
All generated code was manually reviewed, integrated, executed locally, and validated by:
Running the ingestion pipeline.
Verifying SQLite outputs.
Validating ChromaDB indexing.
Testing metadata extraction.
Executing semantic and hybrid retrieval.
Testing all MCP tools.
Validating end-to-end integration with Claude Desktop.
The final implementation, architecture, and engineering decisions were manually reviewed to ensure correctness, reproducibility, and alignment with the challenge requirements.
Why this Architecture?
The solution intentionally separates:
Ingestion
Knowledge Storage
Retrieval
MCP Interface
This modular architecture minimizes coupling and allows each layer to evolve independently.
For example:
SQLite can be replaced with PostgreSQL.
ChromaDB can be replaced with OpenSearch or another vector database.
The embedding model can be replaced without changing the retrieval layer.
OCR can be replaced without impacting downstream processing.
This design improves maintainability, extensibility, and testability while remaining intentionally lightweight for the scope of the exercise.
License
This project was developed exclusively for the Caseware AI Data Platform Take-Home Assessment.
Acknowledgements
This project was developed as part of the Caseware AI Data Platform technical assessment.
The goal was to demonstrate an AI-ready procurement knowledge platform using lightweight, explainable, and reproducible engineering practices.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/MrTechi-Dev/caseware-ai-procurement-knowledge-platform'
If you have feedback or need assistance with the MCP directory API, please join our Discord server