How do I use caseware-ai-procurement-knowledge-platform?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@caseware-ai-procurement-knowledge-platform compare pricing terms in recent contracts" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

caseware-ai-procurement-knowledge-platform

by MrTechi-Dev

Overview Schema Related Servers Score Discussions

Python

Local

Python

MCP

SQLite

ChromaDB

Caseware AI Procurement Knowledge Platform

AI-ready procurement knowledge platform built with a local data pipeline, hybrid retrieval, and the Model Context Protocol (MCP).

Related MCP server: linked-docs

Overview

This project implements an end-to-end AI-ready data platform for procurement and inventory documents.

The solution demonstrates how structured and unstructured business documents can be ingested, transformed into searchable knowledge, and exposed through an MCP (Model Context Protocol) server, allowing AI assistants to retrieve evidence, reason across related documents, and generate grounded responses with source references.

The implementation intentionally remains lightweight and fully local while showcasing modern AI Data Engineering concepts, including:

PDF and image ingestion
OCR fallback for scanned documents
Structured metadata extraction
Semantic embeddings
Hybrid retrieval
Cross-document relationship matching
MCP tool integration
Grounded AI responses

The architecture prioritizes simplicity, explainability, and reproducibility, following the challenge recommendation to avoid over-engineering.

Key Capabilities

PDF document ingestion
OCR using Tesseract
Native PDF parsing with PyMuPDF
Structured metadata extraction
SQLite metadata store
ChromaDB vector database
SentenceTransformers embeddings
Hybrid retrieval (metadata + semantic search)
Cross-document relationship matching
Procurement document comparison
MCP server integration
Claude Desktop integration
Grounded source citations

Design Goals

The solution was intentionally designed to demonstrate the core architectural components of an AI-ready data platform while keeping the implementation easy to understand and reproduce.

Primary goals include:

Reproducible local execution
AI-ready document preparation
Explainable retrieval
Hybrid search combining deterministic metadata and semantic similarity
Modular architecture with clear separation of concerns
Agent integration through MCP

Rather than focusing on production-scale infrastructure, the implementation emphasizes engineering decisions, maintainability, and retrieval quality.

High-Level Architecture

flowchart TD

    A[Raw Procurement Documents]

    A --> B[PDF Parser]
    A --> C[OCR - Tesseract]

    B --> D[Extracted Text]
    C --> D

    D --> E[Chunking]

    E --> F[Metadata Extraction]
    E --> G[SentenceTransformers Embeddings]

    F --> H[(SQLite)]
    G --> I[(ChromaDB)]

    H --> J[Hybrid Retrieval Layer]
    I --> J

    J --> K[FastMCP Server]

    K --> L[Claude Desktop]

Data Flow

The ingestion pipeline performs the following steps:

Load procurement documents from the local filesystem.
Parse native PDFs using PyMuPDF.
Apply OCR to scanned documents using Tesseract.
Normalize extracted text.
Extract structured procurement metadata.
Split documents into retrieval-ready chunks.
Generate semantic embeddings.
Store structured metadata in SQLite.
Store semantic vectors in ChromaDB.
Expose retrieval capabilities through an MCP server.

Technology Stack

Layer	Technology
Language	Python
PDF Parsing	PyMuPDF
OCR	Tesseract
Embeddings	SentenceTransformers
Metadata Store	SQLite
Vector Database	ChromaDB
MCP Framework	FastMCP
AI Client	Claude Desktop

Project Structure

The project is organized into independent modules following a clear separation of concerns. Each component has a single responsibility, making the solution easier to understand, maintain, and extend.

caseware-ai-data-mcp/
│
├── app/
│   ├── pipeline/
│   │   ├── extract.py          # PDF parsing and OCR
│   │   ├── chunk.py            # Document chunking
│   │   ├── model.py            # Metadata extraction
│   │   ├── ingest.py           # End-to-end ingestion pipeline
│   │   └── index.py            # ChromaDB indexing
│   │
│   ├── retrieval/
│   │   ├── search.py           # Semantic retrieval
│   │   ├── matching.py         # Cross-document matching
│   │   ├── hybrid.py           # Hybrid retrieval
│   │   └── citations.py        # Source references
│   │
│   ├── db.py                   # SQLite initialization
│   └── server.py               # FastMCP server
│
├── data/
│   └── raw/                    # Procurement documents
│
├── storage/
│   ├── knowledge.db            # SQLite metadata store
│   └── chroma/                 # ChromaDB vector index
│
├── run_pipeline.py
├── requirements.txt
└── README.md

Module Responsibilities

Pipeline

The pipeline transforms raw procurement documents into AI-ready knowledge.

Responsibilities include:

Reading procurement documents
Parsing PDF files
Running OCR when required
Extracting structured metadata
Chunking document content
Generating semantic embeddings
Populating SQLite
Building the ChromaDB vector index

Retrieval

The retrieval layer is responsible for answering user questions.

It combines two complementary strategies:

Deterministic metadata lookup
Semantic vector search

This hybrid approach improves retrieval precision while maintaining flexibility for natural language queries.

Storage

Structured and semantic information are intentionally stored separately.

# SQLite

Stores:

Document metadata
Extracted procurement fields
Chunk metadata
Document relationships

# ChromaDB

Stores:

Sentence embeddings
Semantic vector index

Separating these responsibilities keeps the architecture simple while allowing each technology to focus on its strengths.

MCP Server

The FastMCP server exposes business-oriented retrieval capabilities rather than direct database access.

Available operations include:

Search procurement documents
Retrieve supporting documents for an order
Compare procurement documents
Detect missing purchase orders
Execute hybrid retrieval

This abstraction allows AI assistants to interact with procurement knowledge through natural language instead of SQL queries.

Installation

Prerequisites

Before running the project, install the following software:

Dependency	Version
Python	3.11+
Git	Latest
Tesseract OCR	Latest
Claude Desktop (optional)	Latest

Clone the Repository

git clone <repository-url>

cd caseware-ai-data-mcp

Create a Virtual Environment

macOS / Linux

python -m venv env

source env/bin/activate

Windows

python -m venv env

env\Scripts\activate

Install Python Dependencies

pip install -r requirements.txt

Install OCR

# macOS

brew install tesseract

# Ubuntu

sudo apt install tesseract-ocr

# Windows

Download and install Tesseract from:

https://github.com/UB-Mannheim/tesseract/wiki

Verify the installation:

tesseract --version

Preparing the Dataset

Place the procurement documents inside the data/raw/ directory.

data/

└── raw/

    ├── contracts/

    ├── invoices/

    ├── purchase_orders/

    ├── shipping_orders/

    └── inventory_reports/

Supported document formats:

PDF
PNG
JPG
JPEG
TIFF
BMP

Note
The original procurement documents are not included in this repository because they are part of the challenge dataset. Place the provided files under data/raw/ before running the ingestion pipeline.

Running the Pipeline

Build the local knowledge base by executing:

python run_pipeline.py

Example output:

{
    "documents_processed": 45,
    "chunks_indexed": 179
}

The ingestion pipeline performs the following tasks:

Reads procurement documents
Parses PDF files
Applies OCR when required
Extracts structured metadata
Generates retrieval-ready chunks
Creates semantic embeddings
Stores metadata in SQLite
Builds the ChromaDB vector index
Creates document relationships

The pipeline is idempotent and may be executed multiple times.

Running the MCP Server

Start the MCP server:

python -m app.server

The server exposes procurement retrieval capabilities through the Model Context Protocol (MCP).

Rather than exposing raw database queries, the MCP server provides business-oriented tools that allow AI assistants to retrieve grounded procurement evidence using natural language.

Verifying the Installation

After executing the ingestion pipeline, verify that the following artifacts have been created:

storage/

├── knowledge.db

└── chroma/

The SQLite database contains:

Documents
Extracted metadata
Chunk metadata
Document relationships

The Chroma directory contains the semantic vector index.

If both artifacts exist, the knowledge base has been successfully created.

Claude Desktop Integration

The MCP server can be consumed directly from Claude Desktop, enabling natural language interaction with the procurement knowledge base.

Configure Claude Desktop

Open the Claude Desktop configuration file.

macOS

~/Library/Application Support/Claude/claude_desktop_config.json

Add the following configuration:

{
  "mcpServers": {
    "caseware-ai-data-mcp": {
      "command": "/absolute/path/to/env/bin/python",
      "args": [
        "-m",
        "app.server"
      ],
      "cwd": "/absolute/path/to/caseware-ai-data-mcp",
      "env": {
        "PYTHONPATH": "/absolute/path/to/caseware-ai-data-mcp"
      }
    }
  }
}

Replace the placeholder paths with your local project paths.

Restart Claude Desktop after saving the configuration.

Available MCP Tools

Tool	Description
`search_documents`	Semantic retrieval across indexed procurement documents
`hybrid_document_search`	Hybrid metadata + semantic retrieval
`get_documents_for_order`	Retrieves supporting procurement documents for an Order ID
`compare_documents_for_order`	Performs a lightweight procurement audit
`get_invoices_missing_purchase_orders`	Detects invoices without matching purchase orders

Quick Validation

After connecting Claude Desktop, execute the following question:

Which documents support order 10248?

Expected response:

Invoice
Purchase Order
Shipping Order

This confirms that:

the ingestion pipeline executed successfully
SQLite contains the extracted metadata
ChromaDB contains the semantic index
the MCP server is running correctly
Claude Desktop can retrieve grounded procurement evidence

Retrieval Strategy

The platform implements a lightweight Hybrid Retrieval architecture that combines deterministic metadata lookup with semantic vector search.

Metadata Retrieval

During ingestion, structured procurement entities are extracted and stored in SQLite.

Examples include:

Order IDs
Invoice Numbers
Purchase Order Numbers
Vendor Names
Dates
Amounts

Queries containing explicit identifiers are resolved through deterministic lookups, providing fast and highly accurate results.

Semantic Retrieval

Natural language questions are answered using semantic similarity search.

Document chunks are embedded using SentenceTransformers and indexed in ChromaDB.

Typical semantic queries include:

Summarize payment terms.
Find supplier obligations.
What inventory reports mention warehouse damage?
Which contracts discuss delivery conditions?

Hybrid Retrieval

The retrieval layer automatically selects the most appropriate strategy based on the query.

For example:

Which documents support order 10248?

The system:

Detects the Order ID.
Retrieves matching procurement documents from SQLite.
Complements the response with semantic evidence when applicable.
Returns grounded citations.

This approach provides better precision than relying exclusively on vector search.

Citation Strategy

Every retrieval result includes references to the original source document whenever possible.

Example:

{
  "file": "invoice_10248.pdf",
  "page": 1,
  "chunk": 0
}

This enables AI assistants to generate grounded and explainable responses rather than unsupported summaries.

Example Questions

Once connected through Claude Desktop (or another MCP-compatible client), the following questions can be executed:

Which documents support order 10248?

Compare procurement documents for order 10248.

Which invoices are missing purchase orders?

Summarize the payment terms in the supplier contract.

Find evidence related to vendor Paul Henriot.

What inventory reports are available?

Design Decisions

The implementation intentionally favors simplicity over unnecessary complexity while demonstrating the architectural patterns expected from an AI-ready data platform.

Key design decisions include:

SQLite provides a lightweight metadata store requiring no external infrastructure.
ChromaDB enables local semantic retrieval without requiring managed vector databases.
SentenceTransformers generates embeddings locally without external AI services.
FastMCP exposes business-oriented capabilities through the Model Context Protocol.
Hybrid Retrieval combines deterministic matching with semantic similarity to improve retrieval accuracy.

These choices keep the project reproducible, easy to understand, and aligned with the challenge scope.

Engineering Trade-offs

This implementation intentionally prioritizes:

Simplicity over production-scale infrastructure.
Explainability over complex AI pipelines.
Local execution over cloud deployment.
Modular design over tightly coupled components.
Deterministic metadata extraction combined with semantic retrieval.

The objective is to demonstrate sound AI Data Engineering principles rather than build a production-ready enterprise platform.

Future Improvements

Potential production enhancements include:

Schema-constrained LLM-based metadata extraction.
Confidence scoring for extracted fields.
BM25 + Vector hybrid ranking.
Line-item reconciliation across procurement documents.
Human review workflows for low-confidence matches.
OpenSearch or AWS Bedrock Knowledge Bases for cloud deployment.
Observability with LangFuse, LangSmith, or OpenTelemetry.

AI-Assisted Development

This project was developed with AI-assisted development support for architectural brainstorming, implementation scaffolding, documentation, and code refinement.

All generated code was manually reviewed, integrated, executed locally, and validated by:

Running the ingestion pipeline.
Verifying SQLite outputs.
Validating ChromaDB indexing.
Testing metadata extraction.
Executing semantic and hybrid retrieval.
Testing all MCP tools.
Validating end-to-end integration with Claude Desktop.

The final implementation, architecture, and engineering decisions were manually reviewed to ensure correctness, reproducibility, and alignment with the challenge requirements.

Why this Architecture?

The solution intentionally separates:

Ingestion
Knowledge Storage
Retrieval
MCP Interface

This modular architecture minimizes coupling and allows each layer to evolve independently.

For example:

SQLite can be replaced with PostgreSQL.
ChromaDB can be replaced with OpenSearch or another vector database.
The embedding model can be replaced without changing the retrieval layer.
OCR can be replaced without impacting downstream processing.

This design improves maintainability, extensibility, and testability while remaining intentionally lightweight for the scope of the exercise.

License

This project was developed exclusively for the Caseware AI Data Platform Take-Home Assessment.

Acknowledgements

This project was developed as part of the Caseware AI Data Platform technical assessment.

The goal was to demonstrate an AI-ready procurement knowledge platform using lightweight, explainable, and reproducible engineering practices.

This server cannot be installed

license - not found

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MrTechi-Dev/caseware-ai-procurement-knowledge-platform'

If you have feedback or need assistance with the MCP directory API, please join our Discord server