Skip to main content
Glama
MrTechi-Dev

caseware-ai-procurement-knowledge-platform

by MrTechi-Dev

Python

MCP

SQLite

ChromaDB

Caseware AI Procurement Knowledge Platform

AI-ready procurement knowledge platform built with a local data pipeline, hybrid retrieval, and the Model Context Protocol (MCP).


Related MCP server: caseware-kb

Overview

This project implements an end-to-end AI-ready data platform for procurement and inventory documents.

The solution demonstrates how structured and unstructured business documents can be ingested, transformed into searchable knowledge, and exposed through an MCP (Model Context Protocol) server, allowing AI assistants to retrieve evidence, reason across related documents, and generate grounded responses with source references.

The implementation intentionally remains lightweight and fully local while showcasing modern AI Data Engineering concepts, including:

  • PDF and image ingestion

  • OCR fallback for scanned documents

  • Structured metadata extraction

  • Semantic embeddings

  • Hybrid retrieval

  • Cross-document relationship matching

  • MCP tool integration

  • Grounded AI responses

The architecture prioritizes simplicity, explainability, and reproducibility, following the challenge recommendation to avoid over-engineering.


Key Capabilities

  • PDF document ingestion

  • OCR using Tesseract

  • Native PDF parsing with PyMuPDF

  • Structured metadata extraction

  • SQLite metadata store

  • ChromaDB vector database

  • SentenceTransformers embeddings

  • Hybrid retrieval (metadata + semantic search)

  • Cross-document relationship matching

  • Procurement document comparison

  • MCP server integration

  • Claude Desktop integration

  • Grounded source citations


Design Goals

The solution was intentionally designed to demonstrate the core architectural components of an AI-ready data platform while keeping the implementation easy to understand and reproduce.

Primary goals include:

  • Reproducible local execution

  • AI-ready document preparation

  • Explainable retrieval

  • Hybrid search combining deterministic metadata and semantic similarity

  • Modular architecture with clear separation of concerns

  • Agent integration through MCP

Rather than focusing on production-scale infrastructure, the implementation emphasizes engineering decisions, maintainability, and retrieval quality.


High-Level Architecture

flowchart TD

    A[Raw Procurement Documents]

    A --> B[PDF Parser]
    A --> C[OCR - Tesseract]

    B --> D[Extracted Text]
    C --> D

    D --> E[Chunking]

    E --> F[Metadata Extraction]
    E --> G[SentenceTransformers Embeddings]

    F --> H[(SQLite)]
    G --> I[(ChromaDB)]

    H --> J[Hybrid Retrieval Layer]
    I --> J

    J --> K[FastMCP Server]

    K --> L[Claude Desktop]

Data Flow

The ingestion pipeline performs the following steps:

  1. Load procurement documents from the local filesystem.

  2. Parse native PDFs using PyMuPDF.

  3. Apply OCR to scanned documents using Tesseract.

  4. Normalize extracted text.

  5. Extract structured procurement metadata.

  6. Split documents into retrieval-ready chunks.

  7. Generate semantic embeddings.

  8. Store structured metadata in SQLite.

  9. Store semantic vectors in ChromaDB.

  10. Expose retrieval capabilities through an MCP server.


Technology Stack

Layer

Technology

Language

Python

PDF Parsing

PyMuPDF

OCR

Tesseract

Embeddings

SentenceTransformers

Metadata Store

SQLite

Vector Database

ChromaDB

MCP Framework

FastMCP

AI Client

Claude Desktop


Project Structure

The project is organized into independent modules following a clear separation of concerns. Each component has a single responsibility, making the solution easier to understand, maintain, and extend.

caseware-ai-data-mcp/
│
├── app/
│   ├── pipeline/
│   │   ├── extract.py          # PDF parsing and OCR
│   │   ├── chunk.py            # Document chunking
│   │   ├── model.py            # Metadata extraction
│   │   ├── ingest.py           # End-to-end ingestion pipeline
│   │   └── index.py            # ChromaDB indexing
│   │
│   ├── retrieval/
│   │   ├── search.py           # Semantic retrieval
│   │   ├── matching.py         # Cross-document matching
│   │   ├── hybrid.py           # Hybrid retrieval
│   │   └── citations.py        # Source references
│   │
│   ├── db.py                   # SQLite initialization
│   └── server.py               # FastMCP server
│
├── data/
│   └── raw/                    # Procurement documents
│
├── storage/
│   ├── knowledge.db            # SQLite metadata store
│   └── chroma/                 # ChromaDB vector index
│
├── run_pipeline.py
├── requirements.txt
└── README.md

Module Responsibilities

Pipeline

The pipeline transforms raw procurement documents into AI-ready knowledge.

Responsibilities include:

  • Reading procurement documents

  • Parsing PDF files

  • Running OCR when required

  • Extracting structured metadata

  • Chunking document content

  • Generating semantic embeddings

  • Populating SQLite

  • Building the ChromaDB vector index


Retrieval

The retrieval layer is responsible for answering user questions.

It combines two complementary strategies:

  • Deterministic metadata lookup

  • Semantic vector search

This hybrid approach improves retrieval precision while maintaining flexibility for natural language queries.


Storage

Structured and semantic information are intentionally stored separately.

# SQLite

Stores:

  • Document metadata

  • Extracted procurement fields

  • Chunk metadata

  • Document relationships

# ChromaDB

Stores:

  • Sentence embeddings

  • Semantic vector index

Separating these responsibilities keeps the architecture simple while allowing each technology to focus on its strengths.


MCP Server

The FastMCP server exposes business-oriented retrieval capabilities rather than direct database access.

Available operations include:

  • Search procurement documents

  • Retrieve supporting documents for an order

  • Compare procurement documents

  • Detect missing purchase orders

  • Execute hybrid retrieval

This abstraction allows AI assistants to interact with procurement knowledge through natural language instead of SQL queries.


Installation

Prerequisites

Before running the project, install the following software:

Dependency

Version

Python

3.11+

Git

Latest

Tesseract OCR

Latest

Claude Desktop (optional)

Latest


Clone the Repository

git clone <repository-url>

cd caseware-ai-data-mcp

Create a Virtual Environment

macOS / Linux

python -m venv env

source env/bin/activate

Windows

python -m venv env

env\Scripts\activate

Install Python Dependencies

pip install -r requirements.txt

Install OCR

# macOS

brew install tesseract

# Ubuntu

sudo apt install tesseract-ocr

# Windows

Download and install Tesseract from:

https://github.com/UB-Mannheim/tesseract/wiki

Verify the installation:

tesseract --version

Preparing the Dataset

Place the procurement documents inside the data/raw/ directory.

data/

└── raw/

    ├── contracts/

    ├── invoices/

    ├── purchase_orders/

    ├── shipping_orders/

    └── inventory_reports/

Supported document formats:

  • PDF

  • PNG

  • JPG

  • JPEG

  • TIFF

  • BMP

Note

The original procurement documents are not included in this repository because they are part of the challenge dataset. Place the provided files under data/raw/ before running the ingestion pipeline.


Running the Pipeline

Build the local knowledge base by executing:

python run_pipeline.py

Example output:

{
    "documents_processed": 45,
    "chunks_indexed": 179
}

The ingestion pipeline performs the following tasks:

  • Reads procurement documents

  • Parses PDF files

  • Applies OCR when required

  • Extracts structured metadata

  • Generates retrieval-ready chunks

  • Creates semantic embeddings

  • Stores metadata in SQLite

  • Builds the ChromaDB vector index

  • Creates document relationships

The pipeline is idempotent and may be executed multiple times.


Running the MCP Server

Start the MCP server:

python -m app.server

The server exposes procurement retrieval capabilities through the Model Context Protocol (MCP).

Rather than exposing raw database queries, the MCP server provides business-oriented tools that allow AI assistants to retrieve grounded procurement evidence using natural language.


Verifying the Installation

After executing the ingestion pipeline, verify that the following artifacts have been created:

storage/

├── knowledge.db

└── chroma/

The SQLite database contains:

  • Documents

  • Extracted metadata

  • Chunk metadata

  • Document relationships

The Chroma directory contains the semantic vector index.

If both artifacts exist, the knowledge base has been successfully created.


Claude Desktop Integration

The MCP server can be consumed directly from Claude Desktop, enabling natural language interaction with the procurement knowledge base.

Configure Claude Desktop

Open the Claude Desktop configuration file.

macOS

~/Library/Application Support/Claude/claude_desktop_config.json

Add the following configuration:

{
  "mcpServers": {
    "caseware-ai-data-mcp": {
      "command": "/absolute/path/to/env/bin/python",
      "args": [
        "-m",
        "app.server"
      ],
      "cwd": "/absolute/path/to/caseware-ai-data-mcp",
      "env": {
        "PYTHONPATH": "/absolute/path/to/caseware-ai-data-mcp"
      }
    }
  }
}

Replace the placeholder paths with your local project paths.

Restart Claude Desktop after saving the configuration.


Available MCP Tools

Tool

Description

search_documents

Semantic retrieval across indexed procurement documents

hybrid_document_search

Hybrid metadata + semantic retrieval

get_documents_for_order

Retrieves supporting procurement documents for an Order ID

compare_documents_for_order

Performs a lightweight procurement audit

get_invoices_missing_purchase_orders

Detects invoices without matching purchase orders


Quick Validation

After connecting Claude Desktop, execute the following question:

Which documents support order 10248?

Expected response:

  • Invoice

  • Purchase Order

  • Shipping Order

This confirms that:

  • the ingestion pipeline executed successfully

  • SQLite contains the extracted metadata

  • ChromaDB contains the semantic index

  • the MCP server is running correctly

  • Claude Desktop can retrieve grounded procurement evidence


Retrieval Strategy

The platform implements a lightweight Hybrid Retrieval architecture that combines deterministic metadata lookup with semantic vector search.

Metadata Retrieval

During ingestion, structured procurement entities are extracted and stored in SQLite.

Examples include:

  • Order IDs

  • Invoice Numbers

  • Purchase Order Numbers

  • Vendor Names

  • Dates

  • Amounts

Queries containing explicit identifiers are resolved through deterministic lookups, providing fast and highly accurate results.


Semantic Retrieval

Natural language questions are answered using semantic similarity search.

Document chunks are embedded using SentenceTransformers and indexed in ChromaDB.

Typical semantic queries include:

  • Summarize payment terms.

  • Find supplier obligations.

  • What inventory reports mention warehouse damage?

  • Which contracts discuss delivery conditions?


Hybrid Retrieval

The retrieval layer automatically selects the most appropriate strategy based on the query.

For example:

Which documents support order 10248?

The system:

  1. Detects the Order ID.

  2. Retrieves matching procurement documents from SQLite.

  3. Complements the response with semantic evidence when applicable.

  4. Returns grounded citations.

This approach provides better precision than relying exclusively on vector search.


Citation Strategy

Every retrieval result includes references to the original source document whenever possible.

Example:

{
  "file": "invoice_10248.pdf",
  "page": 1,
  "chunk": 0
}

This enables AI assistants to generate grounded and explainable responses rather than unsupported summaries.


Example Questions

Once connected through Claude Desktop (or another MCP-compatible client), the following questions can be executed:

Which documents support order 10248?

Compare procurement documents for order 10248.

Which invoices are missing purchase orders?

Summarize the payment terms in the supplier contract.

Find evidence related to vendor Paul Henriot.

What inventory reports are available?

Design Decisions

The implementation intentionally favors simplicity over unnecessary complexity while demonstrating the architectural patterns expected from an AI-ready data platform.

Key design decisions include:

  • SQLite provides a lightweight metadata store requiring no external infrastructure.

  • ChromaDB enables local semantic retrieval without requiring managed vector databases.

  • SentenceTransformers generates embeddings locally without external AI services.

  • FastMCP exposes business-oriented capabilities through the Model Context Protocol.

  • Hybrid Retrieval combines deterministic matching with semantic similarity to improve retrieval accuracy.

These choices keep the project reproducible, easy to understand, and aligned with the challenge scope.


Engineering Trade-offs

This implementation intentionally prioritizes:

  • Simplicity over production-scale infrastructure.

  • Explainability over complex AI pipelines.

  • Local execution over cloud deployment.

  • Modular design over tightly coupled components.

  • Deterministic metadata extraction combined with semantic retrieval.

The objective is to demonstrate sound AI Data Engineering principles rather than build a production-ready enterprise platform.


Future Improvements

Potential production enhancements include:

  • Schema-constrained LLM-based metadata extraction.

  • Confidence scoring for extracted fields.

  • BM25 + Vector hybrid ranking.

  • Line-item reconciliation across procurement documents.

  • Human review workflows for low-confidence matches.

  • OpenSearch or AWS Bedrock Knowledge Bases for cloud deployment.

  • Observability with LangFuse, LangSmith, or OpenTelemetry.


AI-Assisted Development

This project was developed with AI-assisted development support for architectural brainstorming, implementation scaffolding, documentation, and code refinement.

All generated code was manually reviewed, integrated, executed locally, and validated by:

  • Running the ingestion pipeline.

  • Verifying SQLite outputs.

  • Validating ChromaDB indexing.

  • Testing metadata extraction.

  • Executing semantic and hybrid retrieval.

  • Testing all MCP tools.

  • Validating end-to-end integration with Claude Desktop.

The final implementation, architecture, and engineering decisions were manually reviewed to ensure correctness, reproducibility, and alignment with the challenge requirements.


Why this Architecture?

The solution intentionally separates:

  • Ingestion

  • Knowledge Storage

  • Retrieval

  • MCP Interface

This modular architecture minimizes coupling and allows each layer to evolve independently.

For example:

  • SQLite can be replaced with PostgreSQL.

  • ChromaDB can be replaced with OpenSearch or another vector database.

  • The embedding model can be replaced without changing the retrieval layer.

  • OCR can be replaced without impacting downstream processing.

This design improves maintainability, extensibility, and testability while remaining intentionally lightweight for the scope of the exercise.


License

This project was developed exclusively for the Caseware AI Data Platform Take-Home Assessment.


Acknowledgements

This project was developed as part of the Caseware AI Data Platform technical assessment.

The goal was to demonstrate an AI-ready procurement knowledge platform using lightweight, explainable, and reproducible engineering practices.

F
license - not found
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MrTechi-Dev/caseware-ai-procurement-knowledge-platform'

If you have feedback or need assistance with the MCP directory API, please join our Discord server