Skip to main content
Glama

Procurement Knowledge MCP

Local data pipeline and MCP server for querying a procurement and inventory document corpus (invoices, purchase orders, shipping orders, inventory reports, contracts).

Status: Pipeline and MCP server complete. Run make ingest once, then connect an MCP client to query preprocessed artifacts.

Design decisions and trade-offs: CAPABILITY_TRACKER.md.

Requirements

  • Python 3.11+

  • uv package manager

  • Tesseract OCR (brew install tesseract on macOS)

  • Cursor or another MCP-compatible client

Related MCP server: Atlas

Setup

Extract corpus data

From the project root, unzip the bundled archive. It creates data/ in place (no copying or rearranging files):

unzip test-data.zip

Expected layout: data/invoices/, data/purchase_orders/, data/shipping_orders/, data/inventory_reports/, data/contracts/ (~45 documents).

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Virtual environment on external drives

If the project lives on an external volume, uv may fail to create .venv on that drive due to macOS ._* (AppleDouble) files.

cp .env.example .env

Edit .env:

UV_PROJECT_ENVIRONMENT=/Users/you/.venvs/procurement-knowledge-mcp
mkdir -p ~/.venvs

Use make or source .env before bare uv run commands.

Install dependencies

make sync
# or: uv sync --all-groups

Development workflow

make          # same as make check
make check    # ruff + mypy + full pytest with coverage (CI / pre-commit gate)
make test-fast # unit tests only, no coverage (skips corpus ingest)
make test     # full pytest with terminal coverage
make test-integration # slow corpus / ingest tests only
make test-cov # pytest + htmlcov/
make cursor-setup # sync deps + verify MCP launcher
make smoke    # MCP tool smoke test (requires make ingest)
make ingest   # test-fast, then full ingest
make mcp      # test-fast, then MCP server (stdio)

Coverage artifacts (.coverage, htmlcov/) are gitignored.

Architecture

flowchart LR
    subgraph ingest [ingest.py]
        D[discover] --> E[extract] --> M[model]
        M --> C[compare] --> CH[chunk] --> I[index]
    end

    subgraph artifacts [processed/ and index/]
        DJ[documents.jsonl]
        OI[order_index.json]
        CAT[catalog.json]
        OC[order_comparisons.json]
        CJ[chunks.jsonl]
        DB[knowledge.db]
    end

    subgraph mcp [mcp_server.py]
        T1[search_documents]
        T2[list_documents]
        T3[get_document]
        T4[find_document_gaps]
        T5[compare_order_documents]
        T6[get_knowledge_base_summary]
    end

    data[(data/)] --> D
    M --> DJ
    M --> OI
    C --> CAT
    C --> OC
    CH --> CJ
    I --> DB
    DJ --> mcp
    CAT --> mcp
    OC --> mcp
    DB --> T1

Pipeline

discover -> extract -> model -> compare -> chunk -> index

Stage

Module

Output

discover

discover.py

File registry, stable doc_id, basename collision index

extract

extract.py

PyMuPDF text; Tesseract OCR for JPGs / empty PDFs

model

parse.py, model.py

documents.jsonl, order_index.json

compare

compare.py

catalog.json, order_comparisons.json

chunk

chunk.py

chunks.jsonl

index

index.py

knowledge.db (SQLite FTS5)

Typical ingest result: 45 documents, 19 orders, 130 chunks.

Data modeling

  • Content-primary fields: order_id, totals, line items, inventory period parsed from document body (or OCR), with field_provenance on each field.

  • Path-based identity only: doc_id, source_path, doc_type (from folder). Filename patterns are fallbacks, never silent authority for linking.

  • Order index: maps content-derived order_id to doc_id lists. Scanned JPG invoices without OCR order_id are excluded and listed in catalog.json.

  • Comparisons: precomputed per-order presence, total matching (?0.01), and line-item checks in order_comparisons.json.

Retrieval strategy

Query type

Mechanism

Structured lookup (order, doc type, gaps)

documents.jsonl, order_index.json, catalog.json

Cross-document reconciliation

order_comparisons.json

Text evidence (contracts, keywords)

FTS5 over chunks.jsonl in knowledge.db

Chunking: one chunk per page (invoice, PO, shipping, contract); one chunk per inventory report; contract boilerplate stripped.

Identity and duplicate handling

Rule

Behavior

source_path

Normalized relative path; duplicate paths in one run are skipped

doc_id

Slug from folder + stem (invoices__invoice_10687); hash suffix on collision

basename_index

Recorded in catalog.json when the same filename appears in multiple folders

Hidden / junk files

._*, .DS_Store, and hidden files are not ingested

Run the ingest pipeline

make ingest

Custom paths:

uv run python ingest.py --data ./data --processed ./processed --index ./index

Run the MCP server

Smoke test all six tools against ingested artifacts (no client required):

make ingest   # once, if processed/ and index/ are missing
make smoke    # or: uv run python scripts/smoke_test_mcp.py

Start the stdio server for Cursor or another MCP client (make mcp runs test-fast first, then blocks):

make mcp

Press Ctrl+C to stop the server.

Required artifacts

make ingest must complete successfully before MCP tools work:

  • processed/documents.jsonl

  • processed/order_index.json

  • processed/catalog.json

  • processed/order_comparisons.json

  • index/knowledge.db

MCP tools

Every tool returns a top-level sources[] array (citations). Use these for grounded answers.

Tool

Purpose

search_documents

FTS keyword search; optional doc_type, order_id, period filters

list_documents

List documents by metadata

get_document

Fetch one record by doc_id; include_text=false by default

find_document_gaps

Gap lists from catalog (invoices_missing_po, etc.)

compare_order_documents

Precomputed order comparison + field citations

get_knowledge_base_summary

Corpus counts, inventory periods, ingest metadata

Citation schema

Each entry in sources[] includes:

doc_id, source_path, doc_type, page, chunk_id, field, value, snippet, field_provenance, extraction_method, confidence, citation_label

Built by procurement/citations.py (build_citation, build_citation_from_chunk).

Connect Cursor

Project MCP config is committed at .cursor/mcp.json. It runs scripts/run_mcp.sh, which uses uv and loads .env (for UV_PROJECT_ENVIRONMENT on external drives).

make cursor-setup   # uv sync + chmod launcher
make ingest         # required once before MCP queries work

In Cursor: Developer: Reload Window, then Settings ? MCP and confirm procurement-knowledge is connected. In Agent mode, ask the agent to use procurement-knowledge tools.

To override paths locally, copy the server block to ~/.cursor/mcp.json or edit the project file.

Validate basic questions (Cursor Agent)

Use tests/prompts/test_questions.txt for a manual end-to-end check after ingest and MCP connect:

  1. Run make ingest and confirm procurement-knowledge is connected (see above).

  2. Open Agent chat and attach or paste the prompt file (@tests/prompts/test_questions.txt).

  3. Ask the agent to follow the file instructions: one question at a time, grounded answers with procurement-knowledge tools, and show the answer before the next question.

The prompt covers assignment-style questions (missing POs, order 10687, shipment vs invoice, contract terms, mismatches, inventory periods). Answers should cite sources[] from tool results.

Prompt question (summary)

Primary tool(s)

Invoices missing a PO

find_document_gaps

PO for invoice 10248

list_documents

Shipment 10603 vs invoice

compare_order_documents

Contract supply of goods

search_documents (doc_type=contract)

Documents for order 10687

list_documents

Mismatches across doc types

find_document_gaps, compare_order_documents

Inventory reports and periods

get_knowledge_base_summary, list_documents

Automated coverage of the same paths lives in tests/test_mcp_tools.py; make smoke exercises tools without a client.

Example queries (assignment-style)

After make ingest, automated tests in tests/test_mcp_tools.py exercise these paths. Expected results on the bundled corpus:

Question

Tool

Expected

Which invoices lack a PO?

find_document_gaps(gap_type="invoices_missing_po")

10436, 10687, 10839

Documents for order 10687?

list_documents(order_id="10687")

invoice + shipping_order

Does shipment match invoice for 10687?

compare_order_documents(order_id="10687")

missing_purchase_order; invoice/shipping totals match

Contract supply terms?

search_documents(query="supply goods", doc_type="contract")

Hits on TotalEnergies master contract

Inventory periods?

get_knowledge_base_summary()

2016-07 ? 2018-01

Sample compare_order_documents("10687") summary:

Order 10687: invoice and shipping totals match; purchase order missing.

Project layout

procurement-knowledge-mcp/
  test-data.zip         # Corpus archive; unzip at repo root ? data/
  data/                 # Source documents (read-only; from test-data.zip)
  processed/            # Generated JSON/JSONL (gitignored)
  index/                # SQLite FTS index (gitignored)
  procurement/          # Pipeline library
  ingest.py             # Ingest CLI
  mcp_server.py         # MCP server (stdio)
  tests/                # pytest suite
    prompts/
      test_questions.txt  # Cursor Agent manual validation prompt
  pyproject.toml
  Makefile

Testing

make test-fast        # quick unit loop during development
make test             # full suite with coverage
make test-integration # corpus / full-ingest tests only
make check            # lint + typecheck + full suite with coverage

Current suite: 130 passed, 100% coverage on measured source (make check).

Integration tests (marked @pytest.mark.integration) run discover ? ingest on data/ and validate known orders (10687, 10248) and gap lists. make test-fast skips them for a faster feedback loop.

For interactive validation in Cursor, use tests/prompts/test_questions.txt (see Validate basic questions under Connect Cursor).

Verbose:

set -a && source .env && set +a && uv run pytest -v

Troubleshooting

Issue

Solution

uv run fails with ._ruff on external drive

Use .env with UV_PROJECT_ENVIRONMENT on local disk; prefer make

Processed artifacts missing

Run make ingest

make mcp hangs

Normal ? stdio server waiting for client; Ctrl+C to exit

SQLite readonly on external drive

Ingest builds knowledge.db in a temp dir then moves it (handled in index.py)

Known limitations

  • No vector embeddings (FTS keyword search only).

  • Scanned JPG invoices without OCR order_id are excluded from the order index.

  • Line-item matching is best-effort regex parsing, not layout-aware extraction.

  • Contract text is page-chunked plain text (no clause segmentation).

AI-assisted development

This project was developed with AI assistance (Cursor). Design decisions follow a content-primary modeling policy with explicit trade-offs for a 3?5 hour take-home scope. Validation: make check (lint, types, 100% coverage), make smoke, pytest in tests/test_mcp_tools.py, and manual Agent runs via tests/prompts/test_questions.txt.

F
license - not found
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/trussler-leveragepointdata/procurement-knowledge-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server