PocketMCP

PocketMCP
docs
prompts

more-ingest.md•6.11 KiB

**ROLE:** You are an expert TypeScript/Node developer tasked with extending a local-first MCP server (“PocketMCP”) that ingests files from a watched folder, chunks text, embeds with `Transformers.js` (`Xenova/all-MiniLM-L6-v2`), and indexes into SQLite(+vec). **DELIVERABLE:** Production-ready changes to add **DOCX** and **text-based PDF** ingestion with strict guardrails, minimal dependencies, and clear observability. **No OCR**, **no legacy .doc**. Keep existing TXT/MD behavior unchanged. ### Constraints & Non-Goals * Keep it **local-first**, no cloud calls. * **No OCR** for scanned PDFs. If a PDF has negligible text, mark “needs\_ocr” and skip. * Do **not** introduce heavy native deps; prefer pure-JS/TS libs. * Avoid schema bloat. Prefer a small, normalized model with good traceability. * Backwards compatible: existing data remains valid after migration. ### Dependencies to Use (no code, just integrate) * PDF: `pdf-parse` (simple, text-first) * DOCX: `mammoth` (DOCX → text; headings optional) * CSV is **out of scope** for this task. * Existing: `@xenova/transformers`, sqlite/vec stack already in the repo. ### Configuration (env) * `PDF_MAX_PAGES` (default: 300) * `DOC_MAX_BYTES` (default: 10\_000\_000) * `PDF_MIN_TEXT_CHARS` (default: 500) → below this per file → mark as `needs_ocr` and skip * `CHUNK_SIZE` and `CHUNK_OVERLAP` already exist; keep using them. * `VERBOSE_LOGGING` already exists; extend logs accordingly. ### Data Model Add/confirm three levels: **documents → segments → chunks**. * `documents`: `doc_id` (stable per file version), `path`, `content_type`, `size_bytes`, `sha256`, `mtime`, `ingest_status` (`ok|skipped|needs_ocr|too_large|error`), `notes` (short reason) * `segments`: `segment_id`, `doc_id`, `kind` (`page|section`), `page` (int, null for DOCX), `meta` (JSON; e.g., `{heading:"...",level:2}`), `text` * `chunks`: `chunk_id`, `segment_id`, `start_char`, `end_char`, `text`, `embedding` (vec) **Rules** * **PDF**: one **segment per page** (`kind=page`, `page=n`). * **DOCX**: default to **one segment per document** unless heading splits are trivial; if splitting, `kind=section` and store heading in `meta`. * Keep existing chunker; ensure chunks **do not cross segment boundaries**. ### Ingestion Pipeline (detector → extractor → normalizer → chunker → embedder → indexer) 1. **Detect** by extension + light magic sniff: * `.pdf` → PDF extractor * `.docx` → DOCX extractor * `.txt` / `.md` → existing path * Everything else: unchanged (ignored) 2. **Extract** * **PDF**: get **page-wise text**. If total chars < `PDF_MIN_TEXT_CHARS`, mark document `needs_ocr` and skip. * **DOCX**: produce linear text; optionally capture headings if trivial. 3. **Normalize** to `segments[]` per the rules above. 4. **Chunk** using existing `CHUNK_SIZE` / `CHUNK_OVERLAP`, **bounded within each segment**. 5. **Embed** with existing MiniLM model and batching. 6. **Index**: upsert; previous data for the same `doc_id` must be replaced atomically. ### Watcher Behavior * Compute `sha256`. If unchanged, skip. * On change, delete old rows for `doc_id` (cascade segments/chunks), then re-ingest. * Respect caps: * PDFs with pages > `PDF_MAX_PAGES` → `ingest_status=too_large` (skip). * DOCX with size > `DOC_MAX_BYTES` → `too_large` (skip). * Encrypted/secured PDFs → `ingest_status=skipped`, `notes="encrypted"`. ### Retrieval & UI Impacts (no UI rewrite; small tweaks) * Search pipeline unchanged. * When rendering results, add a **source badge**: * PDF: `filename.pdf · p.<page>` * DOCX: `filename.docx` (if `meta.heading` exists, append `· § <heading>`) * When a document is skipped, surface a row in the existing **ingest status**/stats view. ### Logging & Metrics * For each file: log `content_type`, `size`, `pages` (PDF), `text_chars`, `status`, and reason. * Count parse failures by type (PDF/DOCX) and show in existing stats endpoint. * For `VERBOSE_LOGGING=true`, log per-segment char counts and per-batch embedding sizes. ### Error Handling * All extraction wrapped with clear error classes: `ParseError`, `EncryptedPdfError`, `TooLargeError`. * On error, set `ingest_status=error` and capture short `notes` (truncate to 200 chars). ### Minimal Public API Adjustments * Existing search endpoints remain. * Extend any “doc detail” endpoint to include segments count and per-segment pointers (`page` or `meta.heading`). ### Acceptance Tests (automate) Prepare fixtures (place under a `fixtures/ingest/` folder) and implement automated tests (no snapshots of large text): * **DOCX happy path**: medium doc with headings → segments count ≥ 1, total text > 3k chars, status `ok`. * **PDF happy path**: 10-page digital PDF → segments count == 10, cumulative chars > 5k, status `ok`. * **PDF low-text**: 5-page scanned PDF (almost no text) → status `needs_ocr`, zero segments. * **PDF too large**: synthetic with `pages > PDF_MAX_PAGES` → status `too_large`. * **DOCX too large**: filesize > `DOC_MAX_BYTES` → status `too_large`. * **Re-ingest**: modify a DOCX; sha changes → old rows removed; new rows inserted; counts updated. * **Chunk locality**: assert all chunk `segment_id` are consistent and `start_char/end_char` within segment text length. ### Developer Ergonomics * Add a CLI subcommand to **re-ingest a single path** and to **print an ingest summary** (counts, statuses, reasons). * Add a feature-flag `DOCX_SPLIT_ON_HEADINGS` (default: false). If true, split on headings h1/h2 only. * Document all env vars in README. ### Performance Guardrails * Stream extraction where possible (PDF). * Embed in batches of a reasonable size (keep current defaults). * Avoid loading entire multi-hundred page PDFs into memory at once. ### Definition of Done * Schema migration applied and idempotent. * Unit/integration tests for all **Acceptance Tests** above are passing. * Ingest status UI shows the new statuses and reasons. * Searching returns results with correct **page/section** badges. * README updated with: * Supported types: TXT, MD, DOCX, PDF (text-based only) * Limits (pages/bytes), skip reasons, and how to override envs * Caveat: scanned PDFs require OCR (not implemented)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Kailash-Sankar/PocketMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

more-ingest.md•6.11 KiB