Skip to main content
Glama
UNIFIED-BINARY-HANDLER-TASK-MAP.md15.2 kB
# TDD Task Map: Unified Binary Handler (Option C) ## Overview **Bug**: URL sources pointing to PDF files store raw binary data instead of extracted text. **Root Cause**: In [`fetchSource()`](../src/tools/projects.ts:963), the URL handler: 1. Calls `response.text()` for ALL URLs (line 973) 2. Only handles `text/html` content-type (line 981) 3. Falls through to push raw binary content for non-HTML (lines 1008-1009) **Fix**: Create a unified `extractTextFromResponse()` helper that routes to the correct extractor based on content-type and URL extension. --- ## Interface Design ### `extractTextFromResponse()` Signature ```typescript interface ExtractTextOptions { url: string; response: Response; maxSizeBytes?: number; } interface ExtractTextResult { text: string; contentType: string; extractorUsed: 'html' | 'pdf' | 'plain' | 'jina'; } async function extractTextFromResponse( options: ExtractTextOptions ): Promise<ExtractTextResult> ``` ### Content-Type Detection Logic | Content-Type Header | URL Extension | Action | |---------------------|---------------|--------| | `application/pdf` | any | PDF extractor | | `text/html` | any | HTML extractor (with Jina fallback) | | `text/plain` | any | Return as-is | | missing/unknown | `.pdf` | PDF extractor | | missing/unknown | `.txt` | Return as-is | | missing/unknown | other | Attempt HTML, fallback to error | | `application/octet-stream` | `.pdf` | PDF extractor | | other binary | any | Throw meaningful error | --- ## Task Map ### Phase 0: Setup & Scaffolding | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-000` | Create test file `tests/binary-handler.test.ts` with imports and describe blocks | code | none | File exists with vitest imports, describe blocks for each extractor type | | `UBH-001` | Add mock Response factory for tests | code | UBH-000 | Helper function `createMockResponse()` can create Response objects with configurable content-type, body, and status | --- ### Phase 1: Red Phase - Write Failing Tests All tests MUST fail initially with clear, meaningful error messages. #### 1.1 Content-Type Detection Tests | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-100` | Test: PDF content-type returns extracted text | red-phase | UBH-001 | Test fails because `extractTextFromResponse` doesn't exist; test expects `extractorUsed: 'pdf'` | | `UBH-101` | Test: HTML content-type returns extracted text | red-phase | UBH-001 | Test fails; expects `extractorUsed: 'html'` | | `UBH-102` | Test: Plain text content-type returns text as-is | red-phase | UBH-001 | Test fails; expects `extractorUsed: 'plain'` | | `UBH-103` | Test: Unknown binary content-type throws descriptive error | red-phase | UBH-001 | Test fails; expects error with message containing content-type and URL | #### 1.2 URL Extension Fallback Tests | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-110` | Test: Missing content-type with `.pdf` extension → PDF extractor | red-phase | UBH-001 | Test fails; expects PDF extraction when URL ends in `.pdf` | | `UBH-111` | Test: Missing content-type with `.txt` extension → plain text | red-phase | UBH-001 | Test fails; expects plain text pass-through | | `UBH-112` | Test: `application/octet-stream` with `.pdf` extension → PDF extractor | red-phase | UBH-001 | Test fails; expects PDF extraction to be attempted | #### 1.3 PDF Extraction Tests | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-120` | Test: Valid PDF buffer extracts text correctly | red-phase | UBH-001 | Test fails; expects extracted text to contain known PDF content | | `UBH-121` | Test: Corrupted PDF throws meaningful error | red-phase | UBH-001 | Test fails; expects error message mentioning PDF extraction failure | | `UBH-122` | Test: Scanned/image-only PDF throws "insufficient text" error | red-phase | UBH-001 | Test fails; expects error about insufficient extractable text | #### 1.4 HTML Extraction Tests | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-130` | Test: HTML with rich content extracts text | red-phase | UBH-001 | Test fails; expects extracted text without HTML tags | | `UBH-131` | Test: JS-rendered shell HTML triggers Jina fallback | red-phase | UBH-001 | Test fails; expects `extractorUsed: 'jina'` when shell detected | | `UBH-132` | Test: HTML with insufficient content uses Jina fallback | red-phase | UBH-001 | Test fails; expects Jina fallback for < 200 chars extracted | #### 1.5 Error Handling Tests | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-140` | Test: Content exceeding size limit throws error | red-phase | UBH-001 | Test fails; expects error about size limit | | `UBH-141` | Test: Error messages include URL and content-type | red-phase | UBH-001 | Test fails; expects descriptive errors with context | --- ### Phase 2: Green Phase - Implementation Write minimal code to make each test pass. No premature optimization. #### 2.1 Core Function Structure | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-200` | Implement `extractTextFromResponse()` function skeleton | green-phase | UBH-100..UBH-103 | Function exists, throws "not implemented" for all cases | | `UBH-201` | Implement content-type detection and routing | green-phase | UBH-200 | Content-type parsing works; routes to correct extractor stub | | `UBH-202` | Implement URL extension fallback detection | green-phase | UBH-201, UBH-110..UBH-112 | Falls back to extension when content-type missing/ambiguous | #### 2.2 Extractors | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-210` | Implement PDF extractor (reuse existing `pdfParse` logic) | green-phase | UBH-120..UBH-122 | PDF tests pass; extracts text from valid PDFs | | `UBH-211` | Implement HTML extractor (reuse existing `extractTextFromHtml`) | green-phase | UBH-130 | HTML tests pass; strips tags, preserves structure | | `UBH-212` | Implement plain text extractor | green-phase | UBH-102, UBH-111 | Plain text tests pass; returns content as-is | | `UBH-213` | Implement Jina fallback integration | green-phase | UBH-131, UBH-132 | Jina fallback tests pass; shell detection works | #### 2.3 Error Handling | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-220` | Implement size limit validation | green-phase | UBH-140 | Size limit test passes; early rejection of large content | | `UBH-221` | Implement descriptive error messages | green-phase | UBH-141, UBH-103 | Error tests pass; messages include URL and content-type | --- ### Phase 3: Blue Phase - Refactor & Integration Improve code quality while maintaining passing tests. #### 3.1 Code Quality | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-300` | Extract extractor functions to separate helpers | blue-phase | UBH-210..UBH-213 | Clean separation of concerns; all tests still pass | | `UBH-301` | Add JSDoc documentation to public functions | blue-phase | UBH-300 | All exported functions have JSDoc with examples | | `UBH-302` | Add TypeScript types for all parameters/returns | blue-phase | UBH-300 | No `any` types; full type safety | #### 3.2 Integration with `fetchSource()` | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-310` | Replace URL case in `fetchSource()` with `extractTextFromResponse()` | blue-phase | UBH-300 | URL handling uses new helper; existing tests pass | | `UBH-311` | Remove duplicated PDF/HTML logic from `fetchSource()` | blue-phase | UBH-310 | No duplicate extraction logic; DRY principle | | `UBH-312` | Update sitemap handler to use `extractTextFromResponse()` | blue-phase | UBH-310 | Sitemap URL fetching uses unified handler | #### 3.3 Performance & Polish | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-320` | Add logging for extractor selection | blue-phase | UBH-310 | Debug logs show which extractor was used | | `UBH-321` | Ensure streaming-friendly for large responses | blue-phase | UBH-320 | Memory efficient; no full buffer for content-type check | --- ### Phase 4: Data Cleanup & Validation | Task ID | Objective | Mode | Dependencies | Acceptance Criteria | |---------|-----------|------|--------------|---------------------| | `UBH-400` | Create script to identify corrupted PDF chunks | code | UBH-310 | Script lists source_ids with binary content markers | | `UBH-401` | Create data migration tool to purge bad chunks | code | UBH-400 | Tool removes chunks/vectors for specified sources | | `UBH-402` | Document rebuild procedure for affected projects | architect | UBH-401 | README updated with recovery instructions | | `UBH-403` | Rebuild affected project data (dnd-reference) | orchestrator | UBH-401, UBH-402 | Project rebuilt with correct PDF extraction | --- ## Dependency Graph ``` Phase 0 (Setup) │ ├── UBH-000 ──► UBH-001 │ │ │ ▼ Phase 1 (Red) │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ ▼ │ Content-Type Tests Extension Fallback Tests │ UBH-100..103 UBH-110..112 │ │ │ │ │ ┌──────────────────────────┤ │ │ │ │ │ ▼ ▼ ▼ │ PDF Tests HTML Tests │ UBH-120..122 UBH-130..132 │ │ │ │ └──────────────┬───────────────┘ │ │ │ ▼ │ Error Tests UBH-140..141 │ │ │ ▼ Phase 2 (Green) │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ │ │ UBH-200 (skeleton) │ │ │ │ │ ▼ │ │ UBH-201 (content-type routing) │ │ │ │ │ ▼ │ │ UBH-202 (extension fallback) │ │ │ │ │ ├──────────────┬───────────────┤ │ │ │ │ │ ▼ ▼ ▼ │ UBH-210 UBH-211 UBH-212 │ (PDF) (HTML) (plain) │ │ │ │ │ │ ▼ │ │ │ UBH-213 │ │ │ (Jina) │ │ │ │ │ │ └──────────────┼───────────────┘ │ │ │ ▼ │ UBH-220, UBH-221 │ (error handling) │ │ │ ▼ Phase 3 (Blue) │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ ▼ │ UBH-300 (extract helpers) UBH-301 (docs) │ │ │ │ ▼ ▼ │ UBH-302 (types) │ │ │ │ │ ▼ │ │ UBH-310 (integrate URL) │ │ │ │ │ ├───────────────►──────────────┘ │ │ │ ▼ │ UBH-311 (remove duplication) │ │ │ ▼ │ UBH-312 (sitemap integration) │ │ │ ▼ │ UBH-320, UBH-321 (polish) │ │ │ ▼ Phase 4 (Cleanup) │ │ ┌──────────────┴───────────────┐ │ │ │ │ ▼ ▼ │ UBH-400 (identify bad data) UBH-402 (docs) │ │ │ │ ▼ │ │ UBH-401 (purge tool) │ │ │ │ │ └──────────────┬───────────────┘ │ │ │ ▼ │ UBH-403 │ (rebuild project) ``` --- ## Effort Estimates | Phase | Tasks | Estimated Effort | |-------|-------|------------------| | Phase 0: Setup | 2 | 30 minutes | | Phase 1: Red | 14 | 2 hours | | Phase 2: Green | 10 | 3 hours | | Phase 3: Blue | 8 | 2 hours | | Phase 4: Cleanup | 4 | 1 hour | | **Total** | **38** | **~8.5 hours** | --- ## Risk Mitigation | Risk | Mitigation | |------|------------| | Jina Reader API rate limits | Add exponential backoff; cache results | | Large PDFs causing OOM | Implement streaming; enforce size limits | | Edge cases in content-type parsing | Add charset handling; normalize variations | | Existing data format changes | Run migration script in dry-run first | --- ## Success Criteria 1. **All 14+ tests pass** after Green Phase 2. **No regressions** in existing `fetchSource()` behavior 3. **PDF URLs correctly extract text** (verified with real URL) 4. **Error messages are actionable** for debugging 5. **Corrupted project data is cleaned** and rebuilt --- ## Files Affected | File | Changes | |------|---------| | `tests/binary-handler.test.ts` | New test file | | `src/tools/projects.ts` | Refactored `fetchSource()`, new helpers | | `src/utils.ts` | Possible helper additions | | `projects/dnd-reference/data/*` | Rebuilt after cleanup | --- *Generated by Planner Mode • IndexFoundry TDD Workflow*

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Mnehmos/mnehmos.index-foundry.mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server