Fetch Document Text
fetch_document_textExtract full text from Australian legislation and case law documents using OCR for scanned PDFs. Supports multiple output formats including JSON, text, markdown, and HTML.
Instructions
Fetch full text for a legislation or case URL, with OCR fallback for scanned PDFs.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| format | No | json | |
| url | Yes |
Implementation Reference
- src/services/fetcher.ts:98-167 (handler)Core handler function that fetches document from URL, detects content type, extracts text (PDF with OCR fallback, HTML parsing), and returns FetchResponse.
export async function fetchDocumentText(url: string): Promise<FetchResponse> { try { const response = await axios.get(url, { responseType: "arraybuffer", headers: { "User-Agent": "auslaw-mcp/0.1.0 (legal research tool)", }, timeout: 30000, maxContentLength: 50 * 1024 * 1024, // 50MB limit }); const buffer = Buffer.from(response.data); const contentType = response.headers["content-type"] || ""; // Detect file type from buffer const detectedType = await fileTypeFromBuffer(buffer); let text: string; let ocrUsed = false; // Handle PDF documents if ( contentType.includes("application/pdf") || detectedType?.mime === "application/pdf" ) { const result = await extractTextFromPdf(buffer, url); text = result.text; ocrUsed = result.ocrUsed; } // Handle HTML documents else if ( contentType.includes("text/html") || detectedType?.mime === "text/html" ) { const html = buffer.toString("utf-8"); text = extractTextFromHtml(html, url); } // Handle plain text else if (contentType.includes("text/plain")) { text = buffer.toString("utf-8"); } // Unsupported format else { throw new Error( `Unsupported content type: ${contentType}${detectedType ? ` (detected: ${detectedType.mime})` : ""}`, ); } // Extract basic metadata const metadata: Record<string, string> = { contentLength: String(buffer.length), contentType: contentType || detectedType?.mime || "unknown", }; return { text, contentType: contentType || detectedType?.mime || "unknown", sourceUrl: url, ocrUsed, metadata, }; } catch (error) { if (axios.isAxiosError(error)) { throw new Error( `Failed to fetch document from ${url}: ${error.message}`, ); } throw error; } } - src/index.ts:90-103 (registration)Registers the 'fetch_document_text' tool with MCP server, providing input schema and a handler that parses input, calls fetchDocumentText, and formats output.
server.registerTool( "fetch_document_text", { title: "Fetch Document Text", description: "Fetch full text for a legislation or case URL, with OCR fallback for scanned PDFs.", inputSchema: fetchDocumentShape, }, async (rawInput) => { const { url, format } = fetchDocumentParser.parse(rawInput); const response = await fetchDocumentText(url); return formatFetchResponse(response, format ?? "json"); }, ); - src/index.ts:84-88 (schema)Defines the input schema for fetch_document_text tool using Zod: required URL and optional format (json/text/markdown/html).
const fetchDocumentShape = { url: z.string().url("URL must be valid."), format: formatEnum.optional(), }; const fetchDocumentParser = z.object(fetchDocumentShape); - src/services/fetcher.ts:9-15 (schema)TypeScript interface defining the structure of the output returned by fetchDocumentText.
export interface FetchResponse { text: string; contentType: string; sourceUrl: string; ocrUsed: boolean; metadata?: Record<string, string>; } - src/services/fetcher.ts:17-38 (helper)Helper function to extract text from PDF buffers, first trying direct extraction, falling back to OCR if insufficient text.
async function extractTextFromPdf( buffer: Buffer, url: string, ): Promise<{ text: string; ocrUsed: boolean }> { try { // First try to extract text from PDF directly const pdfData = await pdf(buffer); const extractedText = pdfData.text.trim(); // If we got substantial text, return it if (extractedText.length > 100) { return { text: extractedText, ocrUsed: false }; } // Otherwise, fall back to OCR console.warn(`PDF at ${url} has minimal text, attempting OCR...`); return await performOcr(buffer); } catch (error) { console.warn(`PDF parsing failed for ${url}, attempting OCR:`, error); return await performOcr(buffer); } }