fetch_document_text
Extract full text from Australian legislation and case law documents using OCR for scanned PDFs. Supports multiple output formats including JSON, text, markdown, and HTML.
Instructions
Fetch full text for a legislation or case URL, with OCR fallback for scanned PDFs.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| format | No | json | |
| url | Yes |
Implementation Reference
- src/services/fetcher.ts:98-167 (handler)Core handler function that fetches document from URL, detects content type, extracts text (PDF with OCR fallback, HTML parsing), and returns FetchResponse.export async function fetchDocumentText(url: string): Promise<FetchResponse> { try { const response = await axios.get(url, { responseType: "arraybuffer", headers: { "User-Agent": "auslaw-mcp/0.1.0 (legal research tool)", }, timeout: 30000, maxContentLength: 50 * 1024 * 1024, // 50MB limit }); const buffer = Buffer.from(response.data); const contentType = response.headers["content-type"] || ""; // Detect file type from buffer const detectedType = await fileTypeFromBuffer(buffer); let text: string; let ocrUsed = false; // Handle PDF documents if ( contentType.includes("application/pdf") || detectedType?.mime === "application/pdf" ) { const result = await extractTextFromPdf(buffer, url); text = result.text; ocrUsed = result.ocrUsed; } // Handle HTML documents else if ( contentType.includes("text/html") || detectedType?.mime === "text/html" ) { const html = buffer.toString("utf-8"); text = extractTextFromHtml(html, url); } // Handle plain text else if (contentType.includes("text/plain")) { text = buffer.toString("utf-8"); } // Unsupported format else { throw new Error( `Unsupported content type: ${contentType}${detectedType ? ` (detected: ${detectedType.mime})` : ""}`, ); } // Extract basic metadata const metadata: Record<string, string> = { contentLength: String(buffer.length), contentType: contentType || detectedType?.mime || "unknown", }; return { text, contentType: contentType || detectedType?.mime || "unknown", sourceUrl: url, ocrUsed, metadata, }; } catch (error) { if (axios.isAxiosError(error)) { throw new Error( `Failed to fetch document from ${url}: ${error.message}`, ); } throw error; } }
- src/index.ts:90-103 (registration)Registers the 'fetch_document_text' tool with MCP server, providing input schema and a handler that parses input, calls fetchDocumentText, and formats output.server.registerTool( "fetch_document_text", { title: "Fetch Document Text", description: "Fetch full text for a legislation or case URL, with OCR fallback for scanned PDFs.", inputSchema: fetchDocumentShape, }, async (rawInput) => { const { url, format } = fetchDocumentParser.parse(rawInput); const response = await fetchDocumentText(url); return formatFetchResponse(response, format ?? "json"); }, );
- src/index.ts:84-88 (schema)Defines the input schema for fetch_document_text tool using Zod: required URL and optional format (json/text/markdown/html).const fetchDocumentShape = { url: z.string().url("URL must be valid."), format: formatEnum.optional(), }; const fetchDocumentParser = z.object(fetchDocumentShape);
- src/services/fetcher.ts:9-15 (schema)TypeScript interface defining the structure of the output returned by fetchDocumentText.export interface FetchResponse { text: string; contentType: string; sourceUrl: string; ocrUsed: boolean; metadata?: Record<string, string>; }
- src/services/fetcher.ts:17-38 (helper)Helper function to extract text from PDF buffers, first trying direct extraction, falling back to OCR if insufficient text.async function extractTextFromPdf( buffer: Buffer, url: string, ): Promise<{ text: string; ocrUsed: boolean }> { try { // First try to extract text from PDF directly const pdfData = await pdf(buffer); const extractedText = pdfData.text.trim(); // If we got substantial text, return it if (extractedText.length > 100) { return { text: extractedText, ocrUsed: false }; } // Otherwise, fall back to OCR console.warn(`PDF at ${url} has minimal text, attempting OCR...`); return await performOcr(buffer); } catch (error) { console.warn(`PDF parsing failed for ${url}, attempting OCR:`, error); return await performOcr(buffer); } }