Skip to main content
Glama

ingest_file

Adds document files (PDF, DOCX, TXT, MD) to a local vector database for semantic search, enabling private document retrieval without cloud services.

Instructions

Ingest a document file (PDF, DOCX, TXT, MD) into the vector database for semantic search. File path must be an absolute path. Supports re-ingestion to update existing documents.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
filePathYesAbsolute path to the file to ingest. Example: "/Users/user/documents/manual.pdf"

Implementation Reference

  • Type definition for the input parameters of the ingest_file tool.
    export interface IngestFileInput { /** File path */ filePath: string }
  • Type definition for the output returned by the ingest_file tool.
    export interface IngestResult { /** File path */ filePath: string /** Chunk count */ chunkCount: number /** Timestamp */ timestamp: string }
  • MCP tool registration in the listTools response, defining name, description, and input schema for ingest_file.
    { name: 'ingest_file', description: 'Ingest a document file (PDF, DOCX, TXT, MD) into the vector database for semantic search. File path must be an absolute path. Supports re-ingestion to update existing documents.', inputSchema: { type: 'object', properties: { filePath: { type: 'string', description: 'Absolute path to the file to ingest. Example: "/Users/user/documents/manual.pdf"', }, }, required: ['filePath'], },
  • Core handler for 'ingest_file' tool: handles file parsing (PDF/special cases), semantic chunking, embedding, backup/rollback of existing data, deletion of old chunks, insertion of new vector chunks into LanceDB, and returns ingestion result.
    async handleIngestFile( args: IngestFileInput ): Promise<{ content: [{ type: 'text'; text: string }] }> { let backup: VectorChunk[] | null = null try { // Parse file (with header/footer filtering for PDFs) // For raw-data files (from ingest_data), read directly without validation // since the path is internally generated and content is already processed const isPdf = args.filePath.toLowerCase().endsWith('.pdf') let text: string if (isRawDataPath(args.filePath)) { // Raw-data files: skip validation, read directly text = await readFile(args.filePath, 'utf-8') console.error(`Read raw-data file: ${args.filePath} (${text.length} characters)`) } else if (isPdf) { text = await this.parser.parsePdf(args.filePath, this.embedder) } else { text = await this.parser.parseFile(args.filePath) } // Split text into semantic chunks const chunks = await this.chunker.chunkText(text, this.embedder) // Generate embeddings for final chunks const embeddings = await this.embedder.embedBatch(chunks.map((chunk) => chunk.text)) // Create backup (if existing data exists) try { const existingFiles = await this.vectorStore.listFiles() const existingFile = existingFiles.find((file) => file.filePath === args.filePath) if (existingFile && existingFile.chunkCount > 0) { // Backup existing data (retrieve via search) const queryVector = embeddings[0] || [] if (queryVector.length > 0) { const allChunks = await this.vectorStore.search(queryVector, undefined, 20) // Retrieve max 20 items backup = allChunks .filter((chunk) => chunk.filePath === args.filePath) .map((chunk) => ({ id: randomUUID(), filePath: chunk.filePath, chunkIndex: chunk.chunkIndex, text: chunk.text, vector: queryVector, // Use dummy vector since actual vector cannot be retrieved metadata: chunk.metadata, timestamp: new Date().toISOString(), })) } console.error(`Backup created: ${backup?.length || 0} chunks for ${args.filePath}`) } } catch (error) { // Backup creation failure is warning only (for new files) console.warn('Failed to create backup (new file?):', error) } // Delete existing data await this.vectorStore.deleteChunks(args.filePath) console.error(`Deleted existing chunks for: ${args.filePath}`) // Create vector chunks const timestamp = new Date().toISOString() const vectorChunks: VectorChunk[] = chunks.map((chunk, index) => { const embedding = embeddings[index] if (!embedding) { throw new Error(`Missing embedding for chunk ${index}`) } return { id: randomUUID(), filePath: args.filePath, chunkIndex: chunk.index, text: chunk.text, vector: embedding, metadata: { fileName: args.filePath.split('/').pop() || args.filePath, fileSize: text.length, fileType: args.filePath.split('.').pop() || '', }, timestamp, } }) // Insert vectors (transaction processing) try { await this.vectorStore.insertChunks(vectorChunks) console.error(`Inserted ${vectorChunks.length} chunks for: ${args.filePath}`) // Delete backup on success backup = null } catch (insertError) { // Rollback on error if (backup && backup.length > 0) { console.error('Ingestion failed, rolling back...', insertError) try { await this.vectorStore.insertChunks(backup) console.error(`Rollback completed: ${backup.length} chunks restored`) } catch (rollbackError) { console.error('Rollback failed:', rollbackError) throw new Error( `Failed to ingest file and rollback failed: ${(insertError as Error).message}` ) } } throw insertError } // Result const result: IngestResult = { filePath: args.filePath, chunkCount: chunks.length, timestamp, } return { content: [ { type: 'text', text: JSON.stringify(result, null, 2), }, ], } } catch (error) { // Error handling: suppress stack trace in production const errorMessage = process.env['NODE_ENV'] === 'production' ? (error as Error).message : (error as Error).stack || (error as Error).message console.error('Failed to ingest file:', errorMessage) throw new Error(`Failed to ingest file: ${errorMessage}`) } }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/shinpr/mcp-local-rag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server