Skip to main content
Glama

ingest_file

Adds document files (PDF, DOCX, TXT, MD) to a local vector database for semantic search, enabling private document retrieval without cloud services.

Instructions

Ingest a document file (PDF, DOCX, TXT, MD) into the vector database for semantic search. File path must be an absolute path. Supports re-ingestion to update existing documents.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
filePathYesAbsolute path to the file to ingest. Example: "/Users/user/documents/manual.pdf"

Implementation Reference

  • Type definition for the input parameters of the ingest_file tool.
    export interface IngestFileInput {
      /** File path */
      filePath: string
    }
  • Type definition for the output returned by the ingest_file tool.
    export interface IngestResult {
      /** File path */
      filePath: string
      /** Chunk count */
      chunkCount: number
      /** Timestamp */
      timestamp: string
    }
  • MCP tool registration in the listTools response, defining name, description, and input schema for ingest_file.
    {
      name: 'ingest_file',
      description:
        'Ingest a document file (PDF, DOCX, TXT, MD) into the vector database for semantic search. File path must be an absolute path. Supports re-ingestion to update existing documents.',
      inputSchema: {
        type: 'object',
        properties: {
          filePath: {
            type: 'string',
            description:
              'Absolute path to the file to ingest. Example: "/Users/user/documents/manual.pdf"',
          },
        },
        required: ['filePath'],
      },
  • Core handler for 'ingest_file' tool: handles file parsing (PDF/special cases), semantic chunking, embedding, backup/rollback of existing data, deletion of old chunks, insertion of new vector chunks into LanceDB, and returns ingestion result.
    async handleIngestFile(
      args: IngestFileInput
    ): Promise<{ content: [{ type: 'text'; text: string }] }> {
      let backup: VectorChunk[] | null = null
    
      try {
        // Parse file (with header/footer filtering for PDFs)
        // For raw-data files (from ingest_data), read directly without validation
        // since the path is internally generated and content is already processed
        const isPdf = args.filePath.toLowerCase().endsWith('.pdf')
        let text: string
        if (isRawDataPath(args.filePath)) {
          // Raw-data files: skip validation, read directly
          text = await readFile(args.filePath, 'utf-8')
          console.error(`Read raw-data file: ${args.filePath} (${text.length} characters)`)
        } else if (isPdf) {
          text = await this.parser.parsePdf(args.filePath, this.embedder)
        } else {
          text = await this.parser.parseFile(args.filePath)
        }
    
        // Split text into semantic chunks
        const chunks = await this.chunker.chunkText(text, this.embedder)
    
        // Generate embeddings for final chunks
        const embeddings = await this.embedder.embedBatch(chunks.map((chunk) => chunk.text))
    
        // Create backup (if existing data exists)
        try {
          const existingFiles = await this.vectorStore.listFiles()
          const existingFile = existingFiles.find((file) => file.filePath === args.filePath)
          if (existingFile && existingFile.chunkCount > 0) {
            // Backup existing data (retrieve via search)
            const queryVector = embeddings[0] || []
            if (queryVector.length > 0) {
              const allChunks = await this.vectorStore.search(queryVector, undefined, 20) // Retrieve max 20 items
              backup = allChunks
                .filter((chunk) => chunk.filePath === args.filePath)
                .map((chunk) => ({
                  id: randomUUID(),
                  filePath: chunk.filePath,
                  chunkIndex: chunk.chunkIndex,
                  text: chunk.text,
                  vector: queryVector, // Use dummy vector since actual vector cannot be retrieved
                  metadata: chunk.metadata,
                  timestamp: new Date().toISOString(),
                }))
            }
            console.error(`Backup created: ${backup?.length || 0} chunks for ${args.filePath}`)
          }
        } catch (error) {
          // Backup creation failure is warning only (for new files)
          console.warn('Failed to create backup (new file?):', error)
        }
    
        // Delete existing data
        await this.vectorStore.deleteChunks(args.filePath)
        console.error(`Deleted existing chunks for: ${args.filePath}`)
    
        // Create vector chunks
        const timestamp = new Date().toISOString()
        const vectorChunks: VectorChunk[] = chunks.map((chunk, index) => {
          const embedding = embeddings[index]
          if (!embedding) {
            throw new Error(`Missing embedding for chunk ${index}`)
          }
          return {
            id: randomUUID(),
            filePath: args.filePath,
            chunkIndex: chunk.index,
            text: chunk.text,
            vector: embedding,
            metadata: {
              fileName: args.filePath.split('/').pop() || args.filePath,
              fileSize: text.length,
              fileType: args.filePath.split('.').pop() || '',
            },
            timestamp,
          }
        })
    
        // Insert vectors (transaction processing)
        try {
          await this.vectorStore.insertChunks(vectorChunks)
          console.error(`Inserted ${vectorChunks.length} chunks for: ${args.filePath}`)
    
          // Delete backup on success
          backup = null
        } catch (insertError) {
          // Rollback on error
          if (backup && backup.length > 0) {
            console.error('Ingestion failed, rolling back...', insertError)
            try {
              await this.vectorStore.insertChunks(backup)
              console.error(`Rollback completed: ${backup.length} chunks restored`)
            } catch (rollbackError) {
              console.error('Rollback failed:', rollbackError)
              throw new Error(
                `Failed to ingest file and rollback failed: ${(insertError as Error).message}`
              )
            }
          }
          throw insertError
        }
    
        // Result
        const result: IngestResult = {
          filePath: args.filePath,
          chunkCount: chunks.length,
          timestamp,
        }
    
        return {
          content: [
            {
              type: 'text',
              text: JSON.stringify(result, null, 2),
            },
          ],
        }
      } catch (error) {
        // Error handling: suppress stack trace in production
        const errorMessage =
          process.env['NODE_ENV'] === 'production'
            ? (error as Error).message
            : (error as Error).stack || (error as Error).message
    
        console.error('Failed to ingest file:', errorMessage)
    
        throw new Error(`Failed to ingest file: ${errorMessage}`)
      }
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/shinpr/mcp-local-rag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server