Skip to main content
Glama

ingest_file

Adds document files (PDF, DOCX, TXT, MD) to a local vector database for semantic search, enabling private document retrieval without cloud services.

Instructions

Ingest a document file (PDF, DOCX, TXT, MD) into the vector database for semantic search. File path must be an absolute path. Supports re-ingestion to update existing documents.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
filePathYesAbsolute path to the file to ingest. Example: "/Users/user/documents/manual.pdf"

Implementation Reference

  • Type definition for the input parameters of the ingest_file tool.
    export interface IngestFileInput {
      /** File path */
      filePath: string
    }
  • Type definition for the output returned by the ingest_file tool.
    export interface IngestResult {
      /** File path */
      filePath: string
      /** Chunk count */
      chunkCount: number
      /** Timestamp */
      timestamp: string
    }
  • MCP tool registration in the listTools response, defining name, description, and input schema for ingest_file.
    {
      name: 'ingest_file',
      description:
        'Ingest a document file (PDF, DOCX, TXT, MD) into the vector database for semantic search. File path must be an absolute path. Supports re-ingestion to update existing documents.',
      inputSchema: {
        type: 'object',
        properties: {
          filePath: {
            type: 'string',
            description:
              'Absolute path to the file to ingest. Example: "/Users/user/documents/manual.pdf"',
          },
        },
        required: ['filePath'],
      },
  • Core handler for 'ingest_file' tool: handles file parsing (PDF/special cases), semantic chunking, embedding, backup/rollback of existing data, deletion of old chunks, insertion of new vector chunks into LanceDB, and returns ingestion result.
    async handleIngestFile(
      args: IngestFileInput
    ): Promise<{ content: [{ type: 'text'; text: string }] }> {
      let backup: VectorChunk[] | null = null
    
      try {
        // Parse file (with header/footer filtering for PDFs)
        // For raw-data files (from ingest_data), read directly without validation
        // since the path is internally generated and content is already processed
        const isPdf = args.filePath.toLowerCase().endsWith('.pdf')
        let text: string
        if (isRawDataPath(args.filePath)) {
          // Raw-data files: skip validation, read directly
          text = await readFile(args.filePath, 'utf-8')
          console.error(`Read raw-data file: ${args.filePath} (${text.length} characters)`)
        } else if (isPdf) {
          text = await this.parser.parsePdf(args.filePath, this.embedder)
        } else {
          text = await this.parser.parseFile(args.filePath)
        }
    
        // Split text into semantic chunks
        const chunks = await this.chunker.chunkText(text, this.embedder)
    
        // Generate embeddings for final chunks
        const embeddings = await this.embedder.embedBatch(chunks.map((chunk) => chunk.text))
    
        // Create backup (if existing data exists)
        try {
          const existingFiles = await this.vectorStore.listFiles()
          const existingFile = existingFiles.find((file) => file.filePath === args.filePath)
          if (existingFile && existingFile.chunkCount > 0) {
            // Backup existing data (retrieve via search)
            const queryVector = embeddings[0] || []
            if (queryVector.length > 0) {
              const allChunks = await this.vectorStore.search(queryVector, undefined, 20) // Retrieve max 20 items
              backup = allChunks
                .filter((chunk) => chunk.filePath === args.filePath)
                .map((chunk) => ({
                  id: randomUUID(),
                  filePath: chunk.filePath,
                  chunkIndex: chunk.chunkIndex,
                  text: chunk.text,
                  vector: queryVector, // Use dummy vector since actual vector cannot be retrieved
                  metadata: chunk.metadata,
                  timestamp: new Date().toISOString(),
                }))
            }
            console.error(`Backup created: ${backup?.length || 0} chunks for ${args.filePath}`)
          }
        } catch (error) {
          // Backup creation failure is warning only (for new files)
          console.warn('Failed to create backup (new file?):', error)
        }
    
        // Delete existing data
        await this.vectorStore.deleteChunks(args.filePath)
        console.error(`Deleted existing chunks for: ${args.filePath}`)
    
        // Create vector chunks
        const timestamp = new Date().toISOString()
        const vectorChunks: VectorChunk[] = chunks.map((chunk, index) => {
          const embedding = embeddings[index]
          if (!embedding) {
            throw new Error(`Missing embedding for chunk ${index}`)
          }
          return {
            id: randomUUID(),
            filePath: args.filePath,
            chunkIndex: chunk.index,
            text: chunk.text,
            vector: embedding,
            metadata: {
              fileName: args.filePath.split('/').pop() || args.filePath,
              fileSize: text.length,
              fileType: args.filePath.split('.').pop() || '',
            },
            timestamp,
          }
        })
    
        // Insert vectors (transaction processing)
        try {
          await this.vectorStore.insertChunks(vectorChunks)
          console.error(`Inserted ${vectorChunks.length} chunks for: ${args.filePath}`)
    
          // Delete backup on success
          backup = null
        } catch (insertError) {
          // Rollback on error
          if (backup && backup.length > 0) {
            console.error('Ingestion failed, rolling back...', insertError)
            try {
              await this.vectorStore.insertChunks(backup)
              console.error(`Rollback completed: ${backup.length} chunks restored`)
            } catch (rollbackError) {
              console.error('Rollback failed:', rollbackError)
              throw new Error(
                `Failed to ingest file and rollback failed: ${(insertError as Error).message}`
              )
            }
          }
          throw insertError
        }
    
        // Result
        const result: IngestResult = {
          filePath: args.filePath,
          chunkCount: chunks.length,
          timestamp,
        }
    
        return {
          content: [
            {
              type: 'text',
              text: JSON.stringify(result, null, 2),
            },
          ],
        }
      } catch (error) {
        // Error handling: suppress stack trace in production
        const errorMessage =
          process.env['NODE_ENV'] === 'production'
            ? (error as Error).message
            : (error as Error).stack || (error as Error).message
    
        console.error('Failed to ingest file:', errorMessage)
    
        throw new Error(`Failed to ingest file: ${errorMessage}`)
      }
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It does reveal that the tool performs write operations ('ingest', 'update') and mentions the re-ingestion capability, but doesn't disclose important behavioral traits like required permissions, rate limits, error conditions, or what happens during concurrent operations. The description adds some context but leaves significant gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is perfectly concise with two sentences that each earn their place. The first sentence establishes the core functionality and constraints, while the second adds important behavioral context about re-ingestion. No wasted words or redundant information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (file ingestion with format constraints and update capability), no annotations, and no output schema, the description is moderately complete. It covers the basic purpose and some behavioral aspects but lacks details about return values, error handling, performance characteristics, or how it differs from sibling tools like 'ingest_data'.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage for the single parameter, the baseline would be 3. However, the description adds meaningful context beyond the schema by specifying that the file path must be absolute and listing supported file formats (PDF, DOCX, TXT, MD), which helps the agent understand parameter constraints not fully captured in the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('ingest', 'update') and resources ('document file', 'vector database'), and distinguishes it from siblings by mentioning semantic search capabilities. It explicitly identifies supported file formats (PDF, DOCX, TXT, MD) and the database destination.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context about when to use the tool (for ingesting documents into vector database for semantic search) and mentions re-ingestion for updates. However, it doesn't explicitly state when NOT to use it or name specific alternatives among the sibling tools like 'ingest_data' or 'list_files'.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/shinpr/mcp-local-rag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server