Skip to main content
Glama

ingest_data

Add text, HTML, or Markdown content to a local search index for private document retrieval, using a source identifier to update existing entries.

Instructions

Ingest content as a string, not from a file. Use for: fetched web pages (format: html), copied text (format: text), or markdown strings (format: markdown). The source identifier enables re-ingestion to update existing content. For files on disk, use ingest_file instead.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
contentYesThe content to ingest (text, HTML, or Markdown)
metadataYes

Implementation Reference

  • Primary execution logic for the ingest_data tool. Handles HTML to Markdown conversion if needed, saves processed content to a raw-data file using saveRawData, then delegates ingestion to handleIngestFile with rollback on failure.
    async handleIngestData(
      args: IngestDataInput
    ): Promise<{ content: [{ type: 'text'; text: string }] }> {
      try {
        let contentToSave = args.content
        let formatToSave: ContentFormat = args.metadata.format
    
        // For HTML content, convert to Markdown first
        if (args.metadata.format === 'html') {
          console.error(`Parsing HTML from: ${args.metadata.source}`)
          const markdown = await parseHtml(args.content, args.metadata.source)
    
          if (!markdown.trim()) {
            throw new Error(
              'Failed to extract content from HTML. The page may have no readable content.'
            )
          }
    
          contentToSave = markdown
          formatToSave = 'markdown' // Save as .md file
          console.error(`Converted HTML to Markdown: ${markdown.length} characters`)
        }
    
        // Save content to raw-data directory
        const rawDataPath = await saveRawData(
          this.dbPath,
          args.metadata.source,
          contentToSave,
          formatToSave
        )
    
        console.error(`Saved raw data: ${args.metadata.source} -> ${rawDataPath}`)
    
        // Call existing ingest_file internally with rollback on failure
        try {
          return await this.handleIngestFile({ filePath: rawDataPath })
        } catch (ingestError) {
          // Rollback: delete the raw-data file if ingest fails
          try {
            await unlink(rawDataPath)
            console.error(`Rolled back raw-data file: ${rawDataPath}`)
          } catch {
            console.warn(`Failed to rollback raw-data file: ${rawDataPath}`)
          }
          throw ingestError
        }
      } catch (error) {
        // Error handling: suppress stack trace in production
        const errorMessage =
          process.env['NODE_ENV'] === 'production'
            ? (error as Error).message
            : (error as Error).stack || (error as Error).message
    
        console.error('Failed to ingest data:', errorMessage)
    
        throw new Error(`Failed to ingest data: ${errorMessage}`)
      }
    }
  • MCP tool registration in listTools handler, defining name, description, and detailed inputSchema for ingest_data.
      name: 'ingest_data',
      description:
        'Ingest content as a string, not from a file. Use for: fetched web pages (format: html), copied text (format: text), or markdown strings (format: markdown). The source identifier enables re-ingestion to update existing content. For files on disk, use ingest_file instead.',
      inputSchema: {
        type: 'object',
        properties: {
          content: {
            type: 'string',
            description: 'The content to ingest (text, HTML, or Markdown)',
          },
          metadata: {
            type: 'object',
            properties: {
              source: {
                type: 'string',
                description:
                  'Source identifier. For web pages, use the URL (e.g., "https://example.com/page"). For other content, use URL-scheme format: "{type}://{date}" or "{type}://{date}/{detail}". Examples: "clipboard://2024-12-30", "chat://2024-12-30/project-discussion", "note://2024-12-30/meeting".',
              },
              format: {
                type: 'string',
                enum: ['text', 'html', 'markdown'],
                description: 'Content format: "text", "html", or "markdown"',
              },
            },
            required: ['source', 'format'],
          },
        },
        required: ['content', 'metadata'],
      },
    },
  • TypeScript interfaces defining the input structure for ingest_data: IngestDataMetadata and IngestDataInput, used for type safety and schema validation.
     * ingest_data tool input metadata
     */
    export interface IngestDataMetadata {
      /** Source identifier: URL ("https://...") or custom ID ("clipboard://2024-12-30") */
      source: string
      /** Content format */
      format: ContentFormat
    }
    
    /**
     * ingest_data tool input
     */
    export interface IngestDataInput {
      /** Content to ingest (text, HTML, or Markdown) */
      content: string
      /** Content metadata */
      metadata: IngestDataMetadata
    }
  • Core helper function called by the handler to persist the ingested content to a secure raw-data file path derived from the source identifier.
    export async function saveRawData(
      dbPath: string,
      source: string,
      content: string,
      format: ContentFormat
    ): Promise<string> {
      const filePath = generateRawDataPath(dbPath, source, format)
    
      // Ensure directory exists
      await mkdir(dirname(filePath), { recursive: true })
    
      // Write content to file
      await writeFile(filePath, content, 'utf-8')
    
      return filePath
    }
  • Generates the deterministic file path for raw-data storage using base64url encoding of normalized source, ensuring uniqueness and security.
    export function generateRawDataPath(dbPath: string, source: string, format: ContentFormat): string {
      const normalizedSource = normalizeSource(source)
      const encoded = encodeBase64Url(normalizedSource)
      const extension = formatToExtension(format)
      // Use resolve to ensure absolute path (required by validateFilePath)
      return resolve(getRawDataDir(dbPath), `${encoded}.${extension}`)
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It explains that the source identifier 'enables re-ingestion to update existing content,' which reveals important behavioral context about idempotency and content updates. However, it doesn't describe what 'ingest' actually means operationally, what happens to the content after ingestion, or any limitations like size constraints or rate limits.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is efficiently structured with four sentences that each serve a distinct purpose: stating the core function, listing use cases, explaining the source identifier's purpose, and providing the sibling alternative. Every sentence adds value without redundancy, making it appropriately sized and front-loaded.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool with 2 parameters, no annotations, and no output schema, the description provides good contextual coverage. It explains the tool's purpose, usage guidelines, and key parameter semantics. The main gap is the lack of information about what 'ingest' operationally means and what the tool returns, but given the tool's relative simplicity, the description is reasonably complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 50% schema description coverage (only 'content' parameter has a description in schema), the description adds significant value by explaining the purpose of the source identifier ('enables re-ingestion to update existing content') and providing format examples (html, text, markdown). While it doesn't detail all parameter specifics, it gives meaningful context that compensates for the schema coverage gap.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Ingest content as a string, not from a file' with specific examples (fetched web pages, copied text, markdown strings). It distinguishes from sibling 'ingest_file' by specifying string-based ingestion versus file-based ingestion, providing clear differentiation.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states when to use this tool ('Use for: fetched web pages, copied text, or markdown strings') and when not to use it ('For files on disk, use ingest_file instead'). It names the alternative tool and provides clear context for appropriate usage scenarios.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/shinpr/mcp-local-rag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server