Skip to main content
Glama

extract_from_text

Extract structured data like emails, URLs, phone numbers, dates, and addresses from raw text to parse documents, emails, or web content into clean formats.

Instructions

Extract structured data from raw text: emails, URLs, phone numbers, dates, currencies, addresses, names, or JSON blocks. Useful for parsing documents, emails, web content, or any unstructured text into clean structured data.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYesThe raw text to extract data from
extractorsYesWhich types of data to extract. Choose one or more: emails, urls, phone_numbers, dates, currencies, addresses, names, json_blocks
deduplicateNoRemove duplicate results within each type

Implementation Reference

  • The handler function that performs the extraction logic using the helper `extract` function.
    async function handler(input: Input) {
      const extracted = extract(input.text, input.extractors, input.deduplicate);
    
      const totalFound = Object.values(extracted).reduce((sum, arr) => sum + arr.length, 0);
    
      return {
        extracted,
        summary: {
          totalItemsFound: totalFound,
          byType: Object.fromEntries(Object.entries(extracted).map(([k, v]) => [k, v.length])),
        },
      };
    }
  • The extraction helper function that executes regex matching and post-processing based on requested extractors.
    function extract(text: string, extractors: string[], deduplicate: boolean) {
      const results: Record<string, string[]> = {};
    
      for (const extractor of extractors) {
        const pattern = PATTERNS[extractor];
        if (!pattern) continue;
    
        const matches = text.match(pattern.regex) || [];
        let processed = pattern.postProcess ? matches.map(pattern.postProcess) : matches;
    
        if (deduplicate) {
          processed = [...new Set(processed)];
        }
    
        results[extractor] = processed;
      }
    
      return results;
    }
  • Zod schema definition for input validation of the text-extractor tool.
    const inputSchema = z.object({
      text: z
        .string()
        .min(1)
        .max(10_000)
        .describe("Raw text to extract structured data from"),
      extractors: z
        .array(
          z.enum([
            "emails",
            "urls",
            "phone_numbers",
            "dates",
            "currencies",
            "addresses",
            "names",
            "json_blocks",
          ])
        )
        .min(1)
        .describe("Which types of data to extract"),
      deduplicate: z.boolean().default(true).describe("Remove duplicate results"),
    });
  • Registration of the text-extractor tool within the registry. Note that the registration key used in the server is "extract_from_text" while the internal tool name is "text-extractor".
    const textExtractorTool: ToolDefinition<Input> = {
      name: "text-extractor",
      description:
        "Extract structured data (emails, URLs, phone numbers, dates, currencies, addresses, names, JSON blocks) from raw text. Essential for agents processing unstructured documents, emails, or web content.",
      version: "1.0.0",
      inputSchema,
      handler,
      metadata: {
        tags: ["extraction", "parsing", "nlp", "data-transformation"],
        pricing: "$0.0005 per call",
        exampleInput: {
          text: "Contact John Smith at john@example.com or call (555) 123-4567. Meeting on Jan 15, 2025 at 123 Main St, Springfield, IL 62701. Budget: $5,000.00 USD.",
          extractors: ["emails", "phone_numbers", "dates", "addresses", "currencies"],
          deduplicate: true,
        },
      },
    };
    
    registerTool(textExtractorTool);

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/marras0914/agent-toolbelt'

If you have feedback or need assistance with the MCP directory API, please join our Discord server