rag
Search the web and extract content using intelligent RAG techniques to retrieve, process, and summarize information from web pages for research and analysis.
Instructions
Busca web com extração inteligente de conteúdo (igual Apify RAG Web Browser)
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | Termo de busca para pesquisar na web | |
| maxResults | No | Máximo de páginas a processar | |
| outputFormat | No | Formato de saída do conteúdo | markdown |
| contentMode | No | Modo de conteúdo: preview=resumo truncado (~300 chars), full=conteúdo completo, summary=sumarização inteligente via LLM | full |
| summaryModel | No | Modelo Ollama para sumarização (default: llama3.2:1b). Ex: mistral:7b, qwen2.5:0.5b | |
| useJavascript | No | Renderizar JavaScript nas páginas | |
| timeout | No | Timeout para scraping em milissegundos |
Implementation Reference
- src/tools/rag.ts:69-180 (handler)The core implementation of the 'rag' tool handler. Searches the web, scrapes and extracts content from top results, applies caching, processes content based on modes (preview/full/summary), and returns formatted results.export async function rag(params: RagParams): Promise<RagResult> { const { query, maxResults = 5, outputFormat = "markdown", contentMode = "full", summaryModel, useJavascript = false, timeout = 30000, } = params; let searchResults: Array<{ url: string; title: string; description: string; }>; try { searchResults = await searchWeb(query, maxResults); } catch (error) { return { query, error: `Busca falhou: ${(error as Error).message}. Tente novamente em alguns minutos.`, results: [], totalResults: 0, searchedAt: new Date().toISOString(), }; } const pages = await Promise.all( searchResults.map(async (result) => { const cached = getFromCache(result.url); if (cached) { return { url: result.url, title: cached.title, markdown: cached.markdown, text: cached.content, html: cached.content, excerpt: "", fromCache: true, }; } try { const { html } = await scrapePage(result.url, { javascript: useJavascript, timeout, }); const extracted = await extractContent(html, result.url); if (extracted) { saveToCache(result.url, { content: extracted.textContent, markdown: extracted.markdown, title: extracted.title, }); return { url: result.url, title: extracted.title, markdown: extracted.markdown, text: extracted.textContent, html: extracted.content, excerpt: extracted.excerpt, fromCache: false, }; } } catch (e) { console.error(`Failed to scrape ${result.url}:`, e); } return null; }) ); const validPages = pages.filter(Boolean) as PageResult[]; const formattedResults = await Promise.all( validPages.map(async (p) => { const result: any = { url: p.url, title: p.title, fromCache: p.fromCache, }; if (contentMode === "preview") { result.contentHandle = generateContentHandle(p.url); } if (outputFormat === "markdown") { result.markdown = await applyContentMode(p.markdown, contentMode, summaryModel); } else if (outputFormat === "text") { result.text = await applyContentMode(p.text, contentMode, summaryModel); } else if (outputFormat === "html") { result.html = await applyContentMode(p.html, contentMode, summaryModel); } if (p.excerpt && contentMode === "full") { result.excerpt = p.excerpt; } return result; }) ); return { query, results: formattedResults, totalResults: validPages.length, searchedAt: new Date().toISOString(), }; }
- src/tools/rag.ts:7-34 (schema)TypeScript interfaces defining the input parameters (RagParams), intermediate PageResult, and output RagResult for the rag tool.interface RagParams { query: string; maxResults?: number; outputFormat?: "markdown" | "text" | "html"; contentMode?: "preview" | "full" | "summary"; summaryModel?: string; useJavascript?: boolean; timeout?: number; } interface PageResult { url: string; title: string; markdown?: string; text?: string; html?: string; excerpt?: string; contentHandle?: string; fromCache: boolean; } interface RagResult { query: string; results: PageResult[]; totalResults: number; searchedAt: string; error?: string; }
- src/index.ts:17-59 (registration)MCP server registration of the 'rag' tool, including name, description, Zod input schema for validation, and a thin wrapper handler that invokes the rag function and returns the result as MCP-formatted content.server.tool( "rag", "Busca web com extração inteligente de conteúdo (igual Apify RAG Web Browser)", { query: z .string() .describe("Termo de busca para pesquisar na web"), maxResults: z .number() .int() .min(1) .max(10) .default(5) .describe("Máximo de páginas a processar"), outputFormat: z .enum(["markdown", "text", "html"]) .default("markdown") .describe("Formato de saída do conteúdo"), contentMode: z .enum(["preview", "full", "summary"]) .default("full") .describe("Modo de conteúdo: preview=resumo truncado (~300 chars), full=conteúdo completo, summary=sumarização inteligente via LLM"), summaryModel: z .string() .optional() .describe("Modelo Ollama para sumarização (default: llama3.2:1b). Ex: mistral:7b, qwen2.5:0.5b"), useJavascript: z .boolean() .default(false) .describe("Renderizar JavaScript nas páginas"), timeout: z .number() .int() .default(30000) .describe("Timeout para scraping em milissegundos"), }, async (params) => { const result = await rag(params); return { content: [{ type: "text", text: JSON.stringify(result, null, 2) }], }; } );
- src/index.ts:20-52 (schema)Runtime Zod schema for input validation of the 'rag' tool parameters in the MCP server registration.{ query: z .string() .describe("Termo de busca para pesquisar na web"), maxResults: z .number() .int() .min(1) .max(10) .default(5) .describe("Máximo de páginas a processar"), outputFormat: z .enum(["markdown", "text", "html"]) .default("markdown") .describe("Formato de saída do conteúdo"), contentMode: z .enum(["preview", "full", "summary"]) .default("full") .describe("Modo de conteúdo: preview=resumo truncado (~300 chars), full=conteúdo completo, summary=sumarização inteligente via LLM"), summaryModel: z .string() .optional() .describe("Modelo Ollama para sumarização (default: llama3.2:1b). Ex: mistral:7b, qwen2.5:0.5b"), useJavascript: z .boolean() .default(false) .describe("Renderizar JavaScript nas páginas"), timeout: z .number() .int() .default(30000) .describe("Timeout para scraping em milissegundos"), },