Skip to main content
Glama
jina-ai

Jina AI Remote MCP Server

Official
by jina-ai

deduplicate_strings

Eliminate redundant strings and retain semantically diverse subsets using Jina embeddings and submodular optimization. Ideal for deduplication, content sampling, or extracting unique representations from similar text data.

Instructions

Get top-k semantically unique strings from a list using Jina embeddings and submodular optimization. Use this when you have many similar strings and want to select the most diverse subset that covers the semantic space. Perfect for removing duplicates, selecting representative samples, or finding diverse content.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
kNoNumber of unique strings to return. If not provided, automatically finds optimal k by looking at diminishing return
stringsYesArray of strings to deduplicate

Implementation Reference

  • Primary handler implementation for the 'deduplicate_strings' tool. Includes inline schema validation with Zod, fetches semantic embeddings from Jina AI API, applies submodular greedy selection for diversity, and returns top-k unique strings with indices.
    server.tool( "deduplicate_strings", "Get top-k semantically unique strings from a list using Jina embeddings and submodular optimization. Use this when you have many similar strings and want to select the most diverse subset that covers the semantic space. Perfect for removing duplicates, selecting representative samples, or finding diverse content.", { strings: z.array(z.string()).describe("Array of strings to deduplicate"), k: z.number().optional().describe("Number of unique strings to return. If not provided, automatically finds optimal k by looking at diminishing return") }, async ({ strings, k }: { strings: string[]; k?: number }) => { try { const props = getProps(); const tokenError = checkBearerToken(props.bearerToken); if (tokenError) { return tokenError; } if (strings.length === 0) { throw new Error("No strings provided for deduplication"); } if (k !== undefined && (k <= 0 || k > strings.length)) { throw new Error(`Invalid k value: ${k}. Must be between 1 and ${strings.length}`); } // Get embeddings from Jina API const response = await fetch('https://api.jina.ai/v1/embeddings', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json', 'Authorization': `Bearer ${props.bearerToken}`, }, body: JSON.stringify({ model: 'jina-embeddings-v3', task: 'text-matching', input: strings }), }); if (!response.ok) { return handleApiError(response, "Getting embeddings"); } const data = await response.json() as any; if (!data.data || !Array.isArray(data.data)) { throw new Error("Invalid response format from embeddings API"); } // Extract embeddings const embeddings = data.data.map((item: any) => item.embedding); // Use submodular optimization to select diverse strings let selectedIndices: number[]; let optimalK: number; let values: number[]; if (k !== undefined) { // Use specified k selectedIndices = lazyGreedySelection(embeddings, k); values = []; } else { // Automatically find optimal k using saturation point const result = lazyGreedySelectionWithSaturation(embeddings); selectedIndices = result.selected; values = result.values; } // Get the selected strings const selectedStrings = selectedIndices.map(idx => ({ index: idx, text: strings[idx] })); // Return each deduplicated string as individual text items for consistency const contentItems: Array<{ type: 'text'; text: string }> = []; for (const selectedString of selectedStrings) { contentItems.push({ type: "text" as const, text: yamlStringify(selectedString), }); } return { content: contentItems, }; } catch (error) { return createErrorResponse(`Error: ${error instanceof Error ? error.message : String(error)}`); } }, );
  • Zod schema defining input parameters: array of strings to deduplicate and optional k for number of results.
    { strings: z.array(z.string()).describe("Array of strings to deduplicate"), k: z.number().optional().describe("Number of unique strings to return. If not provided, automatically finds optimal k by looking at diminishing return") },
  • Core helper function for fixed-k greedy submodular selection of diverse embeddings using facility location objective and lazy greedy updates.
    export function lazyGreedySelection(embeddings: number[][], k: number): number[] { const n = embeddings.length; if (k >= n) return Array.from({ length: n }, (_, i) => i); const selected: number[] = []; const remaining = new Set(Array.from({ length: n }, (_, i) => i)); // Pre-compute similarity matrix const similarityMatrix: number[][] = []; for (let i = 0; i < n; i++) { similarityMatrix[i] = []; for (let j = 0; j < n; j++) { // Clamp to non-negative to ensure monotone submodularity of facility-location objective const sim = cosineSimilarity(embeddings[i], embeddings[j]); similarityMatrix[i][j] = sim > 0 ? sim : 0; } } // Maintain current coverage vector (max similarity to selected set for each element) const currentCoverage = new Array(n).fill(0); // Priority queue implementation using array (simplified) const pq: Array<[number, number, number]> = []; // Initialize priority queue for (let i = 0; i < n; i++) { const gain = computeMarginalGainDiversity(i, currentCoverage, similarityMatrix); pq.push([-gain, 0, i]); } // Sort by gain (descending) pq.sort((a, b) => a[0] - b[0]); for (let iteration = 0; iteration < k; iteration++) { while (pq.length > 0) { const [negGain, lastUpdated, bestIdx] = pq.shift()!; if (!remaining.has(bestIdx)) continue; if (lastUpdated === iteration) { selected.push(bestIdx); remaining.delete(bestIdx); // Update coverage in O(n) const row = similarityMatrix[bestIdx]; for (let i = 0; i < n; i++) { if (row[i] > currentCoverage[i]) currentCoverage[i] = row[i]; } break; } const currentGain = computeMarginalGainDiversity(bestIdx, currentCoverage, similarityMatrix); pq.push([-currentGain, iteration, bestIdx]); pq.sort((a, b) => a[0] - b[0]); } } return selected; }
  • Core helper function for automatic-k determination via saturation detection in submodular coverage gains, used when k is not specified.
    export function lazyGreedySelectionWithSaturation( embeddings: number[][], threshold: number = 1e-2 ): { selected: number[], optimalK: number, values: number[] } { const n = embeddings.length; const selected: number[] = []; const remaining = new Set(Array.from({ length: n }, (_, i) => i)); const values: number[] = []; // Pre-compute similarity matrix const similarityMatrix: number[][] = []; for (let i = 0; i < n; i++) { similarityMatrix[i] = []; for (let j = 0; j < n; j++) { const sim = cosineSimilarity(embeddings[i], embeddings[j]); similarityMatrix[i][j] = sim > 0 ? sim : 0; } } const currentCoverage = new Array(n).fill(0); // Priority queue implementation using array (simplified) const pq: Array<[number, number, number]> = []; // Initialize priority queue for (let i = 0; i < n; i++) { const gain = computeMarginalGainDiversity(i, currentCoverage, similarityMatrix); pq.push([-gain, 0, i]); } // Sort by gain (descending) pq.sort((a, b) => a[0] - b[0]); let earlyStopK: number | null = null; for (let iteration = 0; iteration < n; iteration++) { while (pq.length > 0) { const [negGain, lastUpdated, bestIdx] = pq.shift()!; if (!remaining.has(bestIdx)) continue; if (lastUpdated === iteration) { selected.push(bestIdx); remaining.delete(bestIdx); // Compute current function value (coverage) const row = similarityMatrix[bestIdx]; for (let i = 0; i < n; i++) { if (row[i] > currentCoverage[i]) currentCoverage[i] = row[i]; } const functionValue = currentCoverage.reduce((sum, val) => sum + val, 0) / n; values.push(functionValue); // Early stop when the marginal gain (delta of normalized objective) falls below threshold if (values.length >= 2) { const delta = values[values.length - 1] - values[values.length - 2]; if (delta < threshold) { earlyStopK = values.length; // k is count of selected items } } break; } const currentGain = computeMarginalGainDiversity(bestIdx, currentCoverage, similarityMatrix); pq.push([-currentGain, iteration, bestIdx]); pq.sort((a, b) => a[0] - b[0]); } if (earlyStopK !== null) break; } // Choose k: prefer early stop detection; otherwise, use all collected values const optimalK = earlyStopK ?? values.length; const finalSelected = selected.slice(0, optimalK); return { selected: finalSelected, optimalK, values }; }
  • src/index.ts:21-22 (registration)
    Invokes registerJinaTools which includes the registration of deduplicate_strings tool.
    registerJinaTools(this.server, () => this.props); }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jina-ai/MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server