Skip to main content
Glama
acchuang

Jina AI Remote MCP Server

by acchuang

deduplicate_strings

Remove duplicate strings and select semantically diverse content from lists using Jina embeddings and submodular optimization to cover the semantic space.

Instructions

Get top-k semantically unique strings from a list using Jina embeddings and submodular optimization. Use this when you have many similar strings and want to select the most diverse subset that covers the semantic space. Perfect for removing duplicates, selecting representative samples, or finding diverse content.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
stringsYesArray of strings to deduplicate
kNoNumber of unique strings to return. If not provided, automatically finds optimal k by looking at diminishing return

Implementation Reference

  • Registers the 'deduplicate_strings' tool on the MCP server, defining its description, input schema, and handler function.
    server.tool(
    	"deduplicate_strings",
    	"Get top-k semantically unique strings from a list using Jina embeddings and submodular optimization. Use this when you have many similar strings and want to select the most diverse subset that covers the semantic space. Perfect for removing duplicates, selecting representative samples, or finding diverse content. Returns the selected strings with their indices.",
    	{
    		strings: z.array(z.string()).describe("Array of strings to deduplicate"),
    		k: z.number().optional().describe("Number of unique strings to return. If not provided, automatically finds optimal k by looking at diminishing return")
    	},
    	async ({ strings, k }: { strings: string[]; k?: number }) => {
    		try {
    			const props = getProps();
    
    			const tokenError = checkBearerToken(props.bearerToken);
    			if (tokenError) {
    				return tokenError;
    			}
    
    			if (strings.length === 0) {
    				return {
    					content: [
    						{
    							type: "text" as const,
    							text: "No strings provided for deduplication",
    						},
    					],
    					isError: true,
    				};
    			}
    
    			if (k !== undefined && (k <= 0 || k > strings.length)) {
    				return {
    					content: [
    						{
    							type: "text" as const,
    							text: `Invalid k value: ${k}. Must be between 1 and ${strings.length}`,
    						},
    					],
    					isError: true,
    				};
    			}
    
    			// Get embeddings from Jina API
    			const response = await fetch('https://api.jina.ai/v1/embeddings', {
    				method: 'POST',
    				headers: {
    					'Accept': 'application/json',
    					'Content-Type': 'application/json',
    					'Authorization': `Bearer ${props.bearerToken}`,
    				},
    				body: JSON.stringify({
    					model: 'jina-embeddings-v3',
    					task: 'text-matching',
    					input: strings
    				}),
    			});
    
    			if (!response.ok) {
    				return handleApiError(response, "Getting embeddings");
    			}
    
    			const data = await response.json() as any;
    
    			if (!data.data || !Array.isArray(data.data)) {
    				return {
    					content: [
    						{
    							type: "text" as const,
    							text: "Invalid response format from embeddings API",
    						},
    					],
    					isError: true,
    				};
    			}
    
    			// Extract embeddings
    			const embeddings = data.data.map((item: any) => item.embedding);
    
    			// Use submodular optimization to select diverse strings
    			let selectedIndices: number[];
    			let optimalK: number;
    			let values: number[];
    
    			if (k !== undefined) {
    				// Use specified k
    				selectedIndices = lazyGreedySelection(embeddings, k);
    				values = [];
    			} else {
    				// Automatically find optimal k using saturation point
    				const result = lazyGreedySelectionWithSaturation(embeddings);
    				selectedIndices = result.selected;
    				values = result.values;
    			}
    
    			// Get the selected strings
    			const selectedStrings = selectedIndices.map(idx => ({
    				index: idx,
    				text: strings[idx]
    			}));
    
    			return {
    				content: [
    					{
    						type: "text" as const,
    						text: yamlStringify({
    							// values: values,
    							deduplicated_strings: selectedStrings,
    						}),
    					},
    				],
    			};
    		} catch (error) {
    			return {
    				content: [
    					{
    						type: "text" as const,
    						text: `Error: ${error instanceof Error ? error.message : String(error)}`,
    					},
    				],
    				isError: true,
    			};
    		}
    	},
    );
  • The handler function that implements the tool logic: validates input, fetches semantic embeddings using Jina API, applies submodular greedy selection for diversity, and returns the deduplicated strings with indices.
    async ({ strings, k }: { strings: string[]; k?: number }) => {
    	try {
    		const props = getProps();
    
    		const tokenError = checkBearerToken(props.bearerToken);
    		if (tokenError) {
    			return tokenError;
    		}
    
    		if (strings.length === 0) {
    			return {
    				content: [
    					{
    						type: "text" as const,
    						text: "No strings provided for deduplication",
    					},
    				],
    				isError: true,
    			};
    		}
    
    		if (k !== undefined && (k <= 0 || k > strings.length)) {
    			return {
    				content: [
    					{
    						type: "text" as const,
    						text: `Invalid k value: ${k}. Must be between 1 and ${strings.length}`,
    					},
    				],
    				isError: true,
    			};
    		}
    
    		// Get embeddings from Jina API
    		const response = await fetch('https://api.jina.ai/v1/embeddings', {
    			method: 'POST',
    			headers: {
    				'Accept': 'application/json',
    				'Content-Type': 'application/json',
    				'Authorization': `Bearer ${props.bearerToken}`,
    			},
    			body: JSON.stringify({
    				model: 'jina-embeddings-v3',
    				task: 'text-matching',
    				input: strings
    			}),
    		});
    
    		if (!response.ok) {
    			return handleApiError(response, "Getting embeddings");
    		}
    
    		const data = await response.json() as any;
    
    		if (!data.data || !Array.isArray(data.data)) {
    			return {
    				content: [
    					{
    						type: "text" as const,
    						text: "Invalid response format from embeddings API",
    					},
    				],
    				isError: true,
    			};
    		}
    
    		// Extract embeddings
    		const embeddings = data.data.map((item: any) => item.embedding);
    
    		// Use submodular optimization to select diverse strings
    		let selectedIndices: number[];
    		let optimalK: number;
    		let values: number[];
    
    		if (k !== undefined) {
    			// Use specified k
    			selectedIndices = lazyGreedySelection(embeddings, k);
    			values = [];
    		} else {
    			// Automatically find optimal k using saturation point
    			const result = lazyGreedySelectionWithSaturation(embeddings);
    			selectedIndices = result.selected;
    			values = result.values;
    		}
    
    		// Get the selected strings
    		const selectedStrings = selectedIndices.map(idx => ({
    			index: idx,
    			text: strings[idx]
    		}));
    
    		return {
    			content: [
    				{
    					type: "text" as const,
    					text: yamlStringify({
    						// values: values,
    						deduplicated_strings: selectedStrings,
    					}),
    				},
    			],
    		};
    	} catch (error) {
    		return {
    			content: [
    				{
    					type: "text" as const,
    					text: `Error: ${error instanceof Error ? error.message : String(error)}`,
    				},
    			],
    			isError: true,
    		};
    	}
    },
  • Zod schema defining the input parameters: array of strings and optional k.
    {
    	strings: z.array(z.string()).describe("Array of strings to deduplicate"),
    	k: z.number().optional().describe("Number of unique strings to return. If not provided, automatically finds optimal k by looking at diminishing return")
    },
  • Helper function performing lazy greedy submodular optimization to select exactly k diverse embeddings based on cosine similarity.
    export function lazyGreedySelection(embeddings: number[][], k: number): number[] {
        const n = embeddings.length;
        if (k >= n) return Array.from({ length: n }, (_, i) => i);
    
        const selected: number[] = [];
        const remaining = new Set(Array.from({ length: n }, (_, i) => i));
    
        // Pre-compute similarity matrix
        const similarityMatrix: number[][] = [];
        for (let i = 0; i < n; i++) {
            similarityMatrix[i] = [];
            for (let j = 0; j < n; j++) {
                // Clamp to non-negative to ensure monotone submodularity of facility-location objective
                const sim = cosineSimilarity(embeddings[i], embeddings[j]);
                similarityMatrix[i][j] = sim > 0 ? sim : 0;
            }
        }
    
        // Maintain current coverage vector (max similarity to selected set for each element)
        const currentCoverage = new Array(n).fill(0);
    
        // Priority queue implementation using array (simplified)
        const pq: Array<[number, number, number]> = [];
    
        // Initialize priority queue
        for (let i = 0; i < n; i++) {
            const gain = computeMarginalGainDiversity(i, currentCoverage, similarityMatrix);
            pq.push([-gain, 0, i]);
        }
    
        // Sort by gain (descending)
        pq.sort((a, b) => a[0] - b[0]);
    
        for (let iteration = 0; iteration < k; iteration++) {
            while (pq.length > 0) {
                const [negGain, lastUpdated, bestIdx] = pq.shift()!;
    
                if (!remaining.has(bestIdx)) continue;
    
                if (lastUpdated === iteration) {
                    selected.push(bestIdx);
                    remaining.delete(bestIdx);
                    // Update coverage in O(n)
                    const row = similarityMatrix[bestIdx];
                    for (let i = 0; i < n; i++) {
                        if (row[i] > currentCoverage[i]) currentCoverage[i] = row[i];
                    }
                    break;
                }
    
                const currentGain = computeMarginalGainDiversity(bestIdx, currentCoverage, similarityMatrix);
                pq.push([-currentGain, iteration, bestIdx]);
                pq.sort((a, b) => a[0] - b[0]);
            }
        }
    
        return selected;
    }
  • Helper function that automatically determines optimal k by detecting saturation point in submodular objective and returns selected indices.
    export function lazyGreedySelectionWithSaturation(
        embeddings: number[][],
        threshold: number = 1e-2
    ): { selected: number[], optimalK: number, values: number[] } {
        const n = embeddings.length;
    
        const selected: number[] = [];
        const remaining = new Set(Array.from({ length: n }, (_, i) => i));
        const values: number[] = [];
    
        // Pre-compute similarity matrix
        const similarityMatrix: number[][] = [];
        for (let i = 0; i < n; i++) {
            similarityMatrix[i] = [];
            for (let j = 0; j < n; j++) {
                const sim = cosineSimilarity(embeddings[i], embeddings[j]);
                similarityMatrix[i][j] = sim > 0 ? sim : 0;
            }
        }
    
        const currentCoverage = new Array(n).fill(0);
    
        // Priority queue implementation using array (simplified)
        const pq: Array<[number, number, number]> = [];
    
        // Initialize priority queue
        for (let i = 0; i < n; i++) {
            const gain = computeMarginalGainDiversity(i, currentCoverage, similarityMatrix);
            pq.push([-gain, 0, i]);
        }
    
        // Sort by gain (descending)
        pq.sort((a, b) => a[0] - b[0]);
    
        let earlyStopK: number | null = null;
        for (let iteration = 0; iteration < n; iteration++) {
            while (pq.length > 0) {
                const [negGain, lastUpdated, bestIdx] = pq.shift()!;
    
                if (!remaining.has(bestIdx)) continue;
    
                if (lastUpdated === iteration) {
                    selected.push(bestIdx);
                    remaining.delete(bestIdx);
    
                    // Compute current function value (coverage)
                    const row = similarityMatrix[bestIdx];
                    for (let i = 0; i < n; i++) {
                        if (row[i] > currentCoverage[i]) currentCoverage[i] = row[i];
                    }
                    const functionValue = currentCoverage.reduce((sum, val) => sum + val, 0) / n;
                    values.push(functionValue);
    
                    // Early stop when the marginal gain (delta of normalized objective) falls below threshold
                    if (values.length >= 2) {
                        const delta = values[values.length - 1] - values[values.length - 2];
                        if (delta < threshold) {
                            earlyStopK = values.length; // k is count of selected items
                        }
                    }
    
                    break;
                }
    
                const currentGain = computeMarginalGainDiversity(bestIdx, currentCoverage, similarityMatrix);
                pq.push([-currentGain, iteration, bestIdx]);
                pq.sort((a, b) => a[0] - b[0]);
            }
            if (earlyStopK !== null) break;
        }
    
        // Choose k: prefer early stop detection; otherwise, use all collected values
        const optimalK = earlyStopK ?? values.length;
        const finalSelected = selected.slice(0, optimalK);
    
        return { selected: finalSelected, optimalK, values };
    }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/acchuang/jina-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server