deduplicate_strings
Select semantically unique strings from a list using Jina embeddings and submodular optimization. Ideal for removing duplicates, finding representative samples, or extracting diverse content. Returns chosen strings with indices for efficient analysis.
Instructions
Get top-k semantically unique strings from a list using Jina embeddings and submodular optimization. Use this when you have many similar strings and want to select the most diverse subset that covers the semantic space. Perfect for removing duplicates, selecting representative samples, or finding diverse content. Returns the selected strings with their indices.
Input Schema
Name | Required | Description | Default |
---|---|---|---|
k | No | Number of unique strings to return. If not provided, automatically finds optimal k by looking at diminishing return | |
strings | Yes | Array of strings to deduplicate |