Skip to main content
Glama

split_large_claim_groups_v1_2

Split large claim groups in academic literature using TF-IDF and KMeans clustering to manage and organize research data effectively.

Instructions

拆分超大 claim groups (使用 TF-IDF + KMeans)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
split_thresholdNo
target_sizeNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • The core handler function for the 'split_large_claim_groups_v1_2' tool. It identifies large claim groups exceeding a threshold, uses TF-IDF vectorization on claim texts, applies KMeans clustering to subgroup them, and updates the database with new subgroups.
    @mcp.tool()
    def split_large_claim_groups_v1_2(
        split_threshold: int | None = None,
        target_size: int | None = None,
    ) -> dict[str, Any]:
        """拆分超大 claim groups (使用 TF-IDF + KMeans)"""
        try:
            # 使用配置默认值
            if split_threshold is None:
                split_threshold = config.claim_split_threshold()
            if target_size is None:
                target_size = config.claim_target_size()
            
            # 找出需要拆分的大组
            large_groups = query_all("""
                SELECT g.group_id, g.group_key, COUNT(*) as n
                FROM claim_groups g
                JOIN claim_group_members m ON m.group_id = g.group_id
                WHERE g.parent_group_id IS NULL
                GROUP BY g.group_id, g.group_key
                HAVING COUNT(*) > %s
            """, (split_threshold,))
            
            if not large_groups:
                return {"message": "No groups exceed threshold", "split_count": 0}
    
            from sklearn.feature_extraction.text import TfidfVectorizer
            from sklearn.cluster import KMeans
            import numpy as np
    
            split_count = 0
            
            for lg in large_groups:
                # 获取该组的 claims
                claims = query_all("""
                    SELECT c.claim_id, c.claim_text
                    FROM claim_group_members m
                    JOIN claims c ON c.claim_id = m.claim_id
                    WHERE m.group_id = %s
                    ORDER BY c.claim_id
                """, (lg["group_id"],))
                
                if len(claims) < 2:
                    continue
                
                # TF-IDF 向量化
                texts = [normalize_text(c["claim_text"]) for c in claims]
                vectorizer = TfidfVectorizer(max_features=500, min_df=2, max_df=0.9)
                try:
                    tfidf_matrix = vectorizer.fit_transform(texts)
                except ValueError:
                    continue  # 文本太少或太相似
                
                # KMeans 聚类
                k = max(2, int(np.ceil(len(claims) / target_size)))
                kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
                labels = kmeans.fit_predict(tfidf_matrix)
                
                # 为每个 cluster 创建子组
                with get_db() as conn:
                    for cluster_id in range(k):
                        cluster_claims = [claims[i]["claim_id"] for i, l in enumerate(labels) if l == cluster_id]
                        if not cluster_claims:
                            continue
                        
                        # 子组 key
                        subgroup_key = f"kmeans|cluster_{cluster_id}"
                        
                        with conn.cursor() as cur:
                            with conn.transaction():
                                cur.execute("""
                                    INSERT INTO claim_groups 
                                    (group_key, parent_group_id, subgroup_key, topic_entity_id, sign, setting, id_family, params_json)
                                    SELECT 
                                        group_key || '|' || %s,
                                        group_id,
                                        %s,
                                        topic_entity_id, sign, setting, id_family, params_json
                                    FROM claim_groups WHERE group_id = %s
                                    RETURNING group_id
                                """, (subgroup_key, subgroup_key, lg["group_id"]))
                                subgroup_id = cur.fetchone()["group_id"]
                                
                                # 迁移成员到子组
                                for cid in cluster_claims:
                                    cur.execute("""
                                        UPDATE claim_group_members SET group_id = %s
                                        WHERE claim_id = %s AND group_id = %s
                                    """, (subgroup_id, cid, lg["group_id"]))
                        
                        split_count += 1
    
            return {"split_count": split_count, "large_groups_processed": len(large_groups)}
        except ImportError:
            return {"error": "scikit-learn not installed. Run: pip install scikit-learn"}
        except Exception as e:
            return {"error": str(e)}
  • Top-level registration of the graph_v12 tools module, which includes the split_large_claim_groups_v1_2 tool, by calling register_graph_v12_tools on the MCP instance.
    register_graph_v12_tools(mcp)
  • Import of the register_graph_v12_tools function used to register the tool.
    from paperlib_mcp.tools.graph_v12 import register_graph_v12_tools
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries full burden. It mentions the algorithm (TF-IDF + KMeans) but doesn't disclose behavioral traits like whether this is a read-only or destructive operation, what permissions are needed, how long it takes, or what happens to the original claim groups. For a tool with 'split' in its name and no annotations, this is a significant gap in transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise—a single sentence in Chinese that directly states the action and method. It's front-loaded with the core purpose and wastes no words. Every part of the description earns its place by specifying the algorithm used.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (splitting operations often involve data transformation), lack of annotations, and 0% schema coverage, the description is incomplete. However, the presence of an output schema mitigates some need to explain return values. The description covers the 'what' and 'how' at a high level but misses crucial details about behavior and parameters.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate for undocumented parameters. It adds no information about the two parameters (split_threshold, target_size)—not explaining what they control, their units, typical values, or how they interact. The description fails to provide meaningful semantics beyond what the bare schema offers.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose3/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description states the tool splits large claim groups using TF-IDF + KMeans, which gives a general purpose. However, it doesn't specify what 'claim groups' are in this context or what resources are being manipulated, making it somewhat vague. It doesn't clearly differentiate from sibling tools like 'build_claim_groups_v1' or 'merge_entities'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. There's no mention of prerequisites (e.g., needing existing claim groups), when-not-to-use scenarios, or how it relates to sibling tools like 'build_claim_groups_v1' or 'merge_entities'. The agent must infer usage from the name alone.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server