Skip to main content
Glama

split_large_claim_groups_v1_2

Split large claim groups in academic literature using TF-IDF and KMeans clustering to manage and organize research data effectively.

Instructions

拆分超大 claim groups (使用 TF-IDF + KMeans)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
split_thresholdNo
target_sizeNo

Implementation Reference

  • The core handler function for the 'split_large_claim_groups_v1_2' tool. It identifies large claim groups exceeding a threshold, uses TF-IDF vectorization on claim texts, applies KMeans clustering to subgroup them, and updates the database with new subgroups.
    @mcp.tool()
    def split_large_claim_groups_v1_2(
        split_threshold: int | None = None,
        target_size: int | None = None,
    ) -> dict[str, Any]:
        """拆分超大 claim groups (使用 TF-IDF + KMeans)"""
        try:
            # 使用配置默认值
            if split_threshold is None:
                split_threshold = config.claim_split_threshold()
            if target_size is None:
                target_size = config.claim_target_size()
            
            # 找出需要拆分的大组
            large_groups = query_all("""
                SELECT g.group_id, g.group_key, COUNT(*) as n
                FROM claim_groups g
                JOIN claim_group_members m ON m.group_id = g.group_id
                WHERE g.parent_group_id IS NULL
                GROUP BY g.group_id, g.group_key
                HAVING COUNT(*) > %s
            """, (split_threshold,))
            
            if not large_groups:
                return {"message": "No groups exceed threshold", "split_count": 0}
    
            from sklearn.feature_extraction.text import TfidfVectorizer
            from sklearn.cluster import KMeans
            import numpy as np
    
            split_count = 0
            
            for lg in large_groups:
                # 获取该组的 claims
                claims = query_all("""
                    SELECT c.claim_id, c.claim_text
                    FROM claim_group_members m
                    JOIN claims c ON c.claim_id = m.claim_id
                    WHERE m.group_id = %s
                    ORDER BY c.claim_id
                """, (lg["group_id"],))
                
                if len(claims) < 2:
                    continue
                
                # TF-IDF 向量化
                texts = [normalize_text(c["claim_text"]) for c in claims]
                vectorizer = TfidfVectorizer(max_features=500, min_df=2, max_df=0.9)
                try:
                    tfidf_matrix = vectorizer.fit_transform(texts)
                except ValueError:
                    continue  # 文本太少或太相似
                
                # KMeans 聚类
                k = max(2, int(np.ceil(len(claims) / target_size)))
                kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
                labels = kmeans.fit_predict(tfidf_matrix)
                
                # 为每个 cluster 创建子组
                with get_db() as conn:
                    for cluster_id in range(k):
                        cluster_claims = [claims[i]["claim_id"] for i, l in enumerate(labels) if l == cluster_id]
                        if not cluster_claims:
                            continue
                        
                        # 子组 key
                        subgroup_key = f"kmeans|cluster_{cluster_id}"
                        
                        with conn.cursor() as cur:
                            with conn.transaction():
                                cur.execute("""
                                    INSERT INTO claim_groups 
                                    (group_key, parent_group_id, subgroup_key, topic_entity_id, sign, setting, id_family, params_json)
                                    SELECT 
                                        group_key || '|' || %s,
                                        group_id,
                                        %s,
                                        topic_entity_id, sign, setting, id_family, params_json
                                    FROM claim_groups WHERE group_id = %s
                                    RETURNING group_id
                                """, (subgroup_key, subgroup_key, lg["group_id"]))
                                subgroup_id = cur.fetchone()["group_id"]
                                
                                # 迁移成员到子组
                                for cid in cluster_claims:
                                    cur.execute("""
                                        UPDATE claim_group_members SET group_id = %s
                                        WHERE claim_id = %s AND group_id = %s
                                    """, (subgroup_id, cid, lg["group_id"]))
                        
                        split_count += 1
    
            return {"split_count": split_count, "large_groups_processed": len(large_groups)}
        except ImportError:
            return {"error": "scikit-learn not installed. Run: pip install scikit-learn"}
        except Exception as e:
            return {"error": str(e)}
  • Top-level registration of the graph_v12 tools module, which includes the split_large_claim_groups_v1_2 tool, by calling register_graph_v12_tools on the MCP instance.
    register_graph_v12_tools(mcp)
  • Import of the register_graph_v12_tools function used to register the tool.
    from paperlib_mcp.tools.graph_v12 import register_graph_v12_tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server