split_large_claim_groups_v1_2
Split large claim groups in academic literature using TF-IDF and KMeans clustering to manage and organize research data effectively.
Instructions
拆分超大 claim groups (使用 TF-IDF + KMeans)
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| split_threshold | No | ||
| target_size | No |
Implementation Reference
- The core handler function for the 'split_large_claim_groups_v1_2' tool. It identifies large claim groups exceeding a threshold, uses TF-IDF vectorization on claim texts, applies KMeans clustering to subgroup them, and updates the database with new subgroups.@mcp.tool() def split_large_claim_groups_v1_2( split_threshold: int | None = None, target_size: int | None = None, ) -> dict[str, Any]: """拆分超大 claim groups (使用 TF-IDF + KMeans)""" try: # 使用配置默认值 if split_threshold is None: split_threshold = config.claim_split_threshold() if target_size is None: target_size = config.claim_target_size() # 找出需要拆分的大组 large_groups = query_all(""" SELECT g.group_id, g.group_key, COUNT(*) as n FROM claim_groups g JOIN claim_group_members m ON m.group_id = g.group_id WHERE g.parent_group_id IS NULL GROUP BY g.group_id, g.group_key HAVING COUNT(*) > %s """, (split_threshold,)) if not large_groups: return {"message": "No groups exceed threshold", "split_count": 0} from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans import numpy as np split_count = 0 for lg in large_groups: # 获取该组的 claims claims = query_all(""" SELECT c.claim_id, c.claim_text FROM claim_group_members m JOIN claims c ON c.claim_id = m.claim_id WHERE m.group_id = %s ORDER BY c.claim_id """, (lg["group_id"],)) if len(claims) < 2: continue # TF-IDF 向量化 texts = [normalize_text(c["claim_text"]) for c in claims] vectorizer = TfidfVectorizer(max_features=500, min_df=2, max_df=0.9) try: tfidf_matrix = vectorizer.fit_transform(texts) except ValueError: continue # 文本太少或太相似 # KMeans 聚类 k = max(2, int(np.ceil(len(claims) / target_size))) kmeans = KMeans(n_clusters=k, random_state=0, n_init=10) labels = kmeans.fit_predict(tfidf_matrix) # 为每个 cluster 创建子组 with get_db() as conn: for cluster_id in range(k): cluster_claims = [claims[i]["claim_id"] for i, l in enumerate(labels) if l == cluster_id] if not cluster_claims: continue # 子组 key subgroup_key = f"kmeans|cluster_{cluster_id}" with conn.cursor() as cur: with conn.transaction(): cur.execute(""" INSERT INTO claim_groups (group_key, parent_group_id, subgroup_key, topic_entity_id, sign, setting, id_family, params_json) SELECT group_key || '|' || %s, group_id, %s, topic_entity_id, sign, setting, id_family, params_json FROM claim_groups WHERE group_id = %s RETURNING group_id """, (subgroup_key, subgroup_key, lg["group_id"])) subgroup_id = cur.fetchone()["group_id"] # 迁移成员到子组 for cid in cluster_claims: cur.execute(""" UPDATE claim_group_members SET group_id = %s WHERE claim_id = %s AND group_id = %s """, (subgroup_id, cid, lg["group_id"])) split_count += 1 return {"split_count": split_count, "large_groups_processed": len(large_groups)} except ImportError: return {"error": "scikit-learn not installed. Run: pip install scikit-learn"} except Exception as e: return {"error": str(e)}
- src/paperlib_mcp/server.py:52-52 (registration)Top-level registration of the graph_v12 tools module, which includes the split_large_claim_groups_v1_2 tool, by calling register_graph_v12_tools on the MCP instance.register_graph_v12_tools(mcp)
- src/paperlib_mcp/server.py:31-31 (registration)Import of the register_graph_v12_tools function used to register the tool.from paperlib_mcp.tools.graph_v12 import register_graph_v12_tools