Skip to main content
Glama

select_high_value_chunks

Extract key research findings, methods, and results from academic documents or evidence packs using keyword filtering to identify the most relevant content chunks.

Instructions

筛选高价值 chunks

从指定文档或证据包中筛选包含关键方法/识别/结果相关内容的 chunks。

Args: doc_id: 文档 ID(与 pack_id 二选一) pack_id: 证据包 ID(与 doc_id 二选一) max_chunks: 最大返回数量,默认 60 keyword_mode: 关键词模式,"default" 或 "strict"

Returns: 高价值 chunk 列表,每个包含 chunk_id、doc_id、页码和命中原因

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doc_idNo
pack_idNo
max_chunksNo
keyword_modeNodefault

Implementation Reference

  • Main handler function for select_high_value_chunks tool, decorated with @mcp.tool(). Calls internal helper and wraps output in Pydantic model.
    @mcp.tool() def select_high_value_chunks( doc_id: str | None = None, pack_id: int | None = None, max_chunks: int = 60, keyword_mode: str = "default", ) -> dict[str, Any]: """筛选高价值 chunks 从指定文档或证据包中筛选包含关键方法/识别/结果相关内容的 chunks。 Args: doc_id: 文档 ID(与 pack_id 二选一) pack_id: 证据包 ID(与 doc_id 二选一) max_chunks: 最大返回数量,默认 60 keyword_mode: 关键词模式,"default" 或 "strict" Returns: 高价值 chunk 列表,每个包含 chunk_id、doc_id、页码和命中原因 """ try: # 调用内部函数 result = _select_high_value_chunks_internal(doc_id, pack_id, max_chunks, keyword_mode) if result.get("error"): return SelectHighValueChunksOut( error=MCPErrorModel(**result["error"]), ).model_dump() # 转换为 Pydantic 模型 chunks = [ HighValueChunk( chunk_id=c["chunk_id"], doc_id=c["doc_id"], page_start=c.get("page_start"), page_end=c.get("page_end"), reason=c["reason"], ) for c in result.get("chunks", []) ] return SelectHighValueChunksOut(chunks=chunks).model_dump() except Exception as e: return SelectHighValueChunksOut( error=MCPErrorModel(code="DB_CONN_ERROR", message=str(e)), ).model_dump()
  • Pydantic schema definitions for the tool's input (SelectHighValueChunksIn), HighValueChunk model, and output (SelectHighValueChunksOut).
    # ============================================================ # select_high_value_chunks 工具模型 # ============================================================ class SelectHighValueChunksIn(BaseModel): """select_high_value_chunks 输入""" doc_id: Optional[str] = None pack_id: Optional[int] = None max_chunks: int = 60 keyword_mode: Literal["default", "strict"] = "default" class HighValueChunk(BaseModel): """高价值 chunk""" chunk_id: int doc_id: str page_start: Optional[int] = None page_end: Optional[int] = None reason: str class SelectHighValueChunksOut(BaseModel): """select_high_value_chunks 输出""" chunks: list[HighValueChunk] = Field(default_factory=list) error: Optional[MCPErrorModel] = None
  • Invocation of register_graph_extract_tools(mcp) in the main server setup, which defines and registers the tool using @mcp.tool() decorators.
    register_graph_extract_tools(mcp)
  • Internal helper function implementing the core logic for selecting high-value chunks using keyword matching and FTS queries.
    def _select_high_value_chunks_internal( doc_id: str | None = None, pack_id: int | None = None, max_chunks: int = 60, keyword_mode: str = "default", ) -> dict[str, Any]: """高价值 chunk 筛选的核心逻辑(内部使用)""" if not doc_id and not pack_id: return { "chunks": [], "error": {"code": "VALIDATION_ERROR", "message": "Must provide either doc_id or pack_id"}, } # 选择关键词集 keywords = HIGH_VALUE_KEYWORDS_STRICT if keyword_mode == "strict" else HIGH_VALUE_KEYWORDS_DEFAULT # 构建 FTS 查询 fts_query = " OR ".join(f"'{kw}'" for kw in keywords) if pack_id: # 从证据包获取 results = query_all( """ SELECT c.chunk_id, c.doc_id, c.page_start, c.page_end, c.text FROM evidence_pack_items i JOIN chunks c ON c.chunk_id = i.chunk_id WHERE i.pack_id = %s LIMIT %s """, (pack_id, max_chunks) ) reason = "from evidence pack" else: # 使用 FTS 筛选 results = query_all( """ SELECT chunk_id, doc_id, page_start, page_end, text, ts_rank(tsv, websearch_to_tsquery('english', %s)) AS rank FROM chunks WHERE doc_id = %s AND tsv @@ websearch_to_tsquery('english', %s) ORDER BY rank DESC LIMIT %s """, (fts_query, doc_id, fts_query, max_chunks) ) reason = "keyword match" # 构建返回结果 chunks = [] for r in results: # 识别命中的关键词 text_lower = r["text"].lower() if r.get("text") else "" matched_keywords = [kw for kw in keywords if kw in text_lower] chunk_reason = f"{reason}: {', '.join(matched_keywords[:3])}" if matched_keywords else reason chunks.append({ "chunk_id": r["chunk_id"], "doc_id": r["doc_id"], "page_start": r.get("page_start"), "page_end": r.get("page_end"), "reason": chunk_reason, }) return {"chunks": chunks, "error": None}
  • Keyword lists used for identifying high-value chunks in default and strict modes.
    # 高价值 chunk 筛选关键词 HIGH_VALUE_KEYWORDS_DEFAULT = [ "identification", "strategy", "instrument", "did", "difference-in-differences", "event study", "rdd", "regression discontinuity", "robustness", "placebo", "measurement", "proxy", "data", "we measure", "limitation", "threat", "caveat", "mechanism", "channel", "heterogeneous", "heterogeneity", "endogeneity", "instrumental variable", "iv", "fixed effect", "panel", "causal", ] HIGH_VALUE_KEYWORDS_STRICT = [ "identification strategy", "instrumental variable", "difference-in-differences", "regression discontinuity", "event study", "placebo test", "robustness check", "measurement error", "proxy variable", "causal effect", "endogeneity", ] # M2 必需的表 REQUIRED_TABLES = [

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server