Skip to main content
Glama

select_high_value_chunks

Extract key research findings, methods, and results from academic documents or evidence packs using keyword filtering to identify the most relevant content chunks.

Instructions

筛选高价值 chunks

从指定文档或证据包中筛选包含关键方法/识别/结果相关内容的 chunks。

Args: doc_id: 文档 ID(与 pack_id 二选一) pack_id: 证据包 ID(与 doc_id 二选一) max_chunks: 最大返回数量,默认 60 keyword_mode: 关键词模式,"default" 或 "strict"

Returns: 高价值 chunk 列表,每个包含 chunk_id、doc_id、页码和命中原因

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doc_idNo
pack_idNo
max_chunksNo
keyword_modeNodefault

Implementation Reference

  • Main handler function for select_high_value_chunks tool, decorated with @mcp.tool(). Calls internal helper and wraps output in Pydantic model.
    @mcp.tool()
    def select_high_value_chunks(
        doc_id: str | None = None,
        pack_id: int | None = None,
        max_chunks: int = 60,
        keyword_mode: str = "default",
    ) -> dict[str, Any]:
        """筛选高价值 chunks
        
        从指定文档或证据包中筛选包含关键方法/识别/结果相关内容的 chunks。
        
        Args:
            doc_id: 文档 ID(与 pack_id 二选一)
            pack_id: 证据包 ID(与 doc_id 二选一)
            max_chunks: 最大返回数量,默认 60
            keyword_mode: 关键词模式,"default" 或 "strict"
            
        Returns:
            高价值 chunk 列表,每个包含 chunk_id、doc_id、页码和命中原因
        """
        try:
            # 调用内部函数
            result = _select_high_value_chunks_internal(doc_id, pack_id, max_chunks, keyword_mode)
            
            if result.get("error"):
                return SelectHighValueChunksOut(
                    error=MCPErrorModel(**result["error"]),
                ).model_dump()
            
            # 转换为 Pydantic 模型
            chunks = [
                HighValueChunk(
                    chunk_id=c["chunk_id"],
                    doc_id=c["doc_id"],
                    page_start=c.get("page_start"),
                    page_end=c.get("page_end"),
                    reason=c["reason"],
                )
                for c in result.get("chunks", [])
            ]
            
            return SelectHighValueChunksOut(chunks=chunks).model_dump()
            
        except Exception as e:
            return SelectHighValueChunksOut(
                error=MCPErrorModel(code="DB_CONN_ERROR", message=str(e)),
            ).model_dump()
  • Pydantic schema definitions for the tool's input (SelectHighValueChunksIn), HighValueChunk model, and output (SelectHighValueChunksOut).
    # ============================================================
    # select_high_value_chunks 工具模型
    # ============================================================
    
    
    class SelectHighValueChunksIn(BaseModel):
        """select_high_value_chunks 输入"""
        doc_id: Optional[str] = None
        pack_id: Optional[int] = None
        max_chunks: int = 60
        keyword_mode: Literal["default", "strict"] = "default"
    
    
    class HighValueChunk(BaseModel):
        """高价值 chunk"""
        chunk_id: int
        doc_id: str
        page_start: Optional[int] = None
        page_end: Optional[int] = None
        reason: str
    
    
    class SelectHighValueChunksOut(BaseModel):
        """select_high_value_chunks 输出"""
        chunks: list[HighValueChunk] = Field(default_factory=list)
        error: Optional[MCPErrorModel] = None
  • Invocation of register_graph_extract_tools(mcp) in the main server setup, which defines and registers the tool using @mcp.tool() decorators.
    register_graph_extract_tools(mcp)
  • Internal helper function implementing the core logic for selecting high-value chunks using keyword matching and FTS queries.
    def _select_high_value_chunks_internal(
        doc_id: str | None = None,
        pack_id: int | None = None,
        max_chunks: int = 60,
        keyword_mode: str = "default",
    ) -> dict[str, Any]:
        """高价值 chunk 筛选的核心逻辑(内部使用)"""
        if not doc_id and not pack_id:
            return {
                "chunks": [],
                "error": {"code": "VALIDATION_ERROR", "message": "Must provide either doc_id or pack_id"},
            }
        
        # 选择关键词集
        keywords = HIGH_VALUE_KEYWORDS_STRICT if keyword_mode == "strict" else HIGH_VALUE_KEYWORDS_DEFAULT
        
        # 构建 FTS 查询
        fts_query = " OR ".join(f"'{kw}'" for kw in keywords)
        
        if pack_id:
            # 从证据包获取
            results = query_all(
                """
                SELECT c.chunk_id, c.doc_id, c.page_start, c.page_end, c.text
                FROM evidence_pack_items i
                JOIN chunks c ON c.chunk_id = i.chunk_id
                WHERE i.pack_id = %s
                LIMIT %s
                """,
                (pack_id, max_chunks)
            )
            reason = "from evidence pack"
        else:
            # 使用 FTS 筛选
            results = query_all(
                """
                SELECT chunk_id, doc_id, page_start, page_end, text,
                       ts_rank(tsv, websearch_to_tsquery('english', %s)) AS rank
                FROM chunks
                WHERE doc_id = %s
                  AND tsv @@ websearch_to_tsquery('english', %s)
                ORDER BY rank DESC
                LIMIT %s
                """,
                (fts_query, doc_id, fts_query, max_chunks)
            )
            reason = "keyword match"
        
        # 构建返回结果
        chunks = []
        for r in results:
            # 识别命中的关键词
            text_lower = r["text"].lower() if r.get("text") else ""
            matched_keywords = [kw for kw in keywords if kw in text_lower]
            chunk_reason = f"{reason}: {', '.join(matched_keywords[:3])}" if matched_keywords else reason
            
            chunks.append({
                "chunk_id": r["chunk_id"],
                "doc_id": r["doc_id"],
                "page_start": r.get("page_start"),
                "page_end": r.get("page_end"),
                "reason": chunk_reason,
            })
        
        return {"chunks": chunks, "error": None}
  • Keyword lists used for identifying high-value chunks in default and strict modes.
    # 高价值 chunk 筛选关键词
    HIGH_VALUE_KEYWORDS_DEFAULT = [
        "identification", "strategy", "instrument", "did", "difference-in-differences",
        "event study", "rdd", "regression discontinuity", "robustness", "placebo",
        "measurement", "proxy", "data", "we measure", "limitation", "threat", "caveat",
        "mechanism", "channel", "heterogeneous", "heterogeneity", "endogeneity",
        "instrumental variable", "iv", "fixed effect", "panel", "causal",
    ]
    
    HIGH_VALUE_KEYWORDS_STRICT = [
        "identification strategy", "instrumental variable", "difference-in-differences",
        "regression discontinuity", "event study", "placebo test", "robustness check",
        "measurement error", "proxy variable", "causal effect", "endogeneity",
    ]
    
    # M2 必需的表
    REQUIRED_TABLES = [

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server