Skip to main content
Glama

select_high_value_chunks

Extract key research findings, methods, and results from academic documents or evidence packs using keyword filtering to identify the most relevant content chunks.

Instructions

筛选高价值 chunks

从指定文档或证据包中筛选包含关键方法/识别/结果相关内容的 chunks。

Args: doc_id: 文档 ID(与 pack_id 二选一) pack_id: 证据包 ID(与 doc_id 二选一) max_chunks: 最大返回数量,默认 60 keyword_mode: 关键词模式,"default" 或 "strict"

Returns: 高价值 chunk 列表,每个包含 chunk_id、doc_id、页码和命中原因

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doc_idNo
pack_idNo
max_chunksNo
keyword_modeNodefault

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • Main handler function for select_high_value_chunks tool, decorated with @mcp.tool(). Calls internal helper and wraps output in Pydantic model.
    @mcp.tool()
    def select_high_value_chunks(
        doc_id: str | None = None,
        pack_id: int | None = None,
        max_chunks: int = 60,
        keyword_mode: str = "default",
    ) -> dict[str, Any]:
        """筛选高价值 chunks
        
        从指定文档或证据包中筛选包含关键方法/识别/结果相关内容的 chunks。
        
        Args:
            doc_id: 文档 ID(与 pack_id 二选一)
            pack_id: 证据包 ID(与 doc_id 二选一)
            max_chunks: 最大返回数量,默认 60
            keyword_mode: 关键词模式,"default" 或 "strict"
            
        Returns:
            高价值 chunk 列表,每个包含 chunk_id、doc_id、页码和命中原因
        """
        try:
            # 调用内部函数
            result = _select_high_value_chunks_internal(doc_id, pack_id, max_chunks, keyword_mode)
            
            if result.get("error"):
                return SelectHighValueChunksOut(
                    error=MCPErrorModel(**result["error"]),
                ).model_dump()
            
            # 转换为 Pydantic 模型
            chunks = [
                HighValueChunk(
                    chunk_id=c["chunk_id"],
                    doc_id=c["doc_id"],
                    page_start=c.get("page_start"),
                    page_end=c.get("page_end"),
                    reason=c["reason"],
                )
                for c in result.get("chunks", [])
            ]
            
            return SelectHighValueChunksOut(chunks=chunks).model_dump()
            
        except Exception as e:
            return SelectHighValueChunksOut(
                error=MCPErrorModel(code="DB_CONN_ERROR", message=str(e)),
            ).model_dump()
  • Pydantic schema definitions for the tool's input (SelectHighValueChunksIn), HighValueChunk model, and output (SelectHighValueChunksOut).
    # ============================================================
    # select_high_value_chunks 工具模型
    # ============================================================
    
    
    class SelectHighValueChunksIn(BaseModel):
        """select_high_value_chunks 输入"""
        doc_id: Optional[str] = None
        pack_id: Optional[int] = None
        max_chunks: int = 60
        keyword_mode: Literal["default", "strict"] = "default"
    
    
    class HighValueChunk(BaseModel):
        """高价值 chunk"""
        chunk_id: int
        doc_id: str
        page_start: Optional[int] = None
        page_end: Optional[int] = None
        reason: str
    
    
    class SelectHighValueChunksOut(BaseModel):
        """select_high_value_chunks 输出"""
        chunks: list[HighValueChunk] = Field(default_factory=list)
        error: Optional[MCPErrorModel] = None
  • Invocation of register_graph_extract_tools(mcp) in the main server setup, which defines and registers the tool using @mcp.tool() decorators.
    register_graph_extract_tools(mcp)
  • Internal helper function implementing the core logic for selecting high-value chunks using keyword matching and FTS queries.
    def _select_high_value_chunks_internal(
        doc_id: str | None = None,
        pack_id: int | None = None,
        max_chunks: int = 60,
        keyword_mode: str = "default",
    ) -> dict[str, Any]:
        """高价值 chunk 筛选的核心逻辑(内部使用)"""
        if not doc_id and not pack_id:
            return {
                "chunks": [],
                "error": {"code": "VALIDATION_ERROR", "message": "Must provide either doc_id or pack_id"},
            }
        
        # 选择关键词集
        keywords = HIGH_VALUE_KEYWORDS_STRICT if keyword_mode == "strict" else HIGH_VALUE_KEYWORDS_DEFAULT
        
        # 构建 FTS 查询
        fts_query = " OR ".join(f"'{kw}'" for kw in keywords)
        
        if pack_id:
            # 从证据包获取
            results = query_all(
                """
                SELECT c.chunk_id, c.doc_id, c.page_start, c.page_end, c.text
                FROM evidence_pack_items i
                JOIN chunks c ON c.chunk_id = i.chunk_id
                WHERE i.pack_id = %s
                LIMIT %s
                """,
                (pack_id, max_chunks)
            )
            reason = "from evidence pack"
        else:
            # 使用 FTS 筛选
            results = query_all(
                """
                SELECT chunk_id, doc_id, page_start, page_end, text,
                       ts_rank(tsv, websearch_to_tsquery('english', %s)) AS rank
                FROM chunks
                WHERE doc_id = %s
                  AND tsv @@ websearch_to_tsquery('english', %s)
                ORDER BY rank DESC
                LIMIT %s
                """,
                (fts_query, doc_id, fts_query, max_chunks)
            )
            reason = "keyword match"
        
        # 构建返回结果
        chunks = []
        for r in results:
            # 识别命中的关键词
            text_lower = r["text"].lower() if r.get("text") else ""
            matched_keywords = [kw for kw in keywords if kw in text_lower]
            chunk_reason = f"{reason}: {', '.join(matched_keywords[:3])}" if matched_keywords else reason
            
            chunks.append({
                "chunk_id": r["chunk_id"],
                "doc_id": r["doc_id"],
                "page_start": r.get("page_start"),
                "page_end": r.get("page_end"),
                "reason": chunk_reason,
            })
        
        return {"chunks": chunks, "error": None}
  • Keyword lists used for identifying high-value chunks in default and strict modes.
    # 高价值 chunk 筛选关键词
    HIGH_VALUE_KEYWORDS_DEFAULT = [
        "identification", "strategy", "instrument", "did", "difference-in-differences",
        "event study", "rdd", "regression discontinuity", "robustness", "placebo",
        "measurement", "proxy", "data", "we measure", "limitation", "threat", "caveat",
        "mechanism", "channel", "heterogeneous", "heterogeneity", "endogeneity",
        "instrumental variable", "iv", "fixed effect", "panel", "causal",
    ]
    
    HIGH_VALUE_KEYWORDS_STRICT = [
        "identification strategy", "instrumental variable", "difference-in-differences",
        "regression discontinuity", "event study", "placebo test", "robustness check",
        "measurement error", "proxy variable", "causal effect", "endogeneity",
    ]
    
    # M2 必需的表
    REQUIRED_TABLES = [
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. It describes the filtering action and return format, but misses critical details: it doesn't specify whether this is a read-only operation, potential side effects (e.g., caching), performance considerations, or error handling. For a tool with no annotations, this is a significant gap, though it's not misleading—just incomplete.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured and front-loaded: it starts with the core purpose, lists parameters with explanations, and ends with return details. It's concise with no wasted sentences, though the mix of Chinese and English might slightly affect clarity for some users. Overall, it's efficient and earns its place, but minor language consistency issues prevent a perfect score.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (4 parameters, no annotations, but with an output schema), the description is reasonably complete. It covers purpose, parameters, and return format, and the output schema handles return values, so no need to explain those in detail. However, it could improve by addressing behavioral aspects like idempotency or error cases, which are missing. This makes it mostly adequate but not fully comprehensive.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It adds meaningful context for all parameters: it explains that 'doc_id' and 'pack_id' are mutually exclusive choices, 'max_chunks' has a default of 60, and 'keyword_mode' has options 'default' or 'strict'. This goes beyond the schema's basic types and defaults, providing practical usage insights. However, it doesn't detail the exact criteria for 'high-value' or how modes differ, keeping it from a perfect 5.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: '筛选高价值 chunks' (filter high-value chunks) from specified documents or evidence packs, focusing on content related to key methods/identification/results. It uses specific verbs ('筛选' - filter) and resources ('文档或证据包' - documents or evidence packs). However, it doesn't explicitly differentiate from sibling tools like 'get_document_chunks' or 'search_hybrid', which might also retrieve chunks, so it doesn't reach a perfect 5.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage by specifying the target resources (documents or evidence packs) and the type of content to filter (key methods/identification/results). However, it lacks explicit guidance on when to use this tool versus alternatives like 'search_hybrid' or 'get_document_chunks', and doesn't mention prerequisites or exclusions. This leaves room for ambiguity, fitting a score of 3 for implied context without clear alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server