Skip to main content
Glama

extract_graph_missing

Identify documents lacking mentions and process them with extract_graph_v1 to build knowledge graphs from academic papers.

Instructions

批量补跑未抽取的文档

找出没有 mentions 的文档,并对它们执行 extract_graph_v1。

Args: limit_docs: 最大处理文档数,默认 50 llm_model: LLM 模型,默认使用环境变量 LLM_MODEL 配置 min_confidence: 最小置信度阈值

Returns: 处理的文档数和文档 ID 列表

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
limit_docsNo
llm_modelNo
min_confidenceNo
concurrencyNo
doc_concurrencyNo
max_chunksNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • Main handler function that implements the tool logic: finds missing documents, selects high-value chunks, and extracts graph using parallel calls to extract_graph_v1_run.
    @mcp.tool()
    async def extract_graph_missing(
        limit_docs: int = 50,
        llm_model: str | None = None,
        min_confidence: float = 0.8,
        concurrency: int = 30,
        doc_concurrency: int = 15,
        max_chunks: int = 60,
    ) -> dict[str, Any]:
        """批量补跑未抽取的文档
        
        找出没有 mentions 的文档,并对它们执行 extract_graph_v1。
        
        Args:
            limit_docs: 最大处理文档数,默认 50
            llm_model: LLM 模型,默认使用环境变量 LLM_MODEL 配置
            min_confidence: 最小置信度阈值
            
        Returns:
            处理的文档数和文档 ID 列表
        """
        try:
            settings = get_settings()
            actual_llm_model = llm_model or settings.llm_model
            # 找出未抽取的文档
            missing_docs = query_all(
                """
                SELECT d.doc_id
                FROM documents d
                LEFT JOIN mentions m ON m.doc_id = d.doc_id
                GROUP BY d.doc_id
                HAVING COUNT(m.mention_id) = 0
                ORDER BY d.created_at DESC
                LIMIT %s
                """,
                (limit_docs,)
            )
            
            if not missing_docs:
                return ExtractGraphMissingOut(
                    processed_docs=0,
                    doc_ids=[],
                ).model_dump()
            
            # 导入抽取工具
            from paperlib_mcp.tools.graph_extract import (
                extract_graph_v1_run,
                HIGH_VALUE_KEYWORDS_DEFAULT,
            )
            
            processed_doc_ids = []
            
            # 使用 Semaphore 限制文档级并发
            sem = asyncio.Semaphore(doc_concurrency)
    
            async def process_single_doc(doc_row):
                async with sem:
                    doc_id = doc_row["doc_id"]
                    
                    # 获取该文档的高价值 chunks
                    fts_query = " OR ".join(f"'{kw}'" for kw in HIGH_VALUE_KEYWORDS_DEFAULT)
                    chunks = query_all(
                        """
                        SELECT chunk_id, doc_id, page_start, page_end, text
                        FROM chunks
                        WHERE doc_id = %s
                        AND tsv @@ websearch_to_tsquery('english', %s)
                        ORDER BY ts_rank(tsv, websearch_to_tsquery('english', %s)) DESC
                        LIMIT %s
                        """,
                        (doc_id, fts_query, fts_query, max_chunks)
                    )
                    
                    if not chunks:
                        # 没有高价值 chunks,使用所有 chunks
                        chunks = query_all(
                            """
                            SELECT chunk_id, doc_id, page_start, page_end, text
                            FROM chunks
                            WHERE doc_id = %s
                            ORDER BY chunk_index
                            LIMIT %s
                            """,
                            (doc_id, max_chunks)
                        )
                    
                    if not chunks:
                        return None
                    
                    # 直接调用优化后的 extract_graph_v1_run (支持并行)
                    # from paperlib_mcp.tools.graph_extract import extract_graph_v1_run # Already imported above
                    
                    # 提取 chunk_ids
                    chunk_ids = [c["chunk_id"] for c in chunks]
                    
                    await extract_graph_v1_run(
                        doc_id=doc_id,
                        chunk_ids=chunk_ids,
                        mode="custom", # 使用传入的 chunk_ids
                        max_chunks=len(chunks),
                        llm_model=actual_llm_model,
                        min_confidence=min_confidence,
                        concurrency=concurrency,
                    )
                            
                    return doc_id
    
            # 并发执行所有文档
            tasks = [process_single_doc(doc) for doc in missing_docs]
            results = await asyncio.gather(*tasks)
            
            # 收集能够成功处理的 doc_id
            processed_doc_ids = [r for r in results if r is not None]
            
            return ExtractGraphMissingOut(
                processed_docs=len(processed_doc_ids),
                doc_ids=processed_doc_ids,
            ).model_dump()
            
        except Exception as e:
            return ExtractGraphMissingOut(
                processed_docs=0,
                doc_ids=[],
                error=MCPErrorModel(code="LLM_ERROR", message=str(e)),
            ).model_dump()
  • Pydantic input and output schema definitions for the extract_graph_missing tool.
    class ExtractGraphMissingIn(BaseModel):
        """extract_graph_missing 输入"""
        limit_docs: int = 50
        llm_model: Optional[str] = None  # 默认使用环境变量 LLM_MODEL
        min_confidence: float = 0.8
    
    
    class ExtractGraphMissingOut(BaseModel):
        """extract_graph_missing 输出"""
        processed_docs: int
        doc_ids: list[str] = Field(default_factory=list)
        error: Optional[MCPErrorModel] = None
  • Registration call in the main MCP server setup that registers the graph maintenance tools, including extract_graph_missing.
    register_graph_maintenance_tools(mcp)
  • The registration function that defines and registers the tool using @mcp.tool() decorator.
    def register_graph_maintenance_tools(mcp: FastMCP) -> None:
        """注册 GraphRAG 维护工具"""
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries full burden. It mentions batch processing and finding documents without mentions, but lacks critical behavioral details: whether this is a read-only or mutating operation, what permissions are required, how errors are handled, or any rate limits. For a tool with 6 parameters and no annotation coverage, this is inadequate.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is reasonably concise with a purpose statement, parameter list, and return explanation. However, it's not optimally structured—the parameter explanations are brief and lack formatting, and the return statement could be more integrated. It avoids redundancy but misses opportunities for clarity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (6 parameters, no annotations, but has an output schema), the description is incomplete. It covers basic purpose and some parameters but omits behavioral context, usage guidelines, and documentation for half the parameters. The output schema helps, but the description doesn't fully bridge the gap for effective agent use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It only documents 3 out of 6 parameters (limit_docs, llm_model, min_confidence), leaving concurrency, doc_concurrency, and max_chunks completely undocumented. The parameter explanations are minimal (e.g., '最大处理文档数' for limit_docs) without deeper context on trade-offs or constraints.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: '批量补跑未抽取的文档' (batch process documents that haven't been extracted) and specifies it finds documents without mentions and runs extract_graph_v1 on them. This is a specific verb+resource combination. However, it doesn't explicitly differentiate from sibling tools like extract_graph_v1, though the purpose is distinct enough.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It mentions running extract_graph_v1, but doesn't explain when to use extract_graph_missing instead of extract_graph_v1 directly, nor does it mention prerequisites or exclusions. This leaves the agent with minimal context for tool selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server