Skip to main content
Glama

extract_graph_missing

Identify documents lacking mentions and process them with extract_graph_v1 to build knowledge graphs from academic papers.

Instructions

批量补跑未抽取的文档

找出没有 mentions 的文档,并对它们执行 extract_graph_v1。

Args: limit_docs: 最大处理文档数,默认 50 llm_model: LLM 模型,默认使用环境变量 LLM_MODEL 配置 min_confidence: 最小置信度阈值

Returns: 处理的文档数和文档 ID 列表

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
limit_docsNo
llm_modelNo
min_confidenceNo
concurrencyNo
doc_concurrencyNo
max_chunksNo

Implementation Reference

  • Main handler function that implements the tool logic: finds missing documents, selects high-value chunks, and extracts graph using parallel calls to extract_graph_v1_run.
    @mcp.tool()
    async def extract_graph_missing(
        limit_docs: int = 50,
        llm_model: str | None = None,
        min_confidence: float = 0.8,
        concurrency: int = 30,
        doc_concurrency: int = 15,
        max_chunks: int = 60,
    ) -> dict[str, Any]:
        """批量补跑未抽取的文档
        
        找出没有 mentions 的文档,并对它们执行 extract_graph_v1。
        
        Args:
            limit_docs: 最大处理文档数,默认 50
            llm_model: LLM 模型,默认使用环境变量 LLM_MODEL 配置
            min_confidence: 最小置信度阈值
            
        Returns:
            处理的文档数和文档 ID 列表
        """
        try:
            settings = get_settings()
            actual_llm_model = llm_model or settings.llm_model
            # 找出未抽取的文档
            missing_docs = query_all(
                """
                SELECT d.doc_id
                FROM documents d
                LEFT JOIN mentions m ON m.doc_id = d.doc_id
                GROUP BY d.doc_id
                HAVING COUNT(m.mention_id) = 0
                ORDER BY d.created_at DESC
                LIMIT %s
                """,
                (limit_docs,)
            )
            
            if not missing_docs:
                return ExtractGraphMissingOut(
                    processed_docs=0,
                    doc_ids=[],
                ).model_dump()
            
            # 导入抽取工具
            from paperlib_mcp.tools.graph_extract import (
                extract_graph_v1_run,
                HIGH_VALUE_KEYWORDS_DEFAULT,
            )
            
            processed_doc_ids = []
            
            # 使用 Semaphore 限制文档级并发
            sem = asyncio.Semaphore(doc_concurrency)
    
            async def process_single_doc(doc_row):
                async with sem:
                    doc_id = doc_row["doc_id"]
                    
                    # 获取该文档的高价值 chunks
                    fts_query = " OR ".join(f"'{kw}'" for kw in HIGH_VALUE_KEYWORDS_DEFAULT)
                    chunks = query_all(
                        """
                        SELECT chunk_id, doc_id, page_start, page_end, text
                        FROM chunks
                        WHERE doc_id = %s
                        AND tsv @@ websearch_to_tsquery('english', %s)
                        ORDER BY ts_rank(tsv, websearch_to_tsquery('english', %s)) DESC
                        LIMIT %s
                        """,
                        (doc_id, fts_query, fts_query, max_chunks)
                    )
                    
                    if not chunks:
                        # 没有高价值 chunks,使用所有 chunks
                        chunks = query_all(
                            """
                            SELECT chunk_id, doc_id, page_start, page_end, text
                            FROM chunks
                            WHERE doc_id = %s
                            ORDER BY chunk_index
                            LIMIT %s
                            """,
                            (doc_id, max_chunks)
                        )
                    
                    if not chunks:
                        return None
                    
                    # 直接调用优化后的 extract_graph_v1_run (支持并行)
                    # from paperlib_mcp.tools.graph_extract import extract_graph_v1_run # Already imported above
                    
                    # 提取 chunk_ids
                    chunk_ids = [c["chunk_id"] for c in chunks]
                    
                    await extract_graph_v1_run(
                        doc_id=doc_id,
                        chunk_ids=chunk_ids,
                        mode="custom", # 使用传入的 chunk_ids
                        max_chunks=len(chunks),
                        llm_model=actual_llm_model,
                        min_confidence=min_confidence,
                        concurrency=concurrency,
                    )
                            
                    return doc_id
    
            # 并发执行所有文档
            tasks = [process_single_doc(doc) for doc in missing_docs]
            results = await asyncio.gather(*tasks)
            
            # 收集能够成功处理的 doc_id
            processed_doc_ids = [r for r in results if r is not None]
            
            return ExtractGraphMissingOut(
                processed_docs=len(processed_doc_ids),
                doc_ids=processed_doc_ids,
            ).model_dump()
            
        except Exception as e:
            return ExtractGraphMissingOut(
                processed_docs=0,
                doc_ids=[],
                error=MCPErrorModel(code="LLM_ERROR", message=str(e)),
            ).model_dump()
  • Pydantic input and output schema definitions for the extract_graph_missing tool.
    class ExtractGraphMissingIn(BaseModel):
        """extract_graph_missing 输入"""
        limit_docs: int = 50
        llm_model: Optional[str] = None  # 默认使用环境变量 LLM_MODEL
        min_confidence: float = 0.8
    
    
    class ExtractGraphMissingOut(BaseModel):
        """extract_graph_missing 输出"""
        processed_docs: int
        doc_ids: list[str] = Field(default_factory=list)
        error: Optional[MCPErrorModel] = None
  • Registration call in the main MCP server setup that registers the graph maintenance tools, including extract_graph_missing.
    register_graph_maintenance_tools(mcp)
  • The registration function that defines and registers the tool using @mcp.tool() decorator.
    def register_graph_maintenance_tools(mcp: FastMCP) -> None:
        """注册 GraphRAG 维护工具"""

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server