Skip to main content
Glama

rechunk_document

Extract text from PDF documents, split into manageable chunks, and generate new embeddings for improved search and analysis in academic literature management.

Instructions

重新分块文档

从 MinIO 获取 PDF,重新提取文本并分块,然后生成新的 embeddings。 会删除旧的 chunks 和 embeddings。

Args: doc_id: 文档的唯一标识符 strategy: 分块策略,目前支持 "page_v1"(按页分块) force: 是否强制执行(即使已有 chunks),默认 False

Returns: 处理结果,包含新的 chunk 数量

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doc_idYes
strategyNopage_v1
forceNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • The core handler function for the 'rechunk_document' tool. It re-downloads the PDF, extracts text, chunks it, deletes old chunks/embeddings, inserts new ones, and generates embeddings.
    @mcp.tool()
    def rechunk_document(
        doc_id: str,
        strategy: str = "page_v1",
        force: bool = False,
    ) -> dict[str, Any]:
        """重新分块文档
        
        从 MinIO 获取 PDF,重新提取文本并分块,然后生成新的 embeddings。
        会删除旧的 chunks 和 embeddings。
        
        Args:
            doc_id: 文档的唯一标识符
            strategy: 分块策略,目前支持 "page_v1"(按页分块)
            force: 是否强制执行(即使已有 chunks),默认 False
            
        Returns:
            处理结果,包含新的 chunk 数量
        """
        try:
            # 检查文档是否存在
            doc = query_one(
                "SELECT doc_id, pdf_key FROM documents WHERE doc_id = %s",
                (doc_id,)
            )
            
            if not doc:
                return {
                    "success": False,
                    "error": f"Document not found: {doc_id}",
                    "doc_id": doc_id,
                }
            
            # 检查是否已有 chunks
            existing = query_one(
                "SELECT COUNT(*) as count FROM chunks WHERE doc_id = %s",
                (doc_id,)
            )
            
            if existing and existing["count"] > 0 and not force:
                return {
                    "success": False,
                    "error": f"Document already has {existing['count']} chunks. Use force=True to rechunk.",
                    "doc_id": doc_id,
                    "existing_chunks": existing["count"],
                }
            
            settings = get_settings()
            
            # 从 MinIO 获取 PDF
            pdf_content = get_object(doc["pdf_key"])
            
            # 保存到临时文件
            with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
                tmp.write(pdf_content)
                tmp_path = tmp.name
            
            try:
                # 提取文本
                pdf_result = extract_pdf(tmp_path)
                
                # 删除旧的 chunks(级联删除 embeddings)
                execute("DELETE FROM chunks WHERE doc_id = %s", (doc_id,))
                
                # 分块
                pages = [(p.page_num, p.text) for p in pdf_result.pages if not p.is_empty]
                chunks = chunk_document(pages)
                
                if not chunks:
                    return {
                        "success": True,
                        "doc_id": doc_id,
                        "n_chunks": 0,
                        "message": "No text content extracted from PDF",
                    }
                
                # 写入 chunks 表
                chunk_ids = []
                with get_db() as conn:
                    with conn.cursor() as cur:
                        for chunk in chunks:
                            cur.execute(
                                """
                                INSERT INTO chunks (doc_id, chunk_index, page_start, page_end, text, token_count)
                                VALUES (%s, %s, %s, %s, %s, %s)
                                RETURNING chunk_id
                                """,
                                (
                                    doc_id,
                                    chunk["chunk_index"],
                                    chunk["page_start"],
                                    chunk["page_end"],
                                    chunk["text"],
                                    chunk["token_count"],
                                )
                            )
                            result = cur.fetchone()
                            if result:
                                chunk_ids.append(result["chunk_id"])
                
                # 生成 embeddings
                texts = [c["text"] for c in chunks]
                embeddings = get_embeddings_chunked(texts)
                
                # 写入 embeddings
                embedded_count = 0
                with get_db() as conn:
                    with conn.cursor() as cur:
                        for chunk_id, embedding in zip(chunk_ids, embeddings):
                            embedding_str = "[" + ",".join(str(x) for x in embedding) + "]"
                            cur.execute(
                                """
                                INSERT INTO chunk_embeddings (chunk_id, embedding_model, embedding)
                                VALUES (%s, %s, %s::vector)
                                """,
                                (chunk_id, settings.embedding_model, embedding_str)
                            )
                            embedded_count += 1
                
                return {
                    "success": True,
                    "doc_id": doc_id,
                    "strategy": strategy,
                    "n_pages": pdf_result.total_pages,
                    "n_chunks": len(chunks),
                    "embedded_chunks": embedded_count,
                }
                
            finally:
                # 清理临时文件
                Path(tmp_path).unlink(missing_ok=True)
            
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "doc_id": doc_id,
            }
  • Registers the fetch tools, including rechunk_document, by calling register_fetch_tools on the MCP instance.
    register_fetch_tools(mcp)
  • Helper function called by rechunk_document to perform the actual page-based chunking of document text.
    def chunk_document(pages: list[tuple[int, str]]) -> list[dict]:
        """对文档按页分块(返回字典格式,便于数据库存储)
        
        Args:
            pages: 页面列表,每项为 (page_num, text)
            
        Returns:
            chunk 字典列表,包含 chunk_index, page_start, page_end, text, token_count
        """
        chunks = chunk_pages(pages)
        return [
            {
                "chunk_index": c.chunk_index,
                "page_start": c.page_start,
                "page_end": c.page_end,
                "text": c.text,
                "token_count": c.estimated_tokens,
            }
            for c in chunks
        ]
  • Import and invocation of the fetch tools registration in the main server file.
    from paperlib_mcp.tools.fetch import register_fetch_tools
    from paperlib_mcp.tools.writing import register_writing_tools
    
    # M2 GraphRAG 工具
    from paperlib_mcp.tools.graph_extract import register_graph_extract_tools
    from paperlib_mcp.tools.graph_canonicalize import register_graph_canonicalize_tools
    from paperlib_mcp.tools.graph_community import register_graph_community_tools
    from paperlib_mcp.tools.graph_summarize import register_graph_summarize_tools
    from paperlib_mcp.tools.graph_maintenance import register_graph_maintenance_tools
    
    # M3 Review 工具
    from paperlib_mcp.tools.review import register_review_tools
    
    # M4 Canonicalization & Grouping 工具
    from paperlib_mcp.tools.graph_relation_canonicalize import register_graph_relation_canonicalize_tools
    from paperlib_mcp.tools.graph_claim_grouping import register_graph_claim_grouping_tools
    from paperlib_mcp.tools.graph_v12 import register_graph_v12_tools
    
    register_health_tools(mcp)
    register_import_tools(mcp)
    register_search_tools(mcp)
    register_fetch_tools(mcp)
    register_writing_tools(mcp)
    
    # 注册 M2 GraphRAG 工具
    register_graph_extract_tools(mcp)
    register_graph_canonicalize_tools(mcp)
    register_graph_community_tools(mcp)
    register_graph_summarize_tools(mcp)
    register_graph_maintenance_tools(mcp)
    
    # 注册 M3 Review 工具
    register_review_tools(mcp)
    
    # 注册 M4 Canonicalization & Grouping 工具
    register_graph_relation_canonicalize_tools(mcp)
    register_graph_claim_grouping_tools(mcp)
    register_graph_v12_tools(mcp)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key behaviors: fetching from MinIO, re-extracting text, re-chunking, generating new embeddings, and deleting old data. It also mentions the 'force' parameter for overriding existing chunks. However, it lacks details on permissions, rate limits, or error handling, which would be helpful for a destructive operation.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured and appropriately sized, with a clear purpose statement followed by Args and Returns sections. Every sentence adds value, but the Chinese text might require translation for broader usability, and the strategy explanation could be slightly more detailed without losing conciseness.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (destructive operation with 3 parameters) and no annotations, the description does a good job covering purpose, parameters, and behavior. The output schema exists, so return values don't need explanation. However, it could improve by mentioning side effects (e.g., data deletion impact) or dependencies (e.g., MinIO access), making it more complete for safe agent use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It adds meaningful semantics for all three parameters: 'doc_id' as the document identifier, 'strategy' as the chunking strategy with the only supported option 'page_v1', and 'force' as a boolean to enforce re-chunking even if chunks exist. This fully explains parameter purposes beyond the bare schema, though it could provide more context on strategy implications.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('从 MinIO 获取 PDF,重新提取文本并分块,然后生成新的 embeddings') and resources ('文档'), distinguishing it from siblings like 'reembed_document' (which likely only updates embeddings) and 'import_pdf' (which imports new documents). It explicitly mentions deleting old chunks and embeddings, which further clarifies its unique function.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage when a document needs re-chunking (e.g., due to strategy changes or errors), as it mentions 'force' for overriding existing chunks. However, it lacks explicit guidance on when to use this versus alternatives like 'reembed_document' (which might update embeddings without re-chunking) or 'delete_document' followed by re-importing. No clear exclusions or prerequisites are stated.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server