Skip to main content
Glama

reembed_document

Regenerate embeddings for document chunks in Paperlib MCP to update vector representations for improved semantic search and analysis.

Instructions

重新生成文档的 embedding

为文档的 chunks 生成 embedding。默认只处理缺失 embedding 的 chunks, 设置 force=True 可重新生成所有 embedding。

Args: doc_id: 文档的唯一标识符 batch_size: 批处理大小,默认 64 force: 是否强制重新生成所有 embedding,默认 False

Returns: 处理结果,包含处理的 chunk 数量

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doc_idYes
batch_sizeNo
forceNo

Implementation Reference

  • The core handler function for the 'reembed_document' MCP tool. It re-generates embeddings for a document's chunks (optionally forcing all), using batch processing and storing in the database.
    @mcp.tool()
    def reembed_document(
        doc_id: str,
        batch_size: int = 64,
        force: bool = False,
    ) -> dict[str, Any]:
        """重新生成文档的 embedding
        
        为文档的 chunks 生成 embedding。默认只处理缺失 embedding 的 chunks,
        设置 force=True 可重新生成所有 embedding。
        
        Args:
            doc_id: 文档的唯一标识符
            batch_size: 批处理大小,默认 64
            force: 是否强制重新生成所有 embedding,默认 False
            
        Returns:
            处理结果,包含处理的 chunk 数量
        """
        try:
            # 检查文档是否存在
            doc = query_one(
                "SELECT doc_id FROM documents WHERE doc_id = %s",
                (doc_id,)
            )
            
            if not doc:
                return {
                    "success": False,
                    "error": f"Document not found: {doc_id}",
                    "doc_id": doc_id,
                }
            
            settings = get_settings()
            
            # 查找需要处理的 chunks
            if force:
                # 删除现有 embeddings
                execute(
                    """
                    DELETE FROM chunk_embeddings 
                    WHERE chunk_id IN (SELECT chunk_id FROM chunks WHERE doc_id = %s)
                    """,
                    (doc_id,)
                )
                chunks = query_all(
                    "SELECT chunk_id, text FROM chunks WHERE doc_id = %s ORDER BY chunk_index",
                    (doc_id,)
                )
            else:
                # 只查找缺失 embedding 的 chunks
                chunks = query_all(
                    """
                    SELECT c.chunk_id, c.text 
                    FROM chunks c
                    LEFT JOIN chunk_embeddings ce ON c.chunk_id = ce.chunk_id
                    WHERE c.doc_id = %s AND ce.chunk_id IS NULL
                    ORDER BY c.chunk_index
                    """,
                    (doc_id,)
                )
            
            if not chunks:
                return {
                    "success": True,
                    "doc_id": doc_id,
                    "processed_chunks": 0,
                    "message": "No chunks need embedding",
                }
            
            # 批量生成 embeddings
            chunk_ids = [c["chunk_id"] for c in chunks]
            texts = [c["text"] for c in chunks]
            embeddings = get_embeddings_chunked(texts, batch_size=batch_size)
            
            # 写入数据库
            embedded_count = 0
            with get_db() as conn:
                with conn.cursor() as cur:
                    for chunk_id, embedding in zip(chunk_ids, embeddings):
                        embedding_str = "[" + ",".join(str(x) for x in embedding) + "]"
                        cur.execute(
                            """
                            INSERT INTO chunk_embeddings (chunk_id, embedding_model, embedding)
                            VALUES (%s, %s, %s::vector)
                            ON CONFLICT (chunk_id) DO UPDATE SET
                                embedding_model = EXCLUDED.embedding_model,
                                embedding = EXCLUDED.embedding
                            """,
                            (chunk_id, settings.embedding_model, embedding_str)
                        )
                        embedded_count += 1
            
            return {
                "success": True,
                "doc_id": doc_id,
                "processed_chunks": embedded_count,
                "total_chunks": len(chunks),
                "embedding_model": settings.embedding_model,
            }
            
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "doc_id": doc_id,
            }
  • Top-level registration call that registers the fetch tools module, including reembed_document, on the MCP server instance.
    register_fetch_tools(mcp)
  • Module-level registration function that defines and registers all fetch tools (including reembed_document via @mcp.tool() decorators) to the MCP instance.
    def register_fetch_tools(mcp: FastMCP) -> None:

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server