Skip to main content
Glama

reembed_document

Regenerate embeddings for document chunks in Paperlib MCP to update vector representations for improved semantic search and analysis.

Instructions

重新生成文档的 embedding

为文档的 chunks 生成 embedding。默认只处理缺失 embedding 的 chunks, 设置 force=True 可重新生成所有 embedding。

Args: doc_id: 文档的唯一标识符 batch_size: 批处理大小,默认 64 force: 是否强制重新生成所有 embedding,默认 False

Returns: 处理结果,包含处理的 chunk 数量

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
doc_idYes
batch_sizeNo
forceNo

Implementation Reference

  • The core handler function for the 'reembed_document' MCP tool. It re-generates embeddings for a document's chunks (optionally forcing all), using batch processing and storing in the database.
    @mcp.tool() def reembed_document( doc_id: str, batch_size: int = 64, force: bool = False, ) -> dict[str, Any]: """重新生成文档的 embedding 为文档的 chunks 生成 embedding。默认只处理缺失 embedding 的 chunks, 设置 force=True 可重新生成所有 embedding。 Args: doc_id: 文档的唯一标识符 batch_size: 批处理大小,默认 64 force: 是否强制重新生成所有 embedding,默认 False Returns: 处理结果,包含处理的 chunk 数量 """ try: # 检查文档是否存在 doc = query_one( "SELECT doc_id FROM documents WHERE doc_id = %s", (doc_id,) ) if not doc: return { "success": False, "error": f"Document not found: {doc_id}", "doc_id": doc_id, } settings = get_settings() # 查找需要处理的 chunks if force: # 删除现有 embeddings execute( """ DELETE FROM chunk_embeddings WHERE chunk_id IN (SELECT chunk_id FROM chunks WHERE doc_id = %s) """, (doc_id,) ) chunks = query_all( "SELECT chunk_id, text FROM chunks WHERE doc_id = %s ORDER BY chunk_index", (doc_id,) ) else: # 只查找缺失 embedding 的 chunks chunks = query_all( """ SELECT c.chunk_id, c.text FROM chunks c LEFT JOIN chunk_embeddings ce ON c.chunk_id = ce.chunk_id WHERE c.doc_id = %s AND ce.chunk_id IS NULL ORDER BY c.chunk_index """, (doc_id,) ) if not chunks: return { "success": True, "doc_id": doc_id, "processed_chunks": 0, "message": "No chunks need embedding", } # 批量生成 embeddings chunk_ids = [c["chunk_id"] for c in chunks] texts = [c["text"] for c in chunks] embeddings = get_embeddings_chunked(texts, batch_size=batch_size) # 写入数据库 embedded_count = 0 with get_db() as conn: with conn.cursor() as cur: for chunk_id, embedding in zip(chunk_ids, embeddings): embedding_str = "[" + ",".join(str(x) for x in embedding) + "]" cur.execute( """ INSERT INTO chunk_embeddings (chunk_id, embedding_model, embedding) VALUES (%s, %s, %s::vector) ON CONFLICT (chunk_id) DO UPDATE SET embedding_model = EXCLUDED.embedding_model, embedding = EXCLUDED.embedding """, (chunk_id, settings.embedding_model, embedding_str) ) embedded_count += 1 return { "success": True, "doc_id": doc_id, "processed_chunks": embedded_count, "total_chunks": len(chunks), "embedding_model": settings.embedding_model, } except Exception as e: return { "success": False, "error": str(e), "doc_id": doc_id, }
  • Top-level registration call that registers the fetch tools module, including reembed_document, on the MCP server instance.
    register_fetch_tools(mcp)
  • Module-level registration function that defines and registers all fetch tools (including reembed_document via @mcp.tool() decorators) to the MCP instance.
    def register_fetch_tools(mcp: FastMCP) -> None:

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/h-lu/paperlib-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server