reembed_document
Regenerate embeddings for document chunks in Paperlib MCP to update vector representations for improved semantic search and analysis.
Instructions
重新生成文档的 embedding
为文档的 chunks 生成 embedding。默认只处理缺失 embedding 的 chunks, 设置 force=True 可重新生成所有 embedding。
Args: doc_id: 文档的唯一标识符 batch_size: 批处理大小,默认 64 force: 是否强制重新生成所有 embedding,默认 False
Returns: 处理结果,包含处理的 chunk 数量
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| doc_id | Yes | ||
| batch_size | No | ||
| force | No |
Implementation Reference
- src/paperlib_mcp/tools/fetch.py:580-687 (handler)The core handler function for the 'reembed_document' MCP tool. It re-generates embeddings for a document's chunks (optionally forcing all), using batch processing and storing in the database.@mcp.tool() def reembed_document( doc_id: str, batch_size: int = 64, force: bool = False, ) -> dict[str, Any]: """重新生成文档的 embedding 为文档的 chunks 生成 embedding。默认只处理缺失 embedding 的 chunks, 设置 force=True 可重新生成所有 embedding。 Args: doc_id: 文档的唯一标识符 batch_size: 批处理大小,默认 64 force: 是否强制重新生成所有 embedding,默认 False Returns: 处理结果,包含处理的 chunk 数量 """ try: # 检查文档是否存在 doc = query_one( "SELECT doc_id FROM documents WHERE doc_id = %s", (doc_id,) ) if not doc: return { "success": False, "error": f"Document not found: {doc_id}", "doc_id": doc_id, } settings = get_settings() # 查找需要处理的 chunks if force: # 删除现有 embeddings execute( """ DELETE FROM chunk_embeddings WHERE chunk_id IN (SELECT chunk_id FROM chunks WHERE doc_id = %s) """, (doc_id,) ) chunks = query_all( "SELECT chunk_id, text FROM chunks WHERE doc_id = %s ORDER BY chunk_index", (doc_id,) ) else: # 只查找缺失 embedding 的 chunks chunks = query_all( """ SELECT c.chunk_id, c.text FROM chunks c LEFT JOIN chunk_embeddings ce ON c.chunk_id = ce.chunk_id WHERE c.doc_id = %s AND ce.chunk_id IS NULL ORDER BY c.chunk_index """, (doc_id,) ) if not chunks: return { "success": True, "doc_id": doc_id, "processed_chunks": 0, "message": "No chunks need embedding", } # 批量生成 embeddings chunk_ids = [c["chunk_id"] for c in chunks] texts = [c["text"] for c in chunks] embeddings = get_embeddings_chunked(texts, batch_size=batch_size) # 写入数据库 embedded_count = 0 with get_db() as conn: with conn.cursor() as cur: for chunk_id, embedding in zip(chunk_ids, embeddings): embedding_str = "[" + ",".join(str(x) for x in embedding) + "]" cur.execute( """ INSERT INTO chunk_embeddings (chunk_id, embedding_model, embedding) VALUES (%s, %s, %s::vector) ON CONFLICT (chunk_id) DO UPDATE SET embedding_model = EXCLUDED.embedding_model, embedding = EXCLUDED.embedding """, (chunk_id, settings.embedding_model, embedding_str) ) embedded_count += 1 return { "success": True, "doc_id": doc_id, "processed_chunks": embedded_count, "total_chunks": len(chunks), "embedding_model": settings.embedding_model, } except Exception as e: return { "success": False, "error": str(e), "doc_id": doc_id, }
- src/paperlib_mcp/server.py:36-36 (registration)Top-level registration call that registers the fetch tools module, including reembed_document, on the MCP server instance.register_fetch_tools(mcp)
- src/paperlib_mcp/tools/fetch.py:52-52 (registration)Module-level registration function that defines and registers all fetch tools (including reembed_document via @mcp.tool() decorators) to the MCP instance.def register_fetch_tools(mcp: FastMCP) -> None: