get_document_text_extract
Extract text content from IBM FileNet documents using document ID or path to retrieve annotations for processing and analysis.
Instructions
Retrieves a document's text extract content.
:param identifier: The document id or path (required). This can be either the document's ID (GUID) or its path in the repository (e.g., "/Folder1/document.pdf").
:returns: The text content of the document's text extract annotation. If multiple text extracts are found, they will be concatenated. Returns an empty string if no text extract is found.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| identifier | Yes |
Implementation Reference
- The handler function for the 'get_document_text_extract' tool. It queries the GraphQL API for document annotations matching the TEXT_EXTRACT_ANNOTATION_CLASS, downloads the text content from each matching annotation's content elements using download_text_async, concatenates them with separators, and returns the full text.@mcp.tool( name="get_document_text_extract", ) async def get_document_text_extract(identifier: str) -> str: """ Retrieves a document's text extract content. :param identifier: The document id or path (required). This can be either the document's ID (GUID) or its path in the repository (e.g., "/Folder1/document.pdf"). :returns: The text content of the document's text extract annotation. If multiple text extracts are found, they will be concatenated. Returns an empty string if no text extract is found. """ query = """ query getDocumentTextExtract($object_store_name: String!, $identifier: String!) { document(repositoryIdentifier: $object_store_name, identifier: $identifier) { annotations{ annotations{ id name className annotatedContentElement descriptiveText contentElements{ ... on ContentTransfer{ downloadUrl retrievalName contentSize } } } } } } """ variables = { "identifier": identifier, "object_store_name": graphql_client.object_store, } # First run execute_async and wait for the result result = await graphql_client.execute_async(query=query, variables=variables) # Initialize an empty string to store all text content all_text_content = "" # Check if we have a valid result with annotations if ( result and "data" in result and result["data"] and "document" in result["data"] and result["data"]["document"] and "annotations" in result["data"]["document"] and result["data"]["document"]["annotations"] and "annotations" in result["data"]["document"]["annotations"] ): annotations = result["data"]["document"]["annotations"]["annotations"] # Process each annotation for annotation in annotations: if ( "contentElements" in annotation and annotation["className"] == TEXT_EXTRACT_ANNOTATION_CLASS and annotation["annotatedContentElement"] is not None ): # Process each content element for content_element in annotation["contentElements"]: if ( "downloadUrl" in content_element and content_element["downloadUrl"] ): # Download the text content using the downloadUrl download_url = content_element["downloadUrl"] text_content = await graphql_client.download_text_async( download_url ) # Append the text content to our result string if text_content: if all_text_content: all_text_content += TEXT_EXTRACT_SEPARATOR all_text_content += text_content return all_text_content
- src/cs_mcp_server/mcp_server_main.py:219-223 (registration)The register_server_tools function calls register_document_tools (which registers the get_document_text_extract tool among others) for CORE and FULL server types.if server_type == ServerType.CORE: register_document_tools(mcp, graphql_client, metadata_cache) register_folder_tools(mcp, graphql_client) register_class_tools(mcp, graphql_client, metadata_cache) register_search_tools(mcp, graphql_client, metadata_cache)