get_document_text_extract

Extract text content from IBM FileNet documents using document ID or path to retrieve annotations for processing and analysis.

Instructions

Retrieves a document's text extract content.

:param identifier: The document id or path (required). This can be either the document's ID (GUID) or its path in the repository (e.g., "/Folder1/document.pdf").

:returns: The text content of the document's text extract annotation. If multiple text extracts are found, they will be concatenated. Returns an empty string if no text extract is found.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`identifier`	Yes

Implementation Reference

src/cs_mcp_server/tools/documents.py:93-180 (handler)

The handler function for the 'get_document_text_extract' tool. It queries the GraphQL API for document annotations matching the TEXT_EXTRACT_ANNOTATION_CLASS, downloads the text content from each matching annotation's content elements using download_text_async, concatenates them with separators, and returns the full text.

@mcp.tool(
    name="get_document_text_extract",
)
async def get_document_text_extract(identifier: str) -> str:
    """
    Retrieves a document's text extract content.

    :param identifier: The document id or path (required). This can be either the document's ID (GUID)
                      or its path in the repository (e.g., "/Folder1/document.pdf").

    :returns: The text content of the document's text extract annotation.
             If multiple text extracts are found, they will be concatenated.
             Returns an empty string if no text extract is found.
    """
    query = """
    query getDocumentTextExtract($object_store_name: String!, $identifier: String!) {
        document(repositoryIdentifier: $object_store_name, identifier: $identifier) {
            annotations{
                annotations{
                    id
                    name
                    className
                    annotatedContentElement
                    descriptiveText
                    contentElements{
                        ... on ContentTransfer{
                            downloadUrl
                            retrievalName
                            contentSize
                        }
                    }
                }
            }
        }
    }
    """

    variables = {
        "identifier": identifier,
        "object_store_name": graphql_client.object_store,
    }

    # First run execute_async and wait for the result
    result = await graphql_client.execute_async(query=query, variables=variables)

    # Initialize an empty string to store all text content
    all_text_content = ""

    # Check if we have a valid result with annotations
    if (
        result
        and "data" in result
        and result["data"]
        and "document" in result["data"]
        and result["data"]["document"]
        and "annotations" in result["data"]["document"]
        and result["data"]["document"]["annotations"]
        and "annotations" in result["data"]["document"]["annotations"]
    ):
        annotations = result["data"]["document"]["annotations"]["annotations"]

        # Process each annotation
        for annotation in annotations:
            if (
                "contentElements" in annotation
                and annotation["className"] == TEXT_EXTRACT_ANNOTATION_CLASS
                and annotation["annotatedContentElement"] is not None
            ):
                # Process each content element
                for content_element in annotation["contentElements"]:
                    if (
                        "downloadUrl" in content_element
                        and content_element["downloadUrl"]
                    ):
                        # Download the text content using the downloadUrl
                        download_url = content_element["downloadUrl"]
                        text_content = await graphql_client.download_text_async(
                            download_url
                        )

                        # Append the text content to our result string
                        if text_content:
                            if all_text_content:
                                all_text_content += TEXT_EXTRACT_SEPARATOR
                            all_text_content += text_content

    return all_text_content

src/cs_mcp_server/mcp_server_main.py:219-223 (registration)

The register_server_tools function calls register_document_tools (which registers the get_document_text_extract tool among others) for CORE and FULL server types.

if server_type == ServerType.CORE:
    register_document_tools(mcp, graphql_client, metadata_cache)
    register_folder_tools(mcp, graphql_client)
    register_class_tools(mcp, graphql_client, metadata_cache)
    register_search_tools(mcp, graphql_client, metadata_cache)

IBM Core Content Services MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API