Skip to main content
Glama
ibm-ecm

Core Content Services MCP Server

Official
by ibm-ecm

get_document_text_extract

Retrieve the text extract content of a document using its ID or repository path. Returns concatenated text or empty string if none found.

Instructions

Retrieves a document's text extract content.

:param identifier: The document id or path (required). This can be either the document's ID (GUID) or its path in the repository (e.g., "/Folder1/document.pdf").

:returns: The text content of the document's text extract annotation. If multiple text extracts are found, they will be concatenated. Returns an empty string if no text extract is found.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
identifierYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • Registration of the 'get_document_text_extract' tool via @mcp.tool decorator. The function get_document_text_extract(identifier) delegates to the helper function in cs_mcp_server/utils/utils.py.
    @mcp.tool(
        name="get_document_text_extract",
    )
    async def get_document_text_extract(identifier: str) -> str:
        """
        Retrieves a document's text extract content.
    
        :param identifier: The document id or path (required). This can be either the document's ID (GUID)
                          or its path in the repository (e.g., "/Folder1/document.pdf").
    
        :returns: The text content of the document's text extract annotation.
                 If multiple text extracts are found, they will be concatenated.
                 Returns an empty string if no text extract is found.
        """
        return await get_document_text_extract_content(
            graphql_client=graphql_client, identifier=identifier
        )
  • The actual implementation/helper function 'get_document_text_extract_content'. It executes a GraphQL query to fetch document annotations, filters for text extract annotations (by className), downloads text content from each content element's downloadUrl, and concatenates them with a separator.
    async def get_document_text_extract_content(
        graphql_client: GraphQLClient, identifier: str
    ) -> str:
        """
        Retrieves a document's text extract content.
    
        This utility function queries the document's annotations, filters for text extract
        annotations, and downloads the text content from each annotation's content elements.
    
        :param graphql_client: GraphQL client instance
        :param identifier: The document id or path (GUID or repository path)
        :returns: The concatenated text content from all text extract annotations.
                 Returns empty string if no text extract is found.
        """
        query = """
        query getDocumentTextExtract($object_store_name: String!, $identifier: String!) {
            document(repositoryIdentifier: $object_store_name, identifier: $identifier) {
                annotations{
                    annotations{
                        id
                        name
                        className
                        annotatedContentElement
                        descriptiveText
                        contentElements{
                            ... on ContentTransfer{
                                downloadUrl
                                retrievalName
                                contentSize
                            }
                        }
                    }
                }
            }
        }
        """
    
        variables = {
            "identifier": identifier,
            "object_store_name": graphql_client.object_store,
        }
    
        # Execute query
        result = await graphql_client.execute_async(query=query, variables=variables)
    
        # Initialize empty string for text content
        all_text_content = ""
    
        # Check if we have valid result with annotations
        if (
            result
            and "data" in result
            and result["data"]
            and "document" in result["data"]
            and result["data"]["document"]
            and "annotations" in result["data"]["document"]
            and result["data"]["document"]["annotations"]
            and "annotations" in result["data"]["document"]["annotations"]
        ):
            annotations = result["data"]["document"]["annotations"]["annotations"]
    
            # Process each annotation
            for annotation in annotations:
                if (
                    "contentElements" in annotation
                    and annotation["className"] == TEXT_EXTRACT_ANNOTATION_CLASS
                    and annotation["annotatedContentElement"] is not None
                ):
                    # Process each content element
                    for content_element in annotation["contentElements"]:
                        if (
                            "downloadUrl" in content_element
                            and content_element["downloadUrl"]
                        ):
                            # Download the text content
                            download_url = content_element["downloadUrl"]
                            text_content = await graphql_client.download_text_async(
                                download_url
                            )
    
                            # Append text content with separator
                            if text_content:
                                if all_text_content:
                                    all_text_content += TEXT_EXTRACT_SEPARATOR
                                all_text_content += text_content
    
        return all_text_content
  • A separate helper function '_fetch_text_extract_by_identifier' in the resources module that performs nearly identical logic for fetching text extract content (used as a resource, not a tool).
    async def _fetch_text_extract_by_identifier(
        graphql_client: GraphQLClient, identifier: str
    ) -> str:
        """Fetch text extract content for a document by ID or path."""
    
        query = """
        query getDocumentTextExtract($object_store_name: String!, $identifier: String!) {
            document(repositoryIdentifier: $object_store_name, identifier: $identifier) {
                annotations{
                    annotations{
                        id
                        name
                        className
                        annotatedContentElement
                        descriptiveText
                        contentElements{
                            ... on ContentTransfer{
                                downloadUrl
                                retrievalName
                                contentSize
                            }
                        }
                    }
                }
            }
        }
        """
    
        variables = {
            "identifier": identifier,
            "object_store_name": graphql_client.object_store,
        }
    
        try:
            result = await graphql_client.execute_async(query=query, variables=variables)
    
            if "errors" in result:
                logger.error("GraphQL errors in text extract query: %s", result["errors"])
    
            all_text_content = ""
    
            if (
                result
                and "data" in result
                and result["data"]
                and "document" in result["data"]
                and result["data"]["document"]
                and "annotations" in result["data"]["document"]
                and result["data"]["document"]["annotations"]
                and "annotations" in result["data"]["document"]["annotations"]
            ):
                annotations = result["data"]["document"]["annotations"]["annotations"]
    
                for annotation in annotations:
                    if (
                        "contentElements" in annotation
                        and annotation["className"] == TEXT_EXTRACT_ANNOTATION_CLASS
                        and annotation["annotatedContentElement"] is not None
                    ):
                        for content_element in annotation["contentElements"]:
                            if (
                                "downloadUrl" in content_element
                                and content_element["downloadUrl"]
                            ):
                                download_url = content_element["downloadUrl"]
                                text_content = await graphql_client.download_text_async(
                                    download_url
                                )
    
                                if text_content:
                                    if all_text_content:
                                        all_text_content += TEXT_EXTRACT_SEPARATOR
                                    all_text_content += text_content
    
            return all_text_content
    
        except Exception as e:
            logger.error("Error fetching text extract for %s: %s", identifier, str(e))
            return f"Error retrieving text extract: {str(e)}"
  • The property_extraction tool uses the same get_document_text_extract_content helper as part of its workflow, demonstrating reuse of the text extract logic in another tool context.
    @mcp.tool(
        name="property_extraction",
    )
    async def property_extraction(identifier: str) -> Union[dict, ToolError]:
        """
        Use this tool for property extraction workflow when you need to extract property values from a document's content/text.
    
        This tool first determines the document's class, then fetches the class metadata to identify
        all available properties specific to that document class. It filters out system properties and
        hidden properties.
    
        It also retrieves a document's text extract content.
    
        :param identifier: The document id or path (required). This can be either the document's ID (GUID)
                          or its path in the repository (e.g., "/Folder1/document.pdf").
    
        :returns: A dictionary containing:
                 - text_extract: The text content of the document's text extract annotation.
                   If multiple text extracts are found, they will be concatenated.
                 - properties: A list of property metadata dictionaries with symbolicName, displayName,
                   descriptiveText, dataType, and cardinality fields.
        """
        text_extract = await get_document_text_extract_content(
            graphql_client=graphql_client, identifier=identifier
        )
    
        # First, get the class name of the document
        query = """
        query getDocument($object_store_name: String!, $identifier: String!){
            document(repositoryIdentifier: $object_store_name, identifier: $identifier){
                className
            }
        }
        """
    
        variables: dict[str, Any] = {
            "identifier": identifier,
            "object_store_name": graphql_client.object_store,
        }
    
        response = graphql_client.execute(query=query, variables=variables)
    
        if "errors" in response:
            return response
    
        class_name = response["data"]["document"]["className"]
    
        properties = await get_class_specific_property_names(
            graphql_client=graphql_client,
            metadata_cache=metadata_cache,
            class_name=class_name,
        )
    
        return {"text_extract": text_extract, "properties": properties}
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries the full burden. It adds transparency by noting that multiple text extracts are concatenated and that an empty string is returned if none exist. However, it does not disclose other behavioral aspects like access rights or performance implications.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is concise and well-structured with clear bullet points for the parameter and return value. Every sentence provides useful information with no redundancy or wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given that an output schema exists (not shown), the description appropriately omits detailed return format but still explains concatenation and empty string behavior. For a single-parameter tool, the description covers essential aspects adequately.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has 0% description coverage, so the description adds crucial meaning by explaining that 'identifier' can be a document ID (GUID) or a path (e.g., '/Folder1/document.pdf'). It also notes the parameter is required, adding value beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action ('Retrieves') and the resource ('document's text extract content'), making the purpose unambiguous. It is specific enough to distinguish from sibling tools like get_document_properties or get_document_versions, though it does not explicitly differentiate.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

No guidance is provided on when to use this tool versus alternatives. There is no mention of context, prerequisites, or exclusions, which would help an agent decide appropriately.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ibm-ecm/cs-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server