get_document_text_extract
Retrieve the text extract content of a document using its ID or repository path. Returns concatenated text or empty string if none found.
Instructions
Retrieves a document's text extract content.
:param identifier: The document id or path (required). This can be either the document's ID (GUID) or its path in the repository (e.g., "/Folder1/document.pdf").
:returns: The text content of the document's text extract annotation. If multiple text extracts are found, they will be concatenated. Returns an empty string if no text extract is found.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| identifier | Yes |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |
Implementation Reference
- src/cs_mcp_server/tools/documents.py:108-124 (registration)Registration of the 'get_document_text_extract' tool via @mcp.tool decorator. The function get_document_text_extract(identifier) delegates to the helper function in cs_mcp_server/utils/utils.py.
@mcp.tool( name="get_document_text_extract", ) async def get_document_text_extract(identifier: str) -> str: """ Retrieves a document's text extract content. :param identifier: The document id or path (required). This can be either the document's ID (GUID) or its path in the repository (e.g., "/Folder1/document.pdf"). :returns: The text content of the document's text extract annotation. If multiple text extracts are found, they will be concatenated. Returns an empty string if no text extract is found. """ return await get_document_text_extract_content( graphql_client=graphql_client, identifier=identifier ) - src/cs_mcp_server/utils/utils.py:92-178 (handler)The actual implementation/helper function 'get_document_text_extract_content'. It executes a GraphQL query to fetch document annotations, filters for text extract annotations (by className), downloads text content from each content element's downloadUrl, and concatenates them with a separator.
async def get_document_text_extract_content( graphql_client: GraphQLClient, identifier: str ) -> str: """ Retrieves a document's text extract content. This utility function queries the document's annotations, filters for text extract annotations, and downloads the text content from each annotation's content elements. :param graphql_client: GraphQL client instance :param identifier: The document id or path (GUID or repository path) :returns: The concatenated text content from all text extract annotations. Returns empty string if no text extract is found. """ query = """ query getDocumentTextExtract($object_store_name: String!, $identifier: String!) { document(repositoryIdentifier: $object_store_name, identifier: $identifier) { annotations{ annotations{ id name className annotatedContentElement descriptiveText contentElements{ ... on ContentTransfer{ downloadUrl retrievalName contentSize } } } } } } """ variables = { "identifier": identifier, "object_store_name": graphql_client.object_store, } # Execute query result = await graphql_client.execute_async(query=query, variables=variables) # Initialize empty string for text content all_text_content = "" # Check if we have valid result with annotations if ( result and "data" in result and result["data"] and "document" in result["data"] and result["data"]["document"] and "annotations" in result["data"]["document"] and result["data"]["document"]["annotations"] and "annotations" in result["data"]["document"]["annotations"] ): annotations = result["data"]["document"]["annotations"]["annotations"] # Process each annotation for annotation in annotations: if ( "contentElements" in annotation and annotation["className"] == TEXT_EXTRACT_ANNOTATION_CLASS and annotation["annotatedContentElement"] is not None ): # Process each content element for content_element in annotation["contentElements"]: if ( "downloadUrl" in content_element and content_element["downloadUrl"] ): # Download the text content download_url = content_element["downloadUrl"] text_content = await graphql_client.download_text_async( download_url ) # Append text content with separator if text_content: if all_text_content: all_text_content += TEXT_EXTRACT_SEPARATOR all_text_content += text_content return all_text_content - A separate helper function '_fetch_text_extract_by_identifier' in the resources module that performs nearly identical logic for fetching text extract content (used as a resource, not a tool).
async def _fetch_text_extract_by_identifier( graphql_client: GraphQLClient, identifier: str ) -> str: """Fetch text extract content for a document by ID or path.""" query = """ query getDocumentTextExtract($object_store_name: String!, $identifier: String!) { document(repositoryIdentifier: $object_store_name, identifier: $identifier) { annotations{ annotations{ id name className annotatedContentElement descriptiveText contentElements{ ... on ContentTransfer{ downloadUrl retrievalName contentSize } } } } } } """ variables = { "identifier": identifier, "object_store_name": graphql_client.object_store, } try: result = await graphql_client.execute_async(query=query, variables=variables) if "errors" in result: logger.error("GraphQL errors in text extract query: %s", result["errors"]) all_text_content = "" if ( result and "data" in result and result["data"] and "document" in result["data"] and result["data"]["document"] and "annotations" in result["data"]["document"] and result["data"]["document"]["annotations"] and "annotations" in result["data"]["document"]["annotations"] ): annotations = result["data"]["document"]["annotations"]["annotations"] for annotation in annotations: if ( "contentElements" in annotation and annotation["className"] == TEXT_EXTRACT_ANNOTATION_CLASS and annotation["annotatedContentElement"] is not None ): for content_element in annotation["contentElements"]: if ( "downloadUrl" in content_element and content_element["downloadUrl"] ): download_url = content_element["downloadUrl"] text_content = await graphql_client.download_text_async( download_url ) if text_content: if all_text_content: all_text_content += TEXT_EXTRACT_SEPARATOR all_text_content += text_content return all_text_content except Exception as e: logger.error("Error fetching text extract for %s: %s", identifier, str(e)) return f"Error retrieving text extract: {str(e)}" - The property_extraction tool uses the same get_document_text_extract_content helper as part of its workflow, demonstrating reuse of the text extract logic in another tool context.
@mcp.tool( name="property_extraction", ) async def property_extraction(identifier: str) -> Union[dict, ToolError]: """ Use this tool for property extraction workflow when you need to extract property values from a document's content/text. This tool first determines the document's class, then fetches the class metadata to identify all available properties specific to that document class. It filters out system properties and hidden properties. It also retrieves a document's text extract content. :param identifier: The document id or path (required). This can be either the document's ID (GUID) or its path in the repository (e.g., "/Folder1/document.pdf"). :returns: A dictionary containing: - text_extract: The text content of the document's text extract annotation. If multiple text extracts are found, they will be concatenated. - properties: A list of property metadata dictionaries with symbolicName, displayName, descriptiveText, dataType, and cardinality fields. """ text_extract = await get_document_text_extract_content( graphql_client=graphql_client, identifier=identifier ) # First, get the class name of the document query = """ query getDocument($object_store_name: String!, $identifier: String!){ document(repositoryIdentifier: $object_store_name, identifier: $identifier){ className } } """ variables: dict[str, Any] = { "identifier": identifier, "object_store_name": graphql_client.object_store, } response = graphql_client.execute(query=query, variables=variables) if "errors" in response: return response class_name = response["data"]["document"]["className"] properties = await get_class_specific_property_names( graphql_client=graphql_client, metadata_cache=metadata_cache, class_name=class_name, ) return {"text_extract": text_extract, "properties": properties}