Skip to main content
Glama
allvoicelab

All Voice Lab MCP Server

Official
by allvoicelab

text_to_speech

Convert text to speech audio files using specified voices and models, saving results to your chosen directory for accessibility or content creation.

Instructions

[AllVoiceLab Tool] Generate speech from provided text.

This tool converts text to speech using the specified voice and model. The generated audio file is saved to the specified directory.

Args:
    text: Target text for speech synthesis. Maximum 5,000 characters.
    voice_id: Voice ID to use for synthesis. Required. Must be a valid voice ID from the available voices (use get_voices tool to retrieve).
    model_id: Model ID to use for synthesis. Required. Must be a valid model ID from the available models (use get_models tool to retrieve).
    speed: Speech rate adjustment, range [0.5, 1.5], where 0.5 is slowest and 1.5 is fastest. Default value is 1.
    output_dir: Output directory for the generated audio file. Default is user's desktop.
    
Returns:
    TextContent containing file path to the generated audio file.
    
Limitations:
    - Text must not exceed 5,000 characters
    - Both voice_id and model_id must be valid and provided

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYes
voice_idYes
model_idYes
speedNo
output_dirNo

Implementation Reference

  • Main handler function for the text_to_speech MCP tool. Performs input validation (text length, voice_id, model_id), validates model against available models, calls the AllVoiceLab client to generate speech, and returns the file path or error.
    def text_to_speech(
        text: str,
        voice_id: str,
        model_id: str,
        speed: int = 1,
        output_dir: str = None
    ) -> TextContent:
        """
        Convert text to speech
        
        Args:
            text: Target text for speech synthesis. Maximum 5,000 characters.
            voice_id: Voice ID to use for synthesis. Required. Must be a valid voice ID from the available voices (use get_voices tool to retrieve).
            model_id: Model ID to use for synthesis. Required. Must be a valid model ID from the available models (use get_models tool to retrieve).
            speed: Speech rate adjustment, range [0.5, 1.5], where 0.5 is slowest and 1.5 is fastest. Default value is 1.
            output_dir: Output directory for the generated audio file. Default is user's desktop.
            
        Returns:
            TextContent: Contains the file path to the generated audio file.
        """
        all_voice_lab = get_client()
        output_dir = all_voice_lab.get_output_path(output_dir)
        logging.info(f"Tool called: text_to_speech, voice_id: {voice_id}, model_id: {model_id}, speed: {speed}")
        logging.info(f"Output directory: {output_dir}")
    
        # Validate parameters
        if not text:
            logging.warning("Text parameter is empty")
            return TextContent(
                type="text",
                text="text parameter cannot be empty"
            )
        if len(text) > 5000:
            logging.warning(f"Text parameter exceeds maximum length: {len(text)} characters")
            return TextContent(
                type="text",
                text="text parameter cannot exceed 5,000 characters"
            )
        if not voice_id:
            logging.warning("voice_id parameter is empty")
            return TextContent(
                type="text",
                text="voice_id parameter cannot be empty"
            )
        # Validate voice_id is numeric
        if not voice_id.isdigit():
            logging.warning(f"Invalid voice_id format: {voice_id}, not a numeric value")
            return TextContent(
                type="text",
                text="voice_id parameter must be a numeric value"
            )
        if not model_id:
            logging.warning("model_id parameter is empty")
            return TextContent(
                type="text",
                text="model_id parameter cannot be empty"
            )
    
        
    
        # Validate model_id against available models
        try:
            logging.info(f"Validating model_id: {model_id}")
            model_resp = all_voice_lab.get_supported_voice_model()
            available_models = model_resp.models
            valid_model_ids = [model.model_id for model in available_models]
    
            if model_id not in valid_model_ids:
                logging.warning(f"Invalid model_id: {model_id}, available models: {valid_model_ids}")
                return TextContent(
                    type="text",
                    text=f"Invalid model_id: {model_id}. Please use a valid model ID."
                )
            logging.info(f"Model ID validation successful: {model_id}")
        except Exception as e:
            logging.error(f"Failed to validate model_id: {str(e)}")
            # Continue with the process even if validation fails
            # to maintain backward compatibility
    
        try:
            logging.info(f"Starting text-to-speech processing, text length: {len(text)} characters")
            file_path = all_voice_lab.text_to_speech(text, voice_id, model_id, output_dir, speed)
            logging.info(f"Text-to-speech successful, file saved at: {file_path}")
            return TextContent(
                type="text",
                text=f"Speech generation completed, file saved at: {file_path}\n"
            )
        except Exception as e:
            logging.error(f"Text-to-speech failed: {str(e)}")
            return TextContent(
                type="text",
                text=f"Synthesis failed, tool temporarily unavailable"
            )
  • MCP tool registration for text_to_speech, including name, detailed description with input schema and limitations, bound to the handler function.
    mcp.tool(
        name="text_to_speech",
        description="""[AllVoiceLab Tool] Generate speech from provided text.
        
        This tool converts text to speech using the specified voice and model. The generated audio file is saved to the specified directory.
        
        Args:
            text: Target text for speech synthesis. Maximum 5,000 characters.
            voice_id: Voice ID to use for synthesis. Required. Must be a valid voice ID from the available voices (use get_voices tool to retrieve).
            model_id: Model ID to use for synthesis. Required. Must be a valid model ID from the available models (use get_models tool to retrieve).
            speed: Speech rate adjustment, range [0.5, 1.5], where 0.5 is slowest and 1.5 is fastest. Default value is 1.
            output_dir: Output directory for the generated audio file. Default is user's desktop.
            
        Returns:
            TextContent containing file path to the generated audio file.
            
        Limitations:
            - Text must not exceed 5,000 characters
            - Both voice_id and model_id must be valid and provided
        """
    )(text_to_speech)
  • Helper function in AllVoiceLab client that performs the actual HTTP POST request to the text-to-speech API endpoint, saves the audio file locally, and returns the file path. Called by the MCP handler.
    def text_to_speech(self, text: str, voice_id, model_id: str, output_dir: str, speed: float = 1.0) -> str:
        """
        Call API to convert text to speech and save as file
    
        Args:
            text: Text to convert
            voice_id: Voice ID
            model_id: Model ID
            output_dir: Output directory
            speed: Speech speed
    
        Returns:
            Saved audio file path
        """
        # Build request body
        request_body = {
            "text": text,
            "language_code": "auto",
            "voice_id": int(voice_id),
            "model_id": model_id,
            "voice_settings": {
                "speed": float(speed)
            }
        }
    
        # API endpoint
        url = f"{self.api_domain}/v1/text-to-speech/create"
    
        # Send request and get response
        response = requests.post(
            url=url,
            json=request_body,
            headers=self._get_headers(),
            stream=True  # Use streaming for large files
        )
        logging.info(f"text to speech response: {response.headers}")
        # Check response status
        response.raise_for_status()
    
        # Try to get filename from response headers
        filename = None
        content_disposition = response.headers.get('Content-Disposition')
        if content_disposition:
            filename_match = re.search(r'filename="?([^"]+)"?', content_disposition)
            if filename_match:
                filename = filename_match.group(1)
    
        # If filename not obtained from response headers, generate a unique filename
        if not filename:
            timestamp = int(time.time())
            random_suffix = ''.join(random.choices('abcdefghijklmnopqrstuvwxyz0123456789', k=6))
            filename = f"tts_{timestamp}_{random_suffix}.mp3"
    
        # Build complete file path
        output_dir = Path(output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)
        file_path = output_dir / filename
    
        # Save response content to file
        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
    
        # Return file path
        return str(file_path)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden and does well by disclosing key behaviors: it generates and saves an audio file, specifies default values and ranges (speed, output_dir), mentions character limits, and references required validation tools. It doesn't cover rate limits or authentication needs, but provides substantial operational context.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (Args, Returns, Limitations), front-loaded with the core purpose, and every sentence adds value. No redundant information—each part serves to clarify usage, parameters, or constraints efficiently.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (5 parameters, no annotations, no output schema), the description is complete: it explains the tool's purpose, how to use it with sibling references, all parameter semantics, return value (file path), and limitations. This provides everything needed for an agent to invoke it correctly without structured fields.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate fully. It adds significant meaning beyond the schema: explains each parameter's purpose, provides constraints (max 5,000 characters, valid IDs from specific tools, speed range), default values, and output directory behavior. This comprehensively documents all 5 parameters.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('generate speech from provided text', 'converts text to speech') and identifies the resource (audio file). It distinguishes itself from siblings like speech_to_speech or subtitle_extraction by focusing on text-to-speech synthesis.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for usage by referencing sibling tools (get_voices, get_models) to obtain valid IDs, and mentions limitations that guide when to use it. However, it doesn't explicitly state when NOT to use this tool versus alternatives like speech_to_speech or when text_translation might be needed first.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/allvoicelab/AllVoiceLab-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server