Skip to main content
Glama

Clone a voice with Replicate

replicate_clone_voice

Synthesize speech using a cloned voice from a short reference audio sample. Provide a URL to the voice sample and the text to speak.

Instructions

Synthesize speech in a cloned voice. Provide a short reference audio sample (~5-30 s) and the text to speak; the model reproduces the voice characteristics.

DISPLAY REQUIREMENT — after this tool returns successfully, include the URL printed in the tool's text content as a markdown link [Audio](URL) so the user can play it. URLs expire in ~24h.

Args:

  • text (string, 1-5000): Text to synthesize in the cloned voice.

  • reference_audio_url (URL): URL of the voice sample to clone from. Use replicate_upload_file to upload a local file first.

  • language (string, optional): ISO-639 code (e.g. "en", "es", "it"). Default "en".

  • model (string, default "xtts-v2"): Curated key (xtts-v2, openvoice-v2) or "owner/name[:version]".

  • extra_input (object, optional): Model-specific extras.

  • download (boolean, default true).

  • timeout_ms: Default 300000.

Returns: PredictionResult. local_paths contain WAV/MP3 files.

Examples:

  • text="Hello world, this is my cloned voice.", reference_audio_url="<url-to-your-voice-sample.wav>"

  • text="Buongiorno a tutti!", reference_audio_url="", language="it"

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
textYesText to synthesize in the cloned voice.
modelNoVoice cloning model. Curated: xtts-v2, openvoice-v2. Or "owner/name".xtts-v2
downloadNo
languageNoISO-639 language code (e.g. 'en', 'es', 'it'). Default: 'en'.
timeout_msNoMax ms to wait for the prediction. If exceeded, returns the prediction ID so you can poll via replicate_get_prediction. Default: 300000 (5min).
extra_inputNoAdditional model-specific inputs.
reference_audio_urlYesURL of a short voice sample (~5-30s) to clone. Use replicate_upload_file if you only have a local file.
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds substantial value beyond annotations: it discloses URL expiry (~24h), timeout polling behavior, and a display requirement. It aligns with annotations (readOnlyHint=false, destructiveHint=false) and provides no contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with a summary, display requirement, parameter list, returns note, and examples. It is appropriately concise without missing essential details.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a tool with 7 parameters, nested objects, and no output schema, the description covers key aspects: input constraints, timeout handling, and return format. It could mention how to extract the URL from results or handle multiple files, but overall it is sufficiently complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Despite high schema coverage (86%), the description adds meaning: it clarifies the role of reference_audio_url, lists default values, and explains timeout behavior. The Args section provides context not captured in the schema alone.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description starts with 'Synthesize speech in a cloned voice,' clearly stating the tool's core function. It distinguishes itself from sibling tools like replicate_generate_speech by focusing on voice cloning from a reference sample.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear usage context: it specifies the reference audio length (~5-30 s) and directs users to replicate_upload_file for local files. However, it does not explicitly contrast this tool with alternatives like replicate_generate_speech.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sena-labs/replicate-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server