fish_audio_tts
Convert text to speech with voice selection and streaming options using Fish Audio's TTS API. Supports various audio formats and real-time playback.
Instructions
Generate speech from text using Fish Audio TTS API
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| text | Yes | Text to convert to speech | |
| reference_id | No | Voice model reference ID (optional) | |
| reference_name | No | Voice model name to search for (optional) | |
| reference_tag | No | Voice model tag to search for (optional) | |
| streaming | No | Enable HTTP streaming mode (optional) | |
| websocket_streaming | No | Enable WebSocket streaming mode (optional) | |
| realtime_play | No | Enable real-time audio playback during streaming (optional) | |
| format | No | Output audio format (optional) | mp3 |
| mp3_bitrate | No | MP3 bitrate in kbps (optional) | |
| opus_bitrate | No | Opus bitrate in bps; -1000 = auto. Only applies when format=opus. | |
| sample_rate | No | Audio sample rate in Hz. Defaults to format-native rate when omitted. | |
| normalize | No | Enable text normalization (optional) | |
| latency | No | Latency mode: low=lowest latency, balanced=reduced latency, normal=best quality | balanced |
| output_path | No | Custom output file path (optional) | |
| auto_play | No | Automatically play the generated audio (optional) | |
| speed | No | Speaking rate multiplier (0.5=half speed, 1.0=normal, 2.0=double speed) | |
| volume | No | Volume adjustment in dB (0=no change, positive=louder, negative=quieter) | |
| normalize_loudness | No | Normalize output loudness for consistent perceived volume (s2-pro only) | |
| temperature | No | Expressiveness/emotion control (0=consistent and calm, 1=varied and emotional) | |
| top_p | No | Nucleus sampling diversity (0..1) | |
| chunk_length | No | Target text segment size for processing (100-300) | |
| max_new_tokens | No | Maximum audio tokens to generate per text chunk | |
| repetition_penalty | No | Penalty for repeating audio patterns; values >1.0 reduce repetition | |
| min_chunk_length | No | Minimum characters before splitting into a new chunk (0-100) | |
| condition_on_previous_chunks | No | Use previous audio as context for voice consistency across chunks | |
| early_stop_threshold | No | Early stopping threshold for batch processing (0..1) | |
| speakers | No | Multi-speaker mode (s2-pro only). Ordered list of speaker identifiers — each entry is resolved against FISH_REFERENCES by id, then name, then tag (or treated as a raw reference_id if no references are configured). The order maps to speaker tags `<|speaker:0|>`, `<|speaker:1|>`, ... in `text`. Provide at least 2 entries to engage multi-speaker; a single entry is equivalent to `reference_id`. |