Skip to main content
Glama

speak

Convert a klattsch phoneme string into synthesized speech audio. Returns base64-encoded WAV using retro-style formant synthesis.

Instructions

Synthesize speech from a klattsch phoneme string. Returns base64 WAV audio.

What This Is

klattsch is a formant speech synthesizer (late-70s/early-80s style — think Votrax, SAM). You give it a string of ARPAbet phoneme codes with optional voice control directives, and it renders a WAV audio file.

How To Use This Tool

Step 1: Build a phoneme string

Write ARPAbet phonemes separated by spaces, with control directives mixed in. Use the text_to_phonemes tool first to convert English text, then refine by hand.

Step 2 (optional): Set voice character

Prefix your utterance with control directives to set the voice:

  • bN: base pitch in Hz (b120 = default male, b200 = female, b280 = child)

  • rN: per-phoneme duration in ms (r80 = fast, r110 = normal, r250+ = sung)

  • sN: formant scale (1.0 = male, 1.17 = female, 1.3 = child)

  • vN: vibrato depth in Hz (v3-v6 = expressive, v0 = off)

  • hN: breathiness 0..1 (h0.3 = airy/whispery)

  • gN: vocal effort 0=lax..1=tense (default 0.5)

  • tN: spectral tilt -0.9=darker..+0.9=brighter (t-0.4 = warm, t0.3 = bright)

Step 3: Add prosody (intonation)

  • ! after a vowel for stress: DH AE! T = "THAT" with emphasis

  • +N/-N on vowels for pitch changes: AY+20 = rising "I", D AH N(-30) = falling "done"

  • (+N)/(-N) for transient ornaments (don't carry forward)

  • , ; . for pauses: 100ms, 200ms, 300ms

Step 4: Render

Pass the complete string to this tool.

Quick-Reference Voice Presets

Preset

Directives

Description

Male natural

b120 r100 s1.0 v2

Default voice

Male deep

b90 r95 s0.92 v1 t-0.3 g0.6

Deep, authoritative

Male bright

b130 r105 s1.0 v2 t0.2

Clear, energetic

Female natural

b200 r100 s1.17 v2

Natural female

Female warm

b185 r105 s1.15 v3 t-0.2

Warm, friendly

Female bright

b220 r100 s1.18 v2 t0.2

Bright, cheery

Child

b280 r90 s1.3 v1

Young, higher pitch

Robot

b120 r90 s1.0 v0 h0 g0.8 t0.5

Flat, mechanical

Whisper

b120 r100 s1.0 v0 h0.6 g0.1

Breathy whisper

Dramatic

b100 r130 s1.0 v5

Slow, theatrical

Singing male

bC4 r300 s1.0 v5

For sung notes

Singing female

bG4 r300 s1.17 v4

For sung notes

Intonation Patterns That Sound Natural

Falling statement (period): last vowel gets -20 to -30 e.g. D AH N(-25) Rising question: last vowel gets +20 to +30 e.g. R EH D IY(+25) Listing items: each item rises, last falls e.g. AE(+15) P AH L Z(+15) AO R AH N JH(-20) Excited: higher base pitch, faster b140 r85 ... Serious/deep: lower base pitch, slower b95 r115 ... Sarcastic: exaggerated pitch swings AY+30 M . S OW(-30) . S AH R K AE S T IH K

Singing With Note Names

Instead of Hz for b, use note names: bC4, bD#4, bEb4, bF4, bG4, bA4, bB4 Middle C = C4 (261Hz), A4 = 440Hz Set r250-r400 per phoneme, group notes with parentheses: bC4 r300 ( HH AH ) ( L OW ) bE4 ( W ER L D )

Example Strings

  1. "Hello world" (male): b120 r100 s1.0 HH AH L OW . W ER L D

  2. "How are you?" (female, rising): b200 s1.17 HH AW . AA R . Y UW(+25)

  3. "I am NOT impressed" (stress on NOT): b120 AY . AE M . N AO T! . IH M P R EH S T(-20)

  4. "The quick brown fox" (energetic): b135 r90 t0.2 DH AH . K W IH K . B R AW N . F AA K S

  5. Sing "Twinkle twinkle" (two notes): bC4 r300 ( T W IH NG ) ( K AH L ) bG4 r300 ( T W IH NG ) ( K AH L )

  6. Dramatic movie trailer voice: b95 r140 s0.95 v4 t-0.3 g0.7 IH N . AH . W ER L D(-25) .

  7. Robot announcement: b130 r85 s1.0 v0 h0 g0.8 t0.4 AH T EH N SH AH N . P L IY Z

  8. Whispered secret: b110 r105 v0 h0.5 g0.1 s1.0 P S T . D OW N T . T EH L . EH N IY W AH N

Phoneme Categories (all 39 phonemes)

Vowels: IY IH EH AE AA AO AH UH UW ER AY AW EY OW OY Sonorants: W Y R L M N NG Fricatives: F TH S SH V DH Z ZH HH Stops: P B T D K G (these get automatic burst + silence) Affricates: CH JH

⚠️ P, B, T, D, K, G, CH, JH are stop consonants — they include an automatic silence-burst pattern. Don't add extra pauses after them.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
utteranceYesThe klattsch phoneme string. ARPAbet codes + control directives, whitespace-separated. Use text_to_phonemes to convert English first, then tweak.
sampleRateNo
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries full burden. It discloses output format (base64 WAV), control directives, stop consonant auto-burst behavior, and intonation patterns. It does not cover error handling or rate limits, but the behavioral detail is extensive.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is long but well-structured with headers, tables, and examples. Every section adds value for the complex phoneme input. It is front-loaded with the purpose and quick-reference guide. Slightly verbose but justified by the domain.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of speech synthesis and no output schema, the description is extremely thorough. It covers input format, voice presets, prosody patterns, phoneme categories, and even special behaviors like stop consonants. No significant gaps remain.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 50% (only utterance has a description). The description greatly elaborates on the utterance parameter, but does not mention sampleRate at all, leaving its semantics to schema min/max/default. This partial coverage results in a moderate score.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description states the tool synthesizes speech from a klattsch phoneme string and returns base64 WAV audio. It clearly identifies the specific verb and resource, and distinguishes from siblings like text_to_phonemes and speak_file by focusing on phoneme input.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides step-by-step instructions on building phoneme strings, setting voice parameters, and adding prosody. It also suggests using text_to_phonemes first. However, it does not explicitly mention when to use alternatives like speak_file or list_phonemes, so it misses some exclusion guidance.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Endeavor-DoxiDoxi/klattsch-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server