Babulus + ElevenLabs Guide for AI Agents
Target Audience: AI coding agents Purpose: Enable YAML generation for Babulus projects using ElevenLabs TTS, SFX, and Music
ESSENTIALS
Quick Config: Two-Level YAML
Babulus loads config in this order:
./.babulus/config.yml(project-specific)~/.babulus/config.yml(global)
Global Config (~/.babulus/config.yml):
providers:
elevenlabs:
api_key: "sk_..." # <req> ElevenLabs API key
voice_id: "abc123..." # [opt] Default voice for TTS
model_id: "eleven_multilingual_v2" # [opt] TTS model
sfx_model_id: "eleven_text_to_sound_v2" # [opt] SFX model
music_model_id: "music_v1" # [opt] Music modelProject DSL (.babulus.yml):
voiceover:
provider: elevenlabs # Use ElevenLabs for TTS
audio:
sfx_provider: elevenlabs # [opt] For sound effects
music_provider: elevenlabs # [opt] For background music
scenes:
- id: intro
title: "Introduction"
cues:
- id: hook
label: "Hook"
voice: "Narration text here."MODEL TIER SELECTION
ElevenLabs offers multiple model tiers with different trade-offs between quality, speed, and cost. Babulus supports all ElevenLabs TTS models through configuration.
Available Models
ElevenLabs offers multiple model tiers with different trade-offs between quality, speed, and cost. Commonly used models include:
| Model ID | Tier | Description |
|---|---|---|
eleven_v3 ⭐ |
Premium | Latest v3 model - Recommended for production |
eleven_multilingual_v2 |
Premium | Multilingual support, high quality |
eleven_turbo_v2_5 |
Turbo | Faster generation, good quality balance |
eleven_turbo_v2 |
Turbo | Older turbo model |
eleven_flash_v2_5 |
Flash | Fastest generation, lower quality |
eleven_monolingual_v1 |
Premium | English-only, premium quality |
Note: Model availability and names may change. Check ElevenLabs documentation for the most current list.
When to Use Each Tier
Development/Iteration:
- Use faster models (turbo/flash tiers) or switch to OpenAI for rapid iteration
- Saves on API costs during frequent regeneration
Production:
- Use
eleven_v3for highest quality (recommended) - Use
eleven_multilingual_v2if you need multilingual support
Quick prototypes:
- Use flash tier models for fastest generation when quality is less critical
Configuration: Three Ways
1. Global Default (Config File)
Set model_id in .babulus/config.yml or ~/.babulus/config.yml:
providers:
elevenlabs:
api_key: "sk_xxxxx"
voice_id: "lxYfHSkYm1EzQzGhdbfc"
model_id: "eleven_turbo_v2_5" # Global default model2. Per-Video Override (DSL)
Override model for specific videos in your .babulus.yml:
voiceover:
provider: elevenlabs
model: "eleven_flash_v2_5" # Override for this video only
# voice: "different_voice_id" # Optional: override voice too3. Environment-Based Switching (DSL)
Use different models for development vs production:
voiceover:
provider:
development: openai # Or elevenlabs with flash model
production: elevenlabs
model:
development: "tts-1" # Fast OpenAI for iteration
production: "eleven_v3" # Best quality ElevenLabs for final render
voice:
development: "alloy" # OpenAI voice
production: "lxYfHSkYm1EzQzGhdbfc" # ElevenLabs voice IDSet environment:
# Development (fast iteration)
BABULUS_ENV=development babulus generate content/video.babulus.yml
# Production (highest quality)
BABULUS_ENV=production babulus generate content/video.babulus.ymlKey insight: Use faster/cheaper models during iteration, upgrade to best quality for final render only.
Voice Selection vs Model Selection
Model (model_id or model): Which TTS inference model to use
- Affects quality, speed, cost, language support
- Examples:
eleven_v3,eleven_multilingual_v2,eleven_turbo_v2_5
Voice (voice_id or voice): Which voice personality to use
- Affects voice characteristics (gender, accent, tone)
- Get voice IDs from your ElevenLabs account
- Example:
lxYfHSkYm1EzQzGhdbfc
Both are independent:
voiceover:
provider: elevenlabs
model: "eleven_v3" # Quality/model tier
voice: "lxYfHSkYm1EzQzGhdbfc" # Voice personalityTTS Pattern 1: Minimal Voiceover
voiceover:
provider: elevenlabs # Activate ElevenLabs TTS
scenes:
- id: intro
title: "Introduction"
cues:
- id: hook
label: "Hook" # Timeline marker
voice: "Welcome to the video." # Synthesized textTTS Pattern 2: Multi-Segment with Pauses
scenes:
- id: content
title: "Content"
cues:
- id: explanation
label: "Explanation"
voice:
segments:
- voice: "First sentence here."
- pause_seconds: 0.5 # Mid-narration pause
- voice: "Second sentence after pause."
- pause_seconds: 0.35
- voice: "Final thought."Use case: Control pacing and add dramatic pauses mid-narration.
TTS Pattern 3: Pronunciation Dictionary (Auto-Managed)
Method 1: Alias (Simple)
voiceover:
provider: elevenlabs
pronunciation_dictionary:
name: myproject
pronunciations:
- lexeme:
grapheme: "Babulus"
alias: "bab-u-lus" # Simple phonetic guideMethod 2: CMU Arpabet (RECOMMENDED)
voiceover:
provider: elevenlabs
pronunciation_dictionary:
name: myproject
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "T AE1 K T AH0 S" # CMU Arpabet notation
alphabet: "cmu-arpabet" # MUST specify alphabetMethod 3: IPA (Alternative)
voiceover:
provider: elevenlabs
pronunciation_dictionary:
name: myproject
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "ˈtæktəs" # IPA notation
alphabet: "ipa" # MUST specify alphabetBehavior: Babulus auto-creates/updates an ElevenLabs pronunciation dictionary and attaches it to every TTS request. Dictionary cached in .babulus/out/<video>/manifest.json.
Key Point: When using phoneme, you MUST specify alphabet: "cmu-arpabet" or alphabet: "ipa". CMU Arpabet is recommended for better AI model reliability.
PRONUNCIATION DICTIONARIES
When to Use
Fix mispronunciation of project-specific terms (brand names, technical jargon, acronyms).
Three Pronunciation Methods
| Method | Format | Alphabet | Example | Use Case |
|---|---|---|---|---|
| alias | Plain text phonetic guide | N/A | "tack-tus" |
Simple, good for basic cases |
| phoneme (CMU Arpabet) ⭐ | CMU phoneme notation | "cmu-arpabet" |
"T AE1 K T AH0 S" |
RECOMMENDED - Most reliable with AI models |
| phoneme (IPA) | International Phonetic Alphabet | "ipa" |
"ˈtæktəs" |
Alternative, less reliable |
Note: CMU Arpabet is recommended by ElevenLabs for better consistency with AI voice models. IPA is supported but may produce less reliable results.
Auto-Managed vs Pre-Existing Dictionaries
Option A: Auto-Managed (Recommended)
Method 1: Alias (simple)
voiceover:
pronunciation_dictionary:
name: myproject
pronunciations:
- lexeme:
grapheme: "BrandName"
alias: "brand-name"Method 2: CMU Arpabet (RECOMMENDED for precision)
voiceover:
pronunciation_dictionary:
name: myproject
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "T AE1 K T AH0 S"
alphabet: "cmu-arpabet"Method 3: IPA (alternative)
voiceover:
pronunciation_dictionary:
name: myproject
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "ˈtæktəs"
alphabet: "ipa"Option B: Reference Existing Dictionary
voiceover:
pronunciation_dictionaries:
- id: "pd_your_dictionary_id" # Pre-existing ElevenLabs dict
version_id: null # null = latest versionLimits: Max 3 dictionaries per TTS request. Auto-managed dictionary counts as 1.
Real-World Example (from Tactus project)
Current (using CMU Arpabet for precision):
voiceover:
provider: elevenlabs
pronunciation_dictionary:
name: tactus
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "T AE1 K T AH0 S" # CMU Arpabet - most reliable
alphabet: "cmu-arpabet"CMU Arpabet Notes:
- Numeric stress indicators:
1= primary stress,0= no stress - Space-separated phonemes:
T AE1 K= "tack" - Full example:
T AE1 K T AH0 S= "TAK-tus" (stress on first syllable)
PAUSES & TIMING
Delay Cue Start (pause before narration begins)
cues:
- label: "Hook"
voice:
pause_seconds: 1 # Wait 1 second before speaking
segments:
- voice: "Content starts after delay."Mid-Narration Pauses (between segments)
voice:
segments:
- voice: "First part."
- pause_seconds: 0.5 # Explicit pause
- voice: "Second part."Gaussian Pause Distribution (Natural Variance)
Adds subtle, stochastic pausing between segments for natural-sounding narration.
voiceover:
seed: 1337 # Deterministic randomness (same seed = same pauses)
pause_between_items_gaussian:
mean_seconds: 0.18 # Average pause duration
std_seconds: 0.07 # Standard deviation
min_seconds: 0.06 # Floor value
max_seconds: 0.5 # Ceiling valueUse case: Prevent robotic-sounding narration with fixed pauses.
Prevent First-Word Clipping
voiceover:
lead_in_seconds: 0.25 # Add silence before first wordWhy: Prevents audio decode/playback startup from clipping the first word.
Real-World Example (from Tactus project)
voiceover:
provider: elevenlabs
seed: 1337 # Reproducible variation
lead_in_seconds: 0.25 # Prevent clipping
trim_end_seconds: 0 # Disable global trimming (safety)
pause_between_items_gaussian:
mean_seconds: 0.18
std_seconds: 0.07
min_seconds: 0.06
max_seconds: 0.5AUDIO CLIPS: SOUND EFFECTS (SFX)
Inline SFX Definition
scenes:
- id: transition
title: "Transition"
cues:
- label: "Scene Change"
audio:
- kind: sfx # <req> Clip type
id: whoosh # <req> Unique identifier
prompt: "Fast cinematic whoosh transition" # <req> Text description
duration_seconds: 3 # <req> SFX length
volume: 65% # [opt] 0-1, 0-100, or "65%" (default: 1.0)
at: "+1.5s" # [opt] Start time (default: scene start)Variants/Pick Workflow (Generate Multiple, Select Best)
audio:
- kind: sfx
id: whoosh
prompt: "Fast cinematic whoosh transition, deep bass, airy, clean"
duration_seconds: 3
variants: 8 # Generate 8 different SFX variants
pick: 2 # Use variant #2 (0-indexed: 0, 1, 2, ..., 7)
volume: 65%Use case: Generate multiple options, audition them, then lock in the best one with pick.
Caching: Variants keyed by prompt + duration + seed. Stored in .babulus/out/<video>/cache/.
Audio Library Pattern (Define Once, Use Many)
audio:
sfx_provider: elevenlabs
library:
whoosh: # Reusable clip ID
kind: sfx
prompt: "Fast cinematic whoosh"
duration_seconds: 3
variants: 8
scenes:
- id: scene1
audio:
- use: whoosh # Reference library clip
pick: 2
volume: 35%
- id: scene2
audio:
- use: whoosh
pick: 5
volume: 80%Volume Formats
All three formats are valid:
0.8(float: 0-1)80(integer: 0-100)"80%"(string with %)
Timing Options
audio:
- kind: sfx
at: "2.5s" # Absolute time (from scene start)
# OR
at: "+1.5s" # Relative to previous audio/cue
# OR
pause_seconds: 3.5 # Delay after 'at' timeReal-World Example (from Tactus project)
scenes:
- id: problem
title: "Problem"
cues:
- label: "Problem"
audio:
- kind: sfx
id: whoosh
prompt: "Fast cinematic whoosh transition, deep bass, airy, clean, no harsh distortion, no voice"
duration_seconds: 3
variants: 8 # Generated 8 options
pick: 2 # Selected variant #2
volume: 65%
pause_seconds: 3.5 # Play 3.5 seconds after cue startAUDIO CLIPS: MUSIC
Music Generation
scenes:
- id: intro
title: "Introduction"
audio:
- kind: music # <req> Clip type
id: bed # <req> Unique identifier
prompt: "Warm ambient background music, light percussion, no vocals" # <req> Description
duration_seconds: 30 # [opt] Length (default: scene duration)
volume: 80% # [opt] Playback volumeFade To (Gradual Volume Change)
audio:
- kind: music
id: bed
prompt: "Ambient music, no vocals"
volume: 80% # Starting volume
fade_to:
volume: 10% # Target volume (0-1 scale)
after_seconds: 4 # Start fade at 4s
fade_duration_seconds: 4 # Fade over 4 secondsBehavior: From 4s to 8s, volume transitions 80% → 10%.
Fade Out (Before End)
audio:
- kind: music
id: bed
volume: 80%
fade_out:
volume: 92% # Starting volume for fade (not target!)
before_end_seconds: 4 # Start fade 4s before clip ends
fade_duration_seconds: 4 # Fade over 4 secondsBehavior: 4 seconds before clip ends, fade from 92% → 0% over 4 seconds.
Play Through (Extend Beyond Scene)
audio:
- kind: music
id: bed
play_through: true # Extend to end of video (not just scene)Default: Music clips only play for the current scene. Use play_through for continuous background music.
Force Instrumental
audio:
- kind: music
id: bed
prompt: "Upbeat electronic music"
force_instrumental: true # No vocalsReal-World Example (from Tactus project)
scenes:
- id: title
title: "Title"
audio:
- kind: music
id: bed
prompt: "Warm ambient background music, light percussion, deep bass, energetic, no vocals, clean, unobtrusive"
play_through: true # Continue through entire video
volume: 80%
fade_to:
volume: 10% # Duck volume for narration
after_seconds: 4
fade_duration_seconds: 4
fade_out:
volume: 92% # Fade out at end
before_end_seconds: 4
fade_duration_seconds: 4REFERENCE: COMPLETE SCHEMAS
VoiceoverConfig
| Field | Type | Default | Description | Example |
|---|---|---|---|---|
provider |
string | "dry-run" |
TTS provider | "elevenlabs" |
voice |
string? | null |
Voice ID override (per-video) | "EXAVITQu4vr4xnSDxMaL" |
model |
string? | null |
Model override (per-video) | "eleven_turbo_v2_5", "eleven_flash_v2_5", "eleven_multilingual_v2", "eleven_v3" |
format |
string | "wav" |
Output format | "wav" |
sample_rate_hz |
int | 44100 |
Audio sample rate | 22050, 24000, 44100 |
seed |
int | 0 |
Deterministic variation | 1337 |
lead_in_sec |
float | 0.0 |
Silence before first word | 0.25 |
default_trim_end_sec |
float | 0.0 |
Global end trimming | 0.0 (disabled) |
pause_between_items_sec |
float | 0.0 |
Fixed pause between segments | 0.5 |
pause_between_items_gaussian |
GaussianPause? | null |
Natural pause variance | See GaussianPause schema |
pronunciations |
list | [] |
Auto-managed lexemes | See PronunciationLexemeSpec |
pronunciation_dictionary.name |
string? | null |
Auto-managed dict name | "myproject" |
pronunciation_dictionary.workspace_access |
string? | null |
Workspace access level | "private" |
pronunciation_dictionary.description |
string? | null |
Dictionary description | "Project pronunciations" |
pronunciation_dictionaries |
list? | null |
Pre-existing dict IDs | [{"id": "pd_...", "version_id": null}] |
GaussianPause
| Field | Type | Required | Description |
|---|---|---|---|
mean_seconds |
float | yes | Average pause duration |
std_seconds |
float | yes | Standard deviation |
min_seconds |
float? | no | Floor value |
max_seconds |
float? | no | Ceiling value |
Example:
pause_between_items_gaussian:
mean_seconds: 0.18
std_seconds: 0.07
min_seconds: 0.06
max_seconds: 0.5AudioClipSpec (SFX / Music / File)
| Field | Type | Default | Description | Example |
|---|---|---|---|---|
kind |
"file" | "sfx" | "music" |
required | Clip type | "sfx" |
id |
string | required | Unique identifier | "whoosh" |
at |
string? | scene start | Start time (absolute or relative) | "+1.5s", "2.0s" |
pause_seconds |
float? | 0 |
Delay after at time |
3.5 |
volume |
float | int | string | 1.0 |
Playback volume | 0.8, 80, "80%" |
fade_to |
VolumeFadeToSpec? | null |
Gradual volume change | See VolumeFadeToSpec |
fade_out |
VolumeFadeOutSpec? | null |
Fade before end | See VolumeFadeOutSpec |
play_through |
bool | false |
Extend to video end (music only) | true |
source_id |
string? | null |
Variant cache key override | "shared_clip" |
SFX-specific fields:
| Field | Type | Default | Description | Example |
|---|---|---|---|---|
prompt |
string | required* | SFX description | "Fast whoosh transition..." |
duration_seconds |
float | required* | SFX length | 3 |
variants |
int | 1 |
Number to generate | 8 |
pick |
int | 0 |
Which variant to use (0-indexed) | 2 |
Music-specific fields:
| Field | Type | Default | Description | Example |
|---|---|---|---|---|
prompt |
string | required* | Music description | "Ambient background..." |
duration_seconds |
float | scene length | Music length | 30 |
model_id |
string? | config default | ElevenLabs music model | "music_v1" |
force_instrumental |
bool? | null |
No vocals | true |
File-specific fields:
| Field | Type | Default | Description | Example |
|---|---|---|---|---|
src |
string | required* | File path | "public/audio/clip.mp3" |
VolumeFadeToSpec
| Field | Type | Default | Description |
|---|---|---|---|
volume |
float | required | Target volume (0-1) |
after_seconds |
float | required | Start fade at this time |
fade_duration_seconds |
float | 2.0 |
Fade transition length |
Example:
fade_to:
volume: 0.1 # Target 10%
after_seconds: 4 # Start at 4s
fade_duration_seconds: 4 # Fade over 4sVolumeFadeOutSpec
| Field | Type | Default | Description |
|---|---|---|---|
volume |
float | required | Starting volume for fade (NOT target) |
before_end_seconds |
float | required | Start fade this long before end |
fade_duration_seconds |
float | 2.0 |
Fade transition length |
Example:
fade_out:
volume: 0.92 # Starting volume
before_end_seconds: 4 # Start 4s before end
fade_duration_seconds: 4 # Fade over 4sPronunciationLexemeSpec
| Field | Type | Default | Description | Example |
|---|---|---|---|---|
grapheme |
string | required | Text to replace | "Tactus" |
alias |
string? | null |
Phonetic guide (simple method) | "tack-tus" |
phoneme |
string? | null |
IPA or CMU Arpabet notation | "T AE1 K T AH0 S", "ˈtæktəs" |
alphabet |
string | "ipa" |
Phoneme alphabet (required if phoneme used) | "cmu-arpabet" ⭐, "ipa" |
Note: Provide EITHER alias OR (phoneme + alphabet). Not both.
Example (alias - simple):
pronunciations:
- lexeme:
grapheme: "BrandName"
alias: "brand-name"Example (CMU Arpabet - RECOMMENDED):
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "T AE1 K T AH0 S"
alphabet: "cmu-arpabet"Example (IPA - alternative):
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "ˈtæktəs"
alphabet: "ipa"CMU Arpabet Quick Reference:
- Vowels:
AA,AE,AH,AO,AW,AY,EH,ER,EY,IH,IY,OW,OY,UH,UW - Stress:
0(none),1(primary),2(secondary) - appended to vowels (e.g.,AE1) - Consonants: Standard letters (T, K, S, etc.)
- Example: "actually" =
AE1 K CH UW0 AH0 L IY0
TROUBLESHOOTING
| Error/Issue | Cause | Solution |
|---|---|---|
| Silent audio | Bad phoneme notation or missing alphabet |
Try CMU Arpabet with alphabet: "cmu-arpabet", or use alias |
| Wrong SFX variant plays | Stale cache or incorrect pick index |
Verify pick value (0-indexed). Clear .babulus/out/<video>/cache/ if stale. |
| First word clipped | Audio decode startup delay | Add lead_in_seconds: 0.25 |
| Unnatural pauses | Fixed pause_between_items_sec |
Use pause_between_items_gaussian for natural variance |
| Pronunciation ignored | Dictionary not attached | Check .babulus/out/<video>/manifest.json for dictionary ID |
| Music doesn't extend | Default behavior is scene-duration | Add play_through: true |
| Volume too loud/quiet | Format mismatch | Use consistent format: 0.8, 80, or "80%" |
| SFX not generated | Missing kind, prompt, or duration_seconds |
Ensure all required fields present |
| Variants not cached | source_id conflict |
Use unique id per clip or set source_id for shared caching |
ELEVENLABS BEST PRACTICES SUMMARY
Babulus exposes ElevenLabs features via YAML. For API-level details, see ElevenLabs official docs.
Text Normalization
ElevenLabs: Preprocess phone numbers, currency, dates, abbreviations. Babulus: Handle in voice segments before synthesis.
Example:
voice: "Call five five five, one two three four." # Not "555-1234"Pause Control
ElevenLabs: <break time="x.xs" /> tags (v2/Turbo models). Babulus: Use pause_seconds in segments.
Example:
segments:
- voice: "First part."
- pause_seconds: 1.5 # Equivalent to <break time="1.5s" />
- voice: "Second part."Pronunciation
ElevenLabs: CMU Arpabet (recommended) or IPA phonemes via pronunciation dictionaries. Babulus: Auto-managed dictionaries with three methods:
alias- Simple phonetic guidephonemewithalphabet: "cmu-arpabet"- RECOMMENDED for precisionphonemewithalphabet: "ipa"- Alternative
Example (CMU Arpabet - RECOMMENDED):
pronunciations:
- lexeme:
grapheme: "Tactus"
phoneme: "T AE1 K T AH0 S"
alphabet: "cmu-arpabet" # Most reliable with AI modelsEmotion Control (v3 Audio Tags)
ElevenLabs: [whispers], [excited], [sarcastic], etc. Babulus: Not directly exposed. Rely on text phrasing and context.
Pacing
ElevenLabs: Speed parameter (0.7-1.2x). Babulus: Not directly exposed. Use model/voice selection.
Stability Settings (v3)
ElevenLabs: Creative / Natural / Robust modes. Babulus: Configured via voice_settings in global config (advanced).
SUMMARY: AI AGENT CHECKLIST
When generating .babulus.yml with ElevenLabs features:
- Set provider in
voiceoversection:provider: elevenlabs - Choose model tier based on use case:
model: "eleven_v3"for production (best quality)- Use faster models (turbo/flash) for development/iteration
model: "eleven_multilingual_v2"for multilingual content
- Use environment-based switching to optimize iteration workflow
- Add pronunciation dictionary if custom terms exist
- Use CMU Arpabet with
alphabet: "cmu-arpabet"for pronunciation (most reliable) - Always specify
alphabetwhen usingphonemefield - Add
lead_in_seconds: 0.25to prevent first-word clipping - Use
pause_between_items_gaussianfor natural-sounding narration - Generate SFX variants with
variants: 8and audition withpick - Use
fade_to/fade_outfor professional music bed mixing - Add
play_through: truefor continuous background music - Use volume format consistently:
0.8,80, or"80%" - Reference audio library for reusable clips with
use: <id>
End of Guide Token count: ~1,050 lines (~31,500 tokens) Last updated: 2026-01-15