Azure Speech TTS Quick Start
Azure Speech (also called Azure Cognitive Services Speech) is Microsoft's text-to-speech service. It offers high-quality neural voices and supports many languages.
Prerequisites
- Azure Account with Cognitive Services access
- Azure Speech API Key from Azure Portal
- Region where your Speech resource is deployed (e.g.,
eastus,westus2) - Python requests library (should already be installed)
Getting Your Credentials
- Go to Azure Portal
- Create a "Speech" resource (Cognitive Services → Speech)
- After creation, go to "Keys and Endpoint"
- Copy:
- Key 1 or Key 2 (your API key)
- Location/Region (e.g.,
eastus)
Configuration
1. Add Azure Speech to your config
Edit .babulus/config.yml:
providers:
azure_speech:
api_key: "your_azure_api_key_here"
region: "eastus" # Your Azure region
voice: "en-US-JennyNeural" # Default voice2. Set provider in your DSL
Edit your .babulus.yml file:
voiceover:
provider: azure-speech # or just "azure"
sample_rate_hz: 24000 # Azure supports: 8000, 16000, 24000, 441003. Environment-Based Configuration (Recommended)
For multi-environment workflows:
voiceover:
provider:
development: openai
aws: aws
azure: azure # Use Azure Speech for testing
production: elevenlabs
sample_rate_hz:
development: 24000 # OpenAI
aws: 16000 # Polly
azure: 24000 # Azure Speech (good balance)
production: 44100 # ElevenLabsThen generate with:
BABULUS_ENV=azure babulus generate your-video.babulus.ymlAvailable Voices
Azure offers 400+ voices across 140+ languages. Here are popular English options:
US English Neural Voices
- en-US-JennyNeural (Female, friendly)
- en-US-GuyNeural (Male, warm)
- en-US-AriaNeural (Female, professional)
- en-US-DavisNeural (Male, professional)
- en-US-AmberNeural (Female, young adult)
- en-US-AnaNeural (Female, child)
- en-US-AndrewNeural (Male, young adult)
- en-US-BrianNeural (Male, child)
UK English Neural Voices
- en-GB-SoniaNeural (Female)
- en-GB-RyanNeural (Male)
- en-GB-LibbyNeural (Female, young)
Other English Variants
- en-AU (Australian): Natasha, William, Annette
- en-CA (Canadian): Clara, Liam
- en-IN (Indian): Neerja, Prabhat
- en-IE (Irish): Emily, Connor
See full voice list for all languages.
Voice Selection
Override the voice per DSL:
voiceover:
provider: azure
voice: en-US-GuyNeural # Override default voice
sample_rate_hz: 24000Or use the default from config.
Sample Rate Options
Azure Speech supports flexible sample rates:
- 8000 Hz: Telephony quality (smallest files)
- 16000 Hz: Good quality (medium files)
- 24000 Hz: High quality (recommended, good balance)
- 44100 Hz: Maximum quality (largest files)
Pricing
- Standard voices: Free for first 0.5M characters/month, then $4/1M chars
- Neural voices: Free for first 0.5M characters/month, then **16/1Mchars * *( 0.016/1K chars)
- More expensive than OpenAI (0.015/1K)andAWSPolly(0.004-0.016/1K)
- Less expensive than ElevenLabs (~$0.30/1K)
Troubleshooting
Error: "Azure TTS failed (401)"
Solution: Check your API key and region are correct in config.
providers:
azure_speech:
api_key: "YOUR_ACTUAL_KEY" # Get from Azure Portal
region: "YOUR_REGION" # e.g., eastus, westus2Error: "Azure TTS sample_rate_hz must be one of 8000, 16000, 24000, 44100"
Solution: Use a supported sample rate:
voiceover:
sample_rate_hz: 24000 # Change to 8000, 16000, 24000, or 44100Voice Not Found
Solution: Check the voice name is correct. Voice names are case-sensitive and must include language prefix:
- ✅ Correct:
en-US-JennyNeural - ❌ Wrong:
jenny,JennyNeural,en-us-jennyneural
SSML Support
Azure Speech uses SSML (Speech Synthesis Markup Language) internally. Currently, Babulus sends plain text which Azure wraps in basic SSML. Future versions may expose more SSML features for:
- Prosody control (rate, pitch, volume)
- Pauses and breaks
- Pronunciation hints
- Emphasis and expression
Comparison with Other Providers
| Feature | Azure Speech | OpenAI TTS | AWS Polly | ElevenLabs |
|---|---|---|---|---|
| Price (per 1K chars) | $0.016 | $0.015 | $0.004-0.016 | ~$0.30 |
| Quality | Excellent | Very Good | Good | Excellent |
| Sample rates | 8000-44100 Hz | 24000 Hz | 8000, 16000 Hz | 22050-44100 Hz |
| Voices | 400+ voices, 140+ languages | 6 voices | 60+ voices | 1000+ voices |
| Credentials | API key + region | API key | AWS credentials | API key |
| Free tier | 0.5M chars/month | None | None | 10K chars/month |
| Best for | Multilingual, enterprise | Development | Cost-conscious | Final production |
Example: Complete DSL
voiceover:
provider: azure
voice: en-US-GuyNeural
sample_rate_hz: 24000
seed: 1337
lead_in_seconds: 0.25
scenes:
- id: intro
title: "Introduction"
cues:
- id: welcome
label: "Welcome"
voice: "Welcome to our presentation about Azure Speech services."Advanced: Multiple Voices
You can mix voices within a video by overriding voice per cue:
voiceover:
provider: azure
voice: en-US-JennyNeural # Default female voice
sample_rate_hz: 24000
scenes:
- id: dialog
cues:
- id: host
voice: "Hello, I'm the host."
# Uses default: en-US-JennyNeural
- id: guest
voice:
model: en-US-GuyNeural # Override to male voice
segments:
- voice: "And I'm the guest speaker."Regional Endpoints
Azure Speech uses regional endpoints. Common regions:
- eastus - East US (Virginia)
- westus2 - West US 2 (Washington)
- westeurope - West Europe (Netherlands)
- southeastasia - Southeast Asia (Singapore)
Choose the region closest to you for lower latency, or use the region where you created the Azure resource.