ElevenLabs Multilingual Voiceover at Scale: Dub Your Video Ads Into 29 Languages
Contents
A SaaS client came to me last quarter with a problem I would have called "impossible" three years ago. They had a 60-second English hero video, performing well in the US, and they wanted to launch it as paid social in Brazil, Mexico, Germany, France, Spain, Italy, Japan, and Korea. Same creative, same edit, same on-screen talent. Just dubbed. In 8 languages. In two weeks, ahead of their Q4 push.
The old way: book 8 voice actors, schedule studio time in 8 time zones, sync subtitles, pray the agency doesn't blow the budget. The new way: ElevenLabs' Dubbing API, one 60-second video, and a Friday afternoon. We shipped all 8 versions in under 3 hours of total human time, and the language-team lead told me the German and Japanese variants outperformed the US original in their first 7 days. (Caveat: small budget, single test, treat as a data point, not a verdict.)
This is the actual pipeline I built for that project. The one I'd run again tomorrow for any global paid video push.
Why Dubbing Is the Unlocked Lever for Global Paid Media
Most global paid-media budgets still live in one language. That's not because the teams don't want to expand — it's because the last-mile cost of localization is brutal. A 60-second video in 8 languages, professionally voiced, used to mean $8K–$15K in studio costs and 2–3 weeks of project management. The math made it sensible to just run English in non-English geos and hope for the best.
The best was not great. English-only video in a Spanish-speaking market typically underperforms native-language creative on view-through rate (VTR, the percentage of viewers who watch to completion) by 30–50% in my client's Meta data. The translate-as-subtitle approach helped, but never closed the gap — because reading subtitles is a different cognitive task from hearing a real voice in your language.
ElevenLabs' Dubbing API changed the cost curve. Same speaker identity preserved, 29 target languages, output in minutes instead of weeks, and a per-character cost that comes out to roughly $15–$60 per 60-second spot at the time I'm writing this. The cost is no longer the bottleneck. The workflow is.
The Five-Step Pipeline
This is the workflow, end to end. Skip a step and the output falls apart in one of three ways: voices that don't match the on-screen person, translations that read like a Google Translate job, or audio that's audibly AI (which still costs you trust, especially in older demographics).
- Source the original voice — record a clean 60–90s sample OR use your brand's on-screen talent's existing read
- Clone it in ElevenLabs' Voice Library (one-time per speaker)
- Run the Dubbing API with the source language and target languages
- QA the translation and timing in a spreadsheet (this is the step that separates "AI output" from "shippable creative")
- Re-mux audio + video and ship to your ad platform
I'll walk through each one with the actual config.
Step 1: Source the Original Voice (Don't Skip This)
The biggest mistake I see: people trying to clone a voice from a video that already has music, sound effects, and background noise baked into the audio. The clone works, but it carries the room tone and music into every target language. Your German dub ends up sounding like the speaker is in a coffee shop.
The fix: pull a clean 60–90 second WAV of the speaker reading the script, recorded in a quiet room. Phone-on-a-pillow works in a pinch. The cleaner the source, the cleaner every dub.
If you're cloning a brand spokesperson or executive, get their explicit written consent first. ElevenLabs' terms require it, and California + a growing list of US states have right-of-publicity (肖像权 / 声音权) laws that make cloning someone's voice without permission a real legal issue, not just a ToS footnote.
Step 2: Clone Once, Reuse Forever
In the ElevenLabs dashboard, go to Voices → Add Voice → Instant Voice Cloning. Upload the clean WAV, name it (e.g. "Sarah_EN_Hero"), and the system extracts a voice fingerprint in roughly 30 seconds.
For most marketing uses, Instant Voice Cloning is fine. It costs nothing extra and produces a voice that the Dubbing API can re-speak in 29 languages while preserving the speaker's identity. If you have a flagship brand voice you'll use across hundreds of ads and want more control over the timbre, Professional Voice Cloning is the next tier — but it costs more and requires more sample data.
The part most tutorials skip: clone a few test utterances first in your target language. Before committing to a 29-language batch, I do a quick test in Spanish, German, and Japanese. If the cloned voice sounds reasonably like the source in those three, it'll sound like the source in the other 26.
Step 3: The Dubbing API Call (Where the Work Actually Happens)
The Dubbing API is a single endpoint. The Python SDK is the cleanest way to drive it from a script.
pythonfrom elevenlabs.client import ElevenLabs
import time
client = ElevenLabs(api_key="YOUR_API_KEY")
# 1. Start the dub
project = client.dubbing.create(
source_url="https://your-cdn.com/hero-video-en.mp4",
source_lang="en",
target_lang="es", # loop over your 8 markets
num_speakers=1,
watermark=True, # protects against abuse of cloned voice
)
project_id = project.project_id
# 2. Poll until it's done
while True:
status = client.dubbing.get(project_id).status
if status == "dubbed":
break
time.sleep(10)
# 3. Download the dubbed audio
audio_path = client.dubbing.get_audio(project_id, language="es")
with open("hero-video-es.mp3", "wb") as f:
f.write(audio_path)A few things in this snippet that matter:
source_urlis an HTTPS URL, not a local file. Upload your video to S3, Cloudflare R2, or any CDN with a public link. Local file uploads work but are slower and not recommended for batches.num_speakers: ElevenLabs' auto-detection works well for 1–2 speakers, breaks down at 3+. For 3+ speakers, transcribe and segment first.watermark=Trueis the right call for marketing assets. The watermark is a perceptual inaudible signal that lets ElevenLabs prove the audio came from their system. It has zero effect on the listener.
For an 8-language batch, the structure is the same — loop target_lang over ["es", "pt", "de", "fr", "it", "ja", "ko", "..."]. The total wall-clock time is dominated by translation, not synthesis. For a 60-second source, expect 2–4 minutes per language.
Step 4: Translation QA (The Step That Actually Matters)
This is the step that separates a usable dub from a shippable one. ElevenLabs' built-in translation is good — better than Google Translate, better than DeepL for spoken cadence — but it is not a marketing translator. A few failure modes I've seen in production:
- Slogans get translated literally. A US brand tagline like "Built to outlast" came back in Spanish as "Construido para durar más" — grammatically correct, brand-wrong. The actual brand Spanish is "Hecho para resistir."
- CTAs change meaning. "Sign up free" became "Registrar gratis," which in Brazilian Portuguese reads as a spammy affiliate offer, not a SaaS trial.
- Cultural references don't land. A US pop-culture quip in the original landed flat in Japan and got a confused click-through.
The fix: export the transcript, send it to a native-speaker marketer in each market for a 15-minute pass, then re-run the dub with the corrected transcript as the source. The API accepts a transcript override so you don't have to re-translate from the video each time.
This is where the human-in-the-loop earns its keep. The 15 minutes of human review per language is the difference between a dub that sounds like a real campaign and one that sounds like AI trying too hard.
Step 5: Re-mux and Ship
The Dubbing API returns audio only. To get a full video, you have two options:
- Easy: Use a tool like FFmpeg (a free command-line video processing tool) to swap the audio track. One-liner:
ffmpeg -i hero-video-en.mp4 -i hero-video-es.mp3 -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 hero-video-es.mp4. - Smarter: If you want lip-sync — the dubbed mouth matching the new language — pair ElevenLabs with a tool like HeyGen (a video tool that re-animates mouth movements to match new audio), D-ID, or Wav2Lip (an open-source lip-sync model). Each has tradeoffs:
| Tool | Lip-sync quality | Cost per 60s | Best for |
|---|---|---|---|
| HeyGen | Excellent (re-animates face) | ~$30–60 | Spokesperson-led ads, talking-head videos |
| D-ID | Good | ~$5–10 | Budget-conscious, mostly straight-to-camera |
| Wav2Lip (open source) | Adequate | Free (self-hosted) | When budget is zero and you can run a GPU |
For paid social, lip-sync is genuinely worth it. A 2024 Meta internal study (cited in a few industry talks) suggested native-language lip-synced creative lifted brand recall by 12–18% over the same creative with mismatched lip movement. Your mileage will vary, but the directional signal is real.
What I Wish I'd Known on Project One
A few things that would have saved me 4 hours on the first run:
The voice clone doesn't age well across re-uses. I re-used the same cloned voice for 4 different videos in the same campaign, and the timbre drifted slightly on the 3rd and 4th. For a flagship campaign, re-clone per major creative, or use the original source for every new dub.
Some languages need a different pacing model. German and Japanese tend to need slower delivery than the source English to feel natural, and the API will match the source pacing if you don't intervene. The fix: in the source script, add strategic punctuation or short parenthetical breath markers (...) where you want the model to slow down. For German specifically, breaking compound words with hyphens gives the model a hint to re-pace.
The "free" tier is for testing, not for production. ElevenLabs' free plan gives you ~10K characters/month. A single 60-second dub in 8 languages burns about 20K–30K characters. You'll need the Starter tier ($5/month at the time I'm writing this) minimum, and the Creator tier ($22/month) for anything resembling real campaign volume.
Subtitles still matter. Even with a great native-language dub, on platforms where 60–70% of mobile users watch on mute by default (which is most of them), burned-in or platform-side subtitles are still doing real conversion work. Dub + subtitles is the new baseline, not dub or subtitles.
The Real Win
Three years ago, "localize a video into 8 languages" was a project. Today it's a Friday afternoon. The work moved from production (studio, talent, post) to QA (translation review, lip-sync check, brand fit). That's a better place for the work to live — the production step was always the easy part; the part that needed humans was the judgment call, and the judgment call is now all that's left.
If you're running global paid video and still shipping English-only creative, the cheapest test you can run this week is: take your top-performing US ad, dub it into your top 3 non-English markets with the pipeline above, and let the platform's auction do the rest. Don't overthink the creative. The localization alone will move the numbers.