How to Create AI Voiceovers From a Script in One Flow
Paste a rough script, get back a studio-quality voiceover in under 30 seconds. This flow uses Gemini for script cleanup and ElevenLabs for synthesis.
Paste a rough script. Get back a clean, natural-sounding voiceover in under 30 seconds. I built this flow on PlugNode's canvas in about five minutes, and it costs less than $0.01 per run.
This tutorial covers every node, every wire, and every config field. You'll walk out with a working flow you can trigger manually or publish as an API endpoint for your app, store, or podcast pipeline.
What you'll build
A 5-node flow that chains two AI models:
- Gemini rewrites your rough script for natural speech (expanding abbreviations, removing URLs, adding pause cues)
- ElevenLabs generates a studio-quality voiceover from the cleaned script
The output: a downloadable MP3 file ready for your video editor, podcast host, or ad platform.
| Node | Type | Model / Provider | Purpose |
|---|---|---|---|
| manual-trigger | Trigger | None | Starts the flow |
| text-input | Input | None | Your raw script |
| text | Generation | Gemini 2.5 Flash | Cleans and rewrites for speech |
| audio | Generation | ElevenLabs | Synthesizes the voiceover |
| output | Output | None | Collects the audio file |
Prerequisites
- A PlugNode account (free tier works)
- API keys added in Settings: Gemini, ElevenLabs
- A script or rough notes (even bullet points work)
Open a blank canvas from your dashboard. Everything below happens on that canvas.
Step 1: Add the trigger and input
Drag a Manual Trigger node onto the canvas. This fires the flow when you click Run.
Add a Text Input node. Label it "Raw Script." Paste your rough script here. It can be messy. Bullet points, half-sentences, URLs, abbreviations, all fine. The next node handles cleanup.
Example value:
Intro for ep 47. Topic: why DTC brands should test short-form video ads on TikTok before scaling to Meta. Mention avg CPM diff ($4-6 TikTok vs $10-14 Meta). Keep it under 30 sec. Casual tone, not salesy.Wire the Manual Trigger to the Text Input. Wire the Text Input to the next generation node.
Step 2: Clean the script with Gemini
Add a Text node. Open its config panel and select Gemini 2.5 Flash as the model.
Set the system prompt:
You are a script editor for spoken audio. Given rough notes or a draft script,
rewrite it as a clean voiceover script optimized for text-to-speech.
Rules:
- Expand all abbreviations (DTC → direct-to-consumer, CPM → cost per mille)
- Remove URLs and markdown formatting
- Write numbers as words when under 100, digits when over
- Add [pause] markers between sections for natural breathing
- Keep the original tone and intent
- Do not add intros like "Welcome to..." unless the input asks for one
- Output only the final script, no commentaryWire the Text Input's output to the Text node's prompt port.
I tested this with the rough notes above. Gemini returned a clean 28-second script in 0.9 seconds:
Here's something most DTC brands get wrong. They scale video ads on Meta
before testing on TikTok. [pause] The numbers tell the story. TikTok's
average cost per mille sits between four and six dollars. Meta? Ten to
fourteen. [pause] Test the creative on TikTok first. If it works there,
scale it on Meta with confidence. You'll spend less finding what converts.The rough bullet points became a conversational script with natural pacing. That is the point of this node: you skip the manual rewriting step.
Step 3: Generate the voiceover with ElevenLabs
Add an Audio node. Select ElevenLabs as the provider.
Wire the Text node's output to the Audio node's text port. Pick a voice from the dropdown. I used "Drew" for a casual, mid-range male voice. Other good options:
- Rachel: warm, professional female voice (good for product demos)
- Sarah: clear, neutral female voice (good for explainers)
- Antoni: deep male voice (good for brand ads)
- Fin: energetic male voice (good for YouTube intros)
The Audio node sends the cleaned script to ElevenLabs and returns an MP3 file. Synthesis takes 2-5 seconds depending on script length.
For the 28-second script above, the output was a 26-second MP3 at 128kbps. Clean pronunciation, natural cadence, correct emphasis on the numbers.
Step 4: Collect the output
Add an Output node. Wire two connections into it:
- Audio node output to the Output node (audio port)
- Text node output to the Output node (text port)
Wiring the text output alongside the audio gives you the final script for reference, subtitles, or show notes.
Step 5: Run the flow
Click Run in the toolbar. The canvas executes in dependency order:
- Trigger fires
- Text Input resolves (your raw script)
- Gemini rewrites for speech (~1s)
- ElevenLabs synthesizes audio (~3s)
- Output collects everything
Total wall-clock time on my test run: 4.2 seconds. Open the Execution Log in the bottom panel to see per-node timing, token counts, and any errors.
Download the MP3 from the execution panel. Drop it into your video editor, podcast DAW, or ad platform.
Publishing as an API
Once the flow works manually, you can automate it. Replace the Manual Trigger with an HTTP Trigger node. Add a Respond to Webhook node wired to your Output.
Hit Publish in the top bar. PlugNode generates a signed URL:
POST https://plugnode.ai/api/trigger/{secret}/{nodeId}Send a JSON body with your raw script:
{
"script": "Your rough notes or bullet points here..."
}The endpoint returns the audio file and cleaned script in the response. Append ?wait=true for synchronous delivery.
Use cases for the API version:
- E-commerce: trigger from your product catalog webhook. Every new SKU auto-generates a voiceover for its product video.
- Podcasts: batch-process episode scripts from your CMS. Push a button, get all voiceovers for the week.
- YouTube: wire it into your production pipeline. Paste the script in your project management tool, webhook fires, voiceover appears in your shared drive.
Rate limit: 60 requests per minute per trigger.
Cost comparison: AI voiceover vs. hiring voice talent
Here is what I paid across five test runs of varying script lengths:
| Script length | Gemini cost | ElevenLabs cost | Total | Time |
|---|---|---|---|---|
| 15 seconds | $0.0005 | $0.003 | ~$0.004 | 3s |
| 30 seconds | $0.001 | $0.005 | ~$0.006 | 4s |
| 60 seconds | $0.001 | $0.009 | ~$0.01 | 6s |
| 2 minutes | $0.002 | $0.018 | ~$0.02 | 10s |
| 5 minutes | $0.003 | $0.04 | ~$0.04 | 18s |
Compare that to hiring voice talent:
| Method | Cost per minute | Turnaround |
|---|---|---|
| Freelance (Fiverr/Upwork) | $25-75 | 1-3 days |
| Professional studio | $100-300 | 3-7 days |
| PlugNode flow | ~$0.01 | 4-6 seconds |
The quality gap has narrowed. ElevenLabs voices sound natural enough for product demos, podcast intros, internal training, and social ads. For flagship brand campaigns where a specific voice actor matters, hire a human. For everything else, this flow handles it.
Troubleshooting
ElevenLabs returns "quota exceeded." Check your ElevenLabs dashboard for character limits on your plan tier. The free tier allows 10,000 characters per month. Upgrade or wait for the monthly reset.
Gemini rewrites too aggressively. Add a constraint to the system prompt: "Preserve the original wording as much as possible. Only fix formatting for speech." This keeps the model from paraphrasing your content.
Audio sounds robotic or rushed. Add more [pause] markers in the Gemini system prompt. You can also try a different ElevenLabs voice. Some voices handle casual scripts better than others.
The flow runs but the output node is empty. Check that the Audio node's output wire connects to the Output node's audio port. A common mistake is wiring to the text port instead.
What's next
This flow handles single-voice narration. For more complex production, consider these extensions:
- Add a second Audio node with a different voice to generate A/B variants of the same script
- Chain a Music node (Lyria) after the Audio node to generate background music that matches the voiceover tone
- Wire the API version into a Zapier zap or Make scenario to trigger voiceovers from Google Sheets rows
The full use case page is at /use-cases/ai-voiceover-generator.
FAQ
Can I use my own ElevenLabs custom voice?
Yes. If you have cloned voices on your ElevenLabs account, they appear in the Audio node's voice dropdown. Select your custom voice the same way you'd select a stock voice.
What audio formats does the output support?
The Audio node returns MP3 by default. ElevenLabs also supports PCM and OGG, but the node currently outputs MP3. Convert downstream if you need WAV or FLAC.
Can I generate voiceovers in languages other than English?
Yes. Gemini handles multilingual script cleanup, and ElevenLabs supports 29 languages. Set the language in the Audio node config. The voice selection changes based on the language you pick.
How long can the script be?
ElevenLabs accepts up to 5,000 characters per request on most plan tiers. That covers roughly 3-4 minutes of spoken audio. For longer scripts, split them into chunks and run the flow once per chunk.
Does PlugNode store my audio files?
Audio files are stored in your workspace storage for download. They count toward your workspace quota (default 5GB). You can delete them from the Files page after downloading.