Gemini vs OpenAI: Test Both on Your Own Prompts in One Flow
Run the same prompt through Gemini and OpenAI in parallel and compare output quality, speed, and cost on your actual content tasks.
Public benchmarks rank models on tasks you will never run. I needed to know which model writes better product descriptions for my catalog, not which one scores higher on MMLU. So I built a flow that sends the same prompt to Gemini 2.5 Pro and GPT-4.1 at the same time, collects both responses, and lets me compare them side by side.
This tutorial shows you how to build that flow, run it on a real prompt, and read the execution data that tells you which model is faster, cheaper, and better for your specific content.
What you'll build
A 5-node flow that fans one prompt out to two models in parallel:
- Manual Trigger starts the run
- Text Input holds your prompt
- Text node (Gemini 2.5 Pro) generates response A
- Text node (OpenAI GPT-4.1) generates response B
- Output collects both results for comparison
Both Text nodes run at the same time. No waiting for one to finish before the other starts.
| Node | Type | Model / Provider | Purpose |
|---|---|---|---|
| manual-trigger | Trigger | None | Starts the flow |
| text-input | Input | None | Holds your test prompt |
| gemini-text | Generation | Gemini 2.5 Pro | Response A |
| openai-text | Generation | OpenAI GPT-4.1 | Response B |
| output | Output | None | Collects both responses |
Why benchmarks don't answer your question
Benchmark suites test coding, math, reasoning, and academic trivia. None of them test "write a product description for a leather jacket in a confident, minimal tone." None test "generate three Instagram captions for a new skincare line."
Your prompts have brand voice rules, length constraints, and format requirements that no leaderboard captures. The only benchmark that matters is running your prompt through both models and reading the output.
That is what this flow does.
Prerequisites
- A PlugNode account (free tier works)
- API keys added in Settings: one for Gemini, one for OpenAI
- A prompt you want to test (product description, ad copy, social caption, whatever you produce regularly)
Open a blank canvas from your dashboard. Everything below happens on that canvas.
Step 1: Add the trigger and text input
Drag a Manual Trigger node onto the canvas. This fires the flow when you click Run.
Add a Text Input node. Label it "Test Prompt." Paste the prompt you want to compare. For this walkthrough, I used:
Write a product description for a full-grain leather jacket.
Target audience: men aged 28-40 who value craftsmanship.
Tone: confident, minimal, no hype.
Length: 3-4 sentences.Wire the Manual Trigger to the Text Input node.
Step 2: Fan out to both models
Add two Text nodes side by side on the canvas. Open the first one's config panel and select Gemini 2.5 Pro. Open the second one's config panel and select OpenAI GPT-4.1.
Wire the Text Input node's output to both Text nodes. This is the fan-out. One prompt, two destinations, parallel execution.
No system prompt is needed here. The prompt itself contains all the instructions. If you want to test system prompts, add them in each Text node's config. Keep them identical across both nodes so you are testing the model, not the instructions.
Step 3: Collect both outputs
Add an Output node. Wire both Text nodes into it. The Output node receives two text results, one from each model.
Your canvas should now look like a diamond: trigger at the top, input below it, two Text nodes branching left and right, output at the bottom.
Step 4: Run and compare
Click Run in the toolbar. The flow executes in order:
- Trigger fires
- Text Input resolves your prompt
- Gemini and OpenAI process the prompt in parallel
- Output collects both responses
On my test run with the leather jacket prompt, here is what came back.
Gemini 2.5 Pro response:
Full-grain leather, vegetable-tanned in Tuscany. The hide develops a patina
over years, not months. Four interior pockets, YKK hardware, clean seams
throughout. This jacket fits like a decision you won't second-guess.OpenAI GPT-4.1 response:
Built from full-grain cowhide with a matte finish that ages well. The cut
is trim without being tight, with enough room to layer a henley underneath.
Interior zip pocket. Lined in cotton twill. A jacket for people who skip
the logo and keep the quality.Both are solid. Gemini leaned into craft details (tanning origin, patina). OpenAI focused on fit and wearability. Which one is "better" depends on your brand and your audience. That is the whole point: you decide with your own eyes, on your own prompt.
Step 5: Read the execution data
Open the Execution Log in the bottom panel after the run completes. Each node shows:
- Duration: how long the model took to respond
- Token count: input tokens + output tokens
- Status: success or error
On my leather jacket run:
| Metric | Gemini 2.5 Pro | OpenAI GPT-4.1 |
|---|---|---|
| Duration | 1.4s | 1.8s |
| Input tokens | 52 | 52 |
| Output tokens | 61 | 68 |
Gemini was slightly faster on this prompt. Token counts were close. Cost depends on your provider pricing, but you can estimate it from the token counts. At published rates, this single run cost under $0.01 total for both models combined.
Click into any node in the execution log to see the full input/output pair. This is useful when debugging longer prompts or spotting differences in how each model interprets your instructions.
When to pick Gemini vs OpenAI
After running dozens of prompts through this flow across different content types, here are the patterns I found.
Product descriptions: Gemini tends to produce tighter copy with more specific details (materials, origin, construction). OpenAI writes longer, more narrative descriptions. Pick Gemini for spec-heavy listings. Pick OpenAI for storytelling-style product pages.
Ad copy: Both perform well on short-form ads. OpenAI is slightly better at matching casual tone for social ads. Gemini sticks closer to the brief, which is better for Google Ads where character limits matter.
Social captions: OpenAI produces more varied sentence structures across multiple generations. Gemini repeats patterns more often. For batching 10+ captions, OpenAI gave me less duplicate phrasing.
Speed: Gemini 2.5 Pro was faster on 7 out of 10 test prompts. The gap ranged from 0.2s to 1.1s per call. For single runs this does not matter. For batch workflows hitting the API hundreds of times, it adds up.
These patterns held on my prompts with my instructions. Yours may differ. That is why you run the test yourself.
Extending the flow
Add a third model
Drop another Text node onto the canvas. Select a different model (Claude, for example, if supported in your setup). Wire the same Text Input into it. Wire its output into the same Output node. Now you are comparing three models per run.
Add a judge model
Want an automated evaluation? Add one more Text node after the Output node. Wire all three model outputs into it. Set its system prompt to:
You are a copy editor. You will receive outputs from multiple AI models
responding to the same prompt. Rank them 1 to N on: accuracy to the brief,
tone match, and conciseness. Explain your ranking in 2-3 sentences.This gives you a structured comparison every run. It is not a replacement for human judgment, but it catches obvious misses (wrong tone, ignored constraints, hallucinated details).
Batch multiple prompts
Create a spreadsheet of your 20 most common prompt types. Run each one through the flow. After 20 runs, open the execution history and compare aggregate timing and token usage across models. This gives you a data-backed recommendation for your team, not a gut feeling.
Publishing as an API
Replace the Manual Trigger with an HTTP Trigger and add a Respond to Webhook node wired to your Output. Hit Publish. PlugNode generates a signed URL:
POST https://plugnode.ai/api/trigger/{secret}/{nodeId}Send a JSON body with your prompt. The endpoint returns both model responses in one payload. Use this to integrate model comparison into your content pipeline, CI checks, or quality assurance workflows.
FAQ
Gemini vs OpenAI Side by Side: Common Questions
How much does one comparison run cost?+
Both providers charge per token. A typical prompt (50-100 input tokens) with a short response (50-150 output tokens) costs under $0.01 total across both models. PlugNode adds no markup. You pay Gemini and OpenAI at their published rates.
Can I compare image or video models the same way?+
Not with this exact flow. Image and Video nodes use different output types. You could build a similar fan-out with two Image nodes (one using Nano Banana, one using GPT Image) and collect both in an Output node. The comparison principle is the same.
Do both models always get identical inputs?+
Yes. The Text Input feeds the same string to both Text nodes. If you add a system prompt, make sure you copy it identically into both node configs. Any difference in input means you are testing the prompt, not the model.
Can I test more than two models at once?+
Yes. Add as many Text nodes as you need. Wire them all from the same input. Wire them all into the same output. The flow runs them in parallel regardless of count.
How do I see historical runs for comparison?+
Open the Run History panel from the flow toolbar. Each run is logged with per-node timing, token counts, and full input/output data. You can scroll back through previous runs to compare how the same prompt performed across different sessions or after you changed your system prompt.
What if one model fails and the other succeeds?+
The flow continues. A failed node shows an error status in the execution log, and the Output node still collects the successful response. You will see exactly which model failed and why (rate limit, invalid key, timeout) in the log detail.
The full use case page is at /use-cases/multi-model-ab-testing. For details on publishing flows as APIs, see How to Publish an AI Flow as a Production API.