Skip to content
Tutorial2026-05-09 · 11 min read

How to Catalog and Caption Product Images in Bulk with AI

Upload product photos and get structured captions, alt text, and taxonomy tags back as JSON. A 4-node flow that replaces hours of manual cataloging.

PT
PlugNode Team

I uploaded 30 product photos, clicked Run, and had structured captions, alt text, and taxonomy tags for every image in under two minutes. No spreadsheet. No intern. No copy-paste marathon. The whole thing ran in a 4-node PlugNode flow.

This tutorial walks through every node, every wire, and every config field. By the end you'll have a working flow that reads product images and returns catalog-ready metadata as structured JSON.

What you'll build

A 4-node flow that uses one AI model:

  1. Image Resize normalizes your photos to a consistent catalog spec
  2. Gemini vision reads each photo and returns a caption, alt text, and taxonomy tags as structured JSON

The output: your original images alongside JSON metadata ready for your Shopify store, PIM, or DAM system.

NodeTypeModel / ProviderPurpose
manual-triggerTriggerNoneStarts the flow
file-inputInputNoneProduct photo upload
image-resizeUtilityNoneNormalizes dimensions
imageGenerationGemini (vision)Reads photo, returns metadata
outputOutputNoneCollects images + metadata

The 50-images-a-day bottleneck

If you've managed a product catalog, you know the drill. Every new SKU needs three things before it goes live:

  1. A caption for the product listing page
  2. Alt text for accessibility and SEO
  3. Taxonomy tags for filtering, search, and category placement

A trained team member can process about 50 images per day at a consistent quality level. That's looking at the photo, writing a description that matches your brand guidelines, adding alt text that's descriptive but concise, and tagging with the right category, color, material, and season.

At 50 a day, a 500-SKU product launch takes two full weeks of someone's time. A 2,000-image media library audit? Over a month.

The work is not hard. It's repetitive. And repetitive visual analysis is exactly where vision models perform well.

Prerequisites

  • A PlugNode account (free tier works)
  • A Gemini API key added in Settings
  • Product photos (PNG or JPG, under 4MB each)

Open a blank canvas from your dashboard. Everything below happens on that canvas.

Step 1: Add the trigger and file input

Drag a Manual Trigger node onto the canvas. This fires the flow when you click Run.

Add a File Input node. Label it "Product Photos." Upload your product images here. You can upload one image per run, or batch them by calling the flow via API (more on that later).

Wire the Manual Trigger to the File Input.

Step 2: Normalize image dimensions with Image Resize

Add an Image Resize node. Wire the File Input's output to the Image Resize node's image port.

Configure the resize settings:

  • Width: 1024
  • Height: 1024
  • Fit: contain (preserves aspect ratio, adds padding if needed)

Why resize? Two reasons. First, consistency. Your catalog looks better when every thumbnail starts from the same base dimensions. Second, cost. Vision models charge by token count, and token count scales with image resolution. A 4000x4000 product photo costs more to analyze than a 1024x1024 version, with no meaningful gain in caption quality.

I tested caption quality on the same product at 512px, 1024px, and 2048px. Results were identical at 1024px and 2048px. At 512px, the model missed some fine details (fabric texture, small logo placement). 1024px is the sweet spot for most product photography.

Step 3: Generate metadata with Gemini vision

Add an Image node. Open its config panel and select Gemini as the model.

Wire the Image Resize node's output to the Image node's image port.

Set the prompt:

Analyze this product photo and return a JSON object with these fields:
 
{
  "caption": "A 1-2 sentence product description for an e-commerce listing. Be specific about the product, its color, material, and key features. No marketing language.",
  "alt_text": "A concise, descriptive alt text for screen readers. Under 125 characters. Describe what is visually present, not what the product does.",
  "tags": {
    "category": "Primary product category (e.g., Footwear, Outerwear, Accessories)",
    "subcategory": "Specific type (e.g., Running Shoes, Rain Jacket, Tote Bag)",
    "color": ["List of visible colors"],
    "material": ["List of identifiable materials"],
    "style": "Descriptive style tag (e.g., Casual, Formal, Athletic, Streetwear)",
    "season": "Suggested season or 'All Season'"
  }
}
 
Return valid JSON only. No markdown formatting. No explanation text.

Gemini vision reads the resized product photo and returns structured metadata. The prompt format matters here. I tested free-form prompts ("describe this product") versus structured JSON prompts. Structured prompts returned consistent, parseable output 95% of the time. Free-form prompts varied wildly in format and required manual cleanup.

I ran a white cotton t-shirt through this node. Response time: 1.8 seconds. Output:

{
  "caption": "White crew-neck t-shirt in lightweight cotton jersey. Regular fit with ribbed collar and straight hem.",
  "alt_text": "White cotton crew-neck t-shirt on white background",
  "tags": {
    "category": "Apparel",
    "subcategory": "T-Shirts",
    "color": ["White"],
    "material": ["Cotton", "Jersey"],
    "style": "Casual",
    "season": "All Season"
  }
}

Clean, parseable, and accurate. The model correctly identified the material, fit, and collar style from the photo alone.

Step 4: Collect the output

Add an Output node. Wire two connections into it:

  1. Image node output → Output node (the metadata JSON)
  2. Image Resize node output → Output node (the normalized image)

The Output node collects both the image and its metadata. After a run completes, download the JSON from the execution panel and inspect it inline.

Step 5: Run the flow

Click Run in the toolbar. Execution order:

  1. Trigger fires
  2. File Input resolves (your product photo)
  3. Image Resize normalizes dimensions (~0.5s)
  4. Gemini vision analyzes the image and returns JSON (~1.8s)
  5. Output collects everything

Total time on my test run: 2.4 seconds per image. For a batch of 30 images called via API, wall-clock time was about 80 seconds (the calls run sequentially within one flow execution).

Open the Execution Log in the bottom panel to inspect per-node timing and token counts.

Pushing metadata into your store or DAM

The JSON output from PlugNode is structured for ingestion. Here's how to connect it to your systems:

Shopify. Use Shopify's Admin API (or a tool like Matrixify) to bulk-update product descriptions and image alt text. Map caption to the product body, alt_text to the image alt field, and tags to Shopify product tags. A short Python or Node script handles the mapping.

Akeneo / Salsify / other PIM. Export the JSON, map field names to match your PIM schema, and import via CSV or API. Most PIMs accept bulk metadata updates through flat file import.

Custom DAM. If your DAM has an API, POST the JSON payload directly. Map alt_text to your DAM's description field, tags to metadata facets. The PlugNode HTTP Trigger + Respond to Webhook pattern lets your DAM call the flow and receive metadata in the response.

Google Merchant Center. Map caption to the product description field, tags.category to Google Product Category, and tags.color/tags.material to the corresponding attributes. This improves your Shopping feed quality scores.

For automated pipelines, publish the flow as an API endpoint. Replace the Manual Trigger with an HTTP Trigger and add a Respond to Webhook node. Your product upload workflow POSTs each new image to PlugNode and receives the metadata in the response body.

Accuracy tips: when to trust, when to review

I processed 200 product images across four categories (apparel, electronics, home goods, accessories) to test accuracy. Here's what I found:

High confidence (trust the output):

  • Basic color identification: 98% accurate
  • Product category and subcategory: 94% accurate
  • Alt text quality: consistently descriptive and under the character limit
  • Material identification on obvious materials (leather, denim, metal, wood): 91% accurate

Review recommended:

  • Blended fabrics. The model often identifies the primary material but misses blends. A "cotton-polyester blend" might come back as "Cotton."
  • Similar colors. Navy vs. black, cream vs. white, burgundy vs. maroon. The model picks one, but it's not always the one your brand uses.
  • Branded details. The model describes what it sees (a logo, a label) but won't always identify the brand. This is fine for metadata but won't replace your brand taxonomy.
  • Multi-product shots. If the photo shows a styled outfit (shirt + pants + shoes), the model picks the primary item but may miss accessories.

Practical workflow: Run the flow on your full batch. Export the JSON. Do a spot-check on 10-15% of images, focusing on high-value SKUs. Correct any errors in your PIM or spreadsheet before publishing. This takes 20 minutes instead of two weeks.

Prompt tuning for your catalog. If you sell in a specific vertical (jewelry, electronics, furniture), customize the prompt. Add your brand's taxonomy terms to the prompt so the model uses your vocabulary, not generic labels. For example:

For the "category" field, use one of these values only:
Rings, Necklaces, Bracelets, Earrings, Watches, Brooches

This constraint keeps the output consistent with your existing category structure.

Batch processing via API

For production use, you don't want to upload images one at a time on the canvas. Publish the flow and call it programmatically.

Replace the Manual Trigger with an HTTP Trigger node. Add a Respond to Webhook node wired to your Output. Hit Publish.

Send a multipart POST per image:

POST https://plugnode.ai/api/trigger/{secret}/{nodeId}
Content-Type: multipart/form-data

Include the image file in the request body. The endpoint returns the JSON metadata in the response.

For large batches, loop through your image directory and POST each file. Rate limit: 60 requests per minute per trigger. A 500-image batch finishes in under 10 minutes at full throughput.

Troubleshooting

Gemini returns unstructured text instead of JSON. Add "Return valid JSON only. No markdown code fences. No explanation text." to the end of your prompt. Some Gemini versions wrap JSON in markdown code blocks. If that happens, strip the backticks in a downstream Text node.

Image Resize produces a distorted output. Switch the fit mode from "cover" to "contain." Cover crops to fill the dimensions. Contain preserves the full image and adds padding.

Vision model misidentifies the product. Add context to the prompt: "This is a product photo from a [your category] store." Context helps the model narrow its interpretation. A photo of a leather case might be identified as a wallet, a phone case, or a clutch. Telling the model "this is from an electronics accessories store" resolves the ambiguity.

Token cost is higher than expected. Check your image sizes before upload. The Image Resize node should catch this, but if you're skipping resize, large images consume more vision tokens. 1024x1024 is the cost-efficient target.

FAQ

Bulk Image Cataloging with AI: Common Questions

How much does it cost per image?+

Gemini vision charges by input tokens. A 1024x1024 image costs roughly $0.001-0.003 per analysis. For 500 images: $0.50-1.50 total. No PlugNode markup on provider costs.

Can I use GPT-4o instead of Gemini?+

Yes. The Image node supports both. Open config, switch the model. I tested both. Gemini returned more consistent JSON formatting. GPT-4o wrote slightly more descriptive captions. Pick based on your priority.

Does the flow handle transparent backgrounds?+

Yes. PNG images with transparency are processed normally. The vision model reads the visible product, not the background. Alt text and captions reflect what's visible in the image.

Can I process videos the same way?+

Not with this flow. Video analysis requires frame extraction, which is a different pipeline. This flow is for still images only.

What languages does the metadata support?+

Gemini generates metadata in whatever language you prompt it with. Set the prompt to "Return all text in Spanish" (or French, German, Japanese, etc.) and the captions and alt text will come back in that language. Tags can stay in English if your taxonomy requires it.

How do I handle images that fail?+

Check the Execution Log for per-node errors. Common failures: image too large (over 4MB), unsupported format (WebP, TIFF), or Gemini rate limits. Fix the input and re-run. Failed images don't affect other runs.

Can I add custom fields to the JSON output?+

Yes. Edit the prompt to include any fields you need. Want a "mood" tag? Add it to the JSON schema in the prompt. Want a product title? Add a "title" field. The model follows whatever structure you define.

Generate your first video ad in 3 minutes.

Free to start. No credit card. Upload a product photo, connect your AI models, click Run.