The generative AI video landscape is exploding. In the span of a few months, we've seen quantum leaps in quality from models like OpenAI's Sora, Google's Veo, Runway, and Kling. For creatives and developers, this brings a critical new challenge: with so many powerful tools, how do you objectively compare them?

Testing one model with one prompt is simple. But how do you efficiently test four different models with ten different camera angles? How do you move beyond a "gut feeling" and gather real data on prompt adherence, motion fidelity, and creative range?

To solve this, I've designed and implemented a "one-to-many" parallel processing pipeline. This workflow, built on the Weavy node-based platform, acts as a powerful R&D testbed, allowing for the rapid, large-scale A/B testing of multiple generative video models simultaneously.

The Challenge: Moving Beyond Single-Prompt Tests

Traditional 1:1 testing is slow and unreliable. By the time you've run a prompt on Model A, tweaked it, run it on Model B, and tried to remember the nuances of the output, you're comparing apples to oranges. This manual process is a bottleneck to creative R&D.

I needed a scalable framework that could answer two key questions at the same time:

  1. Model vs. Model: Which model (Kling, Runway, Veo, etc.) produces the most coherent, realistic, and faithful result for the exact same prompt?
  2. Prompt vs. Prompt: How well does one specific model understand and execute different creative instructions (e.Example: "dolly left" vs. "slow zoom in")?

Architecting the Parallel Benchmarking Pipeline

My solution is a node-based workflow that automates this entire A/B testing matrix. It takes one set of inputs and "fans out" to dozens of generation tasks that all run in parallel.

This graph provides a high-level view of the architecture. You can explore the interactive workflow directly on Weavy.

View the Live Workflow Graph on Weavy


The "one-to-many" workflow, fanning out from a single input to dozens of parallel generation tasks.

Step 1: Dynamic Prompt Generation

The pipeline kicks off with a single Input Image (like the octopus) and a Base Prompt. Instead of just using this simple prompt, it's first fed into an LLM node (like anthropic/claude-3.5-sonnet) acting as a "Prompt Enhancer."

This enhanced description is then passed to a second LLM, a "Prompt Generator." This node's job is to create a structured list of distinct, ready-to-use prompts, testing key variables like:

  • Camera Movements: "a slow zoom in," "dolly shot from the left," "a slight upward tilt," "pan right."
  • Aesthetic Styles: "cinematic 4k, high detail," "grainy 35mm film," "bioluminescent glow."

Step 2: The "Fan-Out" & Parallel Processing

This is where the real power lies. The structured list of prompts is fed into an Array (Splitter) node. This node instigates a massive "fan-out," distributing each unique prompt to its own dedicated set of generation tasks. Simultaneously, the original Input Image is passed to all nodes.

Step 3: Simultaneous Two-Axis Comparison

The workflow grid is intentionally structured to provide a "two-axis" comparison in a single run. The output is a comprehensive batch of video variations that provides immediate, high-volume visual data.

As seen in the workflow, the test is set up to compare models like Kling Video, Runway Gen-4 Turbo, Sora 2, and Veo 3.1 side-by-side.


Side-by-side comparison: Kling 1.6 vs. Kling 2.1 vs. Runway Gen-4 Turbo, all handling the same input and prompt.

Key R&D Insights from This Workflow

This automated pipeline moves generative video testing from subjective art to objective science. It unlocks several key advantages:

  • Objective Data: I can now definitively see that Model A is better at interpreting "dolly" shots, while Model B offers superior motion fidelity on "zoom" shots.
  • True Prompt Adherence Testing: It's easy to spot which models actually listen to specific cinematic language versus those that just generate a generic, pleasing motion.
  • Rapid Iteration: I can test dozens of creative ideas and model combinations in the time it would have taken to run 2-3 manual tests.

See the Results in Action

This R&D is an ongoing process. You can see one of the first outputs from this benchmarking pipeline on my Instagram, where a single static image was animated using this parallel workflow.

Here is an example output from one of these test batches.

Conclusion: The Future of Creative AI Workflows

As generative models become more numerous and complex, our methods for testing and integrating them must also evolve. A single-prompt-to-single-model approach is no longer sufficient for serious creative R&D.

By leveraging node-based platforms like Weavy to build parallel-processing pipelines, we can automate our experimentation, gather objective data, and ultimately make faster, more informed decisions about which tools to use for our creative projects.

I'll be sharing more findings and outputs from this workflow. You can follow my ongoing research and see more results on my Instagram at @chaipeau.