TLDR: VISTA is a multi agent framework that improves text to video generation during inference, it plans structured prompts as scenes, runs a pairwise tournament to select the best candidate, uses specialized judges across visual, audio, and context, then rewrites the prompt with a Deep Thinking Prompting Agent, the method shows consistent gains over strong prompt optimization baselines in single scene and multi scene settings, and human raters prefer its outputs.


What VISTA is?
VISTA stands for Video Iterative Self improvemenT Agent. It is a black box, multi agent loop that refines prompts and regenerates videos at test time. The system targets 3 aspects jointly, visual, audio, and context. It follows 4 steps, structured video prompt planning, pairwise tournament selection, multi dimensional multi agent critiques, and a Deep Thinking Prompting Agent for prompt rewriting.
The research team evaluates VISTA on a single scene benchmark and on an internal multi scene set. It reports consistent improvements and up to 60 percent pairwise win rate against state of the art baselines in some settings, and a 66.4 percent human preference over the strongest baseline.


Understanding the key problem
Text to video models like Veo 3 can produce high quality video and audio, yet outputs remain sensitive to exact prompt phrasing, adherence to physics can fail, and alignment to user goals can drift, which forces manual trial and error. VISTA frames this as a test time optimization problem. It seeks unified improvement across visual signals, audio signals, and contextual alignment.
How VISTA works, step by step?
Step 1: structured video prompt planning
The user prompt is decomposed into timed scenes. Each scene carries 9 properties, duration, scene type, characters, actions, dialogues, visual environment, camera, sounds, moods. A multimodal LLM fills missing properties and enforces constraints on realism, relevancy, and creativity by default. The system also keeps the original user prompt in the candidate set to allow models that do not benefit from decomposition.
Step 2: pairwise tournament video selection
The system samples multiple video, prompt pairs. An MLLM acts as a judge with binary tournaments and bidirectional swapping to reduce token order bias. The default criteria include visual fidelity, physical commonsense, text video alignment, audio video alignment, and engagement. The method first elicits probing critiques to support analysis, then performs pairwise comparison, and applies customizable penalties for common text to video failures.
Step 3: multi dimensional multi agent critiques
The champion video and prompt receive critiques along 3 dimensions, visual, audio, and context. Each dimension uses a triad, a normal judge, an adversarial judge, and a meta judge that consolidates both sides. Metrics include visual fidelity, motions and dynamics, temporal consistency, camera focus, and visual safety for visual, audio fidelity, audio video alignment, and audio safety for audio, situational appropriateness, semantic coherence, text video alignment, physical commonsense, engagement, and video format for context. Scores are on a 1 to 10 scale, which supports targeted error discovery.
Step 4: Deep Thinking Prompting Agent
The reasoning module reads the meta critiques and runs a 6 step introspection, it identifies low scoring metrics, clarifies expected outcomes, checks prompt sufficiency, separates model limits from prompt issues, detects conflicts or vagueness, proposes modification actions, then samples refined prompts for the next generation cycle.


Understanding the results
Automatic evaluation: The research study reports win, tie, loss rates on ten criteria using an MLLM as a judge, with bidirectional comparisons. VISTA achieves a win rate over direct prompting that rises across iterations, reaching 45.9 percent in single scene and 46.3 percent in multi scene at iteration 5. It also wins directly against each baseline under the same compute budget.
Human studies: Annotators with prompt optimization experience prefer VISTA in 66.4 percent of head to head trials against the best baseline at iteration 5. Experts rate optimization trajectories higher for VISTA, and they score visual quality and audio quality higher than direct prompting.
Cost and scaling: Average tokens per iteration are about 0.7 million across two datasets, generation tokens are not included. Most token use comes from selection and critiques, which process videos as long context inputs. Win rate tends to increase as the number of sampled videos and tokens per iteration increases.
Ablations: Removing prompt planning weakens initialization. Removing tournament selection destabilizes later iterations. Using only one judge type reduces performance. Removing the Deep Thinking Prompting Agent lowers final win rates.
Evaluators: The research team repeated evaluation with alternative evaluator models and observe similar iterative improvements, which supports robustness of the trend.




Key Takeaways
- VISTA is a test time, multi agent loop that jointly optimizes visual, audio, and context for text to video generation.
- It plans prompts as timed scenes with 9 attributes, duration, scene type, characters, actions, dialogues, visual environment, camera, sounds, moods.
- Candidate videos are selected via pairwise tournaments using an MLLM judge with bidirectional swap, scored on visual fidelity, physical commonsense, text video alignment, audio video alignment, and engagement.
- A triad of judges per dimension, normal, adversarial, meta, produces 1 to 10 scores that guide the Deep Thinking Prompting Agent to rewrite the prompt and iterate.
- Results show 45.9 percent wins on single scene and 46.3 percent on multi scene at iteration 5 over direct prompting, human raters prefer VISTA in 66.4 percent of trials, average token cost per iteration is about 0.7 million.
VISTA is a practical step toward reliable text to video generation, it treats inference as an optimization loop and keeps the generator as a black box. The structured video prompt planning is useful for early engineers, the 9 scene attributes give a concrete checklist. The pairwise tournament selection with a multimodal LLM judge and bidirectional swap is a sensible way to reduce ordering bias, the criteria target real failure modes, visual fidelity, physical commonsense, text video alignment, audio video alignment, engagement. The multi dimensional critiques separate visual, audio, and context, the normal, adversarial, and meta judges expose weaknesses that single judges miss. The Deep Thinking Prompting Agent turns those diagnostics into targeted prompt edits. The use of Gemini 2.5 Flash and Veo 3 clarifies the reference setup, the Veo 2 study is a helpful lower bound. The reported 45.9 and 46.3 percent win rates and 66.4 percent human preference indicate repeatable gains. The 0.7 million token cost is non trivial, yet transparent and scalable.
Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.