Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization

CVPR 2026
TL;DR We introduce VisioSonic, a video-aligned sound generation framework that produces semantically rich and temporally synchronized audio, delivering a more immersive and coherent listening experience.

Click on each video to unmute or mute its generated audio.

Abstract

Generating high-fidelity audio that is both semantically meaningful and temporally synchronized with silent videos remains a challenging problem in video-to-audio generation. Existing approaches often fail to capture fine-grained temporal correspondence between visual events and audio dynamics, leading to unrealistic or desynchronized outputs. To address these limitations, we propose VisioSonic, a Video-Aligned Sound generation framework that unifies flow-matching diffusion and preference-guided alignment. VisioSonic introduces a multimodal conditioning module that jointly leverages video frames and textual cues to provide semantic and frame-level temporal guidance. A co-attention diffusion transformer efficiently fuses visual and audio representations, enabling content-aware sound synthesis with minimal computation costs. To further enhance alignment beyond supervised training, we introduce Semantic-Temporal Alignment Ranked Direct Preference Optimization (STAR-DPO), a novel preference-learning paradigm that automatically generates audio candidates, ranks them based on both semantic and temporal alignment, and subsequently fine-tunes the diffusion model using the derived preference pairs. Extensive experiments on various benchmarks demonstrate that VisioSonic achieves state-of-the-art audio-video synchronization and audio fidelity while using the fewest trainable parameters among competing approaches.

Method

Teaser
Figure 1: Overview of proposed VisioSonic: base model architecture (left) and STAR-DPO pipeline (right). Our VisioSonic model concatenates video and audio latents along the temporal dimension to exploit their inherent synchronization, and introduces a video-text-audio co-attention mechanism for effective multimodal interaction. Furthermore, a multimodal conditioner provides both semantic and temporal guidance, enabling precise audio-visual synchronization. In addition, we further propose STAR-DPO to optimize our model toward improved semantic and temporal synchronization without requiring human-annotated preference labels.

Main Results

Teaser
Table 1: Comparison results with existing SOTA video-to-audio models on VGGSound test set. The best results are marked in bold, and the second ones are marked with underline.
Text Caption Ground Truth VisioSonic (ours) MMAudio SAH FoleyCrafer Frieren VATT
train whistling
sloshing water
skateboarding

 


Out-of-Domain (MovieGen)

Text Caption MovieGen Audio VisioSonic (ours)
Wheels spinning, and a slamming sound as the skateboard lands on concrete.
Whistling sounds, followed by a sharp explosion and loud crackling.
Rhythmic splashing and lapping of water.
Ice cracking with sharp snapping sound, and metal tool scraping against the ice surface.
ATV engine roars and accelerates, with guitar music.
Shovel scrapes against dry earth.

 


Out-of-Domain (Sora Video Samples)

Ships riding waves Train (no text prompt given) Seashore (no text prompt given) Surfing

 


Out-of-Domain (Hunyuan Video Samples)

Typing Water is rushing down a stream and pouring Waves on beach Water droplet