Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization

Abstract

Generating high-fidelity audio that is both semantically meaningful and temporally synchronized with silent videos remains a challenging problem in video-to-audio generation. Existing approaches often fail to capture fine-grained temporal correspondence between visual events and audio dynamics, leading to unrealistic or desynchronized outputs. To address these limitations, we propose VisioSonic, a Video-Aligned Sound generation framework that unifies flow-matching diffusion and preference-guided alignment. VisioSonic introduces a multimodal conditioning module that jointly leverages video frames and textual cues to provide semantic and frame-level temporal guidance. A co-attention diffusion transformer efficiently fuses visual and audio representations, enabling content-aware sound synthesis with minimal computation costs. To further enhance alignment beyond supervised training, we introduce Semantic-Temporal Alignment Ranked Direct Preference Optimization (STAR-DPO), a novel preference-learning paradigm that automatically generates audio candidates, ranks them based on both semantic and temporal alignment, and subsequently fine-tunes the diffusion model using the derived preference pairs. Extensive experiments on various benchmarks demonstrate that VisioSonic achieves state-of-the-art audio-video synchronization and audio fidelity while using the fewest trainable parameters among competing approaches.

In-Domain Samples (VGGSound) <More results>

Text Caption	Ground Truth	VisioSonic (ours)	MMAudio	SAH	FoleyCrafer	Frieren	VATT
train whistling
sloshing water
skateboarding

Out-of-Domain (MovieGen)

Text Caption	MovieGen Audio	VisioSonic (ours)
Wheels spinning, and a slamming sound as the skateboard lands on concrete.
Whistling sounds, followed by a sharp explosion and loud crackling.
Rhythmic splashing and lapping of water.
Ice cracking with sharp snapping sound, and metal tool scraping against the ice surface.
ATV engine roars and accelerates, with guitar music.
Shovel scrapes against dry earth.