🎊 Welcome to my website!

My name is Kai (Alan) Wang. I’m a Ph.D candidate at The Edward S. Rogers Sr. Department of Electrical and Computer Engineering (ECE), University of Toronto, supervised by Prof. Dimitrios Hatzinakos. I also work closely with Prof. Yapeng Tian on Audio-Visual Learning, Generation, and Editing.

My research interests lie in Multimodal Learning (Image/Video/Audio), Generative Models, Video-Audio Understanding and Generation, and Multimodal World Model.

I’m always looking for research collaboration. If you are interested in my research, please don’t hesitate to reach out to me.

🔥 News

2026.07: 🎉 One paper is accepted by COLM 2026.
2026.05: 🎉 We released a audio-visual intelligence survey paper.
2026.02: 🎉 Two papers (1 first-author and 1 co-author) are accepted by CVPR 2026 (Main Track).
2026.01: 🎉 One co-first-author paper is accepted by ICLR 2026.
2025.08: 🎉 One paper is accepted by EMNLP 2025 (Findings).
2025.07: 🎉 One first-author paper is accepted by ACM MM 2025.
2025.07: 🎉 One first-author paper is accepted by BMVC 2025.
2025.01: 🎉 One paper is accepted by ICLR 2025.
2024.09: 🎉 One paper is accepted by NeurIPS 2024 (Spotlight).
2024.09: 🎉 One paper is accepted by EMNLP 2024 (Main).
2024.06: 🎉 One paper is appeared by ArXiv.
2024.05: 🎉 One co-first author paper is accepted by Pattern Recognition Letter.
2024.04: 🎉 One first-author paper is accepted by CVPR 2024.
2023.12: 🎉 One first-author paper is accepted by ICASSP 2024.
2023.07: 🎉 One first-author paper is accepted by APSIPA 2023.
2022.09: 📖 Starting my Ph.D study at the University of Toronto.

📝 Selected Publications

( ^* equal contribution)

Arxiv

Audio-Visual Intelligence in Large Foundation Models

You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng, Yapeng Tian, Junbin Xiao, Yazhou Xing, Yinghao Ma, Bobo Li, Roger Zimmermann, Lei Cui, Furu Wei, Jiebo Luo, Hao Fei.

Technical Report

CVPR 2026

Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization

Kai Wang, Tao Zhou, Jiayi Lei, Jing Wang, Jinman Zhao, Weiguo Pian, Yuan Cheng, Yapeng Tian, Peng Gao, Bin Fu, Yihao Liu, Dimitrios Hatzinakos, Yuewen Cao.

CVPR 2026 (Main)

CVPR 2026

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian.

CVPR 2026 (Main)

ICLR 2026

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Kai Liu^*, Yanhao Zheng^*, Kai Wang^*, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, Tat-Seng Chua

ICLR 2026

EMNLP 2025

Self-Improvement in Multimodal Large Language Models: A Survey

Shijian Deng, Kai Wang, Tianyu Yang, Harsh Singh, Yapeng Tian.

EMNLP 2025 (Findings)

ArXiv

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

Dongyang Liu^*, Shicheng Li^*, Yutong Liu^*, Zhen Li^*, Kai Wang^*, Xinyue Li^*, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao

Under Review

BMVC 2025

Prompt Image to Watch and Hear: Multimodal Prompting for Parameter-Efficient Audio-Visual Learning

Kai Wang, Shentong Mo, Yapeng Tian, Dimitrios Hatzinakos.

BMVC 2025 (Poster)

ACM MM 2025

AV-DiT: Taming Image Diffusion Transformers for Efficient Joint Audio and Video Generation

Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian.

ACM Multimedia 2025 (Oral)

We design an efficient audio-visual diffusion transformer generate high-quality, realistic videos with both visual and audio tracks.

NeurIPS 2024

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen.

NeurIPS 2024 (Spotlight)

EMNLP 2024

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen

EMNLP 2024 (Main)

CVPR 2024

Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation

Kai Wang,Yapeng Tian, Dimitrios Hatzinakos.

CVPR 2024

We propose a Spatial-Temporal-Global Cross-Modal Adaptation (STG-CMA) to gradually equip the frozen ViTs with the capability for learning audio-visual representation.

Pattern Recognition Letter

HARWE: A multi-modal large-scale dataset for context-aware human activity recognition in smart working environments

Alireza Esmaeilzehi^*, Ensieh Khazaei^*, Kai Wang^*, Navjot Kaur Kalsi, Pai Chet Ng, Huan Liu, Yuanhao Yu, Dimitrios Hatzinakos, Konstantinos Plataniotis.

Pattern Recognition Letter

We propose a novel dataset for the task of human activity recognition, in which the labels are specified for the working environments.

ICASSP 2024

MoMA: Mixture-of-Modality-Adaptations for Transferring Knowledge from Image Models Towards Efficient Audio-Visual Action Recognition

Kai Wang, Dimitrios Hatzinakos.

ICASSP 2024 (Oral)

We propose a novel parameter-efficient scheme called Mixture-of-Modality-Adaptations (MoMA) for audio-visual action recognition.

We propose the SEformer, an efficient dual-path conformer neural network for speech enhancement.

IWAENC 2022

Cptnn: Cross-parallel transformer neural network for time-domain speech enhancement

Kai Wang, Bengbeng He and Wei-Ping Zhu.

IWAENC 2022

ICASSP 2021

TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain

Kai Wang, Bengbeng He and Wei-Ping Zhu.

ICASSP 2021

🎖 Honors and Awards

2024: School of Graduate Studies (SGS) Conference Grant, University of Toronto
2022 - Present: Edward S. Rogers Sr. Graduate Scholarships, University of Toronto
2022 - Present: Research Fellowship, University of Toronto
2021: Conference and Exposition Award
2015: Meritorious Award of International Mathematical Contest in Modeling

🏁 Services

Conference Reviewer: NeurIPS, ECCV, CVPR, ICLR, ACM MM, BMVC, CVPRW, ICASSP, ICME, ICJNN
Journal Reviewer: IJCV, Systems, and Signal Processing (CSSP), Speech Communication

🧑‍🏫 Teaching

Winter (2023, 2024), Fall(2023, 2024): ECE421 Introduction to Machine Learning
Winter 2025: ECE462 Multimedia Systems
Fall 2025: ECE 1786 Creative Applications of Natural Language Processing