Biography个人简介
I am currently a Research Scientist at Alibaba Qwen VL Team, working on unified multimodal large models that integrate visual understanding and generation, as well as agentic approaches that leverage VLMs to orchestrate visual creation. My research is driven by a passion for Artificial General Intelligence (AGI), with a focus on building omni-modal foundation models that can perceive, reason, and create across vision, language, and beyond.
I obtained my Ph.D. from the Multimedia Laboratory (MMLab) of The Chinese University of Hong Kong (CUHK) in 2025, fortunate to be supervised by Prof. Hongsheng Li. I also worked closely with Prof. Xihui Liu.
Previously, I was a visiting scholar at MIT CSAIL, advised by Prof. Dina Katabi. I obtained my B.Eng. degree from Shanghai Jiao Tong University, where I was ranked 1st/157 and advised by Prof. Bingbing Ni.
我目前在阿里巴巴 Qwen VL 团队担任研究员,主要从事统一多模态大模型研究,推动视觉理解与视觉生成的一体化能力提升,同时探索以视觉语言模型为核心的智能体驱动视觉创作。我的研究以通用人工智能(AGI)为长期目标,关注全模态基础模型,让模型能够在视觉、语言等不同模态间实现统一的感知、推理与生成。
我于2025年在香港中文大学多媒体实验室(MMLab)获得博士学位,导师为李鸿升教授。博士期间,我也与刘希慧教授保持紧密合作。
此前,我曾在麻省理工学院计算机科学与人工智能实验室(MIT CSAIL)担任访问学者,导师为 Prof. Dina Katabi。本科毕业于上海交通大学信息工程专业,排名第1/157名,导师为倪冰冰教授。
News最新动态
-
[Jan. 2026] Two papers accepted to ICLR 2026.
-
[Sep. 2025] One paper accepted to NeurIPS 2025.
-
[Jul. 2025] One paper accepted to ICCV 2025.
-
[Feb. 2025] One paper accepted to CVPR 2025.
-
[2026年1月] 两篇论文被 ICLR 2026 接收。
-
[2025年9月] 一篇论文被 NeurIPS 2025 接收。
-
[2025年7月] 一篇论文被 ICCV 2025 接收。
-
[2025年2月] 一篇论文被 CVPR 2025 接收。
Education教育背景
-
[2021 - 2025] Ph.D. at MMLab, The Chinese University of Hong Kong.
-
[2016 - 2020] B.Eng. in Information Engineering, Shanghai Jiao Tong University (Ranking: 1st/157).
-
[2019 - 2020] Visiting Scholar at CSAIL, Massachusetts Institute of Technology.
-
[2021 - 2025] 香港中文大学多媒体实验室(MMLab)博士。
-
[2016 - 2020] 上海交通大学信息工程专业工学学士(排名:第1/157名)。
-
[2019 - 2020] 麻省理工学院计算机科学与人工智能实验室(CSAIL)访问学者。
Publications学术论文
(* indicates equal contribution)(* 表示同等贡献)
-
MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
W Shi*, A Yu*, Rongyao Fang*, H Ren, K Wang, A Zhou, C Tian, X Fu, Y Hu, Z Lu, L Huang, S Liu, R Liu, H Li
-
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
C Duan*, K Sun*, Rongyao Fang*, M Zhang, Y Feng, Y Luo, Y Liu, K Wang, P Pei, X Cai, H Li, Y Ma, X Liu
arXiv preprint, 2025 [
Paper]
-
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
International Conference on Learning Representations (ICLR), 2026 [
Paper]
-
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation
arXiv preprint, 2025 [
Paper]
-
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
International Conference on Learning Representations (ICLR), 2026 [
Paper]
-
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Rongyao Fang, C Duan, K Wang, L Huang, H Li, S Yan, H Tian, X Zeng, R Zhao, J Dai, X Liu, H Li
Conference on Neural Information Processing Systems (NeurIPS), 2025 [
Paper]
-
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
arXiv preprint, 2025 [
Paper]
-
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
Conference on Computer Vision and Pattern Recognition (CVPR), 2025 [
Paper]
-
StreamChat: Chatting with Streaming Video
arXiv preprint, 2024 [
Paper]
-
Puma: Empowering Unified MLLM with Multi-Granular Visual Generation
International Conference on Computer Vision (ICCV), 2025 [
Paper]
-
Mimic Before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
International Journal of Computer Vision (IJCV), 2024 [
Paper]
-
FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024 [
Paper]
-
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
European Conference on Computer Vision (ECCV), 2024 [
Paper]
-
Clip-Adapter: Better Vision-Language Models with Feature Adapters
International Journal of Computer Vision (IJCV), 2024 [
Paper]
-
InstructSeq: Unifying Vision Tasks with Instruction-Conditioned Multi-Modal Sequence Generation
arXiv preprint, 2023 [
Paper]
-
Point-M2AE: Multi-Scale Masked Autoencoders for Hierarchical Point Cloud Pre-Training
Conference on Neural Information Processing Systems (NeurIPS), 2022 [
Paper]
-
RBGNet: Ray-Based Grouping for 3D Object Detection
Conference on Computer Vision and Pattern Recognition (CVPR), 2022 [
Paper]
-
Tip-Adapter: Training-Free CLIP-Adapter for Better Vision-Language Modeling
European Conference on Computer Vision (ECCV), 2022 [
Paper]
-
Learning Longterm Representations for Person Re-Identification Using Radio Signals
Conference on Computer Vision and Pattern Recognition (CVPR), 2020 [
Paper]
-
Probabilistic Radiomics: Ambiguous Diagnosis with Controllable Shape Analysis
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2019 [
Paper]
-
Adversarial Attack and Defense on Point Sets
arXiv preprint, 2019 [
Paper]
Experience工作经历
-
[Nov. 2025 - Present] Research Scientist, Alibaba Qwen VL Team.
-
[Feb. 2024 - Aug. 2025] Research Intern, SenseTime.
-
[Jun. 2022 - Apr. 2023] Research Intern, Shanghai AI Laboratory.
-
[2025年11月 - 至今] 阿里巴巴 Qwen VL 团队,研究员。
-
[2024年2月 - 2025年8月] 商汤科技,研究实习生。
-
[2022年6月 - 2023年4月] 上海人工智能实验室,研究实习生。
Selected Awards所获荣誉
-
[2021] Hong Kong PhD Fellowship.
-
[2020] Outstanding Graduates of Shanghai (Top 1%).
-
[2017 & 2018] National Scholarship (Top 1%).
-
[2017 & 2018] Zhiyuan College Honors Scholarship (Top 5%).
-
[2021] 香港政府博士奖学金(Hong Kong PhD Fellowship)。
-
[2020] 上海市优秀毕业生(前 1%)。
-
[2017 & 2018] 国家奖学金(前 1%)。
-
[2017 & 2018] 致远学院荣誉奖学金(前 5%)。