Biography
I am a Ph.D. candidate at the Multimedia Laboratory (MMLab) of The Chinese University of Hong Kong (CUHK), fortunate to be supervised by Prof. Hongsheng Li. I also work closely with Prof. Xihui Liu. I anticipate completing my doctorate in 2025.
My research is driven by a passion for Artificial General Intelligence (AGI) with a focus on visual understanding and generation. I am dedicated to developing integrated systems capable of perceiving, understanding, and generating visual content by leveraging advanced techniques with Multimodal Large Language Models.
Previously, I was a visiting scholar at MIT CSAIL, advised by Prof. Dina Katabi. I obtained my B.Eng. degree from Shanghai Jiao Tong University, where I was ranked 1st/157 and advised by Prof. Bingbing Ni.
Education
-
[2021 - 2025 (Exp.)] Ph.D. candidate at MMLab, The Chinese University of Hong Kong.
-
[2016 - 2020] B.Eng. in Information Engineering, Shanghai Jiao Tong University (Ranking: 1st/157).
-
[2019 - 2020] Visiting Scholar at CSAIL, Massachusetts Institute of Technology.
Publications
(* indicates equal contribution)
-
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
arXiv preprint, 2025
-
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Rongyao Fang, C Duan, K Wang, L Huang, H Li, S Yan, H Tian, X Zeng, R Zhao, J Dai, X Liu, H Li
arXiv preprint, 2025
-
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
arXiv preprint, 2025
-
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
Conference on Computer Vision and Pattern Recognition (CVPR), 2025
-
StreamChat: Chatting with Streaming Video
arXiv preprint, 2024
-
Puma: Empowering Unified MLLM with Multi-Granular Visual Generation
International Conference on Computer Vision (ICCV), 2025
-
Mimic Before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
International Journal of Computer Vision (IJCV), 2024
-
FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
-
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
European Conference on Computer Vision (ECCV), 2024
-
Clip-Adapter: Better Vision-Language Models with Feature Adapters
International Journal of Computer Vision (IJCV), 2024
-
InstructSeq: Unifying Vision Tasks with Instruction-Conditioned Multi-Modal Sequence Generation
arXiv preprint, 2023
-
Point-M2AE: Multi-Scale Masked Autoencoders for Hierarchical Point Cloud Pre-Training
Conference on Neural Information Processing Systems (NeurIPS), 2022
-
RBGNet: Ray-Based Grouping for 3D Object Detection
Conference on Computer Vision and Pattern Recognition (CVPR), 2022
-
Tip-Adapter: Training-Free CLIP-Adapter for Better Vision-Language Modeling
European Conference on Computer Vision (ECCV), 2022
-
Learning Longterm Representations for Person Re-Identification Using Radio Signals
Conference on Computer Vision and Pattern Recognition (CVPR), 2020
-
Probabilistic Radiomics: Ambiguous Diagnosis with Controllable Shape Analysis
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2019
-
Adversarial Attack and Defense on Point Sets
arXiv preprint, 2019
Experience
-
[Feb. 2024 - Present] Research Intern, SenseTime.
-
[Jun. 2022 - Apr. 2023] Research Intern, Shanghai AI Laboratory.
Selected Awards
-
[2021] Hong Kong PhD Fellowship.
-
[2020] Outstanding Graduates of Shanghai (Top 1%).
-
[2017 & 2018] National Scholarship (Top 1%).
-
[2017 & 2018] Zhiyuan College Honors Scholarship (Top 5%).