Shangzhe Di

Shangzhe Di 狄尚哲

Hi, I am a third-year PhD candidate at Shanghai Jiao Tong University (SJTU), where I am fortunate to be advised by Prof. Weidi Xie. My research focuses on video understanding and multimodal learning, driven by a passion for exploring the unknowns in these fields.

Before joining SJTU, I earned my master's and bachelor's degrees from Beihang University (BUAA). During this period, I explored video background music generation and visual object tracking under the guidance of Prof. Si Liu.

I’m always eager to connect, exchange ideas, and collaborate on innovative research. Please feel free to reach out!

Email / CV / Github / Google Scholar

Education

PhD Student, Shanghai Jiao Tong University, Apr. 2023 - Present

M.Eng. in Computer Science, Beihang University, Sep. 2020 - Jan. 2023

Exchange Student, Technical University of Munich, Apr. 2019 - Aug. 2019

B.Eng. in Software Engineering, Beihang University, Sep. 2016 - Jun. 2020

Research

	Universal Video Temporal Grounding with Generative Multi-modal Large Language Models Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie In submission, 2025. paper / project page / code / bibtex Towards universal video grounding with superior accuracy, generalizability, and robustness.
	Learning Streaming Video Representation via Multitask Training Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie In ICCV, 2025. (Oral) paper / project page / code / bibtex Learn streaming video representations of various granularities via multitask training, including global (retrieval, action recognition), temporal (grounding), and spatial (segmentation) objectives.
	Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie In CVPR, 2025. paper / project page / code / bibtex Distill multi-step reasoning and spatial-temporal understanding into a generative Video-LLM.
	Streaming Video Question-Answering with In-context Video KV-Cache Retrieval Shangzhe Di, Zhelun Yu, et al. In ICLR, 2025. paper / code / bibtex A training-free approach enabling Video-LLMs for streaming video question-answering.
	Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos Qirui Chen, Shangzhe Di, Weidi Xie In AAAI, 2025. paper / project page / code / bibtex Pinpoint scattered visual evidence in long egocentric videos while responding to questions.
	Grounded Question-Answering in Long Egocentric Videos Shangzhe Di, Weidi Xie In CVPR, 2024. paper / project page / code / bibtex Simultaneous query grounding and answering in long, egocentric videos.
	Video Background Music Generation with Controllable Music Transformer Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, Shuicheng Yan In ACM MM, 2021. (Best Paper Award) paper / project page / code / colab notebook / bibtex The first satisfying method for video background music generation.

Honors and Awards

Best Paper Award, ACM MM 2021

Best Video Award, IJCAI 2021 Video Competition

First Prize Scholarship x 2 (Top 10%), Beihang University, 2019 & 2021

Full Scholarship for Exchange Program, China Scholarship Council, 2019

Special Prize Scholarship (Top 3%), Beihang University, 2018

The website template is borrowed from here.