Shangzhe Di   狄尚哲
Hi, I am a second-year PhD candidate at Shanghai Jiao Tong University (SJTU), where I am fortunate to be advised by Prof. Weidi Xie. My research focuses on video understanding and multimodal learning, driven by a passion for exploring the unknowns in these fields.
Before joining SJTU, I earned my master's and bachelor's degrees from Beihang University (BUAA). During this period, I explored video background music generation and visual object tracking under the guidance of Prof. Si Liu.
I’m always eager to connect, exchange ideas, and collaborate on innovative research.
Please feel free to reach out!
Email  / 
CV  / 
Github  / 
Google Scholar
|
|
|
Unlocking Video-LLM via Agent-of-Thoughts Distillation
Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie
In Submission, 2024.
paper / project page / code
Distill multi-step reasoning and spatial-temporal understanding into a generative video-language model.
|
|
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di, Zhelun Yu, et al.
In Submission, 2024.
A training-free approach enabling Video-LLMs for streaming video question-answering.
|
|
Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
Qirui Chen, Shangzhe Di, Weidi Xie
In AAAI, 2025.
paper / project page / code
Pinpoint scattered visual evidence in long egocentric videos while responding to questions.
|
|
Grounded Question-Answering in Long Egocentric Videos
Shangzhe Di, Weidi Xie
In CVPR, 2024.
paper / project page / code / bibtex
Simultaneous query grounding and answering in long, egocentric videos.
|
|
Video Background Music Generation with Controllable Music Transformer
Shangzhe Di*, Zeren Jiang*, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, Shuicheng Yan
In ACM MM, 2021. (Best Paper Award)
paper / project page / code / colab notebook / bibtex
The first satisfying method for video background music generation.
|
The website template is borrowed from here.
|