I am a Ph.D. student in Data Science and Analytics at HKUST Guangzhou, supervised by Prof. Yuxuan Liang.
My research focuses on spatio-temporal representation learning with cross-domain, multimodal data. I am also interested in efficient AI, aiming to create effective, balanced datasets and develop lightweight, transferable models.
Previously, I was a Research Intern at XPENG, working on visual multimodal research for the XNGP System, and a Software Engineer at Tencent, enhancing QQ’s performance and developing cloud-native tools.
",
which does not match the baseurl
("
") configured in _config.yml
.
baseurl
in _config.yml
to "
".
Siru Zhong, Xixuan Hao, Yibo Yan, Ying Zhang, Yangqiu Song, Yuxuan Liang
ACM International Conference on Multimedia (ACM MM) 2024 Poster
Introduced UrbanCross, a cross-domain satellite image-text retrieval framework that leverages multimodal enhancements and adaptive domain adaptation techniques to bridge diverse urban landscapes, achieving up to a 15% improvement in retrieval performance.
Yutong Feng, Qiongyan Wang, Yutong Xia, Junlin Huang, Siru Zhong, Kun Wang, Shifen Cheng, Yuxuan Liang
The International Joint Conference on Artificial Intelligence (IJCAI) 2024
Present the Spatio-Temporal Field Neural Network and Pyramidal Inference framework, which integrate field and graph perspectives to achieve state-of-the-art nationwide air quality inference in Mainland China.
Huaiwu Zhang, Yutong Xia, Siru Zhong, Kun Wang, Zekun Tong, Qingsong Wen, Roger Zimmermann, Yuxuan Liang
The International Joint Conference on Artificial Intelligence (IJCAI) 2024
Introduce DeepPA, a deep-learning framework and the SINPA dataset for accurately predicting real-time parking availability across Singapore, outperforming existing models and supporting urban planning through a deployed web platform.
Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, Yuxuan Liang
The International World Wide Web Conference (WWW) 2024 Oral
Introduce UrbanCLIP, the first large language model–enhanced framework that integrates textual descriptions with satellite imagery through contrastive language-image pretraining, significantly improving urban region profiling performance across major cities.