Siru Zhong
Logo Ph.D. Student @ HKUST(GZ) | Research Intern @ Huawei | Former Software Engineer @ Tencent & Algorithm Intern @ XPENG

I am a first-year Ph.D. student in Data Science & Analytics at the Hong Kong University of Science and Technology (Guangzhou), supervised by Prof. Yuxuan Liang and Prof. Yangqiu Song. My research focuses on Spatial-Temporal Data Mining, Multimodal Learning, and Time Series Analysis. Currently, I am also interning at Huawei 2012 Lab, where I am conducting research on the Foundation Model for Spatio-Temporal Data.

I hold an M.Phil. degree in Data Science & Analytics from the HKUST (Guangzhou), and a B.Eng. degree in Computer Science & Information Engineering from the Hefei University of Technology. Prior to that, I gained industry experience as an Algorithm Research Intern at the Autonomous Driving Center of XPENG and as a full-time Software Engineer at the CloudIDE & Workflow Engine Center of Tencent for one year.

Education
  • Hong Kong University of Science and Technology (GZ)
    Hong Kong University of Science and Technology (GZ)
    Ph.D. in Data Science & Analytics
    Feb. 2025 - Jan. 2028
  • Hong Kong University of Science and Technology (GZ)
    Hong Kong University of Science and Technology (GZ)
    M.Phil. in Data Science & Analytics
    Aug. 2023 - Jan. 2025
  • Hefei University of Technology
    Hefei University of Technology
    B.Eng. in Computer Science & Information Engineering
    Sep. 2018 - Jul. 2022
Experience
  • Huawei
    Huawei
    Research Intern, HUAWEI 2012 Labs, Service Lab
    Feb. 2025 - Now
  • XPENG
    XPENG
    Algorithm Intern, Autonomous Driving Center
    May. 2024 - Aug. 2024
  • HKUST(GZ)
    HKUST(GZ)
    Research Assistant, CityMind Lab
    May. 2023 - Aug. 2023
  • Tencent
    Tencent
    Software Engineer, CloudIDE Center
    Jul. 2022 - May. 2023
  • Tencent
    Tencent
    Software Intern, Workflow Engine Center
    Jun. 2021 - Sep. 2023
Honors & Awards
  • Runner-Up Prize for DSA Excellent Research Award, HKUST(GZ)
    2024
  • Best Project Award (1st/15) in Data Science Computing, HKUST(GZ)
    2023
  • Outstanding Student Award (Top 10%) in Red Bird Summer Camp, HKUST(GZ)
    2023
  • iCode Certification of R&D Engineering Competency Evaluation, Tencent
    2022
  • Silver Award (2nd/12) in Code World Program, Tencent
    2022
  • Outstanding Student Award (8th/100) in New Employee Training, Tencent
    2022
  • Outstanding Graduation Thesis Award (Top 2%), HFUT
    2022
  • Third Prize in National College English Competition
    2021
  • First Prize (Top 10) in CSDN Technology Blogger Competition
    2020
  • Second Prize in 2D Robotics Competition, HFUT
    2020
  • Scholarship, HFUT
    2018-2021
Extracurricular Activities
  • Greater Bay Area Science Forum attendee, Guangzhou, China
    2024
  • ACM Multimedia Conference attendee, Melbourne, Australia
    2024
  • China National Computer Conference attendee, Yiwu, China
    2024
  • HKUST-GZ System Hub Welcome Party performer, Guangzhou, China
    2023
  • Tencent New Year Gala performer, Shenzhen, China
    2023
  • HFUT Chorus member, Hefei, China
    2018-2022
  • HFUT External Relations Department member, Hefei, China
    2018-2019
News
2025
One tutorial titled 'Multimodal Learning for Spatio-Temporal Data Mining' was accepted by ACM MM 2025 Tutorials!
Feb 28
I am embarking on my PhD studies at the Hong Kong University of Science and Technology (Guangzhou).
Feb 05
I was awarded the Runner-Up Prize in the 2024 HKUST(GZ) DSA Excellent Research Award.
Jan 27
2024
Two papers on Urban Indicator Prediction and Air Quality Inference have been accepted by AAAI 2024.
Dec 10
I successfully passed the MPhil thesis defense in Data Science and Analytics at HKUST(GZ).
Nov 25
I was honored to be invited to serve as a reviewer for ICLR 2025.
Aug 23
One paper on Urban Multimodal Image-Text Retrieval has been accepted to MM 2024.
Jul 18
Two papers on Spatio-Temporal Prediction and Neural Networks have been accepted to IJCAI 2024.
May 14
One paper on LLM-Enhanced Urban Region Profiling has been accepted to WWW 2024.
Jan 23
2023
I am starting my MPhil studies at the Hong Kong University of Science and Technology (Guangzhou).
Aug 24
Selected Publications (view all )
Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

Siru Zhong, Weilin Ruan, Min Jin, Huan Li, Qingsong Wen, Yuxuan Liang

Under review. 2025

Propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced time series forecasting.

Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

Siru Zhong, Weilin Ruan, Min Jin, Huan Li, Qingsong Wen, Yuxuan Liang

Under review. 2025

Propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced time series forecasting.

Vision-Enhanced Time Series Forecasting via Latent Diffusion Models
Vision-Enhanced Time Series Forecasting via Latent Diffusion Models

Weilin Ruan, Siru Zhong, Haomin Wen, Yuxuan Liang

Under review. 2025

Propose LDM4TS, a novel framework that leverages the powerful image reconstruction capabilities of latent diffusion models for vision-enhanced time series forecasting.

Vision-Enhanced Time Series Forecasting via Latent Diffusion Models
Vision-Enhanced Time Series Forecasting via Latent Diffusion Models

Weilin Ruan, Siru Zhong, Haomin Wen, Yuxuan Liang

Under review. 2025

Propose LDM4TS, a novel framework that leverages the powerful image reconstruction capabilities of latent diffusion models for vision-enhanced time series forecasting.

UrbanVLP: A Multi-Granularity Vision-Language Pre-Trained Model for Urban Indicator Prediction
UrbanVLP: A Multi-Granularity Vision-Language Pre-Trained Model for Urban Indicator Prediction

Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, Yuxuan Liang

AAAI Conference on Artificial Intelligence (AAAI) 2025 Poster

Present UrbanVLP, a novel vision-language pretraining framework that integrates both macro and micro-level urban data and enhances interpretability through automatic text generation, achieving superior performance in urban region profiling.

UrbanVLP: A Multi-Granularity Vision-Language Pre-Trained Model for Urban Indicator Prediction
UrbanVLP: A Multi-Granularity Vision-Language Pre-Trained Model for Urban Indicator Prediction

Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, Yuxuan Liang

AAAI Conference on Artificial Intelligence (AAAI) 2025 Poster

Present UrbanVLP, a novel vision-language pretraining framework that integrates both macro and micro-level urban data and enhances interpretability through automatic text generation, achieving superior performance in urban region profiling.

AirRadar: Inferring Nationwide Air Quality in China with Deep Neural Networks
AirRadar: Inferring Nationwide Air Quality in China with Deep Neural Networks

Qiongyan WANG, Yutong Xia, Siru Zhong, Weichuang Li, Yuankai Wu, Shi Fen Cheng, Junbo Zhang, Yu Zheng, Yuxuan Liang

AAAI Conference on Artificial Intelligence (AAAI) 2025 Poster

Introduce AirRadar, a deep neural network inferring unmonitored air quality. It uses learnable mask tokens in two-stage process for feature reconstruction. Validated by a dataset, it outperforms baselines, contributing to air quality monitoring with its design and performance.

AirRadar: Inferring Nationwide Air Quality in China with Deep Neural Networks
AirRadar: Inferring Nationwide Air Quality in China with Deep Neural Networks

Qiongyan WANG, Yutong Xia, Siru Zhong, Weichuang Li, Yuankai Wu, Shi Fen Cheng, Junbo Zhang, Yu Zheng, Yuxuan Liang

AAAI Conference on Artificial Intelligence (AAAI) 2025 Poster

Introduce AirRadar, a deep neural network inferring unmonitored air quality. It uses learnable mask tokens in two-stage process for feature reconstruction. Validated by a dataset, it outperforms baselines, contributing to air quality monitoring with its design and performance.

UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation
UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation

Siru Zhong, Xixuan Hao, Yibo Yan, Ying Zhang, Yangqiu Song, Yuxuan Liang

ACM International Conference on Multimedia (ACM MM) 2024 Poster

Introduced UrbanCross, a cross-domain satellite image-text retrieval framework that leverages multimodal enhancements and adaptive domain adaptation techniques to bridge diverse urban landscapes, achieving up to a 15% improvement in retrieval performance.

UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation
UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation

Siru Zhong, Xixuan Hao, Yibo Yan, Ying Zhang, Yangqiu Song, Yuxuan Liang

ACM International Conference on Multimedia (ACM MM) 2024 Poster

Introduced UrbanCross, a cross-domain satellite image-text retrieval framework that leverages multimodal enhancements and adaptive domain adaptation techniques to bridge diverse urban landscapes, achieving up to a 15% improvement in retrieval performance.

Spatio-Temporal Field Neural Networks for Air Quality Inference
Spatio-Temporal Field Neural Networks for Air Quality Inference

Yutong Feng, Qiongyan Wang, Yutong Xia, Junlin Huang, Siru Zhong, Kun Wang, Shifen Cheng, Yuxuan Liang

The International Joint Conference on Artificial Intelligence (IJCAI) 2024 Spotlight

Present the Spatio-Temporal Field Neural Network and Pyramidal Inference framework, which integrate field and graph perspectives to achieve state-of-the-art nationwide air quality inference in Mainland China.

Spatio-Temporal Field Neural Networks for Air Quality Inference
Spatio-Temporal Field Neural Networks for Air Quality Inference

Yutong Feng, Qiongyan Wang, Yutong Xia, Junlin Huang, Siru Zhong, Kun Wang, Shifen Cheng, Yuxuan Liang

The International Joint Conference on Artificial Intelligence (IJCAI) 2024 Spotlight

Present the Spatio-Temporal Field Neural Network and Pyramidal Inference framework, which integrate field and graph perspectives to achieve state-of-the-art nationwide air quality inference in Mainland China.

Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach
Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Huaiwu Zhang, Yutong Xia, Siru Zhong, Kun Wang, Zekun Tong, Qingsong Wen, Roger Zimmermann, Yuxuan Liang

The International Joint Conference on Artificial Intelligence (IJCAI) 2024 Spotlight

Introduce DeepPA, a deep-learning framework and the SINPA dataset for accurately predicting real-time parking availability across Singapore, outperforming existing models and supporting urban planning through a deployed web platform.

Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach
Predicting Parking Availability in Singapore with Cross-Domain Data: A New Dataset and A Data-Driven Approach

Huaiwu Zhang, Yutong Xia, Siru Zhong, Kun Wang, Zekun Tong, Qingsong Wen, Roger Zimmermann, Yuxuan Liang

The International Joint Conference on Artificial Intelligence (IJCAI) 2024 Spotlight

Introduce DeepPA, a deep-learning framework and the SINPA dataset for accurately predicting real-time parking availability across Singapore, outperforming existing models and supporting urban planning through a deployed web platform.

UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web

Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, Yuxuan Liang

The International World Wide Web Conference (WWW) 2024 Oral

Introduce UrbanCLIP, the first large language model–enhanced framework that integrates textual descriptions with satellite imagery through contrastive language-image pretraining, significantly improving urban region profiling performance across major cities.

UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web

Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, Yuxuan Liang

The International World Wide Web Conference (WWW) 2024 Oral

Introduce UrbanCLIP, the first large language model–enhanced framework that integrates textual descriptions with satellite imagery through contrastive language-image pretraining, significantly improving urban region profiling performance across major cities.

All publications