Traffic video captioning & VQA using Large Vision Language Model (LVLM)

2025-10-20

Finetune LVLM to enhance spatial and temporal understanding in traffic video

Foreword

I would like to sincerely thank all of the members in my team: TA Tinh Anh, TA Tien Huy, Dao Tran Minh Triet, Le Gia Phuc and Vo Lan Tuan.

Disclaimer: This blog is not an official post for the paper or project, this post is based on my experience and is a reflect of what I learned from this challenge.

Introduction

Official description of AI City Challenge:

The AI City Challenge, hosted at ICCV 2025, is designed to push the boundaries of computer vision and AI in real-world settings, driving innovative solutions for smarter transportation, large-scale industrial analytics, and public safety. By tackling multifaceted data sources—from cameras and lidars in roadway intersections to high-fidelity synthetic warehouses—participants in this challenge will develop and benchmark new methods capable of gleaning actionable insights under real-time or near real-time constraints.

In other word, AI City Challenge is an annual workshop held in order to promote the application of AI in real-world settings. This year (2025), there are 4 tracks available which are

Track 1. Multi-Camera 3D Perception
Track 2. Traffic Safety Description and Analysis
Track 3. Warehouse Spatial Intelligence
Track 4. Road Object Detection in Fish-Eye Cameras

Our team participate in track 2, which requires captioning and answering questions related to traffic scenario. Track 2's offciial description:

Challenge Track 2. Traffic Safety Description and Analysis: Using multiple cameras and viewpoints, participants are challenged to describe both the moments leading up to incidents and the normal traffic flow, capturing all relevant details about pedestrian and vehicle behavior. The task also includes a video question-answering component to assess fine-grained understanding. The dataset has been enhanced with 3D gaze annotations and traffic video question answering. Quantitative scores will reflect accuracy on question answering, caption quality, and the fidelity of scene reconstruction in terms of directions, actions, and attributes.

Datasets

The provided datasets are WTS and BDD-PC-5K (filtered from BDD100K). Both dataset include videos and annotations related to the scenarios. In each scenario, there are multiple cameras located in different locations to provide multi-view about the scenario. Captions are provided and divided into 5 phases, from the moment pedestrian do not recognize the vehicle in phase 1, to the end of simluated accident in phase 5. The multiple choice questions cover a wide range of information such as pedestrian' age group, direction of travel or environmental context. Other annotations such as bounding boxes for both pedestrian and vehicles are included, together with pedestrian's gaze direction information.

To be specific, the WTS dataset consists of 249 distinct scenarios and BDD consists of 3402 videos recorded using a vehicle's dashboard camera. The average duration of

Ideas

In order to have a foundation to based on, we read previous works on the problem and extracting key ideas from the papers. After researching and reading a lot of papers on video captioning, traffic-related methods,... we decided our first main contribution is caption decomposition strategy (more detail can be read in paper). Derived from previous work, we utilized LLM such as Qwen-72B, to split the scenario's descriptions into two parts, spatial (remain unchanged through time) and temporal (rapidly changed through time) instead of 4 parts (Appearance, Location, Environment and Attention). In this way, we reduce computational complexity required by using only 3 models instead of 5 models.

The next question is how can we integrate temporal information to the model. This is when we introduce novel frame selection with best-view filtering strategy to include only meaningful and related frames.

Training stage + Validating results + Ablation studies

Detail in the paper

Analysis and Conclusion

Detail in the paper