Huijuan Xu, BU. “Video Understanding with Localization”


Position:  PhD Student

Current Institution:  Boston University

Abstract:  Video Understanding with Localization

Millions of videos come out every day, and the videos tend to be long and untrimmed with sparse events inside the video such as the surveillance videos. Automatically identifying events in the video would be very helpful for monitoring such large amounts of video data. However, the event localizations in previous temporal activity detection models take the sliding window based approaches which produce inflexible activity boundaries and are time-consuming. We propose Region Convolutional 3D Network (R-C3D) for temporal activity detection which can detect arbitrary length activities and run in fast speed with proposal stage filtering out irrelevant background segments. Furthermore, we combine the event localization in video with natural language components which enhances video understanding with richer language description via working in two topics: (1) dense video captioning which localizes distinct events in video stream and generates captions for the localized events and (2) natural language localization in video which requires to temporally localize the input query sentence in the video.

Huijuan Xu is a PhD student in the computer science department at Boston University, advised by Professor Kate Saenko. Her research focuses on deep learning, computer vision and natural language processing, and video understanding with localization. Specifically, her work explores visual question answering, video language description, cross-modal retrieval, and temporal activity detection. Her work “Region Convolutional 3D Network (R-C3D) for Temporal Activity Detection” won the Most Innovative Award in ActivityNet Challenge 2017.